The landscape of artificial intelligence is constantly shifting, with Large Language Models (LLMs) becoming increasingly sophisticated and integrated into our digital lives. While cloud-based AI services offer convenience, a growing number of users are turning towards running these powerful models directly on their own computers. This approach offers enhanced privacy, cost savings, and greater control. Facilitating this shift is Ollama, a revolutionary tool designed to drastically simplify the complex process of downloading, configuring, and operating cutting-edge LLMs like Llama 3, Mistral, Gemma, Phi, and many others locally.
This comprehensive guide serves as your starting point for mastering Ollama. We will journey from the initial installation steps and basic model interactions to more advanced customization techniques, API usage, and essential troubleshooting. Whether you are a software developer seeking to weave local AI into your applications, a researcher keen on experimenting with diverse model architectures, or simply an AI enthusiast eager to explore the potential of running powerful models offline, Ollama provides an exceptionally streamlined and efficient gateway.
Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?
Apidog delivers all your demans, and replaces Postman at a much more affordable price!
Why Choose Ollama to Run AI Models Locally?
Why opt for this approach instead of relying solely on readily available cloud APIs? Well, here are the reasons:
The beauty of having small but smart models like Gemma 2 9B is that it allows you to build all sorts of fun stuff locally.
— Pietro Schirano (@skirano) July 12, 2024
Like this script that uses @ollama to fix typos or improve any text on my Mac just by pressing dedicated keys.
A local Grammarly but super fast. ⚡️ pic.twitter.com/lv15CRmUMv
- Ollama gives you the best privacy and security for runnning LLMs locally, it's all in your control: When you execute an LLM using Ollama on your machine, every piece of data – your prompts, the documents you provide, and the text generated by the model – remains confined to your local system. It never leaves your hardware. This ensures the highest level of privacy and data control, a critical factor when dealing with sensitive personal information, confidential business data, or proprietary research.
- It's just cheaper to run with local LLMs: Cloud-based LLM APIs often operate on pay-per-use models or require ongoing subscription fees. These costs can accumulate rapidly, especially with heavy usage. Ollama eliminates these recurring expenses. Apart from the initial investment in suitable hardware (which you might already possess), running models locally is effectively free, allowing for unlimited experimentation and generation without the looming concern of API bills.
- Ollama allows you to run LLM offline without replying on commercial APIs: Once an Ollama model is downloaded to your local storage, it's yours to use anytime, anywhere, completely independent of an internet connection. This offline access is invaluable for developers working in environments with restricted connectivity, researchers in the field, or anyone who needs reliable AI access on the move.
- Ollama allows you to run customized LLMs: Ollama distinguishes itself with its powerful
Modelfile
system. This allows users to easily modify model behavior by tweaking parameters (like creativity levels or output length), defining custom system prompts to shape the AI's persona, or even integrating specialized fine-tuned adapters (LoRAs). You can also import model weights directly from standard formats like GGUF or Safetensors. This granular level of control and flexibility is rarely offered by closed-source cloud API providers. - Ollama allows you to run LLM on your own server: Depending on your local hardware configuration, particularly the presence of a capable Graphics Processing Unit (GPU), Ollama can deliver significantly faster response times (inference speed) compared to cloud services, which might be subject to network latency, rate limiting, or variable load on shared resources. Leveraging your dedicated hardware can lead to a much smoother and more interactive experience.
- Ollama is Open Source: Ollama itself is an open-source project, fostering transparency and community contribution. Furthermore, it primarily serves as a gateway to a vast and rapidly expanding library of openly accessible LLMs. By using Ollama, you become part of this dynamic ecosystem, benefiting from shared knowledge, community support, and the constant innovation driven by open collaboration.
Ollama's primary achievement is masking the inherent complexities involved in setting up the necessary software environments, managing dependencies, and configuring the intricate settings required to run these sophisticated AI models. It cleverly utilizes highly optimized backend inference engines, most notably the renowned llama.cpp
library, to ensure efficient execution on standard consumer hardware, supporting both CPU and GPU acceleration.
Ollama vs. Llama.cpp: What're the Differences?
It's beneficial to clarify the relationship between Ollama and llama.cpp
, as they are closely related yet serve different purposes.
llama.cpp
: This is the foundational, high-performance C/C++ library responsible for the core task of LLM inference. It handles loading model weights, processing input tokens, and generating output tokens efficiently, with optimizations for various hardware architectures (CPU instruction sets like AVX, GPU acceleration via CUDA, Metal, ROCm). It's the powerful engine doing the computational heavy lifting.
Ollama: This is a comprehensive application built around llama.cpp
(and potentially other future backends). Ollama provides a user-friendly layer on top, offering:
- A simple Command-Line Interface (CLI) for easy interaction (
ollama run
,ollama pull
, etc.). - A built-in REST API server for programmatic integration.
- Streamlined model management (downloading from a library, local storage, updates).
- The
Modelfile
system for customization and creating model variants. - Cross-platform installers (macOS, Windows, Linux) and Docker images.
- Automatic hardware detection and configuration (CPU/GPU).
In essence, while technically you could use llama.cpp
directly by compiling it and running its command-line tools, this requires significantly more technical effort regarding setup, model conversion, and parameter management. Ollama packages this power into an accessible, easy-to-use application, making local LLMs practical for a much broader audience, especially beginners. Think of llama.cpp
as the high-performance engine components, and Ollama as the fully assembled, user-friendly vehicle ready to drive.
How to Install Ollama on Mac, Windows, Linux
Ollama is designed for accessibility, offering straightforward installation procedures for macOS, Windows, Linux, and Docker environments.
General System Requirements for Ollama:
RAM (Memory): This is often the most critical factor.
- Minimum 8 GB: Sufficient for smaller models (e.g., 1B, 3B, 7B parameters), though performance might be slow.
- Recommended 16 GB: A good starting point for running 7B and 13B models comfortably.
- Ideal 32 GB or more: Necessary for larger models (30B, 40B, 70B+) and allows for larger context windows. More RAM generally leads to better performance and the ability to run larger, more capable models.
Disk Space: The Ollama application itself is relatively small (a few hundred MB). However, the LLMs you download require substantial space. Model sizes vary greatly:
- Small Quantized Models (e.g., ~3B Q4): ~2 GB
- Medium Quantized Models (e.g., 7B/8B Q4): 4-5 GB
- Large Quantized Models (e.g., 70B Q4): ~40 GB
- Very Large Models (e.g., 405B): Over 200 GB!
Ensure you have sufficient free space on the drive where Ollama stores models (see section below).
Operating System:
- macOS: Version 11 Big Sur or newer. Apple Silicon (M1/M2/M3/M4) is recommended for GPU acceleration.
- Windows: Windows 10 version 22H2 or newer, or Windows 11. Both Home and Pro editions are supported.
- Linux: A modern distribution (e.g., Ubuntu 20.04+, Fedora 38+, Debian 11+). Kernel requirements may apply, especially for AMD GPU support.

Installing Ollama on macOS
- Download: Obtain the Ollama macOS application DMG file directly from the official Ollama website.
- Mount: Double-click the downloaded
.dmg
file to open it. - Install: Drag the
Ollama.app
icon into yourApplications
folder. - Launch: Open the Ollama application from your Applications folder. You may need to grant it permission to run the first time.
- Background Service: Ollama will start running as a background service, indicated by an icon in your menu bar. Clicking this icon provides options to quit the application or view logs.
Launching the application automatically initiates the Ollama server process and adds the ollama
command-line tool to your system's PATH, making it immediately available in the Terminal application (Terminal.app
, iTerm2, etc.). On Macs equipped with Apple Silicon (M1, M2, M3, M4 chips), Ollama seamlessly utilizes the built-in GPU for acceleration via Apple's Metal graphics API without requiring any manual configuration.
Installing Ollama on Windows
- Download: Get the
OllamaSetup.exe
installer file from the Ollama website. - Run Installer: Double-click the downloaded
.exe
file to launch the setup wizard. Ensure you meet the minimum Windows version requirement (10 22H2+ or 11). - Follow Prompts: Proceed through the installation steps, accepting the license agreement and choosing the installation location if desired (though the default is usually fine).
The installer configures Ollama to run automatically as a background service when your system starts. It also adds the ollama.exe
executable to your system's PATH, allowing you to use the ollama
command in standard Windows terminals like Command Prompt (cmd.exe
), PowerShell, or the newer Windows Terminal. The Ollama API server starts automatically and listens on http://localhost:11434
.
Windows GPU Acceleration for Ollama:
- NVIDIA: Install the latest GeForce Game Ready or NVIDIA Studio drivers from the NVIDIA website. Driver version 452.39 or newer is required. Ollama should automatically detect and utilize compatible GPUs.
- AMD: Install the latest AMD Software: Adrenalin Edition drivers from the AMD support website. Compatible Radeon GPUs (typically RX 6000 series and newer) will be used automatically.
Installing Ollama on Linux
The most convenient method for most Linux distributions is using the official installation script:
curl -fsSL https://ollama.com/install.sh | sh
This command downloads the script and executes it using sh
. The script performs the following actions:
- Detects your system architecture (x86_64, ARM64).
- Downloads the appropriate Ollama binary.
- Installs the binary to
/usr/local/bin/ollama
. - Checks for necessary GPU drivers (NVIDIA CUDA, AMD ROCm) and installs dependencies if possible (this part may vary by distribution).
- Creates a dedicated
ollama
system user and group. - Sets up a systemd service file (
/etc/systemd/system/ollama.service
) to manage the Ollama server process. - Enables and starts the
ollama
service, so it runs automatically on boot and in the background.
Manual Linux Installation & Systemd Configuration for Ollama:
If the script fails, or if you prefer manual control (e.g., installing to a different location, managing users differently, ensuring specific ROCm versions), consult the detailed Linux installation guide on the Ollama GitHub repository. The general steps involve:
- Downloading the correct binary for your architecture.
- Making the binary executable (
chmod +x ollama
) and moving it to a location in your PATH (e.g.,/usr/local/bin
). - (Recommended) Creating a system user/group:
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
andsudo groupadd ollama
, thensudo usermod -a -G ollama ollama
. Add your own user to the group:sudo usermod -a -G ollama $USER
. - Creating the systemd service file (
/etc/systemd/system/ollama.service
) with appropriate settings (user, group, executable path, environment variables if needed). Example snippets are usually provided in the documentation. - Reloading the systemd daemon:
sudo systemctl daemon-reload
. - Enabling the service to start on boot:
sudo systemctl enable ollama
. - Starting the service immediately:
sudo systemctl start ollama
. You can check its status withsudo systemctl status ollama
.
Essential Linux GPU Drivers for Ollama:
For optimal performance, installing GPU drivers is highly recommended:
- NVIDIA: Install the official proprietary NVIDIA drivers for your distribution (e.g., via package manager like
apt
ordnf
, or downloaded from NVIDIA's website). Verify installation using thenvidia-smi
command. - AMD (ROCm): Install the ROCm toolkit. AMD provides official repositories and installation guides for supported distributions (like Ubuntu, RHEL/CentOS, SLES). Ensure your kernel version is compatible. The Ollama installation script might handle some ROCm dependencies, but manual installation guarantees correct setup. Verify with
rocminfo
.
How to Use Ollama with Docker Image
Docker offers a platform-agnostic way to run Ollama in an isolated container, simplifying dependency management, especially for complex GPU setups.
CPU-Only Ollama Container:
docker run -d \
-v ollama_data:/root/.ollama \
-p 127.0.0.1:11434:11434 \
--name my_ollama \
ollama/ollama
-d
: Runs the container in detached (background) mode.-v ollama_data:/root/.ollama
: Creates a Docker named volume calledollama_data
on your host system and maps it to the/root/.ollama
directory inside the container. This is crucial for persisting your downloaded models. If you omit this, models will be lost when the container is removed. You can choose any name for the volume.-p 127.0.0.1:11434:11434
: Maps port 11434 on your host machine's loopback interface (127.0.0.1
) to port 11434 inside the container. Use0.0.0.0:11434:11434
if you need to access the Ollama container from other machines on your network.--name my_ollama
: Assigns a custom, memorable name to the running container.ollama/ollama
: Specifies the official Ollama image from Docker Hub.
NVIDIA GPU Ollama Container:
- First, ensure the NVIDIA Container Toolkit is properly installed on your host machine and that Docker is configured to use the NVIDIA runtime.
- Run the container, adding the
--gpus=all
flag:
docker run -d \
--gpus=all \
-v ollama_data:/root/.ollama \
-p 127.0.0.1:11434:11434 \
--name my_ollama_gpu \
ollama/ollama
This flag grants the container access to all compatible NVIDIA GPUs detected by the toolkit. You can specify particular GPUs if needed (e.g., --gpus '"device=0,1"'
).
AMD GPU (ROCm) Ollama Container:
- Use the ROCm-specific image tag:
ollama/ollama:rocm
. - Map the required ROCm device nodes from the host into the container:
docker run -d \
--device /dev/kfd \
--device /dev/dri \
-v ollama_data:/root/.ollama \
-p 127.0.0.1:11434:11434 \
--name my_ollama_rocm \
ollama/ollama:rocm
/dev/kfd
: The kernel fusion driver interface./dev/dri
: Direct Rendering Infrastructure devices (often includes render nodes like/dev/dri/renderD128
).- Ensure the user running the
docker
command on the host has appropriate permissions to access these device files (membership in groups likerender
andvideo
might be required).
Once the Ollama container is running, you can interact with it using the docker exec
command to run ollama
CLI commands inside the container:
docker exec -it my_ollama ollama list
docker exec -it my_ollama ollama pull llama3.2
docker exec -it my_ollama ollama run llama3.2
Alternatively, if you mapped the port (-p
), you can interact with the Ollama API directly from your host machine or other applications pointing to http://localhost:11434
(or the IP/port you mapped).
Where Does Ollama Store Models?

Knowing where Ollama keeps its downloaded models is essential for managing disk space and backups. The default location varies by operating system and installation method:
- Ollama on macOS: Models reside within your user's home directory at
~/.ollama/models
. The~
represents/Users/<YourUsername>
. - Ollama on Windows: Models are stored in your user profile directory at
C:\Users\<YourUsername>\.ollama\models
. - Ollama on Linux (User Install / Manual): Similar to macOS, models are typically stored in
~/.ollama/models
. - Ollama on Linux (Systemd Service Install): When installed via the script or configured as a system-wide service running as the
ollama
user, models are often stored in/usr/share/ollama/.ollama/models
. Check the service's configuration or documentation if you used a non-standard setup. - Ollama via Docker: Inside the container, the path is
/root/.ollama/models
. However, if you correctly used the-v
flag to mount a Docker volume (e.g.,-v ollama_data:/root/.ollama
), the actual model files are stored within Docker's managed volume area on your host machine. The exact location of Docker volumes depends on your Docker setup, but they are designed to persist independently of the container.
You can redirect the model storage location using the OLLAMA_MODELS
environment variable, which we'll cover in the Configuration section. This is useful if your primary drive is low on space and you want to store large models on a secondary drive.
Your First Steps with Ollama: Running an LLM
Now that Ollama is installed and the server is active (running via the desktop app, systemd service, or Docker container), you can begin interacting with LLMs using the straightforward ollama
command in your terminal.
Downloading Ollama Models: The pull
Command
Before running any specific LLM, you must first download its weights and configuration files. Ollama provides a curated library of popular open models, easily accessible via the ollama pull
command. You can browse the available models on the Ollama website's library page.
# Example 1: Pull the latest Llama 3.2 8B Instruct model
# This is often tagged as 'latest' or simply by the base name.
ollama pull llama3.2
# Example 2: Pull a specific version of Mistral (7 Billion parameters, base model)
ollama pull mistral:7b
# Example 3: Pull Google's Gemma 3 4B model
ollama pull gemma3
# Example 4: Pull Microsoft's smaller Phi-4 Mini model (efficient)
ollama pull phi4-mini
# Example 5: Pull a vision model (can process images)
ollama pull llava
Here's the link for Ollama library where you can browse all the available and trending ollama models:

Understanding Ollama Model Tags:
Models in the Ollama library utilize a model_family_name:tag
naming convention. The tag specifies variations like:
- Size:
1b
,3b
,7b
,8b
,13b
,34b
,70b
,405b
(indicating billions of parameters). Larger models generally have more knowledge but require more resources. - Quantization:
q2_K
,q3_K_S
,q4_0
,q4_K_M
,q5_1
,q5_K_M
,q6_K
,q8_0
,f16
(float16),f32
(float32). Quantization reduces model size and computational needs, often with a slight trade-off in precision. Lower numbers (e.g.,q4
) mean more compression.K
-variants (_K_S
,_K_M
,_K_L
) are generally considered good balances of size and quality.f16
/f32
are unquantized or minimally compressed, offering the highest fidelity but requiring the most resources. - Variant:
instruct
(tuned for following instructions),chat
(tuned for conversation),code
(optimized for programming tasks),vision
(multimodal, handles images),uncensored
(less safety filtering, use responsibly). latest
: If you omit a tag (e.g.,ollama pull llama3.2
), Ollama typically defaults to thelatest
tag, which usually points to a commonly used, well-balanced version (often a medium-sized, quantized, instruct-tuned model).
The pull
command downloads the required files (which can be several gigabytes) into your designated Ollama models directory. You only need to pull a specific model:tag combination once. Ollama can also update models; running pull
again on an existing model will download only the changed layers (diffs), making updates efficient.
How to Chat with LLMs Locally with Ollama run
Command
The most direct way to converse with a downloaded model is using the ollama run
command:
ollama run llama3.2
If the specified model (llama3.2:latest
in this case) hasn't been downloaded yet, ollama run
will conveniently trigger ollama pull
first. Once the model is ready and loaded into memory (which might take a few seconds, especially for larger models), you'll be presented with an interactive prompt:
>>> Send a message (/? for help)
Now, you can simply type your question or instruction, press Enter, and wait for the AI to generate a response. The output typically streams token by token, providing a responsive feel.
>>> Explain the concept of quantum entanglement in simple terms.
Okay, imagine you have two special coins that are linked together in a magical way. Let's call them Coin A and Coin B. Before you look at them, neither coin is heads or tails – they're in a fuzzy mix of both possibilities.
Now, you give Coin A to a friend and travel light-years away with Coin B. The instant you look at your Coin B and see it's, say, heads, you instantly know *for sure* that your friend's Coin A is tails. And if you saw tails, you'd know theirs is heads.
That's kind of like quantum entanglement! Two particles (like our coins) become linked, and their properties (like heads/tails) remain correlated no matter how far apart they are. Measuring one instantly influences the property of the other, faster than light could travel between them. It's one of the weirdest and coolest parts of quantum physics!
>>> Send a message (/? for help)
Helpful Commands within Ollama's Interactive Mode:
While interacting with a model via ollama run
, you can use special commands prefixed with /
:
/?
: Displays a helpful menu listing all available slash commands./set parameter <parameter_name> <value>
: Temporarily modifies a model's runtime parameter for the current chat session. For example,/set parameter temperature 0.9
increases creativity, while/set parameter num_ctx 8192
increases the context window for this session./show info
: Prints detailed information about the currently loaded model, including its parameters, template structure, and license./show modelfile
: Displays the contents of theModelfile
that was used to create the currently running model. This is useful for understanding its base model, parameters, and prompt template./save <session_name>
: Saves the current chat history to a named session file./load <session_name>
: Loads a previously saved chat session, restoring the conversation history./bye
or/exit
: Gracefully exits the interactive chat session and unloads the model from memory (if no other sessions are using it). You can also usually exit using Ctrl+D.
How to Manage Your Local Ollama Models
As you download and create models, you'll need ways to manage them:
Listing Downloaded Ollama Models: To see all the models currently stored locally, use:
ollama list
This command outputs a table showing the model name (NAME), unique ID, size on disk (SIZE), and modification time (MODIFIED).
Showing Detailed Ollama Model Information: To inspect the specifics of a particular model (its parameters, system prompt, template, license, etc.), use:
ollama show llama3.2:8b-instruct-q5_K_M
This will print the Modelfile contents, parameter settings, template details, and other metadata associated with that specific model tag.
Removing an Ollama Model: If you no longer need a model and want to free up disk space, use:
ollama rm mistral:7b
This permanently deletes the specified model:tag combination from your storage. Use with caution!
Copying/Renaming an Ollama Model: To create a duplicate of an existing model, perhaps as a starting point for customization or simply to give it a different name, use:
ollama cp llama3.2 my-custom-llama3.2-setup
This creates a new model entry named my-custom-llama3.2-setup
based on the original llama3.2
.
Checking Currently Loaded Ollama Models: To see which models are actively loaded into your RAM or VRAM and ready for immediate inference, use:
ollama ps
This command shows the model name, ID, size, processor used (CPU/GPU), and how long ago it was last accessed. Models usually stay loaded for a short period after use (e.g., 5 minutes) to speed up subsequent requests, then unload automatically to free up resources.
What are the Best Ollama Models? Selecting the Right LLM
This is a frequent and important question, but the answer is nuanced. There isn't a single "best" Ollama model for everyone or every task. The optimal choice hinges on several factors:
- Your Specific Task: What do you primarily want the AI to do?
- General Chat/Assistance:
llama3.2
,llama3.1
,mistral
,gemma3
,qwq
are strong contenders. - Coding/Programming:
codellama
,phi4
,phi4-mini
,starcoder2
,deepseek-coder
are specifically tuned for code generation, explanation, and debugging. - Creative Writing: Models with higher default temperatures or larger parameter counts (like 70B models) might yield more imaginative results. Experimentation is key.
- Summarization/Analysis: Instruction-tuned models (
:instruct
tag) often excel here. - Multimodal (Image Input):
llava
,moondream
,llama3.2-vision
. - Your Hardware Resources (RAM and VRAM): This is a major constraint.
- Low Resources (8GB RAM, No/Weak GPU): Stick to smaller models (1B-3B parameters) or heavily quantized versions (e.g.,
q2_K
,q3_K_S
) of 7B models. Performance will be slower. Examples:gemma3:1b
,llama3.2:1b
,phi4-mini:q4_0
. - Mid-Range Resources (16GB RAM, Basic GPU): You can comfortably run 7B/8B models (e.g.,
llama3.2
,mistral
,gemma3
) with good quantization (q4_K_M
,q5_K_M
). You might be able to run 13B models (q4_0
) slowly. - High-End Resources (32GB+ RAM, Strong GPU w/ 12GB+ VRAM): You can run larger models like 13B, 30B, or even 70B (
q4_K_M
or better quantization likeq5_K_M
,q6_K
). These offer significantly better reasoning and knowledge. Examples:llama3.3
,llama3.1:405b
(requires massive resources),gemma3:27b
,deepseek-r1:671b
(extreme requirements). - Desired Quality vs. Speed Trade-off:
- Higher Quality: Larger parameter counts and less aggressive quantization (e.g.,
q6_K
,q8_0
,f16
) generally provide better results but are slower and require more resources. - Faster Speed/Lower Resource Use: Smaller parameter counts and more aggressive quantization (e.g.,
q4_0
,q4_K_M
,q3_K_M
) are faster and lighter but might exhibit slightly reduced coherence or accuracy.K
-quants (_K_M
,_K_S
) often provide the best balance. - Model Tuning and Alignment: Some models are base models, while others are instruction-tuned (
instruct
) or chat-tuned (chat
). Instruct/chat models are generally better at following directions and engaging in conversation. Uncensored models have fewer safety guardrails.
Recommendations for Beginners (Late 2024):
- Good All-Rounder (Mid-Range Hardware):
llama3.2:8b-instruct-q5_K_M
ormistral:7b-instruct-v0.2-q5_K_M
. These offer a great balance of capability, speed, and resource usage. - Efficient Option (Lower-End Hardware):
phi4-mini:q4_K_M
orgemma3:1b
. Surprisingly capable for their size. - Coding Focus:
codellama:7b-instruct-q5_K_M
. - Vision Needs:
llava:13b
(if resources allow) ormoondream
.
The best approach is empirical: Read model descriptions on the Ollama library, consider your hardware, download a few likely candidates using ollama pull
, test them with your typical prompts using ollama run
, and see which one performs best for you. Don't hesitate to ollama rm
models that don't meet your needs to save space.
Ollama Context Length: Explained
Ollama Context Length: The num_ctx
Parameter
The context length, often referred to as the context window or num_ctx
in Ollama and llama.cpp
settings, is one of the most critical architectural limitations of an LLM.
- What it represents:
num_ctx
defines the maximum number of tokens the model can "see" or process simultaneously. This includes everything: the initial system prompt, all previous user messages and assistant responses in the current chat history, and the user's latest input prompt. - Why it's crucial: A model's ability to generate relevant, coherent, and contextually appropriate responses depends heavily on its context window.
- Long Conversations: A larger context window allows the model to "remember" information from earlier in the conversation, preventing it from losing track or repeating itself.
- Document Analysis: When processing large documents (e.g., for summarization or question-answering), the context window determines how much of the document the model can consider at once.
- Complex Instructions: Instructions that rely on details provided much earlier in the prompt require a sufficient context window to be understood correctly.
- Inherent Model Limits: Every LLM is pre-trained with a specific maximum context length (e.g., 2048, 4096, 8192, 32k, 128k, or even millions for cutting-edge research models). While you can set
num_ctx
in Ollama, setting it higher than the model's original training limit might lead to unpredictable behavior, degraded performance (the model might "forget" things outside its trained window), or errors. Setting it lower is always safe but limits the model's capability. - Resource Consumption: Processing a larger context window requires significantly more RAM and VRAM (GPU memory) and takes longer computationally. You need to balance the desired context capability with your hardware limitations.
- Finding the Default
num_ctx
for an Ollama Model: Use theollama show <model_name:tag>
command and look for thePARAMETER num_ctx
line in the displayed Modelfile section. Ollama models usually ship with a reasonable default (e.g., 4096 or 8192).
Changing the num_ctx
for Ollama:
- Temporary (during
ollama run
): Use the slash command:/set parameter num_ctx 8192
- Per API Request: Include it in the
options
JSON object:curl ... -d '{ "model": "...", "prompt": "...", "options": { "num_ctx": 8192 } }'
- Persistent (Custom Ollama Model): Create or modify a
Modelfile
and add or change thePARAMETER num_ctx <value>
line, then build the model usingollama create
. This sets the default context size for that custom model.
Choose a num_ctx
value that suits your typical tasks. For simple Q&A, a smaller window (e.g., 4096) might suffice. For long chats or summarizing large documents, you'll benefit from the largest context window your hardware and the model can reasonably support (e.g., 8192, 16384, or more if available).
Ollama Model Parameters Explained
LLMs have internal settings, or parameters, that you can adjust to influence how they generate text. Ollama allows you to control many of these:
temperature
: (Default: ~0.7-0.8) Controls the randomness or "creativity" of the output.- Lower values (e.g., 0.2): Make the output more deterministic, focused, and predictable. Good for factual answers or code generation.
- Higher values (e.g., 1.0, 1.2): Increase randomness, making the output more diverse and creative, but potentially less coherent. Good for brainstorming or story writing.
top_p
(Nucleus Sampling): (Default: ~0.9) Sets a probability threshold. The model considers only the most probable next tokens whose cumulative probability mass exceedstop_p
. Loweringtop_p
(e.g., 0.5) restricts the choices to more likely words, increasing coherence but potentially reducing novelty. Higher values allow more diverse choices.top_k
: (Default: ~40) Limits the model's choices to thek
most probable next tokens. A lowertop_k
(e.g., 10) makes the output more focused; a highertop_k
allows more variety.top_p
is often considered more effective thantop_k
. Usually, you set one or the other, not both low.num_predict
: (Default: ~128, -1 for infinite) Maximum number of tokens (roughly words/sub-words) the model will generate in a single response. Set to-1
for unlimited generation until a stop condition is met.stop
: A list of specific text sequences. If the model generates one of these sequences, it will immediately stop producing further output. Useful for preventing run-on sentences or ensuring the model stops after answering. Example:["\n", "User:", "<|eot_id|>"]
.num_ctx
: Defines the model's context window size. See detailed explanation below.- Other parameters: Ollama exposes many other parameters inherited from
llama.cpp
(likerepeat_penalty
,seed
,mirostat
, GPU layersnum_gpu
, etc.) for fine-grained control. Refer to the Ollama and llama.cpp documentation for details.
You can set these temporarily using /set parameter
in ollama run
, persistently in a Modelfile
using the PARAMETER
instruction, or per-request via the options
object in the Ollama API.
How to Use Ollama API
While the ollama
CLI offers easy direct interaction, the true potential for integrating Ollama into workflows and applications lies in its built-in REST API and the Modelfile
customization system.
Interacting Programmatically with the Ollama API
By default, the Ollama server process (whether running via the desktop app, systemd, or Docker) listens for incoming HTTP requests on port 11434
of your local machine (http://localhost:11434
or http://127.0.0.1:11434
). This API allows other programs, scripts, or web interfaces running on the same machine (or others on the network, if configured) to interact with Ollama models programmatically.
Key Ollama API Endpoints:
POST /api/generate
: Used for generating text completions based on a single, non-conversational prompt. Suitable for tasks like text expansion, simple translation, or quick code snippets. Requires a JSON body withmodel
andprompt
.POST /api/chat
: Designed for conversational interactions. It accepts a list of messages, each with arole
(system
,user
, orassistant
) andcontent
. This allows the model to maintain context across multiple turns. Requires a JSON body withmodel
andmessages
.POST /api/embeddings
: Generates numerical vector representations (embeddings) for input text. These vectors capture semantic meaning and are fundamental for tasks like Retrieval-Augmented Generation (RAG), semantic search, text classification, and clustering. Requires a JSON body withmodel
andprompt
.GET /api/tags
: Retrieves a list of all models currently available in your local Ollama storage (equivalent toollama list
). Returns a JSON array of model objects.POST /api/show
: Fetches detailed information about a specific local model, including its parameters, template, license, etc. (equivalent toollama show
). Requires a JSON body withname
(model:tag).DELETE /api/delete
: Removes a specified model from local storage (equivalent toollama rm
). Requires a JSON body withname
(model:tag).POST /api/pull
: Initiates the download of a model from the Ollama library (equivalent toollama pull
). Requires a JSON body withname
(model:tag). Can also stream progress information.POST /api/create
: Creates a new custom model based on a provided Modelfile content (equivalent toollama create -f
). Requires a JSON body withname
(new model name) andmodelfile
(the content of the Modelfile as a string).POST /api/copy
: Duplicates an existing local model under a new name (equivalent toollama cp
). Requires a JSON body withsource
anddestination
names.POST /api/push
: Uploads a custom local model to your account on the Ollama registry (requires prior login/setup). Requires a JSON body withname
(namespaced model name).
API Request/Response Format:
Most POST
and DELETE
requests expect a JSON payload in the request body. Responses are typically returned as JSON objects. For the generate
and chat
endpoints, you can control the response format:
"stream": false
(Default): The API waits until the entire response is generated and returns it as a single JSON object containing the full text, completion details, and performance statistics."stream": true
: The API returns a stream of JSON objects, one for each generated token (or small chunk of tokens). This allows applications to display the response progressively, providing a more interactive user experience. The final JSON object in the stream contains the overall statistics.
Example API Interaction using curl
:
1. Simple Generation Request (Non-Streaming):
curl http://localhost:11434/api/generate -d '{
"model": "phi4-mini",
"prompt": "Write a short Python function to calculate factorial:",
"stream": false,
"options": {
"temperature": 0.3,
"num_predict": 80
}
}'
2. Conversational Chat Request (Streaming):
# Note: Streaming output will appear as multiple JSON lines
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2:8b-instruct-q5_K_M",
"messages": [
{ "role": "system", "content": "You are a knowledgeable historian." },
{ "role": "user", "content": "What were the main causes of World War 1?" }
],
"stream": true,
"options": {
"num_ctx": 4096
}
}'
3. Embedding Generation Request:
curl http://localhost:11434/api/embeddings -d '{
"model": "mxbai-embed-large", # Or another suitable embedding model
"prompt": "Ollama makes running LLMs locally easy."
}'
This versatile API forms the backbone for countless community integrations, including web UIs, development tools, backend services, automation scripts, and more, all powered by your local Ollama instance.
Leveraging the Ollama OpenAI Compatibility API
Recognizing the widespread adoption of OpenAI's API standards, Ollama thoughtfully includes an experimental compatibility layer. This allows many tools, libraries, and applications designed for OpenAI's services to work with your local Ollama instance with minimal, often trivial, modifications.
How it Works:
The Ollama server exposes endpoints under the /v1/
path (e.g., http://localhost:11434/v1/
) that mirror the structure and expected request/response formats of key OpenAI API endpoints.
Key Compatible Endpoints:
/v1/chat/completions
: Mirrors OpenAI's chat completion endpoint./v1/embeddings
: Mirrors OpenAI's embeddings endpoint./v1/models
: Mirrors OpenAI's model listing endpoint (returns your local Ollama models).
Using OpenAI Client Libraries with Ollama:
The primary advantage is that you can use standard OpenAI client libraries (like openai-python
, openai-node
, etc.) by simply changing two configuration parameters when initializing the client:
base_url
(orapi_base
): Set this to your local Ollama v1 endpoint:http://localhost:11434/v1/
.api_key
: Provide any non-empty string. Ollama's/v1/
endpoint does not actually perform authentication and ignores the key value, but most OpenAI client libraries require the parameter to be present. Common practice is to use the string"ollama"
or"nokey"
.
Python Example using openai-python
:
# Ensure you have the openai library installed: pip install openai
from openai import OpenAI
import os
# Define the Ollama endpoint and a dummy API key
OLLAMA_BASE_URL = "http://localhost:11434/v1"
OLLAMA_API_KEY = "ollama" # Placeholder, value ignored by Ollama
# Specify the local Ollama model you want to use
OLLAMA_MODEL = "llama3.2"
try:
# Initialize the OpenAI client, pointing it to the Ollama server
client = OpenAI(
base_url=OLLAMA_BASE_URL,
api_key=OLLAMA_API_KEY,
)
print(f"Sending request to Ollama model: {OLLAMA_MODEL} via OpenAI compatibility layer...")
# Make a standard chat completion request
chat_completion = client.chat.completions.create(
model=OLLAMA_MODEL, # Use the name of your local Ollama model
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between Ollama and llama.cpp."}
],
temperature=0.7,
max_tokens=250, # Note: 'max_tokens' corresponds roughly to Ollama's 'num_predict'
stream=False # Set to True for streaming responses
)
# Process the response
if chat_completion.choices:
response_content = chat_completion.choices[0].message.content
print("\nOllama Response:")
print(response_content)
print("\nUsage Stats:")
print(f" Prompt Tokens: {chat_completion.usage.prompt_tokens}")
print(f" Completion Tokens: {chat_completion.usage.completion_tokens}")
print(f" Total Tokens: {chat_completion.usage.total_tokens}")
else:
print("No response choices received from Ollama.")
except Exception as e:
print(f"\nAn error occurred:")
print(f" Error Type: {type(e).__name__}")
print(f" Error Details: {e}")
print("\nPlease ensure the Ollama server is running and accessible at {OLLAMA_BASE_URL}.")
print("Also verify the model '{OLLAMA_MODEL}' is available locally ('ollama list').")
This compatibility significantly simplifies migrating existing OpenAI-based projects to use local models via Ollama or building new applications that can flexibly switch between cloud and local backends. While not all obscure OpenAI features might be perfectly mirrored, the core chat, embedding, and model listing functionalities are well-supported.
How to Use Ollama Modelfiles
The Modelfile
is the cornerstone of Ollama's customization capabilities. It acts as a blueprint or recipe, defining precisely how an Ollama model should be constructed or modified. By creating and editing these simple text files, you gain fine-grained control over model behavior, parameters, and structure.
Core Ollama Modelfile Instructions:
FROM <base_model_reference>
: (Mandatory First Instruction) Specifies the foundation upon which your new model is built. The reference can point to:- An existing model in your local Ollama library or the official registry (e.g.,
FROM llama3.2:8b-instruct-q5_K_M
). - A relative or absolute path to a local directory containing unpacked model weights in the Safetensors format, along with necessary configuration files (
config.json
,tokenizer.json
, etc.). Example:FROM /mnt/models/my_downloaded_llama/
. Ollama will attempt to load supported architectures (Llama, Mistral, Phi, Gemma, etc.). - A relative or absolute path to a single model file in the popular GGUF format (which packages weights and metadata). Example:
FROM ./models/mistral-7b-instruct-v0.2.Q5_K_M.gguf
. PARAMETER <parameter_name> <value>
: Sets a default runtime parameter value for the model being created. These defaults will be used unless overridden duringollama run
(via/set parameter
), through API call options, or by subsequentPARAMETER
instructions in the same Modelfile.PARAMETER temperature 0.6
PARAMETER num_ctx 8192
(Sets the default context window)PARAMETER stop "<|user|>"
PARAMETER stop "<|end_of_turn|>"
(You can have multiple stop parameters)PARAMETER repeat_penalty 1.15
TEMPLATE "<prompt_template_string>"
: Defines the specific format Ollama should use to structure the input prompt before feeding it to the underlying LLM. This is critically important for chat and instruction-following models, as they are trained to expect special tokens or markers delineating system messages, user turns, and assistant turns. The template uses Go's text/template syntax. Key variables available within the template string include:{{ .System }}
: Placeholder for the system message.{{ .Prompt }}
: Placeholder for the user's current input (used in/api/generate
).{{ .Response }}
: Placeholder where the model's generated output will be appended (often used internally or for specific template structures).{{ range .Messages }}
...{{ .Role }}
...{{ .Content }}
...{{ end }}
: Used in chat templates (/api/chat
) to iterate over the entire message history (system, user, assistant turns) and format it correctly.{{ .First }}
: A boolean indicating if the current message is the very first one (useful for adding special beginning-of-sequence tokens).
Example (simplified ChatML template):
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}{{ range .Messages }}
<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>{{ end }}
<|im_start|>assistant
"""
Getting the template right is essential for making a model follow instructions or converse naturally. You can view a model's default template using ollama show --modelfile <model_name>
.
SYSTEM "<default_system_message>"
: Provides a default system prompt that will be used if no other system message is provided via the API or CLI. This is useful for setting a consistent persona or instruction set for the model.SYSTEM "You are a helpful AI assistant focused on providing concise answers."
ADAPTER </path/to/adapter_weights>
: Applies a LoRA (Low-Rank Adaptation) or QLoRA adapter to the base model specified in theFROM
instruction. Adapters are small sets of weights trained to modify or specialize a pre-trained LLM for a specific task, style, or knowledge domain without retraining the entire massive model. The path can point to:- A single
.gguf
file containing the adapter. - A directory containing the adapter weights in Safetensors format (
adapter_model.safetensors
,adapter_config.json
).
Important: The base model (FROM
) must be the same one the adapter was originally trained on for it to work correctly. LICENSE "<license_details>"
: Embeds license information within the model's metadata. Can be a short identifier (like "MIT") or a longer text block.MESSAGE user "<example_user_message>"
/MESSAGE assistant "<example_assistant_response>"
: Defines example conversational turns. These can sometimes help guide the model's tone, style, or expected output format, especially for few-shot prompting scenarios, although their effect varies between models.
Building an Ollama Model from a Modelfile:
Once you have created your Modelfile
(e.g., saved as MyCustomModel.modelfile
), you use the ollama create
command to build the corresponding Ollama model:
ollama create my-new-model-name -f MyCustomModel.modelfile
Ollama processes the instructions, potentially combines layers, applies adapters, sets parameters, and registers the new model (my-new-model-name
) in your local library. You can then run it like any other model: ollama run my-new-model-name
.
How to Import External Models into Ollama (GGUF, Safetensors)
Ollama's Modelfile
system provides a seamless way to import models obtained from other sources (like Hugging Face, independent researchers, etc.) that are distributed in standard formats.
Importing GGUF Models into Ollama: GGUF is a popular format designed specifically for llama.cpp
and similar inference engines. It packages model weights (often pre-quantized), tokenizer information, and metadata into a single file. This is often the easiest format to import.
- Download the
.gguf
file (e.g.,zephyr-7b-beta.Q5_K_M.gguf
). - Create a minimal
Modelfile
(e.g.,ZephyrImport.modelfile
):
# ZephyrImport.modelfile
FROM ./zephyr-7b-beta.Q5_K_M.gguf
# Crucial: Add the correct prompt template for this model!
# (Look up the model's required template format)
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
{{ .Response }}</s>
"""
PARAMETER num_ctx 4096 # Set a reasonable default context
SYSTEM "You are a friendly chatbot." # Optional default system prompt
- Build the Ollama model:
ollama create my-zephyr-gguf -f ZephyrImport.modelfile
.
Importing Safetensors Models (Full Weights) into Ollama: Safetensors is a secure and fast format for storing model tensors. If you have the complete set of weights and configuration files for a model in this format:
- Ensure all necessary files (
*.safetensors
weight files,config.json
,tokenizer.json
,special_tokens_map.json
,tokenizer_config.json
, etc.) are located within a single directory (e.g.,/data/models/Mistral-7B-v0.1-full/
). - Create a
Modelfile
referencing this directory:
# MistralImport.modelfile
FROM /data/models/Mistral-7B-v0.1-full/
# Add required TEMPLATE, PARAMETER, SYSTEM instructions
TEMPLATE """[INST] {{ if .System }}{{ .System }} \n{{ end }}{{ .Prompt }} [/INST]
{{ .Response }}"""
PARAMETER num_ctx 4096
PARAMETER temperature 0.7
- Build the model:
ollama create my-mistral-safetensors -f MistralImport.modelfile
. Ollama will attempt to load compatible architectures. If the model is unquantized (e.g., FP16), you can optionally quantize it during creation (see below).
Applying Safetensors LoRA Adapters via Ollama Modelfile:
- First, ensure you have the exact base Ollama model that the LoRA adapter was trained for. Pull it if necessary (e.g.,
ollama pull llama3.2:8b
). - Place the LoRA adapter files (e.g.,
adapter_model.safetensors
,adapter_config.json
) in their own directory (e.g.,/data/adapters/my_llama3_lora/
). - Create a
Modelfile
specifying both the base and the adapter:
# ApplyLora.modelfile
FROM llama3.2:8b # Must match the adapter's base!
ADAPTER /data/adapters/my_llama3_lora/
# Adjust parameters or template if the LoRA requires it
PARAMETER temperature 0.5
SYSTEM "You now respond in the style taught by the LoRA."
- Build the adapted model:
ollama create llama3-with-my-lora -f ApplyLora.modelfile
.
How to Quantize Models with Ollama
Quantization is the process of reducing the numerical precision of a model's weights (e.g., converting 16-bit floating-point numbers to 4-bit integers). This significantly shrinks the model's file size and memory footprint (RAM/VRAM usage) and speeds up inference, making it possible to run larger, more capable models on consumer hardware. The trade-off is usually a small, often imperceptible, reduction in output quality.
Ollama can perform quantization during the model creation process if the FROM
instruction in your Modelfile points to unquantized or higher-precision model weights (typically FP16 or FP32 Safetensors).
How to Quantize using ollama create
:
- Create a
Modelfile
that points to the directory containing the unquantized model weights:
# QuantizeMe.modelfile
FROM /path/to/my/unquantized_fp16_model/
# Add TEMPLATE, PARAMETER, SYSTEM as needed
- Run the
ollama create
command, adding the-q
(or--quantize
) flag followed by the desired quantization level identifier:
# Quantize to Q4_K_M (popular balance of size/quality)
ollama create my-quantized-model-q4km -f QuantizeMe.modelfile -q q4_K_M
# Quantize to Q5_K_M (slightly larger, potentially better quality)
ollama create my-quantized-model-q5km -f QuantizeMe.modelfile -q q5_K_M
# Quantize to Q8_0 (largest common quantization, best quality among quantized)
ollama create my-quantized-model-q8 -f QuantizeMe.modelfile -q q8_0
# Quantize to Q3_K_S (very small, more quality loss)
ollama create my-quantized-model-q3ks -f QuantizeMe.modelfile -q q3_K_S
Ollama uses the quantization routines from llama.cpp
to perform the conversion and saves the newly quantized model under the specified name.
Common Quantization Levels:
q4_0
,q4_1
: Basic 4-bit quantization.q5_0
,q5_1
: Basic 5-bit quantization.q8_0
: 8-bit quantization (larger file size, closer to original quality).q3_K_S
,q3_K_M
,q3_K_L
: More advanced 3-bit "K-Quant" methods (Small/Medium/Large internal variations). Very small, noticeable quality trade-off.q4_K_S
,q4_K_M
: Advanced 4-bit K-Quants.q4_K_M
is often a recommended sweet spot.q5_K_S
,q5_K_M
: Advanced 5-bit K-Quants. Good balance, slightly larger than Q4_K.q6_K
: Advanced 6-bit K-Quant. Larger size, approaching Q8 quality.
Choosing the right quantization level depends on your hardware constraints and tolerance for potential quality reduction. It's often worth trying q4_K_M
or q5_K_M
first.
How to Create Your Own Ollama Models
If you've crafted a unique model variant using a Modelfile – perhaps by applying a specific LoRA, setting a creative system prompt and template, or fine-tuning parameters – you can share your creation with the broader Ollama community via the official Ollama model registry website.
Steps to Share an Ollama Model:
- Create an Ollama Account: Sign up for a free account on the Ollama website (ollama.com). Your chosen username will become the namespace for your shared models.
- Link Your Local Ollama: You need to associate your local Ollama installation with your online account. This involves adding your local machine's Ollama public key to your account settings on the website. The website provides specific instructions on how to find your local public key file (
id_ed25519.pub
) based on your operating system. - Name Your Model Correctly: Shared models must be namespaced with your Ollama username, following the format
yourusername/yourmodelname
. If your local custom model has a different name (e.g.,mario
), you first need to copy it to the correct namespaced name usingollama cp
:
# Assuming your username is 'luigi' and local model is 'mario'
ollama cp mario luigi/mario
- Push the Model to the Registry: Once the model is correctly named locally and your key is linked, use the
ollama push
command:
ollama push luigi/mario
Ollama will upload the necessary model layers and metadata to the registry.
After the push is complete, other Ollama users worldwide can easily download and run your shared model simply by using its namespaced name:
ollama run luigi/mario
This sharing mechanism fosters collaboration and allows the community to benefit from specialized or creatively customized models.
How to Optimize Ollama Performance with GPU Acceleration
While Ollama can run models purely on your computer's CPU, leveraging a compatible Graphics Processing Unit (GPU) provides a dramatic performance boost, significantly accelerating the speed at which models generate text (inference speed). Ollama is designed to automatically detect and utilize supported GPUs whenever possible.
Ollama with NVIDIA GPUs: Ollama offers excellent support for NVIDIA GPUs, requiring:
- A GPU with CUDA Compute Capability 5.0 or higher (most GeForce GTX 900 series / Quadro M series and newer cards). Check NVIDIA's CUDA GPU list for compatibility.
- The official proprietary NVIDIA drivers installed correctly on your host system (Linux or Windows).
- On Linux with Docker, the NVIDIA Container Toolkit must be installed and configured.
Ollama should automatically detect compatible hardware and drivers and offload computation to the GPU.
Ollama with AMD Radeon GPUs: Support for modern AMD GPUs is available on both Windows and Linux:
- Requires recent AMD Radeon Software (Adrenalin Edition) drivers on Windows.
- Requires the ROCm (Radeon Open Compute platform) software stack installed on Linux (version 5 or 6+ recommended). The Ollama Linux installer or the
:rocm
Docker image often helps with dependencies, but manual ROCm installation might be needed for full compatibility on certain distributions. - Supported GPUs generally include the RX 6000 series, RX 7000 series, PRO W6000/W7000 series, and some Instinct accelerators. Check the official Ollama GPU documentation for a detailed list.
Ollama with Apple Silicon (macOS): On Macs equipped with M1, M2, M3, or M4 series chips, Ollama automatically utilizes the built-in GPU capabilities via Apple's Metal graphics API. No additional driver installation or configuration is typically required; GPU acceleration works out of the box.
Verifying Ollama GPU Usage:
The easiest way to check if Ollama is actually using your GPU is to run the ollama ps
command while a model is loaded (e.g., immediately after starting ollama run <model>
in another terminal, or while an API request is being processed). Examine the PROCESSOR
column in the output:
gpu
: Indicates the model's layers are primarily loaded onto the GPU.cpu
: Indicates the model is running solely on the CPU. This could mean no compatible GPU was detected, drivers are missing or incorrect, or there was an issue initializing GPU support. Check the Ollama server logs for errors.
Selecting Specific GPUs in Multi-GPU Ollama Setups:
If your system contains multiple compatible GPUs, you can instruct Ollama (and the underlying llama.cpp
) which specific device(s) to use by setting environment variables before launching the Ollama server process:
- NVIDIA (
CUDA_VISIBLE_DEVICES
): Set this variable to a comma-separated list of GPU indices (starting from 0) or, preferably, GPU UUIDs. export CUDA_VISIBLE_DEVICES=0
(Use only the first GPU detected by NVIDIA).export CUDA_VISIBLE_DEVICES=1
(Use only the second GPU).export CUDA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx,GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
(Use specific GPUs identified by their UUIDs, which you can find usingnvidia-smi -L
).- AMD (
ROCR_VISIBLE_DEVICES
): Set this variable to a comma-separated list of GPU indices as reported by therocminfo
command (usually starting from 0). export ROCR_VISIBLE_DEVICES=0
(Use the first detected AMD GPU).export ROCR_VISIBLE_DEVICES=1,2
(Use the second and third detected AMD GPUs).
Setting an invalid device ID (e.g., export CUDA_VISIBLE_DEVICES=-1
) is often used as a way to deliberately force Ollama to use only the CPU, which can be useful for debugging. Remember to restart the Ollama server/app after setting these environment variables for them to take effect.
Configuring Your Ollama Environment
Beyond the default settings, Ollama's behavior can be fine-tuned using various environment variables. These allow you to customize network settings, storage locations, logging levels, and more.
Key Ollama Environment Variables for Configuration
OLLAMA_HOST
: Controls the network interface and port the Ollama API server listens on.- Default:
127.0.0.1:11434
(Listens only on the loopback interface, accessible only from the same machine). - Example:
0.0.0.0:11434
(Listens on all available network interfaces, making Ollama accessible from other devices on your local network. Warning: Ensure proper firewall rules are in place if exposing Ollama externally). - Example:
192.168.1.100:11500
(Listens only on a specific local IP address and a custom port). OLLAMA_MODELS
: Crucially, this variable allows you to specify a custom directory path where Ollama should store and look for downloaded models. This is extremely useful if your default drive (where~/.ollama
orC:\Users\<User>\.ollama
resides) is low on space or if you prefer to organize models on a dedicated SSD or larger drive.- Example (Linux/macOS):
export OLLAMA_MODELS=/mnt/large_drive/my_ollama_models
- Example (Windows): Set
OLLAMA_MODELS
toD:\ollama_data
via System Properties. - Important: Ensure the directory exists and that the user account running the Ollama server process has full read and write permissions for this custom path.
OLLAMA_ORIGINS
: Manages Cross-Origin Resource Sharing (CORS) for the Ollama API. By default, web browsers restrict web pages from making API requests to different domains (origins) than the one the page was served from. If you are running a separate web UI (like Open WebUI, Lobe Chat) served from a different origin (e.g.,http://localhost:3000
) that needs to call your Ollama API (athttp://localhost:11434
), you must add the UI's origin to this variable.- Example:
export OLLAMA_ORIGINS=http://localhost:3000,http://192.168.1.50:8080
(Allows requests from these two specific origins). - Example:
export OLLAMA_ORIGINS='*'
(Allows requests from any origin. Use with caution, especially ifOLLAMA_HOST
is not127.0.0.1
, as this could expose your API widely). Can include protocols likechrome-extension://*
. OLLAMA_DEBUG
: Set to1
to enable verbose debug logging. This provides much more detailed information about Ollama's internal operations, including GPU detection steps, model loading details, and potential errors, which is invaluable for troubleshooting.- Example:
export OLLAMA_DEBUG=1
OLLAMA_KEEP_ALIVE
: Controls how long Ollama keeps a model loaded in memory after its last request. By default, it might be around 5 minutes (5m
). Setting it to0
unloads the model immediately after use (saves RAM/VRAM but increases load time for the next request). Setting it to a longer duration (e.g.,30m
) or-1
(keeps loaded indefinitely until server stops) can speed up frequent requests to the same model but consumes resources constantly.- Example:
export OLLAMA_KEEP_ALIVE=15m
HTTPS_PROXY
/HTTP_PROXY
/NO_PROXY
: Standard networking environment variables used if Ollama needs to route its outgoing internet requests (e.g., when runningollama pull
to download models from ollama.com) through a proxy server, common in corporate environments.- Example:
export HTTPS_PROXY=http://proxy.mycompany.com:8080
Methods for Setting Ollama Environment Variables
The correct way to set these variables depends on how you installed and run Ollama:
Ollama on macOS (Using the App): Environment variables for GUI applications on macOS are best set using launchctl
. Open Terminal and use:
launchctl setenv OLLAMA_MODELS "/Volumes/ExternalSSD/OllamaStorage"
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
# Repeat for other variables
After setting the variables, you must Quit and restart the Ollama application from the menu bar icon for the changes to take effect.
Ollama on Linux (Using Systemd Service): The recommended method is to create an override file for the service:
- Run
sudo systemctl edit ollama.service
. This opens an empty text editor. - Add the following lines, modifying the variable and value as needed:
[Service]
Environment="OLLAMA_MODELS=/path/to/custom/model/dir"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_DEBUG=1"
- Save and close the editor.
- Apply the changes:
sudo systemctl daemon-reload
- Restart the Ollama service:
sudo systemctl restart ollama
Ollama on Windows: Use the built-in Environment Variables editor:
- Search for "Edit the system environment variables" in the Start menu and open it.
- Click the "Environment Variables..." button.
- You can set variables for your specific user ("User variables") or for all users ("System variables"). System variables usually require administrator privileges.
- Click "New..." under the desired section.
- Enter the
Variable name
(e.g.,OLLAMA_MODELS
) andVariable value
(e.g.,D:\OllamaData
). - Click OK on all open dialogs.
- Crucially, you must restart the Ollama background process. Open Task Manager (Ctrl+Shift+Esc), go to the "Services" tab, find "Ollama", right-click, and select "Restart". Alternatively, reboot your computer.
Ollama via Docker: Pass environment variables directly in the docker run
command using the -e
flag for each variable:
docker run -d \
--gpus=all \
-v ollama_data:/root/.ollama \
-p 127.0.0.1:11434:11434 \
-e OLLAMA_HOST="0.0.0.0:11434" \
-e OLLAMA_DEBUG="1" \
-e OLLAMA_KEEP_ALIVE="10m" \
--name my_ollama_configured \
ollama/ollama
Ollama via Manual ollama serve
in Terminal: Simply prefix the command with the variable assignments on the same line:
OLLAMA_DEBUG=1 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_MODELS=/data/ollama ollama serve
These variables will only apply to that specific server instance.
Choose the method appropriate for your setup and remember to restart the Ollama server process after making changes for them to become active.
How to Check Ollama Logs for Troubleshooting
Your primary diagnostic tool is the Ollama server log file. It records startup information, model loading attempts, GPU detection results, API requests, and, most importantly, detailed error messages.
Default Log File Locations:
- Ollama on macOS:
~/.ollama/logs/server.log
- Ollama on Linux (Systemd): Use the journal control command:
journalctl -u ollama
. Add-f
to follow logs in real-time (journalctl -u ollama -f
). Use-n 100
to see the last 100 lines (journalctl -u ollama -n 100
). - Ollama on Windows: The log file is
server.log
located inside%LOCALAPPDATA%\Ollama
. You can easily open this folder by pasting%LOCALAPPDATA%\Ollama
into the File Explorer address bar or the Run dialog (Win+R). - Ollama via Docker: Use the Docker logs command:
docker logs my_ollama
(replacemy_ollama
with your container name). Add-f
to follow (docker logs -f my_ollama
). - Ollama via Manual
ollama serve
: Logs are printed directly to the terminal window where you launched the command.
Tip: For more detailed troubleshooting, always enable debug logging by setting the OLLAMA_DEBUG=1
environment variable before starting the Ollama server, then check the logs again.
How to Fix Ollama Error: listen tcp 127.0.0.1:11434: bind: address already in use
This specific error message is one of the most common issues new users encounter. It means Ollama cannot start its API server because another process is already occupying the network port (default 11434
) that Ollama needs to listen on.
- Likely Cause:
- Another instance of the Ollama server is already running in the background (perhaps from a previous session that didn't shut down cleanly).
- A completely different application on your system happens to be using port 11434.
- Solution 1: Identify and Stop the Conflicting Process: You need to find out what process is using the port and stop it (if it's safe to do so).
- Linux / macOS: Open a terminal and run
sudo lsof -i :11434
orsudo netstat -tulnp | grep 11434
. These commands should show the Process ID (PID) of the program using the port. You can then try to stop it gracefully or usesudo kill <PID>
. If it's an old Ollama process, killing it should resolve the conflict. - Windows: Open Command Prompt as Administrator and run
netstat -ano | findstr "11434"
. Look at the last column for the PID. Open Task Manager (Ctrl+Shift+Esc), go to the "Details" tab (or "Processes" and add the PID column), find the process with that PID, and end it if appropriate. - Solution 2: Change Ollama's Listening Port: If you cannot stop the conflicting process, or if you intentionally want Ollama to run on a different port, you can configure Ollama to use an alternative port. Set the
OLLAMA_HOST
environment variable to include your desired port before starting Ollama. - Example: Set
OLLAMA_HOST
to127.0.0.1:11435
(using the methods described in the Configuration section). - Remember to adjust any API client configurations, web UIs, or scripts to point to the new port (e.g.,
http://localhost:11435
) after making this change.
How to Fix Ollama GPU Detection and Usage Problems
If ollama ps
shows cpu
instead of gpu
, or if you encounter specific GPU-related errors in the logs (like CUDA error
, ROCm error
), follow these steps:
Confirm GPU Compatibility: Double-check that your specific GPU model is listed as supported in the official Ollama GPU documentation on GitHub.
Update Drivers: Ensure you have the very latest stable official drivers installed directly from NVIDIA or AMD's websites. Generic drivers included with the OS are often insufficient. A full system reboot after driver installation is highly recommended.
Check Ollama Logs (Debug Mode): Set OLLAMA_DEBUG=1
, restart the Ollama server, and carefully examine the startup logs. Look for messages related to GPU detection, library loading (CUDA, ROCm), and any specific error codes.
NVIDIA Specifics (Linux):
- Verify
nvidia-smi
command works and shows your GPU and driver version. - If using Docker, confirm the NVIDIA Container Toolkit is installed and functional (
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
). - Check if necessary kernel modules are loaded (
lsmod | grep nvidia
). Sometimes reloading (sudo rmmod nvidia_uvm nvidia && sudo modprobe nvidia_uvm nvidia
) can help. - Check system logs (
dmesg | grep -iE 'nvidia|nvrm'
) for hardware or driver errors.
AMD Specifics (Linux):
- Verify
rocminfo
command works and shows your GPU. - Ensure the user running Ollama (or the Docker container) has correct permissions for
/dev/kfd
and/dev/dri/renderD*
devices (often requires membership inrender
andvideo
groups). Check group membership withgroups $USER
orgroups ollama
. You might need--group-add render --group-add video
in your Docker command. - Check system logs (
dmesg | grep -iE 'amdgpu|kfd'
) for errors.
Force CPU (for testing): As a temporary diagnostic step, try forcing CPU usage by setting CUDA_VISIBLE_DEVICES=-1
or ROCR_VISIBLE_DEVICES=-1
. If Ollama runs correctly on the CPU, it confirms the issue is related to GPU setup.
Addressing Other Common Ollama Issues
Permission Errors (Model Directory): Especially on Linux with the systemd service, if Ollama fails to pull or create models, it might lack write permissions for the model storage directory (OLLAMA_MODELS
or the default). Ensure the directory exists and is owned or writable by the ollama
user/group (sudo chown -R ollama:ollama /path/to/models
and sudo chmod -R 775 /path/to/models
).
Slow Model Downloads (ollama pull
):
- Check your basic internet connection speed.
- If behind a proxy, ensure
HTTPS_PROXY
is correctly configured. - Check for firewall rules potentially blocking or throttling connections to
ollama.com
. - On Windows with WSL2, the known LSO (Large Send Offload) issue can sometimes impact network performance within WSL. Disabling LSOv2 for the
vEthernet (WSL)
adapter in Windows Network Adapter settings might help, though this is less common now.
Garbled Terminal Output (ollama run
on older Windows): If you see strange characters like ←[?25h...
in cmd.exe
or PowerShell on older Windows 10 versions, it's likely due to poor ANSI escape code support. The best solutions are:
- Upgrade Windows 10 to version 22H2 or later.
- Use Windows 11.
- Use the modern Windows Terminal application, which has excellent ANSI support.
If you've exhausted these troubleshooting steps and checked the debug logs without success, the Ollama community is a great resource. Prepare a clear description of the problem, include relevant details about your OS, Ollama version, hardware (CPU/GPU/RAM), the specific model you're using, the command you ran, and crucially, the relevant sections from your debug logs. Post your question on the Ollama Discord or file a well-documented issue on the Ollama GitHub repository.
How to Uninstall Ollama Completely
If you need to remove Ollama from your system, the process varies based on your initial installation method. It typically involves removing the application/binary, the background service (if applicable), and the stored models/configuration files.
Uninstalling Ollama on macOS (Installed via .app):
- Quit Ollama: Click the Ollama menu bar icon and select "Quit Ollama".
- Remove Application: Drag
Ollama.app
from your/Applications
folder to the Trash/Bin. - Remove Data and Config: Open Terminal and execute
rm -rf ~/.ollama
. Warning: This deletes all downloaded models and configuration permanently. Double-check the command before running. - (Optional) Unset Environment Variables: If you manually set variables using
launchctl setenv
, you can unset them:launchctl unsetenv OLLAMA_HOST
,launchctl unsetenv OLLAMA_MODELS
, etc.
Uninstalling Ollama on Windows (Installed via .exe):
- Use Windows Uninstaller: Go to "Settings" > "Apps" > "Installed apps". Locate "Ollama" in the list, click the three dots (...) next to it, and select "Uninstall". Follow the uninstallation prompts.
- Remove Data and Config: After the uninstaller finishes, manually delete the Ollama data directory. Open File Explorer, type
%USERPROFILE%\.ollama
into the address bar, press Enter, and delete the entire.ollama
folder. Warning: This deletes all models. - (Optional) Remove Environment Variables: If you manually added
OLLAMA_HOST
,OLLAMA_MODELS
, etc., via System Properties, go back there ("Edit the system environment variables") and delete them.
Uninstalling Ollama on Linux (Installed via Script or Manual Binary):
- Stop the Service:
sudo systemctl stop ollama
- Disable the Service:
sudo systemctl disable ollama
- Remove Binary:
sudo rm /usr/local/bin/ollama
(or the path where you installed it). - Remove Service File:
sudo rm /etc/systemd/system/ollama.service
- Reload Systemd:
sudo systemctl daemon-reload
- (Optional) Remove User/Group: If the
ollama
user/group were created:sudo userdel ollama
,sudo groupdel ollama
. - Remove Data and Config: Delete the model storage directory. This depends on where it was stored:
- If run as your user:
rm -rf ~/.ollama
- If run as system service (
ollama
user):sudo rm -rf /usr/share/ollama/.ollama
(or the path specified byOLLAMA_MODELS
in the service file).
Warning: Use extreme caution withsudo rm -rf
. Verify the path is correct before executing.
Uninstalling Ollama via Docker:
- Stop the Container:
docker stop my_ollama
(use your container name). - Remove the Container:
docker rm my_ollama
. - Remove the Image:
docker rmi ollama/ollama
(andollama/ollama:rocm
if you used it). - (Optional, Destructive) Remove the Volume: If you want to delete all downloaded models stored in the Docker volume, run
docker volume rm ollama_data
(use the volume name you created). Warning: This is irreversible.
Conclusion: Embracing the Power of Local AI with Ollama
Ollama stands as a pivotal tool in democratizing access to the immense power of modern Large Language Models. By elegantly abstracting away the complexities of setup, configuration, and execution, it empowers a diverse range of users – from seasoned developers and researchers to curious enthusiasts – to run sophisticated AI directly on their own hardware. The advantages are clear: unparalleled privacy, freedom from recurring API costs, reliable offline operation, and the liberating ability to deeply customize and experiment with models using the intuitive Modelfile
system and robust API.
Whether your goal is to build the next generation of AI-driven applications, conduct cutting-edge research while maintaining data sovereignty, or simply explore the fascinating capabilities of language generation without external dependencies, Ollama provides a stable, efficient, and user-friendly foundation. It successfully bridges the gap between the raw power of inference engines like llama.cpp
and the practical needs of users, fostering innovation within the vibrant open-source AI landscape.
The journey into the world of local LLMs is both accessible and deeply rewarding, thanks to Ollama. Download the application, pull your first model using ollama pull
, start a conversation with ollama run
, and begin unlocking the vast potential of artificial intelligence, right on your own machine.
Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?
Apidog delivers all your demans, and replaces Postman at a much more affordable price!
