Unlocking the full power of large language models (LLMs) on your own hardware is now within reach for developers and API teams. Ollama radically simplifies local LLM deployment, making it practical to run, customize, and integrate advanced models like Llama 3, Mistral, Gemma, and Phi—no cloud dependency required.
Whether you build internal tools, automate workflows, or experiment with AI-powered apps, this guide walks you through installing Ollama, managing models, leveraging APIs, and troubleshooting—all with a focus on developer productivity and security.
💡 Looking for an API platform that generates beautiful API Documentation and boosts team productivity? Try Apidog, the all-in-one API solution that replaces Postman at a better price.
Why Run LLMs Locally with Ollama?
Relying on cloud-based AI APIs can be convenient, but local LLM deployment offers compelling advantages for engineering teams:
- Maximum Data Privacy: All prompts, context, and model outputs stay on your machine—critical for sensitive code, user data, or internal documents.
- Cost Control: No per-use or subscription fees. Once your hardware is ready, model runs are unlimited and free.
- Offline Reliability: Use models anytime, even without internet—ideal for air-gapped environments, on-premise development, or field research.
- Model Customization: Ollama’s Modelfile system lets you tweak parameters, system prompts, and add fine-tuned adapters (LoRAs) for specialized use cases. Import weights from GGUF or Safetensors formats with ease.
- Performance on Your Hardware: Run LLMs on your workstation or server, leveraging CPU or GPU for fast inference—no network latency, no shared resources.
- Open Source and Community-Driven: Ollama is open source, with a growing ecosystem of models and contributors.
“The beauty of having small but smart models like Gemma 2 9B is that it allows you to build all sorts of fun stuff locally. Like this script that uses @ollama to fix typos or improve any text on my Mac just by pressing dedicated keys. A local Grammarly but super fast. ⚡️”
— Pietro Schirano (tweet)
Ollama vs. llama.cpp: What’s the Difference?
Understanding Ollama’s architecture helps developers choose the right tool:
- llama.cpp: The C/C++ inference engine powering efficient LLM execution, optimized for CPU and GPU across platforms.
- Ollama: A higher-level application built on llama.cpp, providing:
- Command-line interface (
ollama run,ollama pull) - REST API server for integration
- Streamlined model management and customization
- Cross-platform installers (macOS, Windows, Linux, Docker)
- Automatic hardware detection
- Command-line interface (
Use llama.cpp directly for low-level control or research; use Ollama for a developer-friendly, production-ready local LLM platform.
How to Install Ollama on macOS, Windows, Linux, and Docker
Ollama makes setup frictionless across environments.
System Requirements
- RAM:
- Minimum 8GB for small models (1B–3B).
- 16GB recommended for 7B/13B models.
- 32GB+ for 30B+ models and large context windows.
- Disk Space:
- Model sizes vary: small (2GB), mid (4–5GB), large (40GB+). Ensure enough free space.
- OS:
- macOS 11+ (Apple Silicon recommended)
- Windows 10 22H2+ or 11
- Modern Linux (Ubuntu 20.04+, Fedora 38+, Debian 11+)
macOS
- Download the DMG from Ollama’s official site.
- Mount and drag
Ollama.appto Applications. - Launch the app—Ollama runs as a background service and is accessible via the menu bar and Terminal.
Apple Silicon Macs (M1/M2/M3/M4) are GPU-accelerated by default via Metal.
Windows
- Download and run
OllamaSetup.exe. - Follow the installer prompts.
- Ollama runs as a background service and is available in Command Prompt or PowerShell.
For GPU acceleration:
- NVIDIA: Latest GeForce drivers (version 452.39+).
- AMD: Latest Adrenalin drivers. Supported GPUs: RX 6000 series and newer.
Linux
Install via script:
curl -fsSL https://ollama.com/install.sh | sh
- Installs the binary, configures systemd service, checks GPU drivers, and starts the service.
- For advanced/manual installs, see Ollama’s GitHub documentation.
For GPU:
- NVIDIA: Proprietary drivers, verify with
nvidia-smi. - AMD: ROCm toolkit, verify with
rocminfo.
Docker
CPU-only:
docker run -d \
-v ollama_data:/root/.ollama \
-p 127.0.0.1:11434:11434 \
--name my_ollama \
ollama/ollama
NVIDIA GPU:
docker run -d \
--gpus=all \
-v ollama_data:/root/.ollama \
-p 127.0.0.1:11434:11434 \
--name my_ollama_gpu \
ollama/ollama
AMD GPU (ROCm):
docker run -d \
--device /dev/kfd \
--device /dev/dri \
-v ollama_data:/root/.ollama \
-p 127.0.0.1:11434:11434 \
--name my_ollama_rocm \
ollama/ollama:rocm
Access the Ollama API at http://localhost:11434.
Where Does Ollama Store Models?

- macOS/Linux:
~/.ollama/models - Windows:
C:\Users\<YourUsername>\.ollama\models - Linux systemd:
/usr/share/ollama/.ollama/models - Docker:
/root/.ollama/modelsinside the container (use-vto persist)
Change storage location with the OLLAMA_MODELS environment variable if needed.
Getting Started: Running Your First LLM with Ollama
Once installed, start using LLMs in your terminal.
Downloading Models: ollama pull
Browse available models in the Ollama model library.
Examples:
ollama pull llama3.2 # Llama 3.2 8B Instruct
ollama pull mistral:7b # Mistral 7B base
ollama pull gemma3 # Gemma 4B
ollama pull phi4-mini # Phi-4 Mini
ollama pull llava # Vision model
Understanding tags:
- Size: 1b, 3b, 7b, 13b, etc. (parameters in billions)
- Quantization: q4_K_M, q5_K_M, q8_0, etc. for size/performance trade-offs
- Variants:
instruct,chat,code,vision, etc.
![]()

Chatting with LLMs: ollama run
ollama run llama3.2
Type your prompt at the interactive prompt. Use /set parameter to adjust on the fly (e.g., /set parameter temperature 0.9).
Common interactive commands:
/set parameter <name> <value>– e.g., temp, num_ctx/show info– model details/save <session>and/load <session>/byeor/exit
Managing Your Ollama Models
- List models:
ollama list - Show detailed info:
ollama show llama3.2:8b-instruct-q5_K_M - Delete model:
ollama rm mistral:7b - Copy/rename model:
ollama cp llama3.2 my-custom-llama3.2 - Check loaded models:
ollama ps
Choosing the Right Ollama Model for Your Use Case
Selection depends on:
- Task:
- General chat:
llama3.2,mistral,gemma3 - Coding:
codellama,phi4,starcoder2 - Summarization/analysis: instruction-tuned models
- Multimodal/image:
llava,moondream
- General chat:
- Hardware:
- 8GB RAM: Small/quantized (1B–3B)
- 16GB RAM: 7B/8B quantized
- 32GB+ RAM: 13B or larger
- Quality vs. Speed:
- Less quantization = higher quality, more RAM/VRAM needed
- More quantization = smaller, faster, less resource use
Recommended starting points for most teams:
llama3.2:8b-instruct-q5_K_Mormistral:7b-instruct-v0.2-q5_K_M- For low-resource:
phi4-mini:q4_K_M - For coding:
codellama:7b-instruct-q5_K_M - For vision:
llava:13b(if hardware allows)
Test several and measure against your real prompts and workloads.
Understanding Context Length in Ollama
What is num_ctx?
- The context window: How many tokens the model can “see” at once (prompts + history).
- Why it matters: Longer context = better handling of long chats, documents, or code blocks.
- Limits: Each model has a max value (e.g., 4096, 8192, 32k). Setting above the trained limit can cause issues; lower is always safe.
See default with:
ollama show <model>: look for num_ctx
Adjust context:
- In interactive mode:
/set parameter num_ctx 8192 - By API: JSON
options - In a custom Modelfile:
PARAMETER num_ctx 8192
Key Ollama Model Parameters
- temperature: Controls creativity (0.2 = deterministic, 1.0+ = more random)
- top_p: Nucleus sampling, restricts to most likely tokens
- top_k: Limits to top-k probable tokens
- num_predict: Max tokens to generate in a response
- stop: When to halt generation (e.g., after code block)
- repeat_penalty, seed, etc.: Fine-grained controls
Set parameters:
- Temporarily with
/set parameter - Per-request in API JSON
- Persistently in Modelfile
Integrating with Ollama’s REST API
Ollama’s API enables automated and production integration:
API Endpoints:
POST /api/generate— single prompt completionPOST /api/chat— conversational contextPOST /api/embeddings— semantic vectors for textGET /api/tags— list local modelsPOST /api/pull,/api/delete, etc. — manage models
Basic example:
curl http://localhost:11434/api/generate -d '{
"model": "phi4-mini",
"prompt": "Write a short Python function to calculate factorial:",
"stream": false,
"options": {"temperature": 0.3, "num_predict": 80}
}'
Streaming and full chat API supported.
Works with any HTTP client—curl, Postman, or API platforms like Apidog for automated testing and documentation.
OpenAI Compatibility
Ollama’s /v1/ endpoints mimic OpenAI’s API, so you can drop it in as a local replacement.
- Update your OpenAI client’s
base_urltohttp://localhost:11434/v1 - Use any string for
api_key(“ollama”)
Python example:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
chat_completion = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between Ollama and llama.cpp."},
],
temperature=0.7,
max_tokens=250,
stream=False
)
This allows you to test and swap between local and cloud LLMs with minimal code change.
Advanced Customization: Modelfiles, GGUF, Safetensors, and Quantization
Modelfile: The Blueprint for Custom Models
A Modelfile lets you:
- Specify a base model (
FROM llama3.2:8b-instruct-q5_K_M, or path to GGUF/Safetensors) - Set default parameters (e.g.,
PARAMETER temperature 0.7) - Define system prompts or prompt templates
- Add adapters (LoRA/QLoRA) for fine-tuning
Example:
FROM ./zephyr-7b-beta.Q5_K_M.gguf
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
{{ .Response }}</s>
"""
PARAMETER num_ctx 4096
SYSTEM "You are a friendly chatbot."
Build with:
ollama create my-zephyr-gguf -f ZephyrImport.modelfile
Importing External Models
- GGUF: Download
.gguffile, reference in Modelfile (FROM ./model.gguf) - Safetensors: Point to directory with all required files
- LoRA adapters: Use
ADAPTER /path/to/adapter/in Modelfile; base model must match
Quantizing Models
Reduce RAM/VRAM use by quantizing during ollama create:
ollama create my-quantized-model-q4km -f MyModelfile -q q4_K_M
Popular quantization levels:
q4_K_M,q5_K_M,q8_0(balance size/speed/quality)- More quantization = smaller, faster, but potentially slight quality loss
Sharing Your Custom Ollama Models
- Create an Ollama account and link your local public key.
- Namespace your model:
yourusername/modelname - Push to the registry:
ollama push yourusername/modelname
Others can then pull and run your model directly.
Optimizing Ollama Performance with GPU Acceleration
- NVIDIA: Use proprietary drivers; CUDA Compute Capability 5.0+ required
- AMD: Adrenalin (Win), ROCm (Linux); RX 6000+ series supported
- Apple Silicon: GPU acceleration via Metal works out of the box
Verify usage:
Run ollama ps during inference; gpu or cpu indicates processor.
Multi-GPU control:
Set CUDA_VISIBLE_DEVICES (NVIDIA) or ROCR_VISIBLE_DEVICES (AMD) before starting Ollama.
Fine-Tuning Ollama with Environment Variables
Key variables:
- OLLAMA_HOST: API server IP/port (default:
127.0.0.1:11434) - OLLAMA_MODELS: Custom model storage path
- OLLAMA_ORIGINS: Allowed CORS origins for web UIs
- OLLAMA_DEBUG: Enable detailed logging
- OLLAMA_KEEP_ALIVE: Model RAM retention period
Set via:
launchctlfor macOS apps- systemd override (
systemctl edit ollama.service) for Linux - Windows Environment Variables dialog
- Docker
-eflags - Inline for manual runs:
OLLAMA_DEBUG=1 ollama serve
Restart Ollama after changes to apply.
Troubleshooting Ollama
How to Check Logs
- macOS:
~/.ollama/logs/server.log - Linux (systemd):
journalctl -u ollama - Windows:
%LOCALAPPDATA%\Ollama\server.log - Docker:
docker logs my_ollama - Manual: Terminal output
Enable debug logs with OLLAMA_DEBUG=1.
Common Issues and Fixes
- Port conflict (
listen tcp 127.0.0.1:11434: bind: address already in use):- Kill the existing process using the port, or set
OLLAMA_HOSTto a different port.
- Kill the existing process using the port, or set
- GPU not detected:
- Update drivers, check hardware compatibility, verify permissions, or force CPU use for diagnostics.
- Permission errors:
- Ensure correct ownership of model directories.
- Slow downloads:
- Verify internet connection, proxy settings, and firewall rules.
- Terminal output issues (Windows):
- Use Windows Terminal or update Windows 10.
If stuck, collect logs, environment info, and error messages, then consult the Ollama community or GitHub issues.
Uninstalling Ollama
macOS
- Quit Ollama, remove
Ollama.appfrom Applications, and delete~/.ollama.
Windows
- Uninstall via “Apps”, delete
%USERPROFILE%\.ollama, and remove environment variables if set.
Linux (systemd/manual)
- Stop and disable the service, remove the binary and service file, delete model data directory. Use caution with
sudo rm -rf.
Docker
- Stop and remove container/image;
docker volume rm ollama_datato delete models (irreversible).
Conclusion: Unlock Local AI for API Development
Ollama bridges the gap between raw LLM engines and developer-friendly workflows, letting you run advanced models securely, cost-effectively, and flexibly on your own hardware. With robust command-line tools, APIs, and customization options, Ollama is ideal for API developers, backend teams, and anyone building AI-powered products where privacy, control, and integration matter.
Experiment



