How to Run LLMs Locally with Ollama: Developer’s Guide

<title>How to Run LLMs Locally with Ollama: Developer’s Guide</title> <content> # The Developer’s Guide to Running and Customizing LLMs Locally with Ollama Unlocking the full power of large languag...

Mark Ponomarev

Mark Ponomarev

31 January 2026

How to Run LLMs Locally with Ollama: Developer’s Guide
How to Run LLMs Locally with Ollama: Developer’s Guide

Unlocking the full power of large language models (LLMs) on your own hardware is now within reach for developers and API teams. Ollama radically simplifies local LLM deployment, making it practical to run, customize, and integrate advanced models like Llama 3, Mistral, Gemma, and Phi—no cloud dependency required.

Whether you build internal tools, automate workflows, or experiment with AI-powered apps, this guide walks you through installing Ollama, managing models, leveraging APIs, and troubleshooting—all with a focus on developer productivity and security.

💡 Looking for an API platform that generates beautiful API Documentation and boosts team productivity? Try Apidog, the all-in-one API solution that replaces Postman at a better price.

button

Why Run LLMs Locally with Ollama?

Relying on cloud-based AI APIs can be convenient, but local LLM deployment offers compelling advantages for engineering teams:

  • Maximum Data Privacy: All prompts, context, and model outputs stay on your machine—critical for sensitive code, user data, or internal documents.
  • Cost Control: No per-use or subscription fees. Once your hardware is ready, model runs are unlimited and free.
  • Offline Reliability: Use models anytime, even without internet—ideal for air-gapped environments, on-premise development, or field research.
  • Model Customization: Ollama’s Modelfile system lets you tweak parameters, system prompts, and add fine-tuned adapters (LoRAs) for specialized use cases. Import weights from GGUF or Safetensors formats with ease.
  • Performance on Your Hardware: Run LLMs on your workstation or server, leveraging CPU or GPU for fast inference—no network latency, no shared resources.
  • Open Source and Community-Driven: Ollama is open source, with a growing ecosystem of models and contributors.

“The beauty of having small but smart models like Gemma 2 9B is that it allows you to build all sorts of fun stuff locally. Like this script that uses @ollama to fix typos or improve any text on my Mac just by pressing dedicated keys. A local Grammarly but super fast. ⚡️”
— Pietro Schirano (tweet)


Ollama vs. llama.cpp: What’s the Difference?

Understanding Ollama’s architecture helps developers choose the right tool:

  • llama.cpp: The C/C++ inference engine powering efficient LLM execution, optimized for CPU and GPU across platforms.
  • Ollama: A higher-level application built on llama.cpp, providing:
    • Command-line interface (ollama run, ollama pull)
    • REST API server for integration
    • Streamlined model management and customization
    • Cross-platform installers (macOS, Windows, Linux, Docker)
    • Automatic hardware detection

Use llama.cpp directly for low-level control or research; use Ollama for a developer-friendly, production-ready local LLM platform.


How to Install Ollama on macOS, Windows, Linux, and Docker

Ollama makes setup frictionless across environments.

System Requirements

  • RAM:
    • Minimum 8GB for small models (1B–3B).
    • 16GB recommended for 7B/13B models.
    • 32GB+ for 30B+ models and large context windows.
  • Disk Space:
    • Model sizes vary: small (2GB), mid (4–5GB), large (40GB+). Ensure enough free space.
  • OS:
    • macOS 11+ (Apple Silicon recommended)
    • Windows 10 22H2+ or 11
    • Modern Linux (Ubuntu 20.04+, Fedora 38+, Debian 11+)

macOS

  1. Download the DMG from Ollama’s official site.
  2. Mount and drag Ollama.app to Applications.
  3. Launch the app—Ollama runs as a background service and is accessible via the menu bar and Terminal.

Apple Silicon Macs (M1/M2/M3/M4) are GPU-accelerated by default via Metal.

Windows

  1. Download and run OllamaSetup.exe.
  2. Follow the installer prompts.
  3. Ollama runs as a background service and is available in Command Prompt or PowerShell.

For GPU acceleration:

  • NVIDIA: Latest GeForce drivers (version 452.39+).
  • AMD: Latest Adrenalin drivers. Supported GPUs: RX 6000 series and newer.

Linux

Install via script:

curl -fsSL https://ollama.com/install.sh | sh
  • Installs the binary, configures systemd service, checks GPU drivers, and starts the service.
  • For advanced/manual installs, see Ollama’s GitHub documentation.

For GPU:

  • NVIDIA: Proprietary drivers, verify with nvidia-smi.
  • AMD: ROCm toolkit, verify with rocminfo.

Docker

CPU-only:

docker run -d \
  -v ollama_data:/root/.ollama \
  -p 127.0.0.1:11434:11434 \
  --name my_ollama \
  ollama/ollama

NVIDIA GPU:

docker run -d \
  --gpus=all \
  -v ollama_data:/root/.ollama \
  -p 127.0.0.1:11434:11434 \
  --name my_ollama_gpu \
  ollama/ollama

AMD GPU (ROCm):

docker run -d \
  --device /dev/kfd \
  --device /dev/dri \
  -v ollama_data:/root/.ollama \
  -p 127.0.0.1:11434:11434 \
  --name my_ollama_rocm \
  ollama/ollama:rocm

Access the Ollama API at http://localhost:11434.


Where Does Ollama Store Models?

Where Does Ollama Store Models?

  • macOS/Linux: ~/.ollama/models
  • Windows: C:\Users\<YourUsername>\.ollama\models
  • Linux systemd: /usr/share/ollama/.ollama/models
  • Docker: /root/.ollama/models inside the container (use -v to persist)

Change storage location with the OLLAMA_MODELS environment variable if needed.


Getting Started: Running Your First LLM with Ollama

Once installed, start using LLMs in your terminal.

Downloading Models: ollama pull

Browse available models in the Ollama model library.

Examples:

ollama pull llama3.2            # Llama 3.2 8B Instruct
ollama pull mistral:7b          # Mistral 7B base
ollama pull gemma3              # Gemma 4B
ollama pull phi4-mini           # Phi-4 Mini
ollama pull llava               # Vision model

Understanding tags:

  • Size: 1b, 3b, 7b, 13b, etc. (parameters in billions)
  • Quantization: q4_K_M, q5_K_M, q8_0, etc. for size/performance trade-offs
  • Variants: instruct, chat, code, vision, etc.


Chatting with LLMs: ollama run

ollama run llama3.2

Type your prompt at the interactive prompt. Use /set parameter to adjust on the fly (e.g., /set parameter temperature 0.9).

Common interactive commands:

  • /set parameter <name> <value> – e.g., temp, num_ctx
  • /show info – model details
  • /save <session> and /load <session>
  • /bye or /exit

Managing Your Ollama Models

  • List models:
    ollama list
  • Show detailed info:
    ollama show llama3.2:8b-instruct-q5_K_M
  • Delete model:
    ollama rm mistral:7b
  • Copy/rename model:
    ollama cp llama3.2 my-custom-llama3.2
  • Check loaded models:
    ollama ps

Choosing the Right Ollama Model for Your Use Case

Selection depends on:

  • Task:
    • General chat: llama3.2, mistral, gemma3
    • Coding: codellama, phi4, starcoder2
    • Summarization/analysis: instruction-tuned models
    • Multimodal/image: llava, moondream
  • Hardware:
    • 8GB RAM: Small/quantized (1B–3B)
    • 16GB RAM: 7B/8B quantized
    • 32GB+ RAM: 13B or larger
  • Quality vs. Speed:
    • Less quantization = higher quality, more RAM/VRAM needed
    • More quantization = smaller, faster, less resource use

Recommended starting points for most teams:

  • llama3.2:8b-instruct-q5_K_M or mistral:7b-instruct-v0.2-q5_K_M
  • For low-resource: phi4-mini:q4_K_M
  • For coding: codellama:7b-instruct-q5_K_M
  • For vision: llava:13b (if hardware allows)

Test several and measure against your real prompts and workloads.


Understanding Context Length in Ollama

What is num_ctx?

  • The context window: How many tokens the model can “see” at once (prompts + history).
  • Why it matters: Longer context = better handling of long chats, documents, or code blocks.
  • Limits: Each model has a max value (e.g., 4096, 8192, 32k). Setting above the trained limit can cause issues; lower is always safe.

See default with:
ollama show <model>: look for num_ctx

Adjust context:

  • In interactive mode: /set parameter num_ctx 8192
  • By API: JSON options
  • In a custom Modelfile: PARAMETER num_ctx 8192

Key Ollama Model Parameters

  • temperature: Controls creativity (0.2 = deterministic, 1.0+ = more random)
  • top_p: Nucleus sampling, restricts to most likely tokens
  • top_k: Limits to top-k probable tokens
  • num_predict: Max tokens to generate in a response
  • stop: When to halt generation (e.g., after code block)
  • repeat_penalty, seed, etc.: Fine-grained controls

Set parameters:

  • Temporarily with /set parameter
  • Per-request in API JSON
  • Persistently in Modelfile

Integrating with Ollama’s REST API

Ollama’s API enables automated and production integration:

API Endpoints:

  • POST /api/generate — single prompt completion
  • POST /api/chat — conversational context
  • POST /api/embeddings — semantic vectors for text
  • GET /api/tags — list local models
  • POST /api/pull, /api/delete, etc. — manage models

Basic example:

curl http://localhost:11434/api/generate -d '{
  "model": "phi4-mini",
  "prompt": "Write a short Python function to calculate factorial:",
  "stream": false,
  "options": {"temperature": 0.3, "num_predict": 80}
}'

Streaming and full chat API supported.
Works with any HTTP client—curl, Postman, or API platforms like Apidog for automated testing and documentation.


OpenAI Compatibility

Ollama’s /v1/ endpoints mimic OpenAI’s API, so you can drop it in as a local replacement.

  • Update your OpenAI client’s base_url to http://localhost:11434/v1
  • Use any string for api_key (“ollama”)

Python example:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
chat_completion = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the difference between Ollama and llama.cpp."},
    ],
    temperature=0.7,
    max_tokens=250,
    stream=False
)

This allows you to test and swap between local and cloud LLMs with minimal code change.


Advanced Customization: Modelfiles, GGUF, Safetensors, and Quantization

Modelfile: The Blueprint for Custom Models

A Modelfile lets you:

  • Specify a base model (FROM llama3.2:8b-instruct-q5_K_M, or path to GGUF/Safetensors)
  • Set default parameters (e.g., PARAMETER temperature 0.7)
  • Define system prompts or prompt templates
  • Add adapters (LoRA/QLoRA) for fine-tuning

Example:

FROM ./zephyr-7b-beta.Q5_K_M.gguf
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
{{ .Response }}</s>
"""
PARAMETER num_ctx 4096
SYSTEM "You are a friendly chatbot."

Build with:
ollama create my-zephyr-gguf -f ZephyrImport.modelfile

Importing External Models

  • GGUF: Download .gguf file, reference in Modelfile (FROM ./model.gguf)
  • Safetensors: Point to directory with all required files
  • LoRA adapters: Use ADAPTER /path/to/adapter/ in Modelfile; base model must match

Quantizing Models

Reduce RAM/VRAM use by quantizing during ollama create:

ollama create my-quantized-model-q4km -f MyModelfile -q q4_K_M

Popular quantization levels:

  • q4_K_M, q5_K_M, q8_0 (balance size/speed/quality)
  • More quantization = smaller, faster, but potentially slight quality loss

Sharing Your Custom Ollama Models

  • Create an Ollama account and link your local public key.
  • Namespace your model: yourusername/modelname
  • Push to the registry:
    ollama push yourusername/modelname

Others can then pull and run your model directly.


Optimizing Ollama Performance with GPU Acceleration

  • NVIDIA: Use proprietary drivers; CUDA Compute Capability 5.0+ required
  • AMD: Adrenalin (Win), ROCm (Linux); RX 6000+ series supported
  • Apple Silicon: GPU acceleration via Metal works out of the box

Verify usage:
Run ollama ps during inference; gpu or cpu indicates processor.

Multi-GPU control:
Set CUDA_VISIBLE_DEVICES (NVIDIA) or ROCR_VISIBLE_DEVICES (AMD) before starting Ollama.


Fine-Tuning Ollama with Environment Variables

Key variables:

  • OLLAMA_HOST: API server IP/port (default: 127.0.0.1:11434)
  • OLLAMA_MODELS: Custom model storage path
  • OLLAMA_ORIGINS: Allowed CORS origins for web UIs
  • OLLAMA_DEBUG: Enable detailed logging
  • OLLAMA_KEEP_ALIVE: Model RAM retention period

Set via:

  • launchctl for macOS apps
  • systemd override (systemctl edit ollama.service) for Linux
  • Windows Environment Variables dialog
  • Docker -e flags
  • Inline for manual runs:
    OLLAMA_DEBUG=1 ollama serve

Restart Ollama after changes to apply.


Troubleshooting Ollama

How to Check Logs

  • macOS: ~/.ollama/logs/server.log
  • Linux (systemd): journalctl -u ollama
  • Windows: %LOCALAPPDATA%\Ollama\server.log
  • Docker: docker logs my_ollama
  • Manual: Terminal output

Enable debug logs with OLLAMA_DEBUG=1.

Common Issues and Fixes

  • Port conflict (listen tcp 127.0.0.1:11434: bind: address already in use):
    • Kill the existing process using the port, or set OLLAMA_HOST to a different port.
  • GPU not detected:
    • Update drivers, check hardware compatibility, verify permissions, or force CPU use for diagnostics.
  • Permission errors:
    • Ensure correct ownership of model directories.
  • Slow downloads:
    • Verify internet connection, proxy settings, and firewall rules.
  • Terminal output issues (Windows):
    • Use Windows Terminal or update Windows 10.

If stuck, collect logs, environment info, and error messages, then consult the Ollama community or GitHub issues.


Uninstalling Ollama

macOS

  • Quit Ollama, remove Ollama.app from Applications, and delete ~/.ollama.

Windows

  • Uninstall via “Apps”, delete %USERPROFILE%\.ollama, and remove environment variables if set.

Linux (systemd/manual)

  • Stop and disable the service, remove the binary and service file, delete model data directory. Use caution with sudo rm -rf.

Docker

  • Stop and remove container/image; docker volume rm ollama_data to delete models (irreversible).

Conclusion: Unlock Local AI for API Development

Ollama bridges the gap between raw LLM engines and developer-friendly workflows, letting you run advanced models securely, cost-effectively, and flexibly on your own hardware. With robust command-line tools, APIs, and customization options, Ollama is ideal for API developers, backend teams, and anyone building AI-powered products where privacy, control, and integration matter.

Experiment

Explore more

How Top Companies Ensure API Design Consistency in 2026

How Top Companies Ensure API Design Consistency in 2026

Discover how enterprise teams achieve API design consistency using proven strategies, automated tools, and comprehensive guidelines that scale across distributed teams.

6 March 2026

How to Remove Censorship from ANY Open-Weight LLM with a Single Click

How to Remove Censorship from ANY Open-Weight LLM with a Single Click

Remove AI censorship from any open-weight LLM in minutes. Complete guide to OBLITERATUS - the free tool that liberates models without retraining.

6 March 2026

How to Make Your API Agent-Ready: Design Principles for the AI Age

How to Make Your API Agent-Ready: Design Principles for the AI Age

Learn how to build APIs designed for AI agents. Complete OpenAPI specs, MCP protocol support, and consistent response patterns that let Claude, Copilot, and Cursor consume your API automatically

6 March 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs