Large Language Models (LLMs) are transforming how we build applications, but relying solely on cloud-based APIs isn't always ideal. Latency, cost, data privacy, and the need for offline capabilities often drive developers towards running models locally. Ollama has emerged as a fantastic tool for easily running powerful open-source LLMs like Llama 3, Mistral, and Phi-3 directly on your machine (macOS, Linux, Windows).
However, interacting with different LLMs, whether local or remote, often requires writing model-specific code. This is where LiteLLM comes in. LiteLLM provides a standardized, lightweight interface to interact with over 100 different LLM APIs, including Ollama, OpenAI, Anthropic, Cohere, and many more. By using LiteLLM, you can write code once and seamlessly switch between different models and providers—including your locally running Ollama models—with minimal changes.
This tutorial provides a detailed, step-by-step guide on how to set up and use LiteLLM to interact with Ollama models running on your local machine. We'll cover everything from installation and basic configuration to making API calls, streaming responses, and leveraging more advanced LiteLLM features.
Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?
Apidog delivers all your demans, and replaces Postman at a much more affordable price!
Introduction: Why LiteLLM and Ollama?
Before diving into the technical steps, let's understand the tools we're working with and why their combination is powerful.
What is LiteLLM?
LiteLLM is a lightweight Python library that acts as a unified interface for interacting with a vast array of LLM APIs. Instead of learning and implementing different SDKs or API request formats for OpenAI, Anthropic, Cohere, Google Gemini, Azure OpenAI, Replicate, Hugging Face, and local models like Ollama, you use a single, consistent function call: litellm.completion()
. LiteLLM handles the underlying complexity of translating your request into the specific format required by the target model provider.
Key Features of LiteLLM:
- Unified Interface: Consistent API calls (
litellm.completion
,litellm.embedding
) across 100+ LLM providers. - Provider Agnostic: Easily switch between models (e.g., from
gpt-4o
toollama/llama3
) by changing a single model string. - Robustness: Built-in support for timeouts, retries, and fallbacks.
- Observability: Integrated logging, callbacks, and support for platforms like Langfuse, Helicone, and PromptLayer.
- Proxy Server: Offers a standalone proxy server for centralized API key management, load balancing, routing, and consistent access across different applications or services.
- Cost Tracking: Helps monitor spending across various LLM APIs.
Why Use LiteLLM with Ollama?
Combining Ollama and LiteLLM offers several advantages:
- Standardization: Write application logic using LiteLLM's standard interface, allowing you to seamlessly switch between a local Ollama model (for development, privacy, offline use) and a powerful cloud model (for production, specific capabilities) without rewriting core code.
- Simplified Local Integration: LiteLLM makes interacting with Ollama's API straightforward within your Python applications, handling request formatting and response parsing.
- Flexibility: Easily experiment with different local models available through Ollama just by changing the model name in your LiteLLM call (e.g.,
ollama/llama3
vs.ollama/mistral
). - Leverage LiteLLM Features: Benefit from LiteLLM's features like retries, fallbacks, logging, and cost tracking, even when using local Ollama models.
- Hybrid Approaches: Build applications that can intelligently route requests to either local Ollama models or remote APIs based on factors like cost, latency, or task requirements, all managed through LiteLLM (especially via its proxy).
Getting Ready to Use LiteLLM and Ollama
Before we begin, ensure you have the necessary tools installed and set up.
LiteLLM is a Python library, so you need Python installed on your system. LiteLLM requires Python 3.8 or higher.
- Verify Installation: Open your terminal or command prompt and run:
python --version
# or
python3 --version
- Installation: If you don't have Python or need a newer version, download it from the official Python website (python.org) or use a package manager like Homebrew (macOS), apt (Debian/Ubuntu), or Chocolatey (Windows).
- Pip: Ensure you have
pip
, the Python package installer. It usually comes bundled with Python. You can check withpip --version
orpip3 --version
.
Install and Setup Ollama
You need Ollama installed and running on the same machine where you plan to run your LiteLLM Python code (or accessible over the network if running on a different machine).
- Download and Install: Visit the Ollama website (ollama.com) and download the installer for your operating system (macOS, Linux, Windows). Follow the installation instructions.
- Verify Installation: After installation, open a new terminal window and run:
ollama --version
This should display the installed Ollama version.
Pulling an Ollama Model
Ollama needs at least one model downloaded to be able to serve requests. Let's pull a popular model like Llama 3 (8B instruct variant).
- Pull Command: In your terminal, run:
ollama pull llama3
This will download the Llama 3 model files. It might take some time depending on your internet connection and the model size. You can replace llama3
with other models available in the Ollama library (e.g., mistral
, phi3
, gemma:2b
). Check the Ollama website for a full list.
Install LiteLLM
With Python and Ollama ready, install LiteLLM using pip.
- Installation Command: Open your terminal or command prompt and run:
pip install litellm
# or if you use pip3
# pip3 install litellm
- Verification (Optional): You can verify the installation by importing LiteLLM in a Python interpreter:
python # or python3
>>> import litellm
>>> litellm.__version__
# This should print the installed version without errors.
>>> exit()
Run Ollama Locally
LiteLLM needs to connect to the Ollama API server. You need to ensure Ollama is running.
Verifying Ollama is Running
- macOS/Windows Desktop Apps: If you installed the desktop application, Ollama usually runs automatically in the background after installation. You should see an Ollama icon in your menu bar (macOS) or system tray (Windows).
- Linux / Manual Start: On Linux, or if you prefer manual control, you might need to start the Ollama server explicitly. Open a terminal and run:
ollama serve
This command will start the server, and it will typically keep running in that terminal window until you stop it (Ctrl+C). You might want to run this in the background or as a system service for long-term use.
- Check Status: You can check if the server is responsive by trying to access its default endpoint using
curl
(if available) or a web browser:
curl http://localhost:11434
You should get a response like Ollama is running
. If you get a "connection refused" error, Ollama is likely not running or is configured on a different port/address.
By default, Ollama exposes its API at:
- URL:
http://localhost:11434
LiteLLM is pre-configured to know this default address. If your Ollama instance is running on a different host or port (e.g., inside a Docker container with port mapping, or on a remote server), you'll need to tell LiteLLM where to find it (covered in the Configuration section). For most standard local setups, the default works out of the box.
Make Your First LiteLLM Call to Ollama
Now, let's write a simple Python script to send a request to the llama3
model running via Ollama, using LiteLLM.
Basic Python Script
Create a new Python file (e.g., ollama_test.py
) and add the following code:
import litellm
import os
# Optional: Set verbose logging for LiteLLM to see what's happening
# litellm.set_verbose = True
# Define the model name - important: prefix with "ollama/"
# Ensure 'llama3' is the model you pulled with `ollama pull llama3`
model_name = "ollama/llama3"
# Define the messages for the chat completion
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is the sky blue?"}
]
try:
print(f"--- Sending request to {model_name} ---")
# Call the litellm.completion function
response = litellm.completion(
model=model_name,
messages=messages
)
# Print the response
print("--- Response ---")
# The response object mirrors the OpenAI structure
# Access the message content like this:
message_content = response.choices[0].message.content
print(message_content)
# You can also print the entire response object for inspection
# print("\n--- Full Response Object ---")
# print(response)
except Exception as e:
print(f"An error occurred: {e}")
Let's break down the function:
litellm.completion()
: This is the core function in LiteLLM for generating text completions (including chat-style completions).model
: This parameter specifies which model/provider to use. Critically, for Ollama models, you must prefix the model name (as known by Ollama) withollama/
. So,llama3
becomesollama/llama3
,mistral
becomesollama/mistral
, etc. This tells LiteLLM to route the request to an Ollama-compatible endpoint.messages
: This follows the standard OpenAI chat completion format—a list of dictionaries, each with arole
(system
,user
, orassistant
) andcontent
.- Return Value: The
response
object returned bylitellm.completion
mimics the structure of the OpenAI API response object. This consistency is a key benefit of LiteLLM. You typically access the generated text viaresponse.choices[0].message.content
.
With these points in mind, let's run the function:
Ensure Ollama is running (as verified in Step 4) and that you have the llama3
model pulled. Then, run the script from your terminal:
python ollama_test.py
You should see output similar to this (the exact text will vary based on the model's response):
--- Sending request to ollama/llama3 ---
--- Response ---
The sky appears blue because of a phenomenon called Rayleigh scattering. Sunlight reaching Earth's atmosphere is composed of different colors of light, which have different wavelengths. Blue and violet light have shorter wavelengths, while red and orange light have longer wavelengths.
As sunlight travels through the atmosphere, it collides with tiny gas molecules (mostly nitrogen and oxygen). These molecules scatter the sunlight in all directions. Shorter wavelengths (blue and violet) are scattered more effectively than longer wavelengths (red and orange).
Our eyes are more sensitive to blue than violet, and some violet light is absorbed higher in the atmosphere, so we perceive the sky as primarily blue during the daytime when the sun is high.
Near sunrise and sunset, sunlight passes through more of the atmosphere to reach our eyes. Most of the blue light is scattered away, allowing more of the longer wavelength red and orange light to reach us, which is why sunrises and sunsets often appear reddish or orange.
Congratulations! You've successfully used LiteLLM to communicate with your local Ollama model.
Make Streaming Responses
For interactive applications (like chatbots) or when generating long responses, waiting for the entire completion can lead to a poor user experience. Streaming allows you to receive the response token by token as it's generated. LiteLLM makes this easy.
Why Streaming?
- Improved Perceived Performance: Users see output immediately, making the application feel more responsive.
- Handling Long Responses: Process parts of the response without waiting for the whole thing, useful for very long text generation.
- Real-time Interaction: Enables building real-time conversational interfaces.
Implementing Streaming with LiteLLM
Modify the previous script (ollama_test.py
) to enable streaming:
import litellm
import os
# litellm.set_verbose = True # Optional for debugging
model_name = "ollama/llama3"
messages = [
{"role": "system", "content": "You are a concise poet."},
{"role": "user", "content": "Write a short haiku about a cat."}
]
try:
print(f"--- Sending streaming request to {model_name} ---")
# Set stream=True
response_stream = litellm.completion(
model=model_name,
messages=messages,
stream=True # Enable streaming
)
print("--- Streaming Response ---")
full_response = ""
# Iterate through the stream chunks
for chunk in response_stream:
# Each chunk mimics the OpenAI streaming chunk structure
# Access the content delta like this:
content_delta = chunk.choices[0].delta.content
if content_delta: # Check if there's new content in this chunk
print(content_delta, end="", flush=True) # Print immediately without newline
full_response += content_delta
print("\n--- End of Stream ---")
# print(f"\nFull reconstructed response:\n{full_response}") # Optional: show full response at the end
except Exception as e:
print(f"\nAn error occurred: {e}")
Changes:
stream=True
: Added this parameter to thelitellm.completion
call.- Iteration: The function now returns an iterator (
response_stream
). We loop through this iterator. - Chunk Handling: Each
chunk
in the loop represents a small part of the response. We access the new text fragment usingchunk.choices[0].delta.content
. Thedelta
attribute contains the difference from the previous chunk (usually a few characters or a word). print(..., end="", flush=True)
: Prints the chunk content immediately without adding a newline, flushing the output buffer to ensure it appears in the terminal right away.
Run this updated script:
python ollama_test.py
You'll see the haiku appear word by word or character by character in your terminal, demonstrating the streaming behavior.
Configuring the LiteLLM + Ollama Setup
While LiteLLM works with default Ollama setups out-of-the-box, you might need to configure it if your setup deviates from the standard http://localhost:11434
.
Default Behavior (Localhost)
As mentioned, LiteLLM automatically assumes Ollama is running at http://localhost:11434
when you use the ollama/
prefix. If this is your setup, no extra configuration is needed.
Using Environment Variables (Optional)
If Ollama is running on a different host or port, you can tell LiteLLM where to find it using environment variables. LiteLLM checks for specific environment variables to configure API endpoints. For a specific Ollama model (or all Ollama models if you want a general override), you can set its base URL.
For example, if your Ollama instance is running at http://192.168.1.100:11434
, you could set an environment variable before running your Python script:
Linux/macOS:
export OLLAMA_API_BASE="http://192.168.1.100:11434"
python your_script.py
Windows (Command Prompt):
set OLLAMA_API_BASE=http://192.168.1.100:11434
python your_script.py
Windows (PowerShell):
$env:OLLAMA_API_BASE = "http://192.168.1.100:11434"
python your_script.py
Now, when your script calls litellm.completion(model="ollama/llama3", ...)
LiteLLM will look for the OLLAMA_API_BASE
environment variable and use that URL instead of the default localhost
.
Note: Setting OLLAMA_API_BASE
overrides the base URL for all models starting with ollama/
. Consult the LiteLLM documentation for more granular environment variable controls if needed (e.g., setting the base URL for a specific model alias).
Using a Configuration File (Optional)
For more complex configurations involving multiple models, custom settings, or avoiding environment variables, LiteLLM supports a configuration file (config.yaml
or specified path).
Create a config.yaml
file in the same directory as your script (or elsewhere, and point LiteLLM to it):
# config.yaml
model_list:
- model_name: ollama/llama3 # The name you use in litellm.completion
litellm_params:
model: ollama/llama3 # The actual model identifier for the provider
api_base: "http://localhost:11434" # Default, change if needed
- model_name: ollama/mistral-remote # Example: Alias for a remote Ollama
litellm_params:
model: ollama/mistral # Ollama expects 'mistral'
api_base: "http://192.168.1.100:11434"
- model_name: gpt-4o-mini # Example: Including a non-Ollama model
litellm_params:
model: gpt-4o-mini
api_key: "os.environ/OPENAI_API_KEY" # Securely read from env var
general_settings:
# Optional: Set global settings like timeouts
# timeout: 300 # 5 minutes
# You can define custom environment variables within the config too
environment_variables:
OPENAI_API_KEY: "" # Define placeholders or actual keys (less secure)
To use this config file, you need to tell LiteLLM to load it, typically at the start of your application:
import litellm
import os
# Load configuration from config.yaml in the current directory
# You can also provide an absolute path: litellm.load_config("path/to/your/config.yaml")
try:
litellm.load_config()
print("LiteLLM configuration loaded successfully.")
except Exception as e:
print(f"Warning: Could not load LiteLLM config. Using defaults. Error: {e}")
# Now you can use the model names defined in the config
try:
# Using the standard ollama/llama3 which might pick up the api_base from the config
response_local = litellm.completion(model="ollama/llama3", messages=[{"role": "user", "content": "Test local"}])
print("Local Llama3 Response:", response_local.choices[0].message.content[:50], "...") # Print snippet
# Using the alias defined for the remote mistral model
# response_remote = litellm.completion(model="ollama/mistral-remote", messages=[{"role": "user", "content": "Test remote"}])
# print("Remote Mistral Response:", response_remote.choices[0].message.content[:50], "...")
except Exception as e:
print(f"An error occurred during completion: {e}")
The configuration file offers a structured way to manage settings for multiple models, including Ollama instances potentially running on different machines.
Beyond basic calls and streaming, LiteLLM offers features that enhance robustness and observability, which work seamlessly with Ollama.
Model Aliasing
While the config file allows defining aliases, you can also register them programmatically. This is useful for simplifying model names or mapping generic names to specific Ollama models.
import litellm
# Define an alias: map "my-local-llm" to "ollama/llama3"
litellm.register_model({
"my-local-llm": {
"model": "ollama/llama3",
# You could also specify api_base here if needed for this alias specifically
# "api_base": "http://localhost:11434"
}
})
# Now use the alias in your completion call
messages = [{"role": "user", "content": "Tell me about model aliasing."}]
response = litellm.completion(model="my-local-llm", messages=messages)
print(response.choices[0].message.content)
Error Handling and Retries
Network glitches or temporary Ollama issues can occur. LiteLLM has built-in retry logic.
import litellm
import time
# Example: Make Ollama temporarily unavailable (e.g., stop `ollama serve`)
print("Stopping Ollama for 10 seconds (if possible)... You might need to do this manually.")
# os.system("ollama stop") # This command might not exist; manual stop is safer
# time.sleep(10)
# print("Restarting Ollama... You might need to do this manually.")
# os.system("ollama serve &") # Start in background
# time.sleep(5) # Give it time to start
messages = [{"role": "user", "content": "Does retry work?"}]
try:
# LiteLLM will automatically retry on specific connection errors
# You can configure the number of retries, backoff factor, etc.
response = litellm.completion(
model="ollama/llama3",
messages=messages,
num_retries=3, # Attempt up to 3 retries
timeout=10 # Set a timeout for each request attempt (seconds)
)
print("Success after retries (or on first try):")
print(response.choices[0].message.content)
except Exception as e:
# This will catch errors that persist after retries (e.g., model not found, config error)
# Or if all retries fail for connection errors.
print(f"An error occurred after retries: {e}")
LiteLLM intelligently retries common transient network errors. You can customize retry behavior globally or per call.
Logging and Callbacks
LiteLLM provides hooks to log request/response data or trigger custom functions (callbacks) on successful calls or errors. This is invaluable for monitoring, debugging, and cost tracking (though cost tracking is less relevant for local Ollama unless you assign virtual costs).
import litellm
import logging
# Configure basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger(__name__)
# Define a simple success callback function
def log_success(kwargs, completion_response, start_time, end_time):
"""Logs details about a successful LLM call."""
model = kwargs.get("model", "unknown_model")
input_text = kwargs.get("messages", [])[-1]['content'] if kwargs.get("messages") else "N/A"
output_text = completion_response.choices[0].message.content[:50] + "..." # Snippet
duration = (end_time - start_time).total_seconds()
logger.info(f"Success! Model: {model}, Duration: {duration:.2f}s, Input: '{input_text[:30]}...', Output: '{output_text}'")
# Define a simple failure callback function
def log_failure(kwargs, exception, start_time, end_time):
"""Logs details about a failed LLM call."""
model = kwargs.get("model", "unknown_model")
duration = (end_time - start_time).total_seconds()
logger.error(f"Failure! Model: {model}, Duration: {duration:.2f}s, Error: {exception}")
# Register the callbacks with LiteLLM
litellm.success_callback = [log_success]
litellm.failure_callback = [log_failure]
# Now make a call - callbacks will trigger automatically
messages = [{"role": "user", "content": "How do callbacks work in LiteLLM?"}]
try:
response = litellm.completion(model="ollama/llama3", messages=messages)
# print(response.choices[0].message.content) # You can still use the response
except Exception as e:
pass # Failure callback already handled the logging
# Example of a call that might fail (e.g., model not pulled)
# try:
# response_fail = litellm.completion(model="ollama/nonexistent-model", messages=messages)
# except Exception as e:
# pass # Failure callback will log
Run this script, and you'll see INFO or ERROR messages logged by your callback functions, providing visibility into the LLM interactions. LiteLLM also integrates with platforms like Langfuse, Helicone, PromptLayer, etc., for more advanced observability.
Managing API Keys
While Ollama itself doesn't typically require API keys for local access, if your application also uses cloud providers (OpenAI, Anthropic, etc.) via LiteLLM, you'll need to manage those keys. LiteLLM looks for keys in standard environment variables (e.g., OPENAI_API_KEY
, ANTHROPIC_API_KEY
) or they can be set in the config.yaml
or passed directly to the completion
call (less recommended for security). Using a config.yaml
or environment variables is the preferred approach.
How to Use the LiteLLM Proxy Server (Recommended for Robustness)
While directly using the litellm
library in your Python code is great for simple scripts or single applications, the LiteLLM Proxy offers a more robust and scalable solution, especially for microservices or when multiple applications need access to LLMs (including Ollama).
What is the LiteLLM Proxy?
The LiteLLM Proxy is a standalone server you run separately. Your applications then make standard OpenAI-compatible API requests to the Proxy's endpoint. The Proxy, configured with your model details (including Ollama instances, API keys for cloud providers, etc.), handles routing the request to the correct underlying LLM, managing keys, enforcing rate limits, logging, retries, and more.
Setting up the Proxy with Ollama
Install Proxy Dependencies:
pip install 'litellm[proxy]'
Create a Proxy Config File (proxy_config.yaml
):
This file tells the proxy about your available models.
# proxy_config.yaml
model_list:
- model_name: local-llama3 # How users will call this model via the proxy
litellm_params:
model: ollama/llama3 # Tells proxy this uses ollama driver for 'llama3'
api_base: "http://localhost:11434" # Ensure this points to your Ollama
- model_name: local-mistral
litellm_params:
model: ollama/mistral
api_base: "http://localhost:11434" # Assuming same Ollama instance
# Example: Adding an OpenAI model alongside Ollama
- model_name: cloud-gpt4o
litellm_params:
model: gpt-4o-mini
api_key: "os.environ/OPENAI_API_KEY" # Proxy reads env var
litellm_settings:
# Optional: Set proxy-level settings
# drop_params: True # Removes unsupported params before sending to model
# set_verbose: True # Enable detailed proxy logging
pass
# environment_variables: # Define env vars for the proxy if needed
# OPENAI_API_KEY: "your_openai_key_here" # Less secure - prefer real env vars
Run the Proxy Server:
Open a terminal, ensure your OPENAI_API_KEY
environment variable is set (if using OpenAI models), and run:
litellm --config /path/to/proxy_config.yaml --port 8000
# Replace /path/to/proxy_config.yaml with the actual path
# --port 8000 specifies the port the proxy listens on (default is 4000)
The proxy server will start and log information, indicating it's ready to receive requests.
Making Calls Through the Proxy
Now, instead of using litellm.completion
directly, your application code (which could even be in a different language) makes standard OpenAI API calls to the proxy's endpoint (http://localhost:8000
in this example). You can use the openai
Python library or simple requests
.
Using openai
library:
First, install it: pip install openai
import openai
import os
# Point the OpenAI client to the LiteLLM Proxy endpoint
client = openai.OpenAI(
base_url="http://localhost:8000", # URL of YOUR LiteLLM Proxy
api_key="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Required by OpenAI client, but LiteLLM Proxy ignores it by default
)
# Define messages
messages = [
{"role": "system", "content": "You are a proxy-powered assistant."},
{"role": "user", "content": "Explain the benefit of using a proxy for LLM calls."}
]
try:
print("--- Sending request via LiteLLM Proxy ---")
# Use the model name defined in proxy_config.yaml ('local-llama3')
response = client.chat.completions.create(
model="local-llama3", # Use the alias from proxy_config.yaml
messages=messages,
stream=False # Set to True for streaming via proxy
)
print("--- Response from Proxy ---")
print(response.choices[0].message.content)
# Example Streaming via Proxy
print("\n--- Streaming via Proxy ---")
stream_response = client.chat.completions.create(
model="local-llama3",
messages=messages,
stream=True
)
for chunk in stream_response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print("\n--- End of Stream ---")
except openai.APIConnectionError as e:
print(f"Connection Error: Could not connect to LiteLLM Proxy at {client.base_url}. Is it running?")
except openai.APIStatusError as e:
print(f"API Error: Status={e.status_code}, Response={e.response}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Benefits of the Proxy Approach
- Centralized Management: Manage all model configurations (Ollama, OpenAI, etc.) and API keys in one place.
- Standard OpenAI Interface: All your applications interact using the familiar OpenAI SDK/API format, regardless of the backend model.
- Language Agnostic: Any application that can make HTTP requests can use the proxy.
- Load Balancing/Routing: Configure routing rules, fallbacks, and load balancing across multiple model deployments (including multiple Ollama instances).
- Consistent Logging/Monitoring: Centralized logging and observability across all LLM calls.
- Security: API keys for cloud providers are stored securely on the proxy server, not distributed in client applications.
For anything beyond simple scripts, using the LiteLLM Proxy is highly recommended when integrating Ollama into a larger ecosystem.
Troubleshooting Common Issues
Here are some common problems you might encounter and how to fix them:
Connection Errors (connection refused
, timeout
)
- Is Ollama Running? Verify that the Ollama server is active (Step 4). Check the Ollama logs for errors.
- Correct API Base? Are you sure Ollama is at
http://localhost:11434
? If not, configure LiteLLM using environment variables (OLLAMA_API_BASE
) or the config file (Step 7) or proxy config (Step 9). - Firewall Issues? If Ollama is running on a different machine or inside Docker, ensure firewall rules allow connections to the Ollama port (default 11434) from where your LiteLLM code/proxy is running.
- LiteLLM Proxy Running? If using the proxy, ensure the
litellm --config ...
command is running and accessible on the specified host/port.
Model Not Found Errors (litellm.exceptions.NotFounderror
, 404 Model Not Found
)
- Correct Model Name? Did you use the
ollama/
prefix (e.g.,ollama/llama3
) in yourlitellm.completion
call or proxy config? - Model Pulled in Ollama? Did you actually run
ollama pull <model_name>
(e.g.,ollama pull llama3
)? Verify withollama list
. - Typo? Double-check the spelling of the model name against the output of
ollama list
. - Proxy Alias: If using the proxy, ensure the model name in your client call (e.g.,
local-llama3
) exactly matches amodel_name
defined in theproxy_config.yaml
.
Timeout Issues (litellm.exceptions.Timeout
)
- Model Overloaded/Slow Hardware: Your machine might be taking too long to process the request. Try a smaller model (e.g.,
ollama pull phi3
) or increase the timeout in LiteLLM:
# Direct call
response = litellm.completion(model="ollama/llama3", messages=messages, timeout=300) # 300 seconds
# Or set globally in config.yaml or proxy_config.yaml
# general_settings:
# timeout: 300
- Complex Prompt: Very long or complex prompts can take longer.
Dependency Conflicts:
- Ensure you are using compatible versions of Python, LiteLLM, and potentially the
openai
library if using the proxy. Consider using Python virtual environments (venv
) to isolate project dependencies:
python -m venv myenv
source myenv/bin/activate # Linux/macOS
# myenv\Scripts\activate # Windows
pip install litellm 'litellm[proxy]' openai # Install within the venv
Verbose Logging: For deeper debugging, enable verbose logging in LiteLLM:
# For direct library use
litellm.set_verbose = True
# For proxy (in proxy_config.yaml)
# litellm_settings:
# set_verbose: True
This will print detailed information about the requests being made, headers, responses, and potential errors.
Conclusion and Next Steps for You
You have now learned how to effectively bridge the gap between the ease of running local models with Ollama and the standardized, feature-rich interface of LiteLLM. By prefixing your Ollama model names with ollama/
, you can leverage LiteLLM's completion
function, streaming capabilities, configuration options, error handling, and observability features with your local LLMs.
Using the LiteLLM Proxy provides an even more robust solution for managing Ollama instances alongside potentially many other cloud or local LLM providers through a unified, OpenAI-compatible API endpoint.
Where to go from here?
- Explore More Ollama Models: Try different models from the Ollama library (
ollama pull <model>
) and access them via LiteLLM (ollama/<model>
). - Dive Deeper into LiteLLM Proxy: Explore advanced proxy features like routing strategies, budget management, and UI dashboards (LiteLLM UI).
- Integrate with Frameworks: Use LiteLLM with frameworks like LangChain or LlamaIndex, specifying
ollama/<model>
as the model name when configuring the LLM component. - Experiment with Callbacks: Integrate LiteLLM's callbacks with monitoring tools (Langfuse, Helicone, etc.) for better tracking of your local model usage.
- Contribute: Both LiteLLM (github.com/BerriAI/litellm) and Ollama (github.com/ollama/ollama) are open-source projects.
By combining LiteLLM and Ollama, you gain significant flexibility and control over how you integrate large language models into your applications, enabling powerful local-first AI solutions.
Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?
Apidog delivers all your demans, and replaces Postman at a much more affordable price!