This guide will walk you through how to use one of Ollama's new powerful features: the ability to stream responses and call tools (like functions or APIs) in real time. This is a game-changer for building chat applications that feel alive and can interact with the world around them.
What you'll learn in this tutorial:
- What streaming responses and tool calling mean in Ollama.
- Why this combination is super useful for your AI projects.
- Step-by-step instructions to implement this using:
- cURL (for quick tests and universal access)
- Python (for backend applications)
- JavaScript (for web and Node.js applications)
- A peek into how Ollama cleverly handles these features.
- Tips for getting the best performance.
Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?
Apidog delivers all your demans, and replaces Postman at a much more affordable price!
Getting Started: What You'll Need
To follow along, you'll need a few things:
- Ollama Installed: Make sure you have the latest version of Ollama running on your system. If not, head over to the official Ollama website to download and install it.
- Basic Command-Line Knowledge: For the cURL examples.
- Python Environment (for Python section): Python 3.x installed, along with
pip
for managing packages. - Node.js Environment (for JavaScript section): Node.js and
npm
installed. - Understanding of JSON: Ollama uses JSON for structuring data and tool calls.
Understanding Key Ideas: Streaming and Tool Calls
Let's break down what we mean by "streaming responses" and "tool calling."
What is Response Streaming?
Imagine you're chatting with an AI. Instead of waiting for it to think and type out its entire answer before you see anything, streaming means the AI sends its response to you piece by piece, word by word, as it generates it. This makes the interaction feel much faster and more natural, like a real conversation.
With Ollama, when you enable streaming ("stream": true
), you get these incremental updates.
How Does Tool Calling Work?
Tool calling allows your AI models to do more than just generate text. You can define "tools" – which are essentially functions or external APIs – that the AI can decide to use to get information or perform actions.
For example, a tool could be:
get_current_weather(location)
: Fetches the current weather.calculate_sum(number1, number2)
: Performs a calculation.search_web(query)
: Gets information from the internet.
You describe these tools to Ollama, and when the AI determines that using a tool would help answer the user's query, it signals its intent to call that tool with specific arguments. Your application then executes the tool and can send the results back to the AI to continue the conversation.
Why Combine Streaming with Tool Calling?
Ollama's big upgrade is that it can now handle tool calling while streaming responses. This means your application can:
- Receive initial text from the model (streamed).
- Suddenly, the stream might indicate a tool call is needed.
- Your app processes the tool call.
- Meanwhile, the model might even stream more text (e.g., "Okay, I'll get the weather for you...").
- Once your app gets the tool's result, you can send it back to the model, and it will continue streaming its response, now informed by the tool's output.
This creates highly responsive and capable AI applications.
Which Models Support These Features?
Ollama has enabled this for several popular models, including:
- Qwen 3
- Devstral
- Qwen2.5 and Qwen2.5-coder
- Llama 3.1
- Llama 4
- ...and more are continually being added!
How to Make Your First Streaming Tool Call with cURL
cURL is a great way to quickly test Ollama's API. Let's ask for the weather in Toronto.
Step 1: Conceptualizing Your Tool
Our tool will be get_current_weather
. It needs:
location
(string): e.g., "Toronto"format
(string): e.g., "celsius" or "fahrenheit"
Step 2: Building the cURL Command
Open your terminal and prepare the following command. We'll break it down:
curl <http://localhost:11434/api/chat> -d '{
"model": "qwen3",
"messages": [
{
"role": "user",
"content": "What is the weather today in Toronto?"
}
],
"stream": true,
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the weather for, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"description": "The format to return the weather in, e.g. \\\\\\\\'celsius\\\\\\\\' or \\\\\\\\'fahrenheit\\\\\\\\'",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location", "format"]
}
}
}
]
}'
Breakdown:
curl <http://localhost:11434/api/chat
:> The command and Ollama's chat API endpoint.d '{...}'
: Sends the JSON data in the request body."model": "qwen3"
: Specifies which AI model to use."messages": [...]
: The conversation history. Here, just the user's question."stream": true
: This is key! It tells Ollama to stream the response."tools": [...]
: An array where we define the tools available to the model."type": "function"
: Specifies the tool type."function": {...}
: Describes the function."name": "get_current_weather"
: The tool's name."description": "..."
: Helps the model understand what the tool does."parameters": {...}
: Defines the arguments the tool accepts (using JSON Schema).
Step 3: Execute and Observe the Output
Press Enter. You'll see a series of JSON objects appear one after another. This is the stream!
Example snippets from the stream:
{
"model": "qwen3", "created_at": "...",
"message": { "role": "assistant", "content": "Okay, " }, "done": false
}
{
"model": "qwen3", "created_at": "...",
"message": { "role": "assistant", "content": "I will " }, "done": false
}
{
"model": "qwen3", "created_at": "...",
"message": { "role": "assistant", "content": "try to get that for you." }, "done": false
}
(The model might output some "thinking" tokens like <think>...celsius...</think>
depending on its internal process, these are also part of the stream)
Then, critically, you might see something like this:
{
"model": "qwen3",
"created_at": "2025-05-27T22:54:58.100509Z",
"message": {
"role": "assistant",
"content": "", // Content might be empty when a tool call is made
"tool_calls": [
{
"function": {
"name": "get_current_weather",
"arguments": { // The arguments the model decided on!
"format": "celsius",
"location": "Toronto"
}
}
}
]
},
"done": false // Still not done, awaiting tool result
}
What to Notice:
- Each chunk is a JSON object.
"done": false
means the stream is ongoing. The final chunk will have"done": true
.- The
"message"
object contains: "role": "assistant"
"content"
: The text part of the stream."tool_calls"
: An array that appears when the model wants to use a tool. It includes the tool'sname
and thearguments
it decided on.
In a real application, when you see a tool_calls
chunk, your code would:
- Pause processing the stream (or handle it async).
- Execute the actual
get_current_weather
function/API with "Toronto" and "celsius". - Get the result (e.g., "20 degrees Celsius").
- Send this result back to Ollama in a new message with
role: "tool"
. - The model will then use this information to continue generating its response, also streamed.
How to Stream Tool Calls Using Python
Let's implement a similar idea in Python using Ollama's official library.
Step 1: Installing the Ollama Python Library
If you haven't already, install or upgrade the library:
pip install -U ollama
Step 2: Defining Your Tool and Coding in Python
The Ollama Python SDK cleverly allows you to pass Python functions directly as tools. It inspects the function signature and docstring to create the schema for the AI.
Let's create a simple math tool example (the input uses add_two_numbers
but the output example shows subtract_two_numbers
being called by the model. We'll stick to the provided add_two_numbers
for definition and let the model decide what to do based on the prompt.)
import ollama
# Define the python function that can be used as a tool
def add_two_numbers(a: int, b: int) -> int:
"""
Add two numbers.
Args:
a (int): The first number as an int.
b (int): The second number as an int.
Returns:
int: The sum of the two numbers.
"""
print(f"--- Tool 'add_two_numbers' called with a={a}, b={b} ---")
return a + b
# --- Main conversation logic ---
messages = [{'role': 'user', 'content': 'What is three plus one?'}]
# Or, for the subtraction example in the original output:
# messages = [{'role': 'user', 'content': 'what is three minus one?'}]
print(f"User: {messages[0]['content']}")
# Make the chat request with streaming and the tool
# Note: ChatResponse type hint might be ollama.ChatResponse or similar depending on library version
response_stream = ollama.chat(
model='qwen3', # Or another capable model
messages=messages,
tools=[
{ # You can also define the tool explicitly if needed, or pass the function directly
'type': 'function',
'function': {
'name': 'add_two_numbers', # Must match the Python function name if you want it to be called directly by your code later
'description': 'Add two integer numbers together.',
'parameters': {
'type': 'object',
'properties': {
'a': {'type': 'integer', 'description': 'The first number'},
'b': {'type': 'integer', 'description': 'The second number'}
},
'required': ['a', 'b']
}
}
}
# Simpler way for Python: pass the function directly if the library supports easy schema generation from it
# tools=[add_two_numbers] # The SDK can often create the schema from this
],
stream=True
)
print("Assistant (streaming):")
full_response_content = ""
tool_call_info = None
for chunk in response_stream:
# Print the streamed content part
if chunk['message']['content']:
print(chunk['message']['content'], end='', flush=True)
full_response_content += chunk['message']['content']
# Check for tool calls in the chunk
if 'tool_calls' in chunk['message'] and chunk['message']['tool_calls']:
tool_call_info = chunk['message']['tool_calls'][0] # Assuming one tool call for simplicity
print(f"\\\\n--- Detected Tool Call: {tool_call_info['function']['name']} ---")
break # Stop processing stream for now, handle tool call
if chunk.get('done'):
print("\\\\n--- Stream finished ---")
if not tool_call_info:
print("No tool call was made.")
# --- If a tool call was detected, handle it ---
if tool_call_info:
tool_name = tool_call_info['function']['name']
tool_args = tool_call_info['function']['arguments']
print(f"Arguments for the tool: {tool_args}")
# Here, you'd actually call your Python tool function
if tool_name == "add_two_numbers":
# For safety, ensure arguments are of correct type if necessary
try:
arg_a = int(tool_args.get('a'))
arg_b = int(tool_args.get('b'))
tool_result = add_two_numbers(a=arg_a, b=arg_b)
print(f"--- Tool execution result: {tool_result} ---")
# Now, send this result back to Ollama to continue the conversation
messages.append({'role': 'assistant', 'content': full_response_content, 'tool_calls': [tool_call_info]})
messages.append({
'role': 'tool',
'content': str(tool_result), # Result must be a string
'tool_call_id': tool_call_info.get('id', '') # If your library/model provides a tool_call_id
})
print("\\\\n--- Sending tool result back to model ---")
follow_up_response_stream = ollama.chat(
model='qwen3',
messages=messages,
stream=True
# No tools needed here unless you expect another tool call
)
print("Assistant (after tool call):")
for follow_up_chunk in follow_up_response_stream:
if follow_up_chunk['message']['content']:
print(follow_up_chunk['message']['content'], end='', flush=True)
if follow_up_chunk.get('done'):
print("\\\\n--- Follow-up stream finished ---")
break
except ValueError:
print("Error: Could not parse tool arguments as integers.")
except Exception as e:
print(f"An error occurred during tool execution or follow-up: {e}")
else:
print(f"Error: Unknown tool '{tool_name}' requested by the model.")
Explanation of the Python Code:
- Import
ollama
. add_two_numbers
function: This is our tool. The docstring and type hints help Ollama understand its purpose and parameters.messages
: We start the conversation with the user's query.ollama.chat(...)
:
model
,messages
,stream=True
are similar to cURL.tools=[...]
: We provide the tool definition. The Python SDK is quite flexible; you can pass the function object directly (e.g.,tools=[add_two_numbers]
) if it can infer the schema, or define it explicitly as shown.
- Looping through
response_stream
:
chunk['message']['content']
: This is the streamed text. We print it immediately.chunk['message']['tool_calls']
: If this key exists and has content, the AI wants to use a tool. We store thistool_call_info
and break the loop to handle it.
- Handling the Tool Call:
- We extract the
tool_name
andtool_args
. - We call our actual Python function (
add_two_numbers
) with these arguments. - Crucially: We then append the assistant's partial response (that led to the tool call) and a new message with
role: "tool"
and thecontent
as the stringified result of our function to themessages
list. - We make another
ollama.chat
call with these updated messages to get the AI's final response based on the tool's output.
Expected Output Flow:You'll see the initial user question, then the assistant's response streaming in. If it decides to call add_two_numbers
(or subtract_two_numbers
as in the original material's sample output if the prompt was for subtraction), you'll see the "Detected Tool Call" message, the arguments, the result of your Python function, and then the assistant continuing its response using that result.
(The original sample output showed:
<think>
Okay, the user is asking ...
</think>
[ToolCall(function=Function(name='subtract_two_numbers', arguments={'a': 3, 'b': 1}))]
This indicates the AI's internal "thought" process and then the structured tool call object that the Python SDK provides.)
How to Stream Tool Calls Using JavaScript (Node.js)
Now, let's do the same with JavaScript, typically for a Node.js backend or web application.
Step 1: Installing the Ollama JavaScript Library
In your project directory, run:
npm i ollama
Step 2: Defining the Tool Schema and Coding in JavaScript
In JavaScript, you usually define the tool schema as a JSON object.
import ollama from 'ollama';
// Describe the tool schema (e.g., for adding two numbers)
const addTool = {
type: 'function',
function: {
name: 'addTwoNumbers',
description: 'Add two numbers together',
parameters: {
type: 'object',
required: ['a', 'b'],
properties: {
a: { type: 'number', description: 'The first number' },
b: { type: 'number', description: 'The second number' }
}
}
}
};
// Your actual JavaScript function that implements the tool
function executeAddTwoNumbers(a, b) {
console.log(`--- Tool 'addTwoNumbers' called with a=${a}, b=${b} ---`);
return a + b;
}
async function main() {
const messages = [{ role: 'user', content: 'What is 2 plus 3?' }];
console.log('User:', messages[0].content);
console.log('Assistant (streaming):');
let assistantResponseContent = "";
let toolToCallInfo = null;
try {
const responseStream = await ollama.chat({
model: 'qwen3', // Or another capable model
messages: messages,
tools: [addTool],
stream: true
});
for await (const chunk of responseStream) {
if (chunk.message.content) {
process.stdout.write(chunk.message.content);
assistantResponseContent += chunk.message.content;
}
if (chunk.message.tool_calls && chunk.message.tool_calls.length > 0) {
toolToCallInfo = chunk.message.tool_calls[0]; // Assuming one tool call
process.stdout.write(`\\\\n--- Detected Tool Call: ${toolToCallInfo.function.name} ---\\\\n`);
break; // Stop processing stream to handle tool call
}
if (chunk.done) {
process.stdout.write('\\\\n--- Stream finished ---\\\\n');
if (!toolToCallInfo) {
console.log("No tool call was made.");
}
break;
}
}
// --- If a tool call was detected, handle it ---
if (toolToCallInfo) {
const toolName = toolToCallInfo.function.name;
const toolArgs = toolToCallInfo.function.arguments;
console.log(`Arguments for the tool:`, toolArgs);
let toolResult;
if (toolName === 'addTwoNumbers') {
toolResult = executeAddTwoNumbers(toolArgs.a, toolArgs.b);
console.log(`--- Tool execution result: ${toolResult} ---`);
// Append assistant's partial message and the tool message
messages.push({
role: 'assistant',
content: assistantResponseContent, // Include content leading up to tool call
tool_calls: [toolToCallInfo]
});
messages.push({
role: 'tool',
content: toolResult.toString(), // Result must be a string
// tool_call_id: toolToCallInfo.id // If available and needed
});
console.log("\\\\n--- Sending tool result back to model ---");
const followUpStream = await ollama.chat({
model: 'qwen3',
messages: messages,
stream: true
});
console.log("Assistant (after tool call):");
for await (const followUpChunk of followUpStream) {
if (followUpChunk.message.content) {
process.stdout.write(followUpChunk.message.content);
}
if (followUpChunk.done) {
process.stdout.write('\\\\n--- Follow-up stream finished ---\\\\n');
break;
}
}
} else {
console.error(`Error: Unknown tool '${toolName}' requested.`);
}
}
} catch (error) {
console.error('Error during Ollama chat:', error);
}
}
main().catch(console.error);
Explanation of the JavaScript Code:
- Import
ollama
. addTool
object: This is the JSON schema describing our tool to Ollama.executeAddTwoNumbers
function: Our actual JavaScript function for the tool.main
async function:
messages
array starts the conversation.await ollama.chat({...})
: Makes the call.tools: [addTool]
: Passes our tool schema.stream: true
: Enables streaming.for await (const chunk of responseStream)
: This loop processes each streamed chunk.chunk.message.content
: Text part of the stream.chunk.message.tool_calls
: If present, the AI wants to use a tool. We storetoolToCallInfo
.
- Handling the Tool Call: Similar to Python, if
toolToCallInfo
is set:
- Extract name and arguments.
- Call
executeAddTwoNumbers()
. - Append the assistant's message (that included the tool call request) and a new
role: "tool"
message with the result to themessages
array. - Make another
ollama.chat
call with the updated messages to get the final response.
Expected Output Flow (similar to the cURL and Python examples):You'll see the user's question, then the assistant's response streaming. When it decides to call addTwoNumbers
, it will print the tool call information, the result from your JavaScript function, and then continue streaming the AI's answer based on that result.
The original sample output for JS looked like:
Question: What is 2 plus 3?
<think>
Okay, the user is asking...
</think>
Tool call: {
function: {
name: "addTwoNumbers",
arguments: { a: 2, b: 3 },
},
}
How Ollama Handles Tool Parsing During Streaming
You might wonder how Ollama manages to stream text and identify tool calls so smoothly. It uses a clever new incremental parser.
- Old Way: Many systems had to wait for the entire AI response, then scan it for tool calls (usually formatted as JSON). This blocked streaming because a tool call could appear anywhere.
- Ollama's New Way:
- The parser looks at each model's specific template to understand how it signals a tool call (e.g., special tokens or prefixes).
- This allows Ollama to identify tool calls "incrementally" as the data streams in, separating them from regular text content.
- It's smart enough to handle models that weren't explicitly trained with tool prefixes but still manage to output valid tool call structures. It can even fall back to looking for JSON-like structures if needed, but intelligently, so it doesn't just grab any JSON.
Why is this better?
- True Streaming: You get text immediately, and tool calls are identified on the fly.
- Accuracy: It's better at avoiding false positives (e.g., if the AI talks about a tool call it made previously, the new parser is less likely to mistakenly trigger it again).
Tip: Improving Performance with the Context Window
For more complex interactions, especially with tool calling, the size of the "context window" the model uses can matter. A larger context window means the model remembers more of the current conversation.
- Model Context Protocol (MCP): Ollama's improvements work well with MCP.
num_ctx
: You can often suggest a context window size. For example, 32,000 (32k
) tokens or higher might improve tool calling performance and the quality of results.- Trade-off: Larger context windows use more memory.
Example: Setting Context Window with cURL(Use a model that supports larger contexts, like llama3.1
or llama4
as suggested in original material - though example uses llama3.2
)
curl -X POST "<http://localhost:11434/api/chat>" -d '{
"model": "llama3.1",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"options": {
"num_ctx": 32000
}
}'
Experiment with this setting if you find tool calling isn't as reliable as you'd like.
Where to Go From Here?
You now have the fundamentals to build sophisticated, real-time AI applications with Ollama using streaming responses and tool calling!
Ideas to explore:
- Connect tools to real-world APIs (weather, stocks, search engines).
- Build agents that can perform multi-step tasks.
- Create more natural and responsive chatbots.
Refer to the official Ollama documentation and its GitHub repository (including the "Tool Streaming Pull Request" mentioned in the original source material for deeper technical dives) for the latest updates and more advanced examples.
Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?
Apidog delivers all your demans, and replaces Postman at a much more affordable price!