How to Use Ollama for Streaming Responses and Tool Calling

This guide will walk you through how to use one of Ollama's new powerful features: the ability to stream responses and call tools (like functions or APIs) in real time. This is a game-changer for building chat applications that feel alive and can interact with the world around them.

What you'll learn in this tutorial:

What streaming responses and tool calling mean in Ollama.
Why this combination is super useful for your AI projects.
Step-by-step instructions to implement this using:
cURL (for quick tests and universal access)
Python (for backend applications)
JavaScript (for web and Node.js applications)
A peek into how Ollama cleverly handles these features.
Tips for getting the best performance.

💡

Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demands, and replaces Postman at a much more affordable price!

button

Getting Started: What You'll Need

To follow along, you'll need a few things:

Ollama Installed: Make sure you have the latest version of Ollama running on your system. If not, head over to the official Ollama website to download and install it.
Basic Command-Line Knowledge: For the cURL examples.
Python Environment (for Python section): Python 3.x installed, along with pip for managing packages.
Node.js Environment (for JavaScript section): Node.js and npm installed.
Understanding of JSON: Ollama uses JSON for structuring data and tool calls.

Understanding Key Ideas: Streaming and Tool Calls

Let's break down what we mean by "streaming responses" and "tool calling."

What is Response Streaming?

Imagine you're chatting with an AI. Instead of waiting for it to think and type out its entire answer before you see anything, streaming means the AI sends its response to you piece by piece, word by word, as it generates it. This makes the interaction feel much faster and more natural, like a real conversation.

With Ollama, when you enable streaming ("stream": true), you get these incremental updates.

How Does Tool Calling Work?

Tool calling allows your AI models to do more than just generate text. You can define "tools" – which are essentially functions or external APIs – that the AI can decide to use to get information or perform actions.

For example, a tool could be:

get_current_weather(location): Fetches the current weather.
calculate_sum(number1, number2): Performs a calculation.
search_web(query): Gets information from the internet.

You describe these tools to Ollama, and when the AI determines that using a tool would help answer the user's query, it signals its intent to call that tool with specific arguments. Your application then executes the tool and can send the results back to the AI to continue the conversation.

Why Combine Streaming with Tool Calling?

Ollama's big upgrade is that it can now handle tool calling while streaming responses. This means your application can:

Receive initial text from the model (streamed).
Suddenly, the stream might indicate a tool call is needed.
Your app processes the tool call.
Meanwhile, the model might even stream more text (e.g., "Okay, I'll get the weather for you...").
Once your app gets the tool's result, you can send it back to the model, and it will continue streaming its response, now informed by the tool's output.

This creates highly responsive and capable AI applications.

Which Models Support These Features?

Ollama has enabled this for several popular models, including:

Qwen 3
Devstral
Qwen2.5 and Qwen2.5-coder
Llama 3.1
Llama 4
...and more are continually being added!

How to Make Your First Streaming Tool Call with cURL

cURL is a great way to quickly test Ollama's API. Let's ask for the weather in Toronto.

Step 1: Conceptualizing Your Tool

Our tool will be get_current_weather. It needs:

location (string): e.g., "Toronto"
format (string): e.g., "celsius" or "fahrenheit"

Step 2: Building the cURL Command

Open your terminal and prepare the following command. We'll break it down:

curl <http://localhost:11434/api/chat> -d '{
  "model": "qwen3",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Toronto?"
    }
  ],
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. \\\\\\\\'celsius\\\\\\\\' or \\\\\\\\'fahrenheit\\\\\\\\'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'

Breakdown:

curl <http://localhost:11434/api/chat:> The command and Ollama's chat API endpoint.
d '{...}': Sends the JSON data in the request body.
"model": "qwen3": Specifies which AI model to use.
"messages": [...]: The conversation history. Here, just the user's question.
"stream": true: This is key! It tells Ollama to stream the response.
"tools": [...]: An array where we define the tools available to the model.
"type": "function": Specifies the tool type.
"function": {...}: Describes the function.
"name": "get_current_weather": The tool's name.
"description": "...": Helps the model understand what the tool does.
"parameters": {...}: Defines the arguments the tool accepts (using JSON Schema).

Step 3: Execute and Observe the Output

Press Enter. You'll see a series of JSON objects appear one after another. This is the stream!

Example snippets from the stream:

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "Okay, " }, "done": false
}

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "I will " }, "done": false
}

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "try to get that for you." }, "done": false
}

(The model might output some "thinking" tokens like <think>...celsius...</think> depending on its internal process, these are also part of the stream)

Then, critically, you might see something like this:

{
  "model": "qwen3",
  "created_at": "2025-05-27T22:54:58.100509Z",
  "message": {
    "role": "assistant",
    "content": "", // Content might be empty when a tool call is made
    "tool_calls": [
      {
        "function": {
          "name": "get_current_weather",
          "arguments": { // The arguments the model decided on!
            "format": "celsius",
            "location": "Toronto"
          }
        }
      }
    ]
  },
  "done": false // Still not done, awaiting tool result
}

What to Notice:

Each chunk is a JSON object.
"done": false means the stream is ongoing. The final chunk will have "done": true.
The "message" object contains:
"role": "assistant"
"content": The text part of the stream.
"tool_calls": An array that appears when the model wants to use a tool. It includes the tool's name and the arguments it decided on.

In a real application, when you see a tool_calls chunk, your code would:

Pause processing the stream (or handle it async).
Execute the actual get_current_weather function/API with "Toronto" and "celsius".
Get the result (e.g., "20 degrees Celsius").
Send this result back to Ollama in a new message with role: "tool".
The model will then use this information to continue generating its response, also streamed.

How to Stream Tool Calls Using Python

Let's implement a similar idea in Python using Ollama's official library.

Step 1: Installing the Ollama Python Library

If you haven't already, install or upgrade the library:

pip install -U ollama

Step 2: Defining Your Tool and Coding in Python

The Ollama Python SDK cleverly allows you to pass Python functions directly as tools. It inspects the function signature and docstring to create the schema for the AI.

Let's create a simple math tool example (the input uses add_two_numbers but the output example shows subtract_two_numbers being called by the model. We'll stick to the provided add_two_numbers for definition and let the model decide what to do based on the prompt.)

import ollama

# Define the python function that can be used as a tool
def add_two_numbers(a: int, b: int) -> int:
  """
  Add two numbers.

  Args:
    a (int): The first number as an int.
    b (int): The second number as an int.

  Returns:
    int: The sum of the two numbers.
  """
  print(f"--- Tool 'add_two_numbers' called with a={a}, b={b} ---")
  return a + b

# --- Main conversation logic ---
messages = [{'role': 'user', 'content': 'What is three plus one?'}]
# Or, for the subtraction example in the original output:
# messages = [{'role': 'user', 'content': 'what is three minus one?'}]

print(f"User: {messages[0]['content']}")

# Make the chat request with streaming and the tool
# Note: ChatResponse type hint might be ollama.ChatResponse or similar depending on library version
response_stream = ollama.chat(
  model='qwen3', # Or another capable model
  messages=messages,
  tools=[
      { # You can also define the tool explicitly if needed, or pass the function directly
          'type': 'function',
          'function': {
              'name': 'add_two_numbers', # Must match the Python function name if you want it to be called directly by your code later
              'description': 'Add two integer numbers together.',
              'parameters': {
                  'type': 'object',
                  'properties': {
                      'a': {'type': 'integer', 'description': 'The first number'},
                      'b': {'type': 'integer', 'description': 'The second number'}
                  },
                  'required': ['a', 'b']
              }
          }
      }
      # Simpler way for Python: pass the function directly if the library supports easy schema generation from it
      # tools=[add_two_numbers] # The SDK can often create the schema from this
  ],
  stream=True
)

print("Assistant (streaming):")
full_response_content = ""
tool_call_info = None

for chunk in response_stream:
  # Print the streamed content part
  if chunk['message']['content']:
    print(chunk['message']['content'], end='', flush=True)
    full_response_content += chunk['message']['content']

  # Check for tool calls in the chunk
  if 'tool_calls' in chunk['message'] and chunk['message']['tool_calls']:
    tool_call_info = chunk['message']['tool_calls'][0] # Assuming one tool call for simplicity
    print(f"\\\\n--- Detected Tool Call: {tool_call_info['function']['name']} ---")
    break # Stop processing stream for now, handle tool call

  if chunk.get('done'):
      print("\\\\n--- Stream finished ---")
      if not tool_call_info:
          print("No tool call was made.")

# --- If a tool call was detected, handle it ---
if tool_call_info:
  tool_name = tool_call_info['function']['name']
  tool_args = tool_call_info['function']['arguments']

  print(f"Arguments for the tool: {tool_args}")

  # Here, you'd actually call your Python tool function
  if tool_name == "add_two_numbers":
    # For safety, ensure arguments are of correct type if necessary
    try:
        arg_a = int(tool_args.get('a'))
        arg_b = int(tool_args.get('b'))
        tool_result = add_two_numbers(a=arg_a, b=arg_b)
        print(f"--- Tool execution result: {tool_result} ---")

        # Now, send this result back to Ollama to continue the conversation
        messages.append({'role': 'assistant', 'content': full_response_content, 'tool_calls': [tool_call_info]})
        messages.append({
            'role': 'tool',
            'content': str(tool_result), # Result must be a string
            'tool_call_id': tool_call_info.get('id', '') # If your library/model provides a tool_call_id
        })

        print("\\\\n--- Sending tool result back to model ---")

        follow_up_response_stream = ollama.chat(
            model='qwen3',
            messages=messages,
            stream=True
            # No tools needed here unless you expect another tool call
        )

        print("Assistant (after tool call):")
        for follow_up_chunk in follow_up_response_stream:
            if follow_up_chunk['message']['content']:
                print(follow_up_chunk['message']['content'], end='', flush=True)
            if follow_up_chunk.get('done'):
                print("\\\\n--- Follow-up stream finished ---")
                break
    except ValueError:
        print("Error: Could not parse tool arguments as integers.")
    except Exception as e:
        print(f"An error occurred during tool execution or follow-up: {e}")
  else:
    print(f"Error: Unknown tool '{tool_name}' requested by the model.")

Explanation of the Python Code:

Import ollama.
add_two_numbers function: This is our tool. The docstring and type hints help Ollama understand its purpose and parameters.
messages: We start the conversation with the user's query.
ollama.chat(...):

model, messages, stream=True are similar to cURL.
tools=[...]: We provide the tool definition. The Python SDK is quite flexible; you can pass the function object directly (e.g., tools=[add_two_numbers]) if it can infer the schema, or define it explicitly as shown.

Looping through response_stream:

chunk['message']['content']: This is the streamed text. We print it immediately.
chunk['message']['tool_calls']: If this key exists and has content, the AI wants to use a tool. We store this tool_call_info and break the loop to handle it.

Handling the Tool Call:

We extract the tool_name and tool_args.
We call our actual Python function (add_two_numbers) with these arguments.
Crucially: We then append the assistant's partial response (that led to the tool call) and a new message with role: "tool" and the content as the stringified result of our function to the messages list.
We make another ollama.chat call with these updated messages to get the AI's final response based on the tool's output.

Expected Output Flow:You'll see the initial user question, then the assistant's response streaming in. If it decides to call add_two_numbers (or subtract_two_numbers as in the original material's sample output if the prompt was for subtraction), you'll see the "Detected Tool Call" message, the arguments, the result of your Python function, and then the assistant continuing its response using that result.

(The original sample output showed:

<think>
Okay, the user is asking ...
</think>

[ToolCall(function=Function(name='subtract_two_numbers', arguments={'a': 3, 'b': 1}))]

This indicates the AI's internal "thought" process and then the structured tool call object that the Python SDK provides.)

How to Stream Tool Calls Using JavaScript (Node.js)

Now, let's do the same with JavaScript, typically for a Node.js backend or web application.

Step 1: Installing the Ollama JavaScript Library

In your project directory, run:

npm i ollama

Step 2: Defining the Tool Schema and Coding in JavaScript

In JavaScript, you usually define the tool schema as a JSON object.

import ollama from 'ollama';

// Describe the tool schema (e.g., for adding two numbers)
const addTool = {
    type: 'function',
    function: {
        name: 'addTwoNumbers',
        description: 'Add two numbers together',
        parameters: {
            type: 'object',
            required: ['a', 'b'],
            properties: {
                a: { type: 'number', description: 'The first number' },
                b: { type: 'number', description: 'The second number' }
            }
        }
    }
};

// Your actual JavaScript function that implements the tool
function executeAddTwoNumbers(a, b) {
    console.log(`--- Tool 'addTwoNumbers' called with a=${a}, b=${b} ---`);
    return a + b;
}

async function main() {
    const messages = [{ role: 'user', content: 'What is 2 plus 3?' }];
    console.log('User:', messages[0].content);

    console.log('Assistant (streaming):');
    let assistantResponseContent = "";
    let toolToCallInfo = null;

    try {
        const responseStream = await ollama.chat({
            model: 'qwen3', // Or another capable model
            messages: messages,
            tools: [addTool],
            stream: true
        });

        for await (const chunk of responseStream) {
            if (chunk.message.content) {
                process.stdout.write(chunk.message.content);
                assistantResponseContent += chunk.message.content;
            }
            if (chunk.message.tool_calls && chunk.message.tool_calls.length > 0) {
                toolToCallInfo = chunk.message.tool_calls[0]; // Assuming one tool call
                process.stdout.write(`\\\\n--- Detected Tool Call: ${toolToCallInfo.function.name} ---\\\\n`);
                break; // Stop processing stream to handle tool call
            }
            if (chunk.done) {
                process.stdout.write('\\\\n--- Stream finished ---\\\\n');
                if (!toolToCallInfo) {
                    console.log("No tool call was made.");
                }
                break;
            }
        }

        // --- If a tool call was detected, handle it ---
        if (toolToCallInfo) {
            const toolName = toolToCallInfo.function.name;
            const toolArgs = toolToCallInfo.function.arguments;

            console.log(`Arguments for the tool:`, toolArgs);

            let toolResult;
            if (toolName === 'addTwoNumbers') {
                toolResult = executeAddTwoNumbers(toolArgs.a, toolArgs.b);
                console.log(`--- Tool execution result: ${toolResult} ---`);

                // Append assistant's partial message and the tool message
                messages.push({
                    role: 'assistant',
                    content: assistantResponseContent, // Include content leading up to tool call
                    tool_calls: [toolToCallInfo]
                });
                messages.push({
                    role: 'tool',
                    content: toolResult.toString(), // Result must be a string
                    // tool_call_id: toolToCallInfo.id // If available and needed
                });

                console.log("\\\\n--- Sending tool result back to model ---");
                const followUpStream = await ollama.chat({
                    model: 'qwen3',
                    messages: messages,
                    stream: true
                });

                console.log("Assistant (after tool call):");
                for await (const followUpChunk of followUpStream) {
                    if (followUpChunk.message.content) {
                        process.stdout.write(followUpChunk.message.content);
                    }
                     if (followUpChunk.done) {
                        process.stdout.write('\\\\n--- Follow-up stream finished ---\\\\n');
                        break;
                    }
                }
            } else {
                console.error(`Error: Unknown tool '${toolName}' requested.`);
            }
        }

    } catch (error) {
        console.error('Error during Ollama chat:', error);
    }
}

main().catch(console.error);

Explanation of the JavaScript Code:

Import ollama.
addTool object: This is the JSON schema describing our tool to Ollama.
executeAddTwoNumbers function: Our actual JavaScript function for the tool.
main async function:

messages array starts the conversation.
await ollama.chat({...}): Makes the call.
tools: [addTool]: Passes our tool schema.
stream: true: Enables streaming.
for await (const chunk of responseStream): This loop processes each streamed chunk.
chunk.message.content: Text part of the stream.
chunk.message.tool_calls: If present, the AI wants to use a tool. We store toolToCallInfo.

Handling the Tool Call: Similar to Python, if toolToCallInfo is set:

Extract name and arguments.
Call executeAddTwoNumbers().
Append the assistant's message (that included the tool call request) and a new role: "tool" message with the result to the messages array.
Make another ollama.chat call with the updated messages to get the final response.

Expected Output Flow (similar to the cURL and Python examples):You'll see the user's question, then the assistant's response streaming. When it decides to call addTwoNumbers, it will print the tool call information, the result from your JavaScript function, and then continue streaming the AI's answer based on that result.

The original sample output for JS looked like:

Question: What is 2 plus 3?
<think>
Okay, the user is asking...
</think>
Tool call: {
  function: {
    name: "addTwoNumbers",
    arguments: { a: 2, b: 3 },
  },
}

How Ollama Handles Tool Parsing During Streaming

You might wonder how Ollama manages to stream text and identify tool calls so smoothly. It uses a clever new incremental parser.

Old Way: Many systems had to wait for the entire AI response, then scan it for tool calls (usually formatted as JSON). This blocked streaming because a tool call could appear anywhere.
Ollama's New Way:
The parser looks at each model's specific template to understand how it signals a tool call (e.g., special tokens or prefixes).
This allows Ollama to identify tool calls "incrementally" as the data streams in, separating them from regular text content.
It's smart enough to handle models that weren't explicitly trained with tool prefixes but still manage to output valid tool call structures. It can even fall back to looking for JSON-like structures if needed, but intelligently, so it doesn't just grab any JSON.

Why is this better?

True Streaming: You get text immediately, and tool calls are identified on the fly.
Accuracy: It's better at avoiding false positives (e.g., if the AI talks about a tool call it made previously, the new parser is less likely to mistakenly trigger it again).

Tip: Improving Performance with the Context Window

For more complex interactions, especially with tool calling, the size of the "context window" the model uses can matter. A larger context window means the model remembers more of the current conversation.

Model Context Protocol (MCP): Ollama's improvements work well with MCP.
num_ctx: You can often suggest a context window size. For example, 32,000 (32k) tokens or higher might improve tool calling performance and the quality of results.
Trade-off: Larger context windows use more memory.

Example: Setting Context Window with cURL(Use a model that supports larger contexts, like llama3.1 or llama4 as suggested in original material - though example uses llama3.2)

curl -X POST "<http://localhost:11434/api/chat>" -d '{
  "model": "llama3.1",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ],
  "options": {
    "num_ctx": 32000
  }
}'

Experiment with this setting if you find tool calling isn't as reliable as you'd like.

Where to Go From Here?

You now have the fundamentals to build sophisticated, real-time AI applications with Ollama using streaming responses and tool calling!

Ideas to explore:

Connect tools to real-world APIs (weather, stocks, search engines).
Build agents that can perform multi-step tasks.
Create more natural and responsive chatbots.

Refer to the official Ollama documentation and its GitHub repository (including the "Tool Streaming Pull Request" mentioned in the original source material for deeper technical dives) for the latest updates and more advanced examples.

💡

button