How to Use Ollama for Streaming Responses and Tool Calling

Mark Ponomarev

Mark Ponomarev

29 May 2025

How to Use Ollama for Streaming Responses and Tool Calling

This guide will walk you through how to use one of Ollama's new powerful features: the ability to stream responses and call tools (like functions or APIs) in real time. This is a game-changer for building chat applications that feel alive and can interact with the world around them.

What you'll learn in this tutorial:

💡
Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demans, and replaces Postman at a much more affordable price!
button

Getting Started: What You'll Need

To follow along, you'll need a few things:

Understanding Key Ideas: Streaming and Tool Calls

Let's break down what we mean by "streaming responses" and "tool calling."

What is Response Streaming?

Imagine you're chatting with an AI. Instead of waiting for it to think and type out its entire answer before you see anything, streaming means the AI sends its response to you piece by piece, word by word, as it generates it. This makes the interaction feel much faster and more natural, like a real conversation.

With Ollama, when you enable streaming ("stream": true), you get these incremental updates.

How Does Tool Calling Work?

Tool calling allows your AI models to do more than just generate text. You can define "tools" – which are essentially functions or external APIs – that the AI can decide to use to get information or perform actions.

For example, a tool could be:

You describe these tools to Ollama, and when the AI determines that using a tool would help answer the user's query, it signals its intent to call that tool with specific arguments. Your application then executes the tool and can send the results back to the AI to continue the conversation.

Why Combine Streaming with Tool Calling?

Ollama's big upgrade is that it can now handle tool calling while streaming responses. This means your application can:

  1. Receive initial text from the model (streamed).
  2. Suddenly, the stream might indicate a tool call is needed.
  3. Your app processes the tool call.
  4. Meanwhile, the model might even stream more text (e.g., "Okay, I'll get the weather for you...").
  5. Once your app gets the tool's result, you can send it back to the model, and it will continue streaming its response, now informed by the tool's output.

This creates highly responsive and capable AI applications.

Which Models Support These Features?

Ollama has enabled this for several popular models, including:

How to Make Your First Streaming Tool Call with cURL

cURL is a great way to quickly test Ollama's API. Let's ask for the weather in Toronto.

Step 1: Conceptualizing Your Tool

Our tool will be get_current_weather. It needs:

Step 2: Building the cURL Command

Open your terminal and prepare the following command. We'll break it down:

curl <http://localhost:11434/api/chat> -d '{
  "model": "qwen3",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Toronto?"
    }
  ],
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. \\\\\\\\'celsius\\\\\\\\' or \\\\\\\\'fahrenheit\\\\\\\\'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'

Breakdown:

Step 3: Execute and Observe the Output

Press Enter. You'll see a series of JSON objects appear one after another. This is the stream!

Example snippets from the stream:

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "Okay, " }, "done": false
}

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "I will " }, "done": false
}

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "try to get that for you." }, "done": false
}

(The model might output some "thinking" tokens like <think>...celsius...</think> depending on its internal process, these are also part of the stream)

Then, critically, you might see something like this:

{
  "model": "qwen3",
  "created_at": "2025-05-27T22:54:58.100509Z",
  "message": {
    "role": "assistant",
    "content": "", // Content might be empty when a tool call is made
    "tool_calls": [
      {
        "function": {
          "name": "get_current_weather",
          "arguments": { // The arguments the model decided on!
            "format": "celsius",
            "location": "Toronto"
          }
        }
      }
    ]
  },
  "done": false // Still not done, awaiting tool result
}

What to Notice:

In a real application, when you see a tool_calls chunk, your code would:

  1. Pause processing the stream (or handle it async).
  2. Execute the actual get_current_weather function/API with "Toronto" and "celsius".
  3. Get the result (e.g., "20 degrees Celsius").
  4. Send this result back to Ollama in a new message with role: "tool".
  5. The model will then use this information to continue generating its response, also streamed.

How to Stream Tool Calls Using Python

Let's implement a similar idea in Python using Ollama's official library.

Step 1: Installing the Ollama Python Library

If you haven't already, install or upgrade the library:

pip install -U ollama

Step 2: Defining Your Tool and Coding in Python

The Ollama Python SDK cleverly allows you to pass Python functions directly as tools. It inspects the function signature and docstring to create the schema for the AI.

Let's create a simple math tool example (the input uses add_two_numbers but the output example shows subtract_two_numbers being called by the model. We'll stick to the provided add_two_numbers for definition and let the model decide what to do based on the prompt.)

import ollama

# Define the python function that can be used as a tool
def add_two_numbers(a: int, b: int) -> int:
  """
  Add two numbers.

  Args:
    a (int): The first number as an int.
    b (int): The second number as an int.

  Returns:
    int: The sum of the two numbers.
  """
  print(f"--- Tool 'add_two_numbers' called with a={a}, b={b} ---")
  return a + b

# --- Main conversation logic ---
messages = [{'role': 'user', 'content': 'What is three plus one?'}]
# Or, for the subtraction example in the original output:
# messages = [{'role': 'user', 'content': 'what is three minus one?'}]

print(f"User: {messages[0]['content']}")

# Make the chat request with streaming and the tool
# Note: ChatResponse type hint might be ollama.ChatResponse or similar depending on library version
response_stream = ollama.chat(
  model='qwen3', # Or another capable model
  messages=messages,
  tools=[
      { # You can also define the tool explicitly if needed, or pass the function directly
          'type': 'function',
          'function': {
              'name': 'add_two_numbers', # Must match the Python function name if you want it to be called directly by your code later
              'description': 'Add two integer numbers together.',
              'parameters': {
                  'type': 'object',
                  'properties': {
                      'a': {'type': 'integer', 'description': 'The first number'},
                      'b': {'type': 'integer', 'description': 'The second number'}
                  },
                  'required': ['a', 'b']
              }
          }
      }
      # Simpler way for Python: pass the function directly if the library supports easy schema generation from it
      # tools=[add_two_numbers] # The SDK can often create the schema from this
  ],
  stream=True
)

print("Assistant (streaming):")
full_response_content = ""
tool_call_info = None

for chunk in response_stream:
  # Print the streamed content part
  if chunk['message']['content']:
    print(chunk['message']['content'], end='', flush=True)
    full_response_content += chunk['message']['content']

  # Check for tool calls in the chunk
  if 'tool_calls' in chunk['message'] and chunk['message']['tool_calls']:
    tool_call_info = chunk['message']['tool_calls'][0] # Assuming one tool call for simplicity
    print(f"\\\\n--- Detected Tool Call: {tool_call_info['function']['name']} ---")
    break # Stop processing stream for now, handle tool call

  if chunk.get('done'):
      print("\\\\n--- Stream finished ---")
      if not tool_call_info:
          print("No tool call was made.")

# --- If a tool call was detected, handle it ---
if tool_call_info:
  tool_name = tool_call_info['function']['name']
  tool_args = tool_call_info['function']['arguments']

  print(f"Arguments for the tool: {tool_args}")

  # Here, you'd actually call your Python tool function
  if tool_name == "add_two_numbers":
    # For safety, ensure arguments are of correct type if necessary
    try:
        arg_a = int(tool_args.get('a'))
        arg_b = int(tool_args.get('b'))
        tool_result = add_two_numbers(a=arg_a, b=arg_b)
        print(f"--- Tool execution result: {tool_result} ---")

        # Now, send this result back to Ollama to continue the conversation
        messages.append({'role': 'assistant', 'content': full_response_content, 'tool_calls': [tool_call_info]})
        messages.append({
            'role': 'tool',
            'content': str(tool_result), # Result must be a string
            'tool_call_id': tool_call_info.get('id', '') # If your library/model provides a tool_call_id
        })

        print("\\\\n--- Sending tool result back to model ---")

        follow_up_response_stream = ollama.chat(
            model='qwen3',
            messages=messages,
            stream=True
            # No tools needed here unless you expect another tool call
        )

        print("Assistant (after tool call):")
        for follow_up_chunk in follow_up_response_stream:
            if follow_up_chunk['message']['content']:
                print(follow_up_chunk['message']['content'], end='', flush=True)
            if follow_up_chunk.get('done'):
                print("\\\\n--- Follow-up stream finished ---")
                break
    except ValueError:
        print("Error: Could not parse tool arguments as integers.")
    except Exception as e:
        print(f"An error occurred during tool execution or follow-up: {e}")
  else:
    print(f"Error: Unknown tool '{tool_name}' requested by the model.")

Explanation of the Python Code:

  1. Import ollama.
  2. add_two_numbers function: This is our tool. The docstring and type hints help Ollama understand its purpose and parameters.
  3. messages: We start the conversation with the user's query.
  4. ollama.chat(...):
  1. Looping through response_stream:
  1. Handling the Tool Call:

Expected Output Flow:You'll see the initial user question, then the assistant's response streaming in. If it decides to call add_two_numbers (or subtract_two_numbers as in the original material's sample output if the prompt was for subtraction), you'll see the "Detected Tool Call" message, the arguments, the result of your Python function, and then the assistant continuing its response using that result.

(The original sample output showed:

<think>
Okay, the user is asking ...
</think>

[ToolCall(function=Function(name='subtract_two_numbers', arguments={'a': 3, 'b': 1}))]

This indicates the AI's internal "thought" process and then the structured tool call object that the Python SDK provides.)

How to Stream Tool Calls Using JavaScript (Node.js)

Now, let's do the same with JavaScript, typically for a Node.js backend or web application.

Step 1: Installing the Ollama JavaScript Library

In your project directory, run:

npm i ollama

Step 2: Defining the Tool Schema and Coding in JavaScript

In JavaScript, you usually define the tool schema as a JSON object.

import ollama from 'ollama';

// Describe the tool schema (e.g., for adding two numbers)
const addTool = {
    type: 'function',
    function: {
        name: 'addTwoNumbers',
        description: 'Add two numbers together',
        parameters: {
            type: 'object',
            required: ['a', 'b'],
            properties: {
                a: { type: 'number', description: 'The first number' },
                b: { type: 'number', description: 'The second number' }
            }
        }
    }
};

// Your actual JavaScript function that implements the tool
function executeAddTwoNumbers(a, b) {
    console.log(`--- Tool 'addTwoNumbers' called with a=${a}, b=${b} ---`);
    return a + b;
}

async function main() {
    const messages = [{ role: 'user', content: 'What is 2 plus 3?' }];
    console.log('User:', messages[0].content);

    console.log('Assistant (streaming):');
    let assistantResponseContent = "";
    let toolToCallInfo = null;

    try {
        const responseStream = await ollama.chat({
            model: 'qwen3', // Or another capable model
            messages: messages,
            tools: [addTool],
            stream: true
        });

        for await (const chunk of responseStream) {
            if (chunk.message.content) {
                process.stdout.write(chunk.message.content);
                assistantResponseContent += chunk.message.content;
            }
            if (chunk.message.tool_calls && chunk.message.tool_calls.length > 0) {
                toolToCallInfo = chunk.message.tool_calls[0]; // Assuming one tool call
                process.stdout.write(`\\\\n--- Detected Tool Call: ${toolToCallInfo.function.name} ---\\\\n`);
                break; // Stop processing stream to handle tool call
            }
            if (chunk.done) {
                process.stdout.write('\\\\n--- Stream finished ---\\\\n');
                if (!toolToCallInfo) {
                    console.log("No tool call was made.");
                }
                break;
            }
        }

        // --- If a tool call was detected, handle it ---
        if (toolToCallInfo) {
            const toolName = toolToCallInfo.function.name;
            const toolArgs = toolToCallInfo.function.arguments;

            console.log(`Arguments for the tool:`, toolArgs);

            let toolResult;
            if (toolName === 'addTwoNumbers') {
                toolResult = executeAddTwoNumbers(toolArgs.a, toolArgs.b);
                console.log(`--- Tool execution result: ${toolResult} ---`);

                // Append assistant's partial message and the tool message
                messages.push({
                    role: 'assistant',
                    content: assistantResponseContent, // Include content leading up to tool call
                    tool_calls: [toolToCallInfo]
                });
                messages.push({
                    role: 'tool',
                    content: toolResult.toString(), // Result must be a string
                    // tool_call_id: toolToCallInfo.id // If available and needed
                });

                console.log("\\\\n--- Sending tool result back to model ---");
                const followUpStream = await ollama.chat({
                    model: 'qwen3',
                    messages: messages,
                    stream: true
                });

                console.log("Assistant (after tool call):");
                for await (const followUpChunk of followUpStream) {
                    if (followUpChunk.message.content) {
                        process.stdout.write(followUpChunk.message.content);
                    }
                     if (followUpChunk.done) {
                        process.stdout.write('\\\\n--- Follow-up stream finished ---\\\\n');
                        break;
                    }
                }
            } else {
                console.error(`Error: Unknown tool '${toolName}' requested.`);
            }
        }

    } catch (error) {
        console.error('Error during Ollama chat:', error);
    }
}

main().catch(console.error);

Explanation of the JavaScript Code:

  1. Import ollama.
  2. addTool object: This is the JSON schema describing our tool to Ollama.
  3. executeAddTwoNumbers function: Our actual JavaScript function for the tool.
  4. main async function:
  1. Handling the Tool Call: Similar to Python, if toolToCallInfo is set:

Expected Output Flow (similar to the cURL and Python examples):You'll see the user's question, then the assistant's response streaming. When it decides to call addTwoNumbers, it will print the tool call information, the result from your JavaScript function, and then continue streaming the AI's answer based on that result.

The original sample output for JS looked like:

Question: What is 2 plus 3?
<think>
Okay, the user is asking...
</think>
Tool call: {
  function: {
    name: "addTwoNumbers",
    arguments: { a: 2, b: 3 },
  },
}

How Ollama Handles Tool Parsing During Streaming

You might wonder how Ollama manages to stream text and identify tool calls so smoothly. It uses a clever new incremental parser.

Why is this better?

Tip: Improving Performance with the Context Window

For more complex interactions, especially with tool calling, the size of the "context window" the model uses can matter. A larger context window means the model remembers more of the current conversation.

Example: Setting Context Window with cURL(Use a model that supports larger contexts, like llama3.1 or llama4 as suggested in original material - though example uses llama3.2)

curl -X POST "<http://localhost:11434/api/chat>" -d '{
  "model": "llama3.1",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ],
  "options": {
    "num_ctx": 32000
  }
}'

Experiment with this setting if you find tool calling isn't as reliable as you'd like.

Where to Go From Here?

You now have the fundamentals to build sophisticated, real-time AI applications with Ollama using streaming responses and tool calling!

Ideas to explore:

Refer to the official Ollama documentation and its GitHub repository (including the "Tool Streaming Pull Request" mentioned in the original source material for deeper technical dives) for the latest updates and more advanced examples.

💡
Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demans, and replaces Postman at a much more affordable price!
button

Explore more

Top 10 Best AI Tools for API and Backend Testing to Watch in 2025

Top 10 Best AI Tools for API and Backend Testing to Watch in 2025

The digital backbone of modern applications, the Application Programming Interface (API), and the backend systems they connect to, are more critical than ever. As development cycles accelerate and architectures grow in complexity, traditional testing methods are struggling to keep pace. Enter the game-changer: Artificial Intelligence. In 2025, AI is not just a buzzword in the realm of software testing; it is the driving force behind a new generation of tools that are revolutionizing how we ensur

21 June 2025

Why I Love Stripe Docs (API Documentation Best Practices)

Why I Love Stripe Docs (API Documentation Best Practices)

As a developer, I’ve had my fair share of late nights fueled by frustration and bad documentation. I think we all have. I can still vividly recall the cold sweat of trying to integrate a certain legacy payment processor years ago. It was a nightmare of fragmented guides, conflicting API versions, and a dashboard that felt like a labyrinth designed by a committee that hated joy. After hours of wrestling with convoluted SOAP requests and getting absolutely nowhere, I threw in the towel. A colleagu

20 June 2025

How to Install and Configure MongoDB MCP Server

How to Install and Configure MongoDB MCP Server

In the ever-evolving landscape of software development, the integration of Artificial Intelligence is no longer a futuristic concept but a present-day reality. AI-powered tools are rapidly becoming indispensable for developers, streamlining workflows, and enhancing productivity. Recognizing this trend, MongoDB has introduced a groundbreaking tool that bridges the gap between your database and AI: the MongoDB Model Context Protocol (MCP) Server. This tutorial provides a comprehensive, step-by-ste

20 June 2025

Practice API Design-first in Apidog

Discover an easier way to build and use APIs