Ollama ストリーミング応答とツール呼び出しの活用方法

このガイドでは、Ollamaの新しい強力な機能の1つである、レスポンスをストリーミングし、ツール（関数やAPIなど）をリアルタイムで呼び出す機能の使用方法を紹介します。これは、生き生きとして周囲の世界とインタラクションできるチャットアプリケーションを構築する上で、画期的な機能です。

このチュートリアルで学ぶこと：

Ollamaにおけるレスポンスのストリーミングとツール呼び出しの意味
この組み合わせがAIプロジェクトに非常に役立つ理由
これを実装するためのステップバイステップの手順（以下を使用）：
cURL（クイックテストとユニバーサルアクセス用）
Python（バックエンドアプリケーション用）
JavaScript（ウェブおよびNode.jsアプリケーション用）
Ollamaがこれらの機能をいかに巧みに処理するかについての洞察
最高のパフォーマンスを得るためのヒント

💡

優れたAPIテストツールをお探しですか？美しいAPIドキュメントを生成するツールは？

開発者チームが最大限の生産性で協力して作業できる、統合されたオールインワンプラットフォームは？

Apidogは、お客様のすべての要望に応え、Postmanをより手頃な価格で置き換えます！

ボタン

はじめに：必要なもの

このガイドに従うには、いくつかのものが必要です。

Ollamaがインストールされていること： システムに最新バージョンのOllamaがインストールされ、実行されていることを確認してください。インストールされていない場合は、公式Ollamaウェブサイトからダウンロードしてインストールしてください。
基本的なコマンドラインの知識： cURLの例のため。
Python環境（Pythonセクション用）： Python 3.xがインストールされており、パッケージ管理のためのpipも必要です。
Node.js環境（JavaScriptセクション用）： Node.jsとnpmがインストールされていること。
JSONの理解： Ollamaはデータとツール呼び出しの構造化にJSONを使用します。

主要な概念の理解：ストリーミングとツール呼び出し

「レスポンスのストリーミング」と「ツール呼び出し」の意味を詳しく見ていきましょう。

レスポンスのストリーミングとは？

AIとチャットしていると想像してください。AIが考えて応答全体をタイプし終えるのを待ってから何かを見るのではなく、ストリーミングではAIが生成するにつれて、応答を断片的に、単語ごとに送信します。これにより、インタラクションがはるかに速く、実際の会話のように自然に感じられます。

Ollamaでは、ストリーミングを有効にする（"stream": true）と、これらの増分的な更新が得られます。

ツール呼び出しはどのように機能しますか？

ツール呼び出しを使用すると、AIモデルはテキストを生成する以上のことができます。情報を取得したりアクションを実行したりするためにAIが使用することを決定できる「ツール」（本質的には関数や外部API）を定義できます。

例として、ツールは次のようになります。

get_current_weather(location): 現在の天気を取得します。
calculate_sum(number1, number2): 計算を実行します。
search_web(query): インターネットから情報を取得します。

これらのツールをOllamaに説明すると、AIはユーザーのクエリに答えるのにツールを使用することが役立つと判断した場合、特定の引数でそのツールを呼び出す意図を示します。アプリケーションはツールを実行し、結果をAIに送り返して会話を続けることができます。

なぜストリーミングとツール呼び出しを組み合わせるのですか？

Ollamaの大きなアップグレードは、レスポンスをストリーミングしながらツール呼び出しを処理できるようになったことです。これは、アプリケーションが次のことができることを意味します。

モデルから最初のテキストを受け取ります（ストリーミング）。
突然、ストリームがツール呼び出しが必要であることを示す場合があります。
アプリはツール呼び出しを処理します。
その間、モデルはさらにテキストをストリーミングする場合があります（例：「はい、天気予報を取得します...」）。
アプリがツールの結果を取得したら、それをモデルに送り返すことができ、モデルはその結果に基づいて応答のストリーミングを続けます。

これにより、非常に応答性が高く、能力のあるAIアプリケーションが作成されます。

これらの機能をサポートするモデルは？

Ollamaは、いくつかの人気のあるモデルでこれを有効にしました。以下が含まれます。

Qwen 3
Devstral
Qwen2.5 および Qwen2.5-coder
Llama 3.1
Llama 4
...など、今後も継続的に追加されます！

cURLを使用して最初のストリーミングツール呼び出しを行う方法

cURLは、OllamaのAPIを迅速にテストするための優れた方法です。トロントの天気を尋ねてみましょう。

ステップ1：ツールの概念化

私たちのツールはget_current_weatherになります。必要なもの：

location (string): 例：「Toronto」
format (string): 例：「celsius」または「fahrenheit」

ステップ2：cURLコマンドの構築

ターミナルを開き、以下のコマンドを準備します。分解して説明します。

curl <http://localhost:11434/api/chat> -d '{
  "model": "qwen3",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Toronto?"
    }
  ],
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. \\\\\\\\'celsius\\\\\\\\' or \\\\\\\\'fahrenheit\\\\\\\\'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'

分解：

curl http://localhost:11434/api/chat: コマンドとOllamaのチャットAPIエンドポイント。
d '{...}': リクエストボディでJSONデータを送信します。
"model": "qwen3": 使用するAIモデルを指定します。
"messages": [...]: 会話履歴。ここでは、ユーザーの質問のみ。
"stream": true: これが重要です！Ollamaにレスポンスをストリーミングするように指示します。
"tools": [...]: モデルが利用できるツールを定義する配列。
"type": "function": ツールのタイプを指定します。
"function": {...}: 関数を記述します。
"name": "get_current_weather": ツールの名前。
"description": "...": モデルがツールの機能を理解するのに役立ちます。
"parameters": {...}: ツールが受け入れる引数を定義します（JSONスキーマを使用）。

ステップ3：実行と出力の観察

Enterキーを押します。一連のJSONオブジェクトが次々と表示されます。これがストリームです！

ストリームからの例スニペット：

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "Okay, " }, "done": false
}

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "I will " }, "done": false
}

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "try to get that for you." }, "done": false
}

（モデルは、その内部プロセスに応じて、<think>...celsius...</think>のような「思考」トークンを出力する場合があります。これらもストリームの一部です）

次に、重要なことに、次のようなものが表示される場合があります。

{
  "model": "qwen3",
  "created_at": "2025-05-27T22:54:58.100509Z",
  "message": {
    "role": "assistant",
    "content": "", // Content might be empty when a tool call is made
    "tool_calls": [
      {
        "function": {
          "name": "get_current_weather",
          "arguments": { // The arguments the model decided on!
            "format": "celsius",
            "location": "Toronto"
          }
        }
      }
    ]
  },
  "done": false // Still not done, awaiting tool result
}

注意点：

各チャンクはJSONオブジェクトです。
"done": falseはストリームが進行中であることを意味します。最後のチャンクは"done": trueになります。
"message"オブジェクトには以下が含まれます。
"role": "assistant"
"content": ストリームのテキスト部分。
"tool_calls": モデルがツールを使用したいときに表示される配列。
ツールのnameと、モデルが決定したargumentsが含まれます。

実際のアプリケーションでは、tool_callsチャンクが表示されたとき、コードは次のようになります。

ストリームの処理を一時停止します（または非同期で処理します）。
実際のget_current_weather関数/APIを「Toronto」と「celsius」で実行します。
結果を取得します（例：「摂氏20度」）。
この結果を、role: "tool"を持つ新しいメッセージでOllamaに送り返します。
モデルはその後、この情報を使用して応答の生成を続けます（これもストリーミングされます）。

Pythonを使用してツール呼び出しをストリーミングする方法

Ollamaの公式ライブラリを使用して、Pythonで同様のアイデアを実装してみましょう。

ステップ1：Ollama Pythonライブラリのインストール

まだインストールしていない場合は、ライブラリをインストールまたはアップグレードしてください。

pip install -U ollama

ステップ2：Pythonでのツールの定義とコーディング

Ollama Python SDKは、Python関数をツールとして直接渡すことができるように巧みに設計されています。関数のシグネチャとドキュメント文字列を検査して、AI用のスキーマを作成します。

簡単な数学ツール（足し算）の例を作成しましょう（入力はadd_two_numbersを使用しますが、出力例はモデルによって呼び出されるsubtract_two_numbersを示しています。ここでは定義のために提供されたadd_two_numbersを使用し、プロンプトに基づいてモデルが何を決定するかを見ます）。

import ollama

# Define the python function that can be used as a tool
def add_two_numbers(a: int, b: int) -> int:
  """
  Add two numbers.

  Args:
    a (int): The first number as an int.
    b (int): The second number as an int.

  Returns:
    int: The sum of the two numbers.
  """
  print(f"--- Tool 'add_two_numbers' called with a={a}, b={b} ---")
  return a + b

# --- Main conversation logic ---
messages = [{'role': 'user', 'content': 'What is three plus one?'}]
# Or, for the subtraction example in the original output:
# messages = [{'role': 'user', 'content': 'what is three minus one?'}]

print(f"User: {messages[0]['content']}")

# Make the chat request with streaming and the tool
# Note: ChatResponse type hint might be ollama.ChatResponse or similar depending on library version
response_stream = ollama.chat(
  model='qwen3', # Or another capable model
  messages=messages,
  tools=[
      { # You can also define the tool explicitly if needed, or pass the function directly
          'type': 'function',
          'function': {
              'name': 'add_two_numbers', # Must match the Python function name if you want it to be called directly by your code later
              'description': 'Add two integer numbers together.',
              'parameters': {
                  'type': 'object',
                  'properties': {
                      'a': {'type': 'integer', 'description': 'The first number'},
                      'b': {'type': 'integer', 'description': 'The second number'}
                  },
                  'required': ['a', 'b']
              }
          }
      }
      # Simpler way for Python: pass the function directly if the library supports easy schema generation from it
      # tools=[add_two_numbers] # The SDK can often create the schema from this
  ],
  stream=True
)

print("Assistant (streaming):")
full_response_content = ""
tool_call_info = None

for chunk in response_stream:
  # Print the streamed content part
  if chunk['message']['content']:
    print(chunk['message']['content'], end='', flush=True)
    full_response_content += chunk['message']['content']

  # Check for tool calls in the chunk
  if 'tool_calls' in chunk['message'] and chunk['message']['tool_calls']:
    tool_call_info = chunk['message']['tool_calls'][0] # Assuming one tool call for simplicity
    print(f"\\\\n--- Detected Tool Call: {tool_call_info['function']['name']} ---")
    break # Stop processing stream for now, handle tool call

  if chunk.get('done'):
      print("\\\\n--- Stream finished ---")
      if not tool_call_info:
          print("No tool call was made.")

# --- If a tool call was detected, handle it ---
if tool_call_info:
  tool_name = tool_call_info['function']['name']
  tool_args = tool_call_info['function']['arguments']

  print(f"Arguments for the tool: {tool_args}")

  # Here, you'd actually call your Python tool function
  if tool_name == "add_two_numbers":
    # For safety, ensure arguments are of correct type if necessary
    try:
        arg_a = int(tool_args.get('a'))
        arg_b = int(tool_args.get('b'))
        tool_result = add_two_numbers(a=arg_a, b=arg_b)
        print(f"--- Tool execution result: {tool_result} ---")

        # Now, send this result back to Ollama to continue the conversation
        messages.append({'role': 'assistant', 'content': full_response_content, 'tool_calls': [tool_call_info]})
        messages.append({
            'role': 'tool',
            'content': str(tool_result), # Result must be a string
            'tool_call_id': tool_call_info.get('id', '') # If your library/model provides a tool_call_id
        })

        print("\\\\n--- Sending tool result back to model ---")

        follow_up_response_stream = ollama.chat(
            model='qwen3',
            messages=messages,
            stream=True
            # No tools needed here unless you expect another tool call
        )

        print("Assistant (after tool call):")
        for follow_up_chunk in follow_up_response_stream:
            if follow_up_chunk['message']['content']:
                print(follow_up_chunk['message']['content'], end='', flush=True)
            if follow_up_chunk.get('done'):
                print("\\\\n--- Follow-up stream finished ---")
                break
    except ValueError:
        print("Error: Could not parse tool arguments as integers.")
    except Exception as e:
        print(f"An error occurred during tool execution or follow-up: {e}")
  else:
    print(f"Error: Unknown tool '{tool_name}' requested by the model.")

Pythonコードの説明：

ollamaをインポートします。
add_two_numbers関数：これが私たちのツールです。ドキュメント文字列と型ヒントは、Ollamaがその目的とパラメータを理解するのに役立ちます。
messages：ユーザーのクエリで会話を開始します。
ollama.chat(...)：

model、messages、stream=TrueはcURLと同様です。
tools=[...]：ツールの定義を提供します。Python SDKは非常に柔軟です。スキーマを推測できる場合は、関数オブジェクトを直接渡すこともできます（例：tools=[add_two_numbers]）。

response_streamをループ処理する：

chunk['message']['content']：これはストリーミングされたテキストです。すぐに表示します。
chunk['message']['tool_calls']：このキーが存在し、内容がある場合、AIはツールを使用したいと考えています。このtool_call_infoを保存し、ループを中断して処理します。

ツール呼び出しの処理：Pythonと同様に、tool_call_infoが設定されている場合：

tool_nameとtool_argsを抽出します。
これらの引数を使用して、実際のPython関数（add_two_numbers）を呼び出します。
**重要**：その後、アシスタントの部分的な応答（ツール呼び出しにつながったもの）と、role: "tool"および関数の文字列化された結果をcontentとする新しいメッセージをmessagesリストに追加します。
これらの更新されたメッセージを使用して別のollama.chat呼び出しを行い、ツールの出力に基づいたAIの最終応答を取得します。

予想される出力フロー：最初のユーザーの質問が表示され、次にアシスタントの応答がストリーミングされます。add_two_numbers（または元の資料のサンプル出力にあるように、プロンプトが引き算に関するものだった場合のsubtract_two_numbers）を呼び出すと決定した場合、「Detected Tool Call」メッセージ、引数、Python関数の結果が表示され、その後アシスタントはその結果を使用して応答を続けます。

（元のサンプル出力は以下を示しています。

<think>
Okay, the user is asking ...
</think>

[ToolCall(function=Function(name='subtract_two_numbers', arguments={'a': 3, 'b': 1}))]

これは、AIの内部的な「思考」プロセスと、Python SDKが提供する構造化されたツール呼び出しオブジェクトを示しています。）

JavaScript（Node.js）を使用してツール呼び出しをストリーミングする方法

次に、通常Node.jsバックエンドまたはウェブアプリケーションで使用されるJavaScriptで同じことを行いましょう。

ステップ1：Ollama JavaScriptライブラリのインストール

プロジェクトディレクトリで、以下を実行します。

npm i ollama

ステップ2：JavaScriptでのツールスキーマの定義とコーディング

JavaScriptでは、通常、ツールスキーマをJSONオブジェクトとして定義します。

import ollama from 'ollama';

// Describe the tool schema (e.g., for adding two numbers)
const addTool = {
    type: 'function',
    function: {
        name: 'addTwoNumbers',
        description: 'Add two numbers together',
        parameters: {
            type: 'object',
            required: ['a', 'b'],
            properties: {
                a: { type: 'number', description: 'The first number' },
                b: { type: 'number', description: 'The second number' }
            }
        }
    }
};

// Your actual JavaScript function that implements the tool
function executeAddTwoNumbers(a, b) {
    console.log(`--- Tool 'addTwoNumbers' called with a=${a}, b=${b} ---`);
    return a + b;
}

async function main() {
    const messages = [{ role: 'user', content: 'What is 2 plus 3?' }];
    console.log('User:', messages[0].content);

    console.log('Assistant (streaming):');
    let assistantResponseContent = "";
    let toolToCallInfo = null;

    try {
        const responseStream = await ollama.chat({
            model: 'qwen3', // Or another capable model
            messages: messages,
            tools: [addTool],
            stream: true
        });

        for await (const chunk of responseStream) {
            if (chunk.message.content) {
                process.stdout.write(chunk.message.content);
                assistantResponseContent += chunk.message.content;
            }
            if (chunk.message.tool_calls && chunk.message.tool_calls.length > 0) {
                toolToCallInfo = chunk.message.tool_calls[0]; // Assuming one tool call
                process.stdout.write(`\\\\n--- Detected Tool Call: ${toolToCallInfo.function.name} ---\\\\n`);
                break; // Stop processing stream to handle tool call
            }
            if (chunk.done) {
                process.stdout.write('\\\\n--- Stream finished ---\\\\n');
                if (!toolToCallInfo) {
                    console.log("No tool call was made.");
                }
                break;
            }
        }

        // --- If a tool call was detected, handle it ---
        if (toolToCallInfo) {
            const toolName = toolToCallInfo.function.name;
            const toolArgs = toolToCallInfo.function.arguments;

            console.log(`Arguments for the tool:`, toolArgs);

            let toolResult;
            if (toolName === 'addTwoNumbers') {
                toolResult = executeAddTwoNumbers(toolArgs.a, toolArgs.b);
                console.log(`--- Tool execution result: ${toolResult} ---`);

                // Append assistant's partial message and the tool message
                messages.push({
                    role: 'assistant',
                    content: assistantResponseContent, // Include content leading up to tool call
                    tool_calls: [toolToCallInfo]
                });
                messages.push({
                    role: 'tool',
                    content: toolResult.toString(), // Result must be a string
                    // tool_call_id: toolToCallInfo.id // If available and needed
                });

                console.log("\\\\n--- Sending tool result back to model ---");
                const followUpStream = await ollama.chat({
                    model: 'qwen3',
                    messages: messages,
                    stream: true
                });

                console.log("Assistant (after tool call):");
                for await (const followUpChunk of followUpStream) {
                    if (followUpChunk.message.content) {
                        process.stdout.write(followUpChunk.message.content);
                    }
                     if (followUpChunk.done) {
                        process.stdout.write('\\\\n--- Follow-up stream finished ---\\\\n');
                        break;
                    }
                }
            } else {
                console.error(`Error: Unknown tool '${toolName}' requested.`);
            }
        }

    } catch (error) {
        console.error('Error during Ollama chat:', error);
    }
}

main().catch(console.error);

JavaScriptコードの説明：

ollamaをインポートします。
addToolオブジェクト：これは、ツールをOllamaに説明するJSONスキーマです。
executeAddTwoNumbers関数：ツールのための実際のJavaScript関数です。
main非同期関数：

messages配列で会話を開始します。
await ollama.chat({...})：呼び出しを行います。
tools: [addTool]：ツールスキーマを渡します。
stream: true：ストリーミングを有効にします。
for await (const chunk of responseStream)：このループは、ストリーミングされた各チャンクを処理します。
chunk.message.content：ストリームのテキスト部分。
chunk.message.tool_calls：存在する場合、AIはツールを使用したいと考えています。toolToCallInfoを保存します。

ツール呼び出しの処理：Pythonと同様に、toolToCallInfoが設定されている場合：

名前と引数を抽出します。
executeAddTwoNumbers()を呼び出します。
アシスタントのメッセージ（ツール呼び出しリクエストを含むもの）と、結果を含む新しいrole: "tool"メッセージをmessages配列に追加します。
これらの更新されたメッセージを使用して別のollama.chat呼び出しを行い、最終応答を取得します。

予想される出力フロー（cURLおよびPythonの例と同様）：ユーザーの質問が表示され、次にアシスタントの応答がストリーミングされます。addTwoNumbersを呼び出すと決定した場合、ツール呼び出し情報、JavaScript関数の結果が表示され、その後AIはその結果に基づいて応答のストリーミングを続けます。

元のJSのサンプル出力は次のようになります。

Question: What is 2 plus 3?
<think>
Okay, the user is asking...
</think>
Tool call: {
  function: {
    name: "addTwoNumbers",
    arguments: { a: 2, b: 3 },
  },
}

ストリーミング中のOllamaによるツール解析の処理方法

Ollamaがテキストをストリーミングし、ツール呼び出しをいかにスムーズに識別するか不思議に思うかもしれません。Ollamaは、巧妙な新しい**インクリメンタルパーサー**を使用しています。

従来のやり方： 多くのシステムでは、AI応答全体を待ってからツール呼び出し（通常JSON形式）をスキャンする必要がありました。ツール呼び出しはどこにでも現れる可能性があるため、これがストリーミングを妨げていました。
Ollamaの新しいやり方：
パーサーは、各モデル固有のテンプレートを見て、ツール呼び出しをどのようにシグナルするか（例：特殊なトークンやプレフィックス）を理解します。
これにより、Ollamaはデータがストリームとして入ってくるにつれてツール呼び出しを「増分的に」識別し、通常のテキストコンテンツから分離できます。
明示的にツールプレフィックスでトレーニングされていなくても、有効なツール呼び出し構造を出力できるモデルを処理するのに十分スマートです。必要に応じてJSONライクな構造を探すことさえできますが、インテリジェントに行うため、任意のJSONを掴むわけではありません。

なぜこれが優れているのですか？

**真のストリーミング：** テキストはすぐに得られ、ツール呼び出しはオンザフライで識別されます。
**精度：** 誤検知を回避するのに優れています（例：AIが以前に行ったツール呼び出しについて話した場合、新しいパーサーが誤って再度トリガーする可能性が低くなります）。

ヒント：コンテキストウィンドウによるパフォーマンスの向上

より複雑なインタラクション、特にツール呼び出しでは、モデルが使用する「コンテキストウィンドウ」のサイズが重要になる場合があります。コンテキストウィンドウが大きいほど、モデルは現在の会話をより多く記憶します。

**Model Context Protocol (MCP)：** Ollamaの改善はMCPとうまく連携します。
**num_ctx：** コンテキストウィンドウのサイズを提案できます。例えば、32,000（32k）トークン以上は、ツール呼び出しのパフォーマンスと結果の品質を向上させる可能性があります。
**トレードオフ：** コンテキストウィンドウが大きいほど、より多くのメモリを使用します。

例：cURLでコンテキストウィンドウを設定する（元の資料で示唆されているように、llama3.1やllama4など、より大きなコンテキストをサポートするモデルを使用してください - 例ではllama3.2を使用していますが）

curl -X POST "<http://localhost:11434/api/chat>" -d '{
  "model": "llama3.1",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ],
  "options": {
    "num_ctx": 32000
  }
}'

ツール呼び出しの信頼性が期待どおりでない場合は、この設定を試してみてください。

今後の展望は？

これで、ストリーミング応答とツール呼び出しを使用して、洗練されたリアルタイムAIアプリケーションをOllamaで構築するための基礎を習得しました！

探求すべきアイデア：

ツールを現実世界のAPI（天気、株価、検索エンジン）に接続する。
複数ステップのタスクを実行できるエージェントを構築する。
より自然で応答性の高いチャットボットを作成する。

最新の更新情報やより高度な例については、公式のOllamaドキュメントおよびそのGitHubリポジトリ（より詳細な技術的な掘り下げについては、元の資料で言及されている「Tool Streaming Pull Request」を含む）を参照してください。

💡

ボタン