Ollama 스트리밍 응답 및 도구 호출 방법

이 가이드는 Ollama의 강력한 새 기능 중 하나인 응답 스트리밍 및 실시간 툴 호출(함수 또는 API 등) 기능을 사용하는 방법을 안내합니다. 이는 마치 살아있는 듯하고 주변 세계와 상호작용할 수 있는 채팅 애플리케이션을 구축하는 데 있어 판도를 바꾸는 기능입니다.

이 튜토리얼에서 배우게 될 내용:

Ollama에서 응답 스트리밍 및 툴 호출이 무엇을 의미하는지.
이 조합이 AI 프로젝트에 왜 매우 유용한지.
다음 도구를 사용하여 이를 구현하는 단계별 지침:
cURL (빠른 테스트 및 범용 접근을 위해)
Python (백엔드 애플리케이션을 위해)
JavaScript (웹 및 Node.js 애플리케이션을 위해)
Ollama가 이러한 기능을 어떻게 영리하게 처리하는지에 대한 엿보기.
최상의 성능을 얻기 위한 팁.

💡

아름다운 API 문서를 생성하는 훌륭한 API 테스트 도구를 원하십니까?

최대 생산성으로 개발팀이 함께 작업할 수 있는 통합 올인원 플랫폼을 원하십니까?

Apidog는 귀하의 모든 요구 사항을 충족하며 Postman을 훨씬 저렴한 가격으로 대체합니다!

button

시작하기: 필요한 것

따라오려면 몇 가지가 필요합니다:

Ollama 설치: 시스템에 최신 버전의 Ollama가 실행 중인지 확인하십시오. 그렇지 않다면 공식 Ollama 웹사이트에서 다운로드하여 설치하십시오.
기본 명령줄 지식: cURL 예제를 위해 필요합니다.
Python 환경 (Python 섹션을 위해): Python 3.x가 설치되어 있고 패키지 관리를 위한 pip이 필요합니다.
Node.js 환경 (JavaScript 섹션을 위해): Node.js 및 npm이 설치되어 있어야 합니다.
JSON 이해: Ollama는 데이터 및 툴 호출 구조화를 위해 JSON을 사용합니다.

핵심 아이디어 이해: 스트리밍 및 툴 호출

"응답 스트리밍" 및 "툴 호출"이 무엇을 의미하는지 자세히 알아보겠습니다.

응답 스트리밍이란?

AI와 채팅한다고 상상해 보세요. AI가 생각하고 전체 답변을 모두 입력할 때까지 기다리는 대신, 스트리밍은 AI가 응답을 생성하는 대로 조각별로, 단어별로 응답을 보내는 것을 의미합니다. 이는 실제 대화처럼 상호작용을 훨씬 빠르고 자연스럽게 만듭니다.

Ollama에서 스트리밍을 활성화하면("stream": true) 이러한 증분 업데이트를 받게 됩니다.

툴 호출은 어떻게 작동하나요?

툴 호출을 사용하면 AI 모델이 단순히 텍스트를 생성하는 것 이상을 할 수 있습니다. 정보를 얻거나 작업을 수행하기 위해 AI가 사용하기로 결정할 수 있는 "툴"(기본적으로 함수 또는 외부 API)을 정의할 수 있습니다.

예를 들어, 툴은 다음과 같을 수 있습니다:

get_current_weather(location): 현재 날씨를 가져옵니다.
calculate_sum(number1, number2): 계산을 수행합니다.
search_web(query): 인터넷에서 정보를 가져옵니다.

이러한 툴을 Ollama에 설명하면, AI는 사용자의 쿼리에 답변하는 데 툴 사용이 도움이 된다고 판단할 때 특정 인수를 사용하여 해당 툴을 호출하겠다는 의도를 알립니다. 그러면 애플리케이션이 툴을 실행하고 결과를 AI에 다시 보내 대화를 계속할 수 있습니다.

스트리밍과 툴 호출을 결합하는 이유는 무엇인가요?

Ollama의 큰 업그레이드는 이제 응답을 스트리밍하는 *동안* 툴 호출을 처리할 수 있다는 것입니다. 이는 애플리케이션이 다음을 수행할 수 있음을 의미합니다:

모델로부터 초기 텍스트를 수신합니다(스트리밍됨).
갑자기 스트림이 툴 호출이 필요함을 나타낼 수 있습니다.
애플리케이션이 툴 호출을 처리합니다.
그 동안 모델은 더 많은 텍스트를 스트리밍할 수도 있습니다(예: "알겠습니다. 토론토 날씨를 가져오겠습니다...").
애플리케이션이 툴 결과를 받으면 이를 모델에 다시 보낼 수 있으며, 모델은 이제 툴 출력에 의해 정보를 받아 응답 스트리밍을 계속합니다.

이는 매우 반응적이고 유능한 AI 애플리케이션을 만듭니다.

어떤 모델이 이 기능을 지원하나요?

Ollama는 다음과 같은 여러 인기 모델에 대해 이 기능을 활성화했습니다:

Qwen 3
Devstral
Qwen2.5 및 Qwen2.5-coder
Llama 3.1
Llama 4
...그리고 더 많은 모델이 지속적으로 추가되고 있습니다!

cURL로 첫 번째 스트리밍 툴 호출 만들기

cURL은 Ollama의 API를 빠르게 테스트하는 좋은 방법입니다. 토론토의 날씨를 물어보겠습니다.

단계 1: 툴 개념화

우리의 툴은 get_current_weather입니다. 다음이 필요합니다:

location (문자열): 예: "Toronto"
format (문자열): 예: "celsius" 또는 "fahrenheit"

단계 2: cURL 명령 구축

터미널을 열고 다음 명령을 준비하십시오. 이를 분석해 보겠습니다:

curl <http://localhost:11434/api/chat> -d '{
  "model": "qwen3",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Toronto?"
    }
  ],
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. \\\\\\\\'celsius\\\\\\\\' or \\\\\\\\'fahrenheit\\\\\\\\'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'

분석:

curl <http://localhost:11434/api/chat:> 명령 및 Ollama의 채팅 API 엔드포인트.
d '{...}': 요청 본문에 JSON 데이터를 보냅니다.
"model": "qwen3": 사용할 AI 모델을 지정합니다.
"messages": [...]: 대화 기록. 여기서는 사용자의 질문만 있습니다.
"stream": true: 이것이 핵심입니다! Ollama에게 응답을 스트리밍하도록 지시합니다.
"tools": [...]: 모델이 사용할 수 있는 툴을 정의하는 배열입니다.
"type": "function": 툴 유형을 지정합니다.
"function": {...}: 함수를 설명합니다.
"name": "get_current_weather": 툴의 이름입니다.
"description": "...": 모델이 툴의 기능을 이해하도록 돕습니다.
"parameters": {...}: 툴이 허용하는 인수를 정의합니다(JSON 스키마 사용).

단계 3: 실행 및 출력 관찰

Enter 키를 누르십시오. 일련의 JSON 객체가 차례로 나타나는 것을 볼 수 있습니다. 이것이 스트림입니다!

스트림의 예제 스니펫:

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "Okay, " }, "done": false
}

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "I will " }, "done": false
}

{
  "model": "qwen3", "created_at": "...",
  "message": { "role": "assistant", "content": "try to get that for you." }, "done": false
}

(모델은 내부 프로세스에 따라 <think>...celsius...</think>와 같은 "생각" 토큰을 출력할 수 있으며, 이 또한 스트림의 일부입니다)

그런 다음, 중요하게도 다음과 같은 것을 볼 수 있습니다:

{
  "model": "qwen3",
  "created_at": "2025-05-27T22:54:58.100509Z",
  "message": {
    "role": "assistant",
    "content": "", // Content might be empty when a tool call is made
    "tool_calls": [
      {
        "function": {
          "name": "get_current_weather",
          "arguments": { // The arguments the model decided on!
            "format": "celsius",
            "location": "Toronto"
          }
        }
      }
    ]
  },
  "done": false // Still not done, awaiting tool result
}

주목할 점:

각 청크는 JSON 객체입니다.
"done": false는 스트림이 진행 중임을 의미합니다. 마지막 청크는 "done": true를 가집니다.
"message" 객체에는 다음이 포함됩니다:
"role": "assistant"
"content": 스트림의 텍스트 부분.
"tool_calls": 모델이 툴을 사용하려 할 때 나타나는 배열입니다. 툴의 name과 모델이 결정한 arguments를 포함합니다.

실제 애플리케이션에서는 tool_calls 청크를 볼 때 코드는 다음을 수행합니다:

스트림 처리를 일시 중지합니다(또는 비동기적으로 처리합니다).
실제 get_current_weather 함수/API를 "Toronto" 및 "celsius" 인수로 실행합니다.
결과를 얻습니다(예: "섭씨 20도").
이 결과를 role: "tool"을 가진 새 메시지로 Ollama에 다시 보냅니다.
그러면 모델은 이 정보를 사용하여 응답 생성을 계속하며, 이 또한 스트리밍됩니다.

Python을 사용하여 툴 호출 스트리밍하기

Ollama의 공식 라이브러리를 사용하여 Python에서 유사한 아이디어를 구현해 보겠습니다.

단계 1: Ollama Python 라이브러리 설치

아직 설치하지 않았다면 라이브러리를 설치하거나 업그레이드하십시오:

pip install -U ollama

단계 2: 툴 정의 및 Python 코딩

Ollama Python SDK는 Python 함수를 툴로 직접 전달할 수 있도록 영리하게 설계되었습니다. 함수 시그니처와 독스트링을 검사하여 AI를 위한 스키마를 생성합니다.

간단한 수학 툴 예제를 만들어 보겠습니다(입력은 add_two_numbers를 사용하지만, 출력 예제는 모델에 의해 subtract_two_numbers가 호출되는 것을 보여줍니다. 정의를 위해 제공된 add_two_numbers를 사용하고 프롬프트에 따라 모델이 무엇을 할지 결정하도록 하겠습니다.)

import ollama

# Define the python function that can be used as a tool
def add_two_numbers(a: int, b: int) -> int:
  """
  Add two numbers.

  Args:
    a (int): The first number as an int.
    b (int): The second number as an int.

  Returns:
    int: The sum of the two numbers.
  """
  print(f"--- Tool 'add_two_numbers' called with a={a}, b={b} ---")
  return a + b

# --- Main conversation logic ---
messages = [{'role': 'user', 'content': 'What is three plus one?'}]
# Or, for the subtraction example in the original output:
# messages = [{'role': 'user', 'content': 'what is three minus one?'}]

print(f"User: {messages[0]['content']}")

# Make the chat request with streaming and the tool
# Note: ChatResponse type hint might be ollama.ChatResponse or similar depending on library version
response_stream = ollama.chat(
  model='qwen3', # Or another capable model
  messages=messages,
  tools=[
      { # You can also define the tool explicitly if needed, or pass the function directly
          'type': 'function',
          'function': {
              'name': 'add_two_numbers', # Must match the Python function name if you want it to be called directly by your code later
              'description': 'Add two integer numbers together.',
              'parameters': {
                  'type': 'object',
                  'properties': {
                      'a': {'type': 'integer', 'description': 'The first number'},
                      'b': {'type': 'integer', 'description': 'The second number'}
                  },
                  'required': ['a', 'b']
              }
          }
      }
      # Simpler way for Python: pass the function directly if the library supports easy schema generation from it
      # tools=[add_two_numbers] # The SDK can often create the schema from this
  ],
  stream=True
)

print("Assistant (streaming):")
full_response_content = ""
tool_call_info = None

for chunk in response_stream:
  # Print the streamed content part
  if chunk['message']['content']:
    print(chunk['message']['content'], end='', flush=True)
    full_response_content += chunk['message']['content']

  # Check for tool calls in the chunk
  if 'tool_calls' in chunk['message'] and chunk['message']['tool_calls']:
    tool_call_info = chunk['message']['tool_calls'][0] # Assuming one tool call for simplicity
    print(f"\\\\n--- Detected Tool Call: {tool_call_info['function']['name']} ---")
    break # Stop processing stream for now, handle tool call

  if chunk.get('done'):
      print("\\\\n--- Stream finished ---")
      if not tool_call_info:
          print("No tool call was made.")

# --- If a tool call was detected, handle it ---
if tool_call_info:
  tool_name = tool_call_info['function']['name']
  tool_args = tool_call_info['function']['arguments']

  print(f"Arguments for the tool: {tool_args}")

  # Here, you'd actually call your Python tool function
  if tool_name == "add_two_numbers":
    # For safety, ensure arguments are of correct type if necessary
    try:
        arg_a = int(tool_args.get('a'))
        arg_b = int(tool_args.get('b'))
        tool_result = add_two_numbers(a=arg_a, b=arg_b)
        print(f"--- Tool execution result: {tool_result} ---")

        # Now, send this result back to Ollama to continue the conversation
        messages.append({'role': 'assistant', 'content': full_response_content, 'tool_calls': [tool_call_info]})
        messages.append({
            'role': 'tool',
            'content': str(tool_result), # Result must be a string
            'tool_call_id': tool_call_info.get('id', '') # If your library/model provides a tool_call_id
        })

        print("\\\\n--- Sending tool result back to model ---")

        follow_up_response_stream = ollama.chat(
            model='qwen3',
            messages=messages,
            stream=True
            # No tools needed here unless you expect another tool call
        )

        print("Assistant (after tool call):")
        for follow_up_chunk in follow_up_response_stream:
            if follow_up_chunk['message']['content']:
                print(follow_up_chunk['message']['content'], end='', flush=True)
            if follow_up_chunk.get('done'):
                print("\\\\n--- Follow-up stream finished ---")
                break
    except ValueError:
        print("Error: Could not parse tool arguments as integers.")
    except Exception as e:
        print(f"An error occurred during tool execution or follow-up: {e}")
  else:
    print(f"Error: Unknown tool '{tool_name}' requested by the model.")

Python 코드 설명:

ollama 임포트.
add_two_numbers 함수: 이것이 우리의 툴입니다. 독스트링과 타입 힌트는 Ollama가 그 목적과 매개변수를 이해하는 데 도움이 됩니다.
messages: 사용자의 쿼리로 대화를 시작합니다.
ollama.chat(...):

model, messages, stream=True는 cURL과 유사합니다.
tools=[...]: 툴 정의를 제공합니다. Python SDK는 매우 유연합니다. 스키마를 추론할 수 있다면 함수 객체를 직접 전달하거나(예: tools=[add_two_numbers]) 표시된 대로 명시적으로 정의할 수 있습니다.

response_stream 루프:

chunk['message']['content']: 스트리밍된 텍스트입니다. 즉시 출력합니다.
chunk['message']['tool_calls']: 이 키가 존재하고 내용이 있으면 AI가 툴을 사용하려 합니다. 이 tool_call_info를 저장하고 루프를 중단하여 처리합니다.

툴 호출 처리: tool_call_info가 설정된 경우 Python과 유사하게 처리합니다.

이름과 인수를 추출합니다.
실제 Python 함수(add_two_numbers)를 이 인수로 호출합니다.
중요: 그런 다음 어시스턴트의 부분 응답(툴 호출 요청으로 이어진)과 role: "tool"을 가진 새 메시지를 함수 결과를 문자열화한 content와 함께 messages 목록에 추가합니다.
업데이트된 메시지로 다른 ollama.chat 호출을 하여 툴 출력에 기반한 AI의 최종 응답을 얻습니다.

예상 출력 흐름:초기 사용자 질문, 그 다음 어시스턴트 응답 스트리밍이 나타납니다. add_two_numbers를 호출하기로 결정하면(또는 원래 자료의 샘플 출력에서 빼기 프롬프트였다면 subtract_two_numbers를 호출하기로 결정하면), "Detected Tool Call" 메시지, 인수, Python 함수의 결과, 그리고 그 결과를 사용하여 어시스턴트가 응답을 계속하는 것을 볼 수 있습니다.

(원래 샘플 출력은 다음과 같았습니다:

<think>
Okay, the user is asking ...
</think>

[ToolCall(function=Function(name='subtract_two_numbers', arguments={'a': 3, 'b': 1}))]

이는 AI의 내부 "생각" 과정과 Python SDK가 제공하는 구조화된 툴 호출 객체를 나타냅니다.)

JavaScript (Node.js)를 사용하여 툴 호출 스트리밍하기

이제 JavaScript를 사용하여 동일한 작업을 수행해 보겠습니다. 일반적으로 Node.js 백엔드 또는 웹 애플리케이션에 사용됩니다.

단계 1: Ollama JavaScript 라이브러리 설치

프로젝트 디렉토리에서 다음을 실행하십시오:

npm i ollama

단계 2: 툴 스키마 정의 및 JavaScript 코딩

JavaScript에서는 일반적으로 툴 스키마를 JSON 객체로 정의합니다.

import ollama from 'ollama';

// Describe the tool schema (e.g., for adding two numbers)
const addTool = {
    type: 'function',
    function: {
        name: 'addTwoNumbers',
        description: 'Add two numbers together',
        parameters: {
            type: 'object',
            required: ['a', 'b'],
            properties: {
                a: { type: 'number', description: 'The first number' },
                b: { type: 'number', description: 'The second number' }
            }
        }
    }
};

// Your actual JavaScript function that implements the tool
function executeAddTwoNumbers(a, b) {
    console.log(`--- Tool 'addTwoNumbers' called with a=${a}, b=${b} ---`);
    return a + b;
}

async function main() {
    const messages = [{ role: 'user', content: 'What is 2 plus 3?' }];
    console.log('User:', messages[0].content);

    console.log('Assistant (streaming):');
    let assistantResponseContent = "";
    let toolToCallInfo = null;

    try {
        const responseStream = await ollama.chat({
            model: 'qwen3', // Or another capable model
            messages: messages,
            tools: [addTool],
            stream: true
        });

        for await (const chunk of responseStream) {
            if (chunk.message.content) {
                process.stdout.write(chunk.message.content);
                assistantResponseContent += chunk.message.content;
            }
            if (chunk.message.tool_calls && chunk.message.tool_calls.length > 0) {
                toolToCallInfo = chunk.message.tool_calls[0]; // Assuming one tool call
                process.stdout.write(`\\\\n--- Detected Tool Call: ${toolToCallInfo.function.name} ---\\\\n`);
                break; // Stop processing stream to handle tool call
            }
            if (chunk.done) {
                process.stdout.write('\\\\n--- Stream finished ---\\\\n');
                if (!toolToCallInfo) {
                    console.log("No tool call was made.");
                }
                break;
            }
        }

        // --- If a tool call was detected, handle it ---
        if (toolToCallInfo) {
            const toolName = toolToCallInfo.function.name;
            const toolArgs = toolToCallInfo.function.arguments;

            console.log(`Arguments for the tool:`, toolArgs);

            let toolResult;
            if (toolName === 'addTwoNumbers') {
                toolResult = executeAddTwoNumbers(toolArgs.a, toolArgs.b);
                console.log(`--- Tool execution result: ${toolResult} ---`);

                // Append assistant's partial message and the tool message
                messages.push({
                    role: 'assistant',
                    content: assistantResponseContent, // Include content leading up to tool call
                    tool_calls: [toolToCallInfo]
                });
                messages.push({
                    role: 'tool',
                    content: toolResult.toString(), // Result must be a string
                    // tool_call_id: toolToCallInfo.id // If available and needed
                });

                console.log("\\\\n--- Sending tool result back to model ---");
                const followUpStream = await ollama.chat({
                    model: 'qwen3',
                    messages: messages,
                    stream: true
                });

                console.log("Assistant (after tool call):");
                for await (const followUpChunk of followUpStream) {
                    if (followUpChunk.message.content) {
                        process.stdout.write(followUpChunk.message.content);
                    }
                     if (followUpChunk.done) {
                        process.stdout.write('\\\\n--- Follow-up stream finished ---\\\\n');
                        break;
                    }
                }
            } else {
                console.error(`Error: Unknown tool '${toolName}' requested.`);
            }
        }

    } catch (error) {
        console.error('Error during Ollama chat:', error);
    }
}

main().catch(console.error);

JavaScript 코드 설명:

ollama 임포트.
addTool 객체: Ollama에 툴을 설명하는 JSON 스키마입니다.
executeAddTwoNumbers 함수: 툴을 구현하는 실제 JavaScript 함수입니다.
main 비동기 함수:

messages 배열로 대화를 시작합니다.
await ollama.chat({...}): 호출을 수행합니다.
tools: [addTool]: 툴 스키마를 전달합니다.
stream: true: 스트리밍을 활성화합니다.
for await (const chunk of responseStream): 이 루프는 각 스트리밍된 청크를 처리합니다.
chunk.message.content: 스트림의 텍스트 부분.
chunk.message.tool_calls: 존재하면 AI가 툴을 사용하려 합니다. toolToCallInfo를 저장합니다.

툴 호출 처리: Python과 유사하게 toolToCallInfo가 설정된 경우:

이름과 인수를 추출합니다.
executeAddTwoNumbers()를 호출합니다.
어시스턴트 메시지(툴 호출 요청을 포함한)와 결과가 포함된 새 role: "tool" 메시지를 messages 배열에 추가합니다.
업데이트된 메시지로 다른 ollama.chat 호출을 하여 최종 응답을 얻습니다.

예상 출력 흐름 (cURL 및 Python 예제와 유사):사용자의 질문, 그 다음 어시스턴트 응답 스트리밍이 나타납니다. addTwoNumbers를 호출하기로 결정하면 툴 호출 정보, JavaScript 함수의 결과, 그리고 그 결과에 기반한 AI의 답변 스트리밍이 계속되는 것을 볼 수 있습니다.

JS의 원래 샘플 출력은 다음과 같았습니다:

Question: What is 2 plus 3?
<think>
Okay, the user is asking...
</think>
Tool call: {
  function: {
    name: "addTwoNumbers",
    arguments: { a: 2, b: 3 },
  },
}

스트리밍 중 Ollama가 툴 파싱을 처리하는 방법

Ollama가 텍스트를 스트리밍하고 툴 호출을 그렇게 원활하게 식별하는 방법이 궁금할 수 있습니다. 이는 영리한 새 증분 파서를 사용합니다.

기존 방식: 많은 시스템은 *전체* AI 응답을 기다린 다음 툴 호출(일반적으로 JSON 형식)을 스캔해야 했습니다. 이는 툴 호출이 어디에든 나타날 수 있으므로 스트리밍을 차단했습니다.
Ollama의 새로운 방식:
파서는 각 모델의 특정 템플릿을 보고 툴 호출을 어떻게 신호하는지(예: 특수 토큰 또는 접두사) 이해합니다.
이를 통해 Ollama는 데이터가 스트리밍되는 대로 툴 호출을 "증분적으로" 식별하여 일반 텍스트 콘텐츠와 분리할 수 있습니다.
명시적으로 툴 접두사로 학습되지 않았지만 유효한 툴 호출 구조를 출력하는 모델도 처리할 수 있을 만큼 스마트합니다. 필요한 경우 JSON과 유사한 구조를 찾는 것으로 대체할 수도 있지만, 지능적으로 작동하여 아무 JSON이나 가져오지 않습니다.

왜 이것이 더 나은가요?

진정한 스트리밍: 텍스트를 즉시 얻을 수 있으며 툴 호출은 즉석에서 식별됩니다.
정확성: 오탐을 더 잘 방지합니다(예: AI가 이전에 수행한 툴 호출에 대해 *이야기하는* 경우, 새 파서는 실수로 다시 트리거할 가능성이 적습니다).

팁: 컨텍스트 창으로 성능 향상

특히 툴 호출과 같은 더 복잡한 상호작용의 경우 모델이 사용하는 "컨텍스트 창"의 크기가 중요할 수 있습니다. 컨텍스트 창이 클수록 모델은 현재 대화를 더 많이 기억합니다.

모델 컨텍스트 프로토콜 (MCP): Ollama의 개선 사항은 MCP와 잘 작동합니다.
num_ctx: 컨텍스트 창 크기를 제안할 수 있습니다. 예를 들어, 32,000(32k) 토큰 이상은 툴 호출 성능과 결과 품질을 향상시킬 수 있습니다.
트레이드오프: 더 큰 컨텍스트 창은 더 많은 메모리를 사용합니다.

예제: cURL로 컨텍스트 창 설정(원래 자료에서 제안된 llama3.1 또는 llama4와 같이 더 큰 컨텍스트를 지원하는 모델을 사용하십시오 - 예제는 llama3.2를 사용하지만)

curl -X POST "<http://localhost:11434/api/chat>" -d '{
  "model": "llama3.1",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ],
  "options": {
    "num_ctx": 32000
  }
}'

툴 호출이 원하는 만큼 안정적이지 않다고 생각되면 이 설정을 실험해 보십시오.

다음 단계는?

이제 응답 스트리밍 및 툴 호출을 사용하여 Ollama로 정교하고 실시간 AI 애플리케이션을 구축할 수 있는 기본 지식을 갖게 되었습니다!

탐색할 아이디어:

툴을 실제 API(날씨, 주식, 검색 엔진)에 연결합니다.
다단계 작업을 수행할 수 있는 에이전트를 구축합니다.
더 자연스럽고 반응적인 챗봇을 만듭니다.

최신 업데이트 및 더 고급 예제는 공식 Ollama 문서 및 GitHub 저장소(더 깊은 기술적 탐구를 위해 원래 소스 자료에 언급된 "Tool Streaming Pull Request" 포함)를 참조하십시오.

💡

button