How to Use the DeepSeek V4 API?

Complete DeepSeek V4 API guide: endpoints, authentication, Python and Node examples, thinking modes, tool calling, streaming, JSON mode, and an Apidog workflow for testing without burning credits.

Ashley Innocent

Ashley Innocent

24 April 2026

How to Use the DeepSeek V4 API?

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Explore Apidog Enterprise

DeepSeek V4 launched with the API live on day one. The model IDs are deepseek-v4-pro and deepseek-v4-flash, the endpoint is OpenAI-compatible, and the base URL is https://api.deepseek.com. That means any client you already use against GPT-5.5 or other OpenAI-shape APIs works against V4 with a single base-URL swap.

This guide covers authentication, every parameter that matters, Python and Node examples, thinking-mode math, tool calling, streaming, and an Apidog-based workflow that keeps the cost visible while you iterate.

button

For the product-level overview, see what is DeepSeek V4. For the no-cost path, see how to use DeepSeek V4 for free.

TL;DR

Prerequisites

Before the first request, line up four things.

Export the key once:

export DEEPSEEK_API_KEY="sk-..."

Endpoint and authentication

Two base URLs cover two request shapes.

POST https://api.deepseek.com/v1/chat/completions    # OpenAI format
POST https://api.deepseek.com/anthropic/v1/messages  # Anthropic format

Pick OpenAI-compatible unless you have an existing Anthropic-shape codebase. The rest of this guide uses the OpenAI format.

Authentication is a bearer token on the standard Authorization header. The minimum viable request:

curl https://api.deepseek.com/v1/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [
      {"role": "user", "content": "Explain MoE routing in two sentences."}
    ]
  }'

Successful responses return a JSON body with a choices array, a usage block broken down into input and output tokens (and reasoning_tokens if thinking mode was on), and an id you can use for tracing. Failures return the standard OpenAI envelope with error.code and error.message.

Request parameters

Every field maps to cost or behavior. Here is the map for deepseek-v4-pro and deepseek-v4-flash.

Parameter Type Values Notes
model string deepseek-v4-pro, deepseek-v4-flash Required.
messages array role/content pairs Required. Same schema as OpenAI.
thinking_mode string non-thinking, thinking, thinking_max Default is non-thinking.
temperature float 0 to 2 DeepSeek recommends 1.0.
top_p float 0 to 1 DeepSeek recommends 1.0.
max_tokens int 1 to 131,072 Caps output length.
stream bool true or false Enables SSE streaming.
tools array OpenAI tool spec For function calling.
tool_choice string or object auto, required, none, or specific tool Controls tool use.
response_format object {"type": "json_object"} JSON-mode output.
seed int any int For reproducibility.
presence_penalty float -2 to 2 Penalize repeated topics.
frequency_penalty float -2 to 2 Penalize repeated tokens.

thinking_mode is the biggest cost lever. non-thinking skips the reasoning trace entirely and returns tokens at roughly V3.2 speed. thinking enables a reasoning block that costs extra tokens but improves accuracy on code and math. thinking_max produces the scores in DeepSeek’s headline table; it also burns the most tokens and is the only mode that requires a 384K+ context budget.

Python client

The official openai SDK works with a base-URL override. Every existing OpenAI-compatible wrapper, including LangChain, LlamaIndex, and DSPy, also works.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "Reply in code only."},
        {"role": "user", "content": "Write a Rust function that debounces events."},
    ],
    extra_body={"thinking_mode": "thinking"},
    temperature=1.0,
    top_p=1.0,
    max_tokens=2048,
)

choice = response.choices[0]
print("Content:", choice.message.content)
print("Reasoning tokens:", response.usage.reasoning_tokens)
print("Total tokens:", response.usage.total_tokens)

The extra_body trick is how you pass DeepSeek-specific parameters through the OpenAI SDK without patching the library.

Node client

Same structure on Node:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.DEEPSEEK_API_KEY,
  baseURL: "https://api.deepseek.com/v1",
});

const response = await client.chat.completions.create({
  model: "deepseek-v4-flash",
  messages: [
    { role: "user", content: "Explain the Muon optimizer in plain English." },
  ],
  thinking_mode: "thinking",
  temperature: 1.0,
  top_p: 1.0,
});

console.log(response.choices[0].message.content);
console.log("Usage:", response.usage);

The Node SDK accepts unknown fields silently, so thinking_mode passes through at the top level without extra_body.

Streaming responses

Set stream: true and iterate the SSE chunks. The shape matches OpenAI exactly.

stream = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Stream a 300-word essay on MoE."}],
    stream=True,
    extra_body={"thinking_mode": "non-thinking"},
)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

Reasoning traces stream separately when thinking mode is on; the delta.reasoning_content field carries them and you can surface them in the UI or drop them.

Tool calling

V4 supports the standard OpenAI tool-call schema. Functions defined in the tools array become callable, and the model decides when to invoke them.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Return the current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["c", "f"]},
            },
            "required": ["city"],
        },
    },
}]

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Weather in Lagos in Celsius?"}],
    tools=tools,
    tool_choice="auto",
    extra_body={"thinking_mode": "thinking"},
)

tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name, tool_call.function.arguments)

From there, call the function, append the result as a role: "tool" message, and call the API again to continue the loop. The pattern is identical to the OpenAI and Anthropic tool-use loops.

JSON mode

For structured output, ask for JSON explicitly and set the response format.

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "Reply with a single JSON object."},
        {"role": "user", "content": "Summarize this release note as {title, date, bullets}: ..."},
    ],
    response_format={"type": "json_object"},
    extra_body={"thinking_mode": "non-thinking"},
)

JSON mode forces valid JSON but does not enforce a specific schema. For schema validation, pair it with Pydantic or Zod on the client side.

Build the collection in Apidog

Replaying requests from the terminal burns credits and hides the diff between runs. The workflow that survives real use:

  1. Download Apidog and create a project.
  2. Add an environment with {{DEEPSEEK_API_KEY}} stored as a secret variable.
  3. Save a POST request to {{BASE_URL}}/chat/completions with the Authorization: Bearer {{DEEPSEEK_API_KEY}} header.
  4. Parameterize model and thinking_mode so you can A/B across variants without duplicating requests.
  5. Use the response viewer to inspect usage.reasoning_tokens on every run. That is the single clearest signal of whether you are paying for thinking mode you do not need.

Teams already running the matching GPT-5.5 API collection in Apidog can duplicate it, swap the base URL to https://api.deepseek.com/v1, swap the model ID, and run comparison prompts across both providers in minutes.

Error handling

The envelope follows OpenAI exactly. The ones you will hit first:

Code Meaning Fix
400 Bad request Check JSON schema, especially messages and tools.
401 Invalid key Regenerate at platform.deepseek.com.
402 Insufficient balance Top up the account.
403 Model not allowed Check the key’s scope and the model ID spelling.
422 Parameter out of range max_tokens or thinking_mode probably mismatched.
429 Rate limit Back off, then retry with exponential jitter.
500 Server error Retry once; if it repeats, check status page.
503 Overloaded Fall back to V4-Flash or retry in 30 seconds.

Wrap calls in a retry helper that handles 429 and 5xx with exponential backoff. Do not retry 4xx errors automatically; they are logic bugs, not transient failures.

Cost control patterns

Four patterns keep spend predictable.

  1. Default to V4-Flash. Switch to V4-Pro only for prompts where you have measured a quality lift.
  2. Gate thinking_max behind a flag. It is the most expensive mode by a wide margin; only route to it when correctness beats latency.
  3. Cap max_tokens. Most answers fit in 2,000 output tokens. The 1M context is for input, not output.
  4. Log usage on every call. Ship input, output, and reasoning counts to your observability stack; an alert on a sudden reasoning-token spike catches prompts that drifted.

Migrating from older DeepSeek models

The older deepseek-chat and deepseek-reasoner IDs deprecate on July 24, 2026. Migration takes one line of diff per call site; the request and response shapes are unchanged.

-  model="deepseek-chat"
+  model="deepseek-v4-pro"

Before flipping production, run side-by-side A/B comparisons in Apidog. The response quality jump usually rewards the switch; the deprecation deadline forces it either way.

FAQ

Is the DeepSeek V4 API production-ready?Yes. The API went live on April 23, 2026 alongside the weights. DeepSeek V3 and V3.2 ran on the same infrastructure at scale for over a year, so the API surface is mature.

Does V4 support the Anthropic message format?Yes. Point at https://api.deepseek.com/anthropic/v1/messages and send the Anthropic-shape payload. Both formats hit the same underlying model.

What is the context window?1 million tokens on both V4-Pro and V4-Flash. Note that Think Max mode recommends a minimum 384K working window.

How do I count input tokens before sending?Use the standard OpenAI tokenizer for approximations; DeepSeek publishes exact counts in the usage block on every response. For production budgeting, trust the response-side count.

Can I fine-tune via the API?Not at launch. Fine-tuning currently runs through the self-hosted Base checkpoints on Hugging Face.

Is the API free to try?There is no free tier at the account level, but new sign-ups occasionally receive a trial credit.

Explore more

Moving From Keploy to Apidog CLI

Moving From Keploy to Apidog CLI

Moving from Keploy to Apidog CLI: an honest switching guide from recorded tests to designed, maintainable API suites. Import a spec, author, run in CI.

17 June 2026

Best Keploy Alternatives for API Testing

Best Keploy Alternatives for API Testing

Looking for a Keploy alternative? Compare Apidog CLI, Newman, Hoppscotch, Schemathesis and record-replay tools with honest pros, cons, and a feature table.

17 June 2026

How to Build a Fake REST API in Minutes (with JSONPlaceholder)

How to Build a Fake REST API in Minutes (with JSONPlaceholder)

Use json-server to turn a JSON file into a full REST API in seconds, call JSONPlaceholder with zero setup, and learn when to move up to a schema-aware mock.

17 June 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs

How to Use the DeepSeek V4 API?