DeepSeek published the V4 pricing on the same day the models dropped, April 23, 2026, and the numbers reset the floor for frontier AI. V4-Flash runs at $0.14 per million input tokens and $0.28 per million output tokens. V4-Pro runs at $1.74 input and $3.48 output. Both carry a 1M-token context window and up to 384K output tokens. Both also carry an aggressive cache-hit discount that cuts input costs by 80% to 90% on repeated prompts.
This guide covers the full rate card, how context caching changes the real per-call cost, an honest comparison against GPT-5.5 and Claude Opus, and four ways to keep spend predictable inside Apidog.
For the product overview, see what is DeepSeek V4. For the developer walkthrough, see how to use the DeepSeek V4 API. For zero-cost paths, see how to use DeepSeek V4 for free.
TL;DR
- V4-Flash: $0.14 / M input (cache miss), $0.028 / M input (cache hit), $0.28 / M output.
- V4-Pro: $1.74 / M input (cache miss), $0.145 / M input (cache hit), $3.48 / M output.
- Context window: 1M tokens input, 384K tokens output, on both variants.
- Cache-hit discount: roughly 80% off Flash, 92% off Pro on repeated prefixes.
deepseek-chatanddeepseek-reasonerdeprecate July 24, 2026; billing maps to V4-Flash.- At cache-miss rates, V4-Pro is ~2.9x cheaper than GPT-5.5 on input and ~8.6x cheaper on output.
The full rate card
| Model | Input (cache miss) | Input (cache hit) | Output | Context |
|---|---|---|---|---|
deepseek-v4-flash |
$0.14 / M | $0.028 / M | $0.28 / M | 1M / 384K |
deepseek-v4-pro |
$1.74 / M | $0.145 / M | $3.48 / M | 1M / 384K |
deepseek-chat (deprecated 2026-07-24) |
maps to V4-Flash non-thinking | — | — | — |
deepseek-reasoner (deprecated 2026-07-24) |
maps to V4-Flash thinking | — | — | — |
Three details matter more than the raw numbers.
First, prices are the same whether you are in thinking mode or non-thinking mode. The model ID sets the rate; the reasoning mode just changes how many tokens you burn at that rate.
Second, cache-hit pricing is automatic. Every request with a repeated prefix against the same account benefits; you do not need to opt in or wire anything up. Prefixes must be at least 1,024 tokens long and must match byte-for-byte.
Third, the older deepseek-chat and deepseek-reasoner IDs now bill as V4-Flash aliases. If you have not migrated, you are already getting V4-Flash quality at V4-Flash prices; the ID deprecation deadline is July 24, 2026.
Context caching in plain English
Caching is the single biggest cost lever on DeepSeek V4. The pattern is simple: anything that repeats across calls, especially long system prompts, agent tool schemas, and RAG context, gets billed at a fraction of the full input rate on the second and subsequent calls.
A concrete example. You run an agent with a 20,000-token system prompt that never changes, then ask 100 different user questions of 200 tokens each.
Without caching:
- Input: 100 calls × 20,200 tokens × $1.74 / M = $3.52
- Output: 100 calls × 500 tokens × $3.48 / M = $0.17
- Total: $3.69
With caching (first call misses, next 99 hit):
- First-call input: 20,200 × $1.74 / M = $0.035
- Next 99 cache-hit prefixes: 99 × 20,000 × $0.145 / M = $0.287
- Next 99 cache-miss user turns: 99 × 200 × $1.74 / M = $0.034
- Output: 100 × 500 × $3.48 / M = $0.174
- Total: $0.53
Roughly 7x cheaper on an identical workload. The caching effect is even more dramatic on V4-Flash, where the raw rate is already low.
How it compares to GPT-5.5 and Claude
The comparison most teams actually care about:
| Model | Input (std) | Input (cached) | Output | Context |
|---|---|---|---|---|
| DeepSeek V4-Flash | $0.14 / M | $0.028 / M | $0.28 / M | 1M |
| DeepSeek V4-Pro | $1.74 / M | $0.145 / M | $3.48 / M | 1M |
| GPT-5.5 | $5 / M | $1.25 / M | $30 / M | 1M |
| GPT-5.5 Pro | $30 / M | — | $180 / M | 1M |
| Claude Opus 4.6 | $15 / M | $1.50 / M | $75 / M | 200K |
Three readings of this table.
- On output tokens, V4-Pro is roughly 8.6x cheaper than GPT-5.5 and 21x cheaper than Claude Opus 4.6. Output is where most agent workloads spend their budget; the gap compounds.
- On cached input, V4-Pro is roughly 10x cheaper than GPT-5.5 cached and 10x cheaper than Claude cached. Long system prompts, tool schemas, and repeated RAG context hit hardest here.
- On raw benchmark ratio, V4-Pro matches or beats GPT-5.5 on LiveCodeBench (93.5 vs the top tier) and Codeforces (3206 vs 3168) while costing a small fraction. That is the core of the open-weights value proposition. See what is DeepSeek V4 for the full benchmark table.
The honest caveats: Claude still beats V4-Pro on long-context retrieval benchmarks, and Gemini 3.1 Pro still leads MMLU-Pro. If your workload depends on needle-in-a-haystack retrieval across a million tokens, the per-token savings may not recover the quality gap.
Cost modeling for common workloads
Four workloads cover most production use cases. Here is what each costs on V4-Pro (cache-miss baseline; cache-hit savings compound on top).
1. Agentic coding loop (50K context, 2K output, 20 calls per task)
- Input: 50,000 × 20 × $1.74 / M = $1.74
- Output: 2,000 × 20 × $3.48 / M = $0.14
- Per-task cost: ~$1.88
Compare to GPT-5.5 at roughly $6.20 per task on the same shape.
2. Long-document Q&A (500K context, 1K output)
- Input: 500,000 × $1.74 / M = $0.87
- Output: 1,000 × $3.48 / M = $0.003
- Per-call cost: ~$0.87
Compare to GPT-5.5 at roughly $2.53 per call.
3. High-volume classification (2K context, 200 output, 10,000 calls)
Use V4-Flash here; V4-Pro is overkill.
- Input: 2,000 × 10,000 × $0.14 / M = $2.80
- Output: 200 × 10,000 × $0.28 / M = $0.56
- Run cost: ~$3.36
Compare to GPT-5.5 at roughly $110 for the same run.
4. Repeated-prompt chatbot (10K system prompt, 500 user tokens, 1K output, 1,000 sessions)
- First-call input: 10,500 × $1.74 / M = $0.018
- Cache-hit input: 999 × 10,000 × $0.145 / M = $1.45
- Cache-miss user turns: 999 × 500 × $1.74 / M = $0.87
- Output: 1,000 × 1,000 × $3.48 / M = $3.48
- Session run cost: ~$5.82
Compare to GPT-5.5 with caching at roughly $26.35 on the same workload.
Hidden costs to watch
The sticker price is not the whole story. Four line items bite teams after the first month:
- Thinking-mode token inflation.
thinking_maxburns 3x to 10x more output tokens thannon-thinkingon the same prompt. Those reasoning tokens bill at the output rate. Gate Think Max behind a flag. - Silent context growth. Agent loops often feed the entire conversation back into each turn. At 1M-token contexts, this balloons fast. Truncate or summarize aggressively.
- Retry storms. A buggy loop that retries on every 500 response can double your bill in an hour. Add exponential backoff and a hard per-request retry cap.
- Development churn. Iterating on a prompt through curl re-runs the full context every time. Using Apidog cuts this to near zero because variable substitution makes prompt tweaks free to retry without re-typing the full payload.
Track cost in Apidog
The workflow most teams land on once bills get real:
- Download Apidog and store
DEEPSEEK_API_KEYas a secret variable per environment. - Save a single POST request to
https://api.deepseek.com/v1/chat/completions. - In the response panel, pin
usage.prompt_tokens,usage.completion_tokens, andusage.reasoning_tokens. Every call surfaces the cost math on the same screen as the output. - Parameterize
modelandthinking_modeso you can A/B V4-Flash vs V4-Pro, and Non-Think vs Think Max, without duplicating requests. - Mirror the same collection for GPT-5.5 (the matching GPT-5.5 API guide documents the setup). One window, both providers, costs visible.
That workflow catches roughly 80% of the cost surprises that show up in month-end invoices.
Four rules that keep spend predictable
- Default to V4-Flash. Switch to V4-Pro only when you have measured a quality gap that moves revenue.
- Default to Non-Think. Escalate to Think High on hard tasks. Reserve Think Max for correctness-critical work.
- Cap
max_tokens. The 384K output ceiling is a safety, not a target. Most production answers fit in 2K. - Ship usage telemetry. Log
prompt_tokens,completion_tokens, andreasoning_tokenson every call. Alert on reasoning-token spikes; they signal prompts that drifted into Think Max territory by accident.
FAQ
Is there a free tier?No usage-free API tier, but new accounts occasionally receive a small trial credit. For zero-cost paths outside the API, see how to use DeepSeek V4 for free.
How does cache-hit pricing work?Prefixes of 1,024 tokens or more that repeat across requests within the same account bill at the cache-hit rate. The first call pays the cache-miss rate; subsequent identical-prefix calls pay the discounted rate. Caching is automatic.
Do thinking modes cost more?The rate per token is the same. Thinking modes consume more tokens because the model writes reasoning traces. Track reasoning_tokens in the usage object to measure the true cost.
Is pricing stable?DeepSeek changes pricing periodically. The V3.2 rates held for most of 2025; V4 pricing has no published end-date. Check the live pricing page before budgeting.
Are V4-Pro and V4-Flash billed at the same output rate?No. V4-Pro output is $3.48 / M; V4-Flash output is $0.28 / M. The 12.4x ratio is the single biggest reason to default to V4-Flash.
Does the Anthropic-format endpoint change pricing?No. https://api.deepseek.com/anthropic uses the same rates as the OpenAI-format endpoint. Format does not affect billing.



