DeepSeek V4 API Pricing

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

DeepSeek published the V4 pricing on the same day the models dropped, April 23, 2026, and the numbers reset the floor for frontier AI. V4-Flash runs at $0.14 per million input tokens and $0.28 per million output tokens. V4-Pro runs at $1.74 input and $3.48 output. Both carry a 1M-token context window and up to 384K output tokens. Both also carry an aggressive cache-hit discount that cuts input costs by 80% to 90% on repeated prompts.

This guide covers the full rate card, how context caching changes the real per-call cost, an honest comparison against GPT-5.5 and Claude Opus, and four ways to keep spend predictable inside Apidog.

button

For the product overview, see what is DeepSeek V4. For the developer walkthrough, see how to use the DeepSeek V4 API. For zero-cost paths, see how to use DeepSeek V4 for free.

TL;DR

V4-Flash: $0.14 / M input (cache miss), $0.028 / M input (cache hit), $0.28 / M output.
V4-Pro: $1.74 / M input (cache miss), $0.145 / M input (cache hit), $3.48 / M output.
Context window: 1M tokens input, 384K tokens output, on both variants.
Cache-hit discount: roughly 80% off Flash, 92% off Pro on repeated prefixes.
deepseek-chat and deepseek-reasoner deprecate July 24, 2026; billing maps to V4-Flash.
At cache-miss rates, V4-Pro is ~2.9x cheaper than GPT-5.5 on input and ~8.6x cheaper on output.

The full rate card

Model	Input (cache miss)	Input (cache hit)	Output	Context
`deepseek-v4-flash`	$0.14 / M	$0.028 / M	$0.28 / M	1M / 384K
`deepseek-v4-pro`	$1.74 / M	$0.145 / M	$3.48 / M	1M / 384K
`deepseek-chat` (deprecated 2026-07-24)	maps to V4-Flash non-thinking	—	—	—
`deepseek-reasoner` (deprecated 2026-07-24)	maps to V4-Flash thinking	—	—	—

Three details matter more than the raw numbers.

First, prices are the same whether you are in thinking mode or non-thinking mode. The model ID sets the rate; the reasoning mode just changes how many tokens you burn at that rate.

Second, cache-hit pricing is automatic. Every request with a repeated prefix against the same account benefits; you do not need to opt in or wire anything up. Prefixes must be at least 1,024 tokens long and must match byte-for-byte.

Third, the older deepseek-chat and deepseek-reasoner IDs now bill as V4-Flash aliases. If you have not migrated, you are already getting V4-Flash quality at V4-Flash prices; the ID deprecation deadline is July 24, 2026.

Context caching in plain English

Caching is the single biggest cost lever on DeepSeek V4. The pattern is simple: anything that repeats across calls, especially long system prompts, agent tool schemas, and RAG context, gets billed at a fraction of the full input rate on the second and subsequent calls.

A concrete example. You run an agent with a 20,000-token system prompt that never changes, then ask 100 different user questions of 200 tokens each.

Without caching:

Input: 100 calls × 20,200 tokens × $1.74 / M = $3.52
Output: 100 calls × 500 tokens × $3.48 / M = $0.17
Total: $3.69

With caching (first call misses, next 99 hit):

First-call input: 20,200 × $1.74 / M = $0.035
Next 99 cache-hit prefixes: 99 × 20,000 × $0.145 / M = $0.287
Next 99 cache-miss user turns: 99 × 200 × $1.74 / M = $0.034
Output: 100 × 500 × $3.48 / M = $0.174
Total: $0.53

Roughly 7x cheaper on an identical workload. The caching effect is even more dramatic on V4-Flash, where the raw rate is already low.

How it compares to GPT-5.5 and Claude

The comparison most teams actually care about:

Model	Input (std)	Input (cached)	Output	Context
DeepSeek V4-Flash	$0.14 / M	$0.028 / M	$0.28 / M	1M
DeepSeek V4-Pro	$1.74 / M	$0.145 / M	$3.48 / M	1M
GPT-5.5	$5 / M	$1.25 / M	$30 / M	1M
GPT-5.5 Pro	$30 / M	—	$180 / M	1M
Claude Opus 4.6	$15 / M	$1.50 / M	$75 / M	200K

Three readings of this table.

On output tokens, V4-Pro is roughly 8.6x cheaper than GPT-5.5 and 21x cheaper than Claude Opus 4.6. Output is where most agent workloads spend their budget; the gap compounds.
On cached input, V4-Pro is roughly 10x cheaper than GPT-5.5 cached and 10x cheaper than Claude cached. Long system prompts, tool schemas, and repeated RAG context hit hardest here.
On raw benchmark ratio, V4-Pro matches or beats GPT-5.5 on LiveCodeBench (93.5 vs the top tier) and Codeforces (3206 vs 3168) while costing a small fraction. That is the core of the open-weights value proposition. See what is DeepSeek V4 for the full benchmark table.

The honest caveats: Claude still beats V4-Pro on long-context retrieval benchmarks, and Gemini 3.1 Pro still leads MMLU-Pro. If your workload depends on needle-in-a-haystack retrieval across a million tokens, the per-token savings may not recover the quality gap.

Cost modeling for common workloads

Four workloads cover most production use cases. Here is what each costs on V4-Pro (cache-miss baseline; cache-hit savings compound on top).

1. Agentic coding loop (50K context, 2K output, 20 calls per task)

Input: 50,000 × 20 × $1.74 / M = $1.74
Output: 2,000 × 20 × $3.48 / M = $0.14
Per-task cost: ~$1.88

Compare to GPT-5.5 at roughly $6.20 per task on the same shape.

2. Long-document Q&A (500K context, 1K output)

Input: 500,000 × $1.74 / M = $0.87
Output: 1,000 × $3.48 / M = $0.003
Per-call cost: ~$0.87

Compare to GPT-5.5 at roughly $2.53 per call.

3. High-volume classification (2K context, 200 output, 10,000 calls)

Use V4-Flash here; V4-Pro is overkill.

Input: 2,000 × 10,000 × $0.14 / M = $2.80
Output: 200 × 10,000 × $0.28 / M = $0.56
Run cost: ~$3.36

Compare to GPT-5.5 at roughly $110 for the same run.

4. Repeated-prompt chatbot (10K system prompt, 500 user tokens, 1K output, 1,000 sessions)

First-call input: 10,500 × $1.74 / M = $0.018
Cache-hit input: 999 × 10,000 × $0.145 / M = $1.45
Cache-miss user turns: 999 × 500 × $1.74 / M = $0.87
Output: 1,000 × 1,000 × $3.48 / M = $3.48
Session run cost: ~$5.82

Compare to GPT-5.5 with caching at roughly $26.35 on the same workload.

Hidden costs to watch

The sticker price is not the whole story. Four line items bite teams after the first month:

Thinking-mode token inflation. thinking_max burns 3x to 10x more output tokens than non-thinking on the same prompt. Those reasoning tokens bill at the output rate. Gate Think Max behind a flag.
Silent context growth. Agent loops often feed the entire conversation back into each turn. At 1M-token contexts, this balloons fast. Truncate or summarize aggressively.
Retry storms. A buggy loop that retries on every 500 response can double your bill in an hour. Add exponential backoff and a hard per-request retry cap.
Development churn. Iterating on a prompt through curl re-runs the full context every time. Using Apidog cuts this to near zero because variable substitution makes prompt tweaks free to retry without re-typing the full payload.

Track cost in Apidog

The workflow most teams land on once bills get real:

Download Apidog and store DEEPSEEK_API_KEY as a secret variable per environment.
Save a single POST request to https://api.deepseek.com/v1/chat/completions.
In the response panel, pin usage.prompt_tokens, usage.completion_tokens, and usage.reasoning_tokens. Every call surfaces the cost math on the same screen as the output.
Parameterize model and thinking_mode so you can A/B V4-Flash vs V4-Pro, and Non-Think vs Think Max, without duplicating requests.
Mirror the same collection for GPT-5.5 (the matching GPT-5.5 API guide documents the setup). One window, both providers, costs visible.

That workflow catches roughly 80% of the cost surprises that show up in month-end invoices.

Four rules that keep spend predictable

Default to V4-Flash. Switch to V4-Pro only when you have measured a quality gap that moves revenue.
Default to Non-Think. Escalate to Think High on hard tasks. Reserve Think Max for correctness-critical work.
Cap max_tokens. The 384K output ceiling is a safety, not a target. Most production answers fit in 2K.
Ship usage telemetry. Log prompt_tokens, completion_tokens, and reasoning_tokens on every call. Alert on reasoning-token spikes; they signal prompts that drifted into Think Max territory by accident.

FAQ

Is there a free tier?No usage-free API tier, but new accounts occasionally receive a small trial credit. For zero-cost paths outside the API, see how to use DeepSeek V4 for free.

How does cache-hit pricing work?Prefixes of 1,024 tokens or more that repeat across requests within the same account bill at the cache-hit rate. The first call pays the cache-miss rate; subsequent identical-prefix calls pay the discounted rate. Caching is automatic.

Do thinking modes cost more?The rate per token is the same. Thinking modes consume more tokens because the model writes reasoning traces. Track reasoning_tokens in the usage object to measure the true cost.

Is pricing stable?DeepSeek changes pricing periodically. The V3.2 rates held for most of 2025; V4 pricing has no published end-date. Check the live pricing page before budgeting.

Are V4-Pro and V4-Flash billed at the same output rate?No. V4-Pro output is $3.48 / M; V4-Flash output is $0.28 / M. The 12.4x ratio is the single biggest reason to default to V4-Flash.

Does the Anthropic-format endpoint change pricing?No. https://api.deepseek.com/anthropic uses the same rates as the OpenAI-format endpoint. Format does not affect billing.

In this article

TL;DR The full rate card Context caching in plain English How it compares to GPT-5.5 and Claude Cost modeling for common workloads 1. Agentic coding loop (50K context, 2K output, 20 calls per task)2. Long-document Q&A (500K context, 1K output)3. High-volume classification (2K context, 200 output, 10,000 calls)4. Repeated-prompt chatbot (10K system prompt, 500 user tokens, 1K output, 1,000 sessions)Hidden costs to watch Track cost in Apidog Four rules that keep spend predictable FAQ

Apidog: A Real Design-first API Development Platform

API Design

API Documentation

API Debugging

Automated Testing

API Mocking

More

Get Started for Free

Enterprise

On-Premises or SaaS or EU-hosted

SSO, RBAC & audit logs

SOC 2, GDPR, ISO 27001

Explore Apidog Enterprise

Explore more

Best AI Image Detection APIs for Developers (2026)

Compare the best AI image detection APIs for developers in 2026. Evaluate accuracy, latency, and pricing across Hive, Sightengine, AI or Not, and Reality Defender.

8 June 2026

10 Cheapest LLM API Providers in 2026

Want the cheapest LLM API? Compare 10 providers by real per-token price, discounts, and free tiers for 2026. Hypereal AI and Blackmagic AI come out on top.

4 June 2026

API Docs With Git Integration: 6 Best Tools

Compare the best API docs tools with Git integration in 2026. Docs-as-code, OpenAPI sync, and PR previews across Apidog, Mintlify, Fern, Redocly, and more.

4 June 2026