DeepSeek turned the most aggressive temporary discount in 2026 LLM pricing into the new normal. On May 22, the team announced that the 75% off DeepSeek-V4-Pro offer, originally set to expire on May 31, 2026 at 15:59 UTC, would not roll back. The promotional rate becomes the permanent list price. Input drops to $0.435 per million tokens, output to $0.87, and cache hits to $0.003625. Below, we break down what changed, what stayed the same, and what every API developer should reconsider this week.
TL;DR
- DeepSeek-V4-Pro API pricing is now permanent at 1/4 of the original list price: $0.435/MTok input, $0.87/MTok output, $0.003625/MTok cache hit.
- The 75% promo discount that was set to end May 31, 2026 is now the regular rate. No rollback. No surprise expiry.
- V4-Pro is now roughly 34x cheaper than GPT-5.5 on output while landing within ~95% of GPT-5.5 on most coding and reasoning benchmarks.
- The cache-hit price of $0.003625/MTok, a 90% cut on top of the headline cut, is the underrated detail. Long system prompts are now nearly free at the prefix.
- If you priced your AI features against GPT-5.5 or Claude Opus 4.7 last quarter, the build math shifted this week.
Why this matters now
LLM pricing usually moves in one direction: down, slowly, with footnotes. DeepSeek skipped the footnotes. The team ran an aggressive promo through May, watched developer traffic climb, and decided to lock the price in instead of letting it snap back. That’s a structural signal about where Chinese frontier-model economics are heading, not a one-time stunt.
If you’re shipping any product that calls an LLM in a hot path (autocomplete, retrieval-augmented chat, code review, agent loops) the difference between $3.48 and $0.87 per million output tokens shows up on your invoice this month. Ship 50 million output tokens a day, a realistic load for any agent with non-trivial users, and the new price cuts your monthly LLM bill from roughly $5,200 to $1,300. That’s a sales hire, or a year of GPU credits.
Building on top of DeepSeek? Apidog lets you generate, test, and monitor V4-Pro API calls in a single workspace, including streaming, tool calls, and JSON schema validation. Download Apidog and you can clone the requests in this article in under a minute.
In the rest of this post, you’ll see the full new price sheet, a head-to-head against GPT-5.5 and Claude Opus 4.7, the cache-hit math most articles miss, three real-bill scenarios, and a five-step decision framework for whether to migrate today.
What changed: the announcement decoded
DeepSeek’s official pricing notice is short, but each line moves a number. Three facts worth pulling out:
- The 75% discount is permanent. The promo running through May 31, 2026 15:59 UTC was supposed to revert to the launch list price on June 1. It won’t. The promo rate is the new list rate, retroactive to launch and forward indefinitely.
- The cut applies to V4-Pro only. DeepSeek-V4-Flash, at $0.14 / $0.28 per million tokens, was already cheap. V4-Pro, the frontier-tier model, is what dropped. See What is DeepSeek V4 for the Flash vs Pro split.
- Cache-hit pricing was cut to 1/10 of launch, effective April 26, 2026 12:15 UTC. This is a separate change from the headline 75% cut, and the two stack. The result: cache hits at $0.003625/MTok, the lowest first-party frontier-model cache price on the market in 2026.
Read together, the announcement says: DeepSeek is willing to absorb gross margin on the headline model to keep developer mindshare. The cache-hit move says: they want you building agents and long-context tools on V4-Pro specifically. Both moves point to the same playbook. Win the inference workload now, monetize the platform later.
The new permanent price sheet
Pricing per 1 million tokens, USD, effective immediately and permanent:
| Token type | Old list | New permanent | Cut |
|---|---|---|---|
| Input (cache miss) | $1.74 | $0.435 | 75% |
| Input (cache hit) | $0.0145 | $0.003625 | 75% |
| Output | $3.48 | $0.87 | 75% |
A few takeaways the table buries:
- The output drop is the one that hits your invoice hardest, because output tokens dominate any agent loop where the model reasons or writes code.
- The cache-hit row looks tiny because the absolute numbers are tiny. The ratio is where the savings live. Input miss to input hit is roughly 120:1. A well-designed system prompt that hits cache 90% of the time pays almost nothing for input, which is the unlock for any agent with a stable scaffold.
- These rates apply to the API only. DeepSeek’s web chat remains free for individuals.
For deeper historical context on V4 pricing tiers and Flash-vs-Pro tradeoffs, see our standing DeepSeek V4 API Pricing reference.
How V4-Pro now compares to GPT-5.5, Claude Opus 4.7, and Gemini 3.5 Flash
The interesting comparison isn’t with V4-Pro’s old self. It’s against the rest of the frontier shelf.
| Model | Input ($/MTok) | Output ($/MTok) | SWE-bench Pro |
|---|---|---|---|
| DeepSeek-V4-Pro (new) | $0.435 | $0.87 | 55.4% |
| GPT-5.5 | $5.00 | $30.00 | 58.6% |
| Claude Opus 4.7 | $3.00 | $15.00 | ~62% |
| Gemini 3.5 Flash | ~$1.50 | ~$9.00 | ~48% |
| DeepSeek-V4-Flash | $0.14 | $0.28 | ~42% |
Two numbers to remember. On output tokens, the line item that runs up your bill, DeepSeek-V4-Pro is 34x cheaper than GPT-5.5 and 17x cheaper than Claude Opus 4.7. On benchmarks, V4-Pro lands within 3 to 7 percentage points of GPT-5.5 on most public coding and reasoning evals, per the DataCamp comparison.
If your workload is latency-tolerant and quality-acceptable in that small band, the migration is a math problem with one answer. For workloads where the last 5 points of benchmark score matter (agent tool reliability, long-horizon planning, hard math), V4-Pro is still cheaper to use as a draft model behind a speculative-decoding or critic pattern.
For deeper head-to-head reviews, see DeepSeek V4 vs Claude Opus 4.5 for coding and GLM-5 vs DeepSeek V3 vs GPT-5: speed, cost, and practical developer comparison.
The cache-hit angle most articles miss
Everyone quotes the $0.87 output number. Few explain what the $0.003625 cache-hit input price does to system design.
DeepSeek’s prompt cache hits when the prefix of your request is byte-identical to a recent prior request, within roughly a 30-minute window. For chat agents and retrieval pipelines, the prefix is usually your system prompt plus tool definitions plus instruction scaffolding. That’s typically 4,000 to 10,000 tokens that don’t change between turns.
Concrete example. Suppose your assistant uses a 6,000-token system prompt and handles 100,000 chat turns per day, with an average user message of 200 input tokens and an average response of 800 output tokens.
- Without cache hits: 100,000 turns × 6,200 input tokens × $0.435 / 1,000,000 = $269.70 per day on input alone.
- With 90% of those system-prompt tokens hitting cache: the same 100,000 turns pay 200 × $0.435 plus 6,000 × (0.9 × $0.003625 + 0.1 × $0.435) per million tokens. That comes out to about $32 per day. An 88% reduction on input cost.
That’s not a rounding error. It’s the difference between the model being a sustainable line item and a luxury one. For more on how prefix caching works across providers, our prompt caching deep dive walks through the mechanics.
Three patterns to get cache hits in real agents:
- Pin the prefix. Keep the system prompt, tool schemas, and few-shot examples in a single block at the start of every request. Don’t interleave session-specific text into the prefix.
- Sort or hash dynamic context. If you append retrieved chunks, sort them stably or hash the request and route identical hashes to the same node. Small fingerprint shifts kill the cache.
- Run a warm-up call. On agent startup, send one request with the full prefix to seat it in the provider’s cache before user traffic hits.
What you should do this week
The migration decision isn’t binary. It depends on what kind of LLM workload you’re running. A five-step framework:
1. Measure your current output:input ratio. If you’re spending 80% of your token budget on output (any agent, code generator, or content tool), the savings from V4-Pro are large. If you’re spending 80% on input (RAG over long documents), the savings are smaller but still real once cache hits land.
2. Run a 100-sample eval on your real workload. Don’t trust public benchmarks. Pull 100 traces from your production traffic, run them against V4-Pro and your current model with identical prompts, and score with your own judge. Most teams find V4-Pro is “good enough” for 70% to 85% of their traffic.
3. Pattern-match by route. Route the 70% to 85% to V4-Pro and keep your premium model on the hard tail. This single change delivers 70%+ of the cost savings with near-zero quality regression.
4. Lock in cache prefixes. Audit your system prompts. Anything that varies per request (timestamps, user IDs, session IDs) belongs in the user message, not the system prompt. Move it.
5. Set up regression tests before you ship. This is where Apidog earns its keep. Record golden responses from your current model, then replay the same requests against V4-Pro and diff the outputs. Apidog’s JSON schema validation catches drift in tool-call shapes before they reach production. Download Apidog, import your OpenAI-compatible collection, change the base URL to https://api.deepseek.com, and you can run a side-by-side smoke test in under ten minutes.
For a hands-on walkthrough of the V4-Pro endpoint shape, see How to use the DeepSeek V4 API.
How V4-Pro stacks up against other 2026 price drops
DeepSeek isn’t the only lab cutting prices. The 2026 LLM market is in a clear margin compression phase:
- OpenAI O3 dropped 80% earlier this year. See our O3 pricing breakdown for the math.
- Kimi K2 repriced aggressively to compete with DeepSeek’s V3 tier. Kimi K2 API pricing covers the details.
- Anthropic Claude held the line on Opus pricing but introduced cheaper Haiku and Sonnet tiers. The full Claude API cost breakdown walks through where each tier fits.
V4-Pro’s cut is the most aggressive of the year because it targets the frontier capability band, not the budget tier. That’s why this announcement reset the market and the others didn’t.
The build math shifted
DeepSeek didn’t drop the price. They redrew the curve. Frontier capability at sub-dollar output pricing is now the baseline, not the outlier, and the rest of the market will respond. If you’ve been deferring an LLM feature on cost grounds, the 2026 budget you priced in last quarter probably overstates your needs by 4x.
Three next steps:
- Audit your top three LLM workloads against the framework above and pick one to migrate this week.
- Lock in your cache prefixes. That’s the cheap win regardless of which model you use.
- Wire up an Apidog regression suite so the next price cut, and there will be one, takes hours to evaluate instead of weeks.
The promo flag came off. The discount didn’t.



