DeepSeek V4-Pro 75% Price Cut Is Now Permanent: What It Means for Developers (2026)

DeepSeek V4-Pro pricing is now permanently 75% off: $0.435 input, $0.87 output, $0.003625 cache hit per 1M tokens. What it means for developers in 2026.

Ashley Innocent

Ashley Innocent

25 May 2026

DeepSeek V4-Pro 75% Price Cut Is Now Permanent: What It Means for Developers (2026)

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Explore Apidog Enterprise

DeepSeek turned the most aggressive temporary discount in 2026 LLM pricing into the new normal. On May 22, the team announced that the 75% off DeepSeek-V4-Pro offer, originally set to expire on May 31, 2026 at 15:59 UTC, would not roll back. The promotional rate becomes the permanent list price. Input drops to $0.435 per million tokens, output to $0.87, and cache hits to $0.003625. Below, we break down what changed, what stayed the same, and what every API developer should reconsider this week.

TL;DR

Why this matters now

LLM pricing usually moves in one direction: down, slowly, with footnotes. DeepSeek skipped the footnotes. The team ran an aggressive promo through May, watched developer traffic climb, and decided to lock the price in instead of letting it snap back. That’s a structural signal about where Chinese frontier-model economics are heading, not a one-time stunt.

If you’re shipping any product that calls an LLM in a hot path (autocomplete, retrieval-augmented chat, code review, agent loops) the difference between $3.48 and $0.87 per million output tokens shows up on your invoice this month. Ship 50 million output tokens a day, a realistic load for any agent with non-trivial users, and the new price cuts your monthly LLM bill from roughly $5,200 to $1,300. That’s a sales hire, or a year of GPU credits.

Building on top of DeepSeek? Apidog lets you generate, test, and monitor V4-Pro API calls in a single workspace, including streaming, tool calls, and JSON schema validation. Download Apidog and you can clone the requests in this article in under a minute.

button

In the rest of this post, you’ll see the full new price sheet, a head-to-head against GPT-5.5 and Claude Opus 4.7, the cache-hit math most articles miss, three real-bill scenarios, and a five-step decision framework for whether to migrate today.

What changed: the announcement decoded

DeepSeek’s official pricing notice is short, but each line moves a number. Three facts worth pulling out:

  1. The 75% discount is permanent. The promo running through May 31, 2026 15:59 UTC was supposed to revert to the launch list price on June 1. It won’t. The promo rate is the new list rate, retroactive to launch and forward indefinitely.
  2. The cut applies to V4-Pro only. DeepSeek-V4-Flash, at $0.14 / $0.28 per million tokens, was already cheap. V4-Pro, the frontier-tier model, is what dropped. See What is DeepSeek V4 for the Flash vs Pro split.
  3. Cache-hit pricing was cut to 1/10 of launch, effective April 26, 2026 12:15 UTC. This is a separate change from the headline 75% cut, and the two stack. The result: cache hits at $0.003625/MTok, the lowest first-party frontier-model cache price on the market in 2026.

Read together, the announcement says: DeepSeek is willing to absorb gross margin on the headline model to keep developer mindshare. The cache-hit move says: they want you building agents and long-context tools on V4-Pro specifically. Both moves point to the same playbook. Win the inference workload now, monetize the platform later.

The new permanent price sheet

Pricing per 1 million tokens, USD, effective immediately and permanent:

Token type Old list New permanent Cut
Input (cache miss) $1.74 $0.435 75%
Input (cache hit) $0.0145 $0.003625 75%
Output $3.48 $0.87 75%

A few takeaways the table buries:

For deeper historical context on V4 pricing tiers and Flash-vs-Pro tradeoffs, see our standing DeepSeek V4 API Pricing reference.

How V4-Pro now compares to GPT-5.5, Claude Opus 4.7, and Gemini 3.5 Flash

The interesting comparison isn’t with V4-Pro’s old self. It’s against the rest of the frontier shelf.

Model Input ($/MTok) Output ($/MTok) SWE-bench Pro
DeepSeek-V4-Pro (new) $0.435 $0.87 55.4%
GPT-5.5 $5.00 $30.00 58.6%
Claude Opus 4.7 $3.00 $15.00 ~62%
Gemini 3.5 Flash ~$1.50 ~$9.00 ~48%
DeepSeek-V4-Flash $0.14 $0.28 ~42%

Two numbers to remember. On output tokens, the line item that runs up your bill, DeepSeek-V4-Pro is 34x cheaper than GPT-5.5 and 17x cheaper than Claude Opus 4.7. On benchmarks, V4-Pro lands within 3 to 7 percentage points of GPT-5.5 on most public coding and reasoning evals, per the DataCamp comparison.

If your workload is latency-tolerant and quality-acceptable in that small band, the migration is a math problem with one answer. For workloads where the last 5 points of benchmark score matter (agent tool reliability, long-horizon planning, hard math), V4-Pro is still cheaper to use as a draft model behind a speculative-decoding or critic pattern.

For deeper head-to-head reviews, see DeepSeek V4 vs Claude Opus 4.5 for coding and GLM-5 vs DeepSeek V3 vs GPT-5: speed, cost, and practical developer comparison.

The cache-hit angle most articles miss

Everyone quotes the $0.87 output number. Few explain what the $0.003625 cache-hit input price does to system design.

DeepSeek’s prompt cache hits when the prefix of your request is byte-identical to a recent prior request, within roughly a 30-minute window. For chat agents and retrieval pipelines, the prefix is usually your system prompt plus tool definitions plus instruction scaffolding. That’s typically 4,000 to 10,000 tokens that don’t change between turns.

Concrete example. Suppose your assistant uses a 6,000-token system prompt and handles 100,000 chat turns per day, with an average user message of 200 input tokens and an average response of 800 output tokens.

That’s not a rounding error. It’s the difference between the model being a sustainable line item and a luxury one. For more on how prefix caching works across providers, our prompt caching deep dive walks through the mechanics.

Three patterns to get cache hits in real agents:

What you should do this week

The migration decision isn’t binary. It depends on what kind of LLM workload you’re running. A five-step framework:

1. Measure your current output:input ratio. If you’re spending 80% of your token budget on output (any agent, code generator, or content tool), the savings from V4-Pro are large. If you’re spending 80% on input (RAG over long documents), the savings are smaller but still real once cache hits land.

2. Run a 100-sample eval on your real workload. Don’t trust public benchmarks. Pull 100 traces from your production traffic, run them against V4-Pro and your current model with identical prompts, and score with your own judge. Most teams find V4-Pro is “good enough” for 70% to 85% of their traffic.

3. Pattern-match by route. Route the 70% to 85% to V4-Pro and keep your premium model on the hard tail. This single change delivers 70%+ of the cost savings with near-zero quality regression.

4. Lock in cache prefixes. Audit your system prompts. Anything that varies per request (timestamps, user IDs, session IDs) belongs in the user message, not the system prompt. Move it.

5. Set up regression tests before you ship. This is where Apidog earns its keep. Record golden responses from your current model, then replay the same requests against V4-Pro and diff the outputs. Apidog’s JSON schema validation catches drift in tool-call shapes before they reach production. Download Apidog, import your OpenAI-compatible collection, change the base URL to https://api.deepseek.com, and you can run a side-by-side smoke test in under ten minutes.

For a hands-on walkthrough of the V4-Pro endpoint shape, see How to use the DeepSeek V4 API.

How V4-Pro stacks up against other 2026 price drops

DeepSeek isn’t the only lab cutting prices. The 2026 LLM market is in a clear margin compression phase:

V4-Pro’s cut is the most aggressive of the year because it targets the frontier capability band, not the budget tier. That’s why this announcement reset the market and the others didn’t.

The build math shifted

DeepSeek didn’t drop the price. They redrew the curve. Frontier capability at sub-dollar output pricing is now the baseline, not the outlier, and the rest of the market will respond. If you’ve been deferring an LLM feature on cost grounds, the 2026 budget you priced in last quarter probably overstates your needs by 4x.

Three next steps:

The promo flag came off. The discount didn’t.

button

Explore more

Bruno for Teams: Cloud Sync Alternatives and Workarounds

Bruno for Teams: Cloud Sync Alternatives and Workarounds

Bruno has no cloud sync. Here is every team workaround, its real limits, and how Apidog's new Spec-First Git mode meets Bruno on git's home turf while adding live sync and RBAC.

9 June 2026

Why Postman Is Slow and Bloated in 2026 (And What to Use Instead)

Why Postman Is Slow and Bloated in 2026 (And What to Use Instead)

Postman's Electron architecture causes 6-9 second startup times and 500MB+ RAM usage. Technical breakdown of the bloat and how Apidog compares as a faster alternative.

9 June 2026

Postman Free Plan 2026: What the 1-User Limit Means for Small Teams

Postman Free Plan 2026: What the 1-User Limit Means for Small Teams

Postman cut its free tier to 1 user in 2026. Learn what changed, what it costs to upgrade, and how Apidog offers free collaborative workspaces for up to 3 users.

9 June 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs

DeepSeek V4-Pro 75% Price Cut Is Now Permanent: What It Means for Developers (2026)