DeepSeek dropped V4 on April 23, 2026, and this one is not a minor point release. The Hangzhou lab released four checkpoints at once, topped by DeepSeek-V4-Pro at 1.6 trillion total parameters, an MIT license, and a 1-million-token context window. The smaller sibling, DeepSeek-V4-Flash, lands at 284 billion parameters with the same context and the same open weights. Benchmarks put the Pro variant ahead of Claude Opus 4.6 on LiveCodeBench and Codeforces, and inside arm’s reach of GPT-5.4 xHigh on MMLU-Pro.
If you are deciding whether to swap Claude, GPT-5.5, or Qwen for DeepSeek V4, this guide covers what the model is, what changed from V3.2, the architecture choices that drive the benchmark story, and where to run it today.
For the matching developer walkthroughs, we have a DeepSeek V4 API guide, a free-access guide, and a full DeepSeek V4 usage walkthrough. The request shape maps cleanly onto OpenAI’s format, so you can pre-build the collection in Apidog before a key lands in your inbox.
TL;DR
- DeepSeek V4 is a Mixture-of-Experts family released April 23, 2026 under the MIT license.
- Four checkpoints ship at launch: V4-Pro, V4-Pro-Base, V4-Flash, and V4-Flash-Base.
- V4-Pro is 1.6T total parameters with 49B active; V4-Flash is 284B total with 13B active.
- Both variants carry a 1M-token context window and three reasoning modes: Non-Think, Think High, and Think Max.
- Headline scores: LiveCodeBench 93.5, Codeforces 3206, MMLU-Pro 87.5 (Pro variant).
- The API is live at
api.deepseek.comwithdeepseek-v4-proanddeepseek-v4-flashas model IDs; weights are on Hugging Face and ModelScope.
What DeepSeek V4 actually is
DeepSeek V4 is the successor to the V3 and V3.2 lines that turned the lab into a household name last year. The architecture is still Mixture-of-Experts, but the shape of the model has changed. V4-Pro activates only 49 billion of its 1.6 trillion parameters per token, so the per-token compute bill looks closer to a 50B dense model than a trillion-parameter frontier system. Read the full technical report on the DeepSeek V4 model card.

Four checkpoints ship at launch:
- DeepSeek-V4-Pro — the flagship. 1.6T total, 49B active, 1M context. This is the one most teams will call through the API.
- DeepSeek-V4-Pro-Base — the pre-trained base without post-training. Aimed at researchers and teams building custom fine-tunes.
- DeepSeek-V4-Flash — the efficiency variant. 284B total, 13B active, same 1M context. Targets latency-sensitive workloads and local deployment on two or three H100s.
- DeepSeek-V4-Flash-Base — the matching base checkpoint for Flash.
All four drop under MIT, which is the quiet story. GPT-5.5 is closed and costs $5 per million input tokens; Claude Opus 4.6 is closed and prices closer to $15. DeepSeek V4-Pro has open weights you can download, mirror, fine-tune, and deploy on your own hardware with no license fee.
What changed from V3.2
V3 was already competitive on reasoning and code. V4 rewrites the attention stack and the training pipeline to push long context and efficiency at the same time.
| Capability | V3.2 | V4-Pro |
|---|---|---|
| Total parameters | 685B | 1.6T |
| Active parameters | 37B | 49B |
| Context window | 128K | 1M |
| Inference FLOPs (1M context) | baseline | 27% of V3.2 |
| KV cache (1M context) | baseline | 10% of V3.2 |
| Precision | FP8 | FP4 + FP8 mixed |
| License | DeepSeek License | MIT |
| Reasoning modes | single | three |
Three things drive the jump. First, a new hybrid attention stack that pairs Compressed Sparse Attention with Heavily Compressed Attention; that is where the 10% KV-cache number comes from. Second, Manifold-Constrained Hyper-Connections that stabilize gradients at the depth V4 needs. Third, a switch to the Muon optimizer for faster convergence. The training corpus also grew past 32 trillion tokens, and post-training uses a two-stage pipeline that cultivates domain-specific experts first, then consolidates them with on-policy distillation.

Benchmarks that matter
DeepSeek’s reported numbers put V4-Pro on the frontier board for coding and knowledge, with gaps on long-context retrieval.

For V4-Flash, the smaller variant, DeepSeek reports MMLU-Pro 86.2, GPQA Diamond 88.1, LiveCodeBench 91.6, Codeforces 3052, and SWE Verified 79.0. That is frontier territory for a 13B-active model, and it is the reason Flash is the interesting checkpoint for anyone deploying on their own hardware. See the DeepSeek V4-Flash card for the full table.
The honest read: V4-Pro wins on code, wins on open-ended factual recall, trails Gemini 3.1 Pro on general knowledge, and trails Claude Opus on the 1M-token retrieval benchmarks. If your workload is agentic coding or reasoning-heavy analysis, V4-Pro is in the conversation. If it is needle-in-a-haystack retrieval across a million tokens, Claude still has the edge.
Three reasoning modes
Every V4 checkpoint exposes three reasoning efforts, and picking the right one is the biggest cost lever.
- Non-Think — fast path. Single-pass generation, no chain-of-thought, no extra reasoning tokens. Use for classification, routing, short summaries, and anything where latency matters more than accuracy.
- Think High — the default for hard work. The model writes reasoning tokens before the answer, plans tool calls, and checks its output. Matches what GPT-5.5 calls “thinking mode” and what Claude calls “extended thinking.”
- Think Max — the ceiling. Longer reasoning traces, more aggressive self-critique, and a minimum 384K-token context window recommendation. This is what produces the 93.5 LiveCodeBench number; expect a matching jump in token cost.
Switch between them with a single thinking_mode parameter in the API or a flag in the local inference script. DeepSeek’s sampling recommendation is temperature=1.0, top_p=1.0 across all three.
Architecture in plain English
The V4 architecture paper is dense, but three choices explain the efficiency story.
- Hybrid attention. Most transformer layers use Compressed Sparse Attention, which keeps a small pool of high-value tokens fully attended and compresses the rest. A handful of layers use Heavily Compressed Attention, which lives closer to linear cost in sequence length. The mix is what delivers the 27% FLOPs and 10% KV-cache numbers at 1M tokens.
- Manifold-Constrained Hyper-Connections. Instead of plain residual connections, V4 wraps each layer’s residuals in a constraint that keeps activations on a stable manifold. The practical effect is that you can stack more layers without gradient chaos.
- Muon optimizer. Replaces AdamW for most of training. Muon converges faster and handles the huge gradient norms MoE models produce better than AdamW does.
None of these ideas are brand-new on their own. The V4 contribution is getting all three working together at trillion-parameter scale without blowing up training.
Availability today
DeepSeek launched all four checkpoints and the API on the same day. Here is the snapshot as of April 24, 2026.
| Surface | Access |
|---|---|
| chat.deepseek.com | Free web chat, V4-Pro default, login required |
| DeepSeek API | Live at api.deepseek.com; model IDs deepseek-v4-pro, deepseek-v4-flash |
| Hugging Face weights | V4-Pro, V4-Flash, both MIT |
| ModelScope | Mirrored weights for users in China |
| OpenRouter and aggregators | Expected within days; typical DeepSeek launch pattern |
deepseek-chat / deepseek-reasoner |
Deprecated July 24, 2026 |
The deprecation notice is worth circling. If you are still calling deepseek-chat in production, you have three months to migrate to deepseek-v4-pro or deepseek-v4-flash.
How it compares to GPT-5.5 and Claude
The three-way comparison most teams actually care about:
- Cost. V4-Pro and V4-Flash have open weights. GPT-5.5 and Claude Opus 4.6 do not. If you can self-host, V4 wins on unit economics at any serious scale.
- Coding. V4-Pro’s 93.5 on LiveCodeBench and 3206 on Codeforces beat both the GPT-5.5 benchmark line and Claude Opus on the same suites.
- Knowledge breadth. Gemini 3.1 Pro still leads MMLU-Pro at 91.0. GPT-5.5 and V4-Pro tie at 87.5. On SimpleQA-Verified, V4 beats GPT-5.5 and Claude by double digits.
- Long context retrieval. Claude Opus wins MRCR 1M by roughly 9 points. If your workload is “find the one sentence in a million tokens,” Claude is still the safer call.
- License. MIT means you can ship V4-Pro inside a product without a usage agreement. Nothing OpenAI or Anthropic offers matches that.
What to build with it
Four workloads line up cleanly with V4’s strengths:
- Agentic coding loops. The SWE Verified 79.0 and Codeforces 3206 numbers point directly at multi-file debugging, repo-aware refactors, and autonomous test fixes. Pair it with a good API client like Apidog to inspect every request and response while you tune prompts.
- Reasoning over long documents. 1M tokens is enough for most monorepos, most contracts, and most research corpora. Think High is the right mode for this.
- Self-hosted AI products. If your compliance story needs on-prem inference, V4-Flash is the first open-weights model that competes with closed frontier APIs on quality.
- Research and fine-tuning. The Base checkpoints are there specifically for custom training. Pair them with a domain dataset and you land on production-grade specialist models.
Where it is not a fit: high-volume classification, embedding retrieval, or short-prompt chat. V4-Flash is still overkill for those, and older DeepSeek checkpoints cost less.
Pricing in one line
DeepSeek had not published the final API rate card at the time of writing. V3.2 ran at roughly $0.28 per million input tokens and $0.42 per million output tokens, and the lab has a track record of keeping V-series pricing close to that floor. Expect V4-Flash in the same range and V4-Pro at a modest premium. Closed competitors price at $5 to $15 per million input tokens, so even a 3x jump from V3.2 leaves DeepSeek well below the frontier-API median. Track the live numbers on the DeepSeek pricing page.
How to test V4 today
Three paths, ranked by time-to-first-token.
- Web chat. Open chat.deepseek.com and sign in. V4-Pro is the default; toggle to Think High in the UI. Free, no card, works now.
- API. Grab a key, point your client at
https://api.deepseek.com, set"model": "deepseek-v4-pro", and go. The request shape is OpenAI-compatible, so any existing OpenAI client works with a base-URL swap. Full walkthrough in the DeepSeek V4 API guide. - Local weights. Pull from Hugging Face or ModelScope. V4-Flash runs on 2 to 4 H100s; V4-Pro needs a serious cluster. The inference code lives in the
/inferencefolder of the model repo.
For the full walkthrough including Apidog-based prompt iteration, see how to use DeepSeek V4. To keep spend at zero, see how to use DeepSeek V4 for free. Download Apidog and pre-build your collection; the OpenAI-compatible format means one request works across DeepSeek, OpenAI, and every other frontier API.
FAQ
Is DeepSeek V4 really open source?Yes. All four checkpoints carry an MIT license, which permits commercial use, modification, and redistribution without a separate usage agreement.
Do I need a GPU cluster to run V4-Flash?You need two to four H100s or H200s for V4-Flash at full precision, less if you quantize. V4-Pro needs a genuine cluster. If you want to try V4 without hardware, use the API or chat.deepseek.com.
When does V4 hit the DeepSeek API?It is already live as of April 23, 2026. The model IDs are deepseek-v4-pro and deepseek-v4-flash. The older deepseek-chat and deepseek-reasoner IDs are deprecated July 24, 2026.
How does V4 compare to Kimi and Qwen?V4-Pro posts higher LiveCodeBench and Codeforces numbers than Kimi K2 and Qwen 3 Max on DeepSeek’s reported tables. All three are open-weights MoE systems with similar deployment profiles. Pick based on the benchmark closest to your workload.
Can I fine-tune V4 on my own data?Yes. The Base checkpoints exist for that; pair them with your domain data and a standard SFT pipeline. The MIT license covers commercial redistribution of the resulting model.
Will V4 work with my existing OpenAI-compatible tooling?Yes. The API accepts both OpenAI and Anthropic message formats at https://api.deepseek.com and https://api.deepseek.com/anthropic respectively. Most existing OpenAI clients work with a single base-URL change. See the matching GPT-5.5 API walkthrough for the parallel pattern.
