Chinese labs cut LLM API prices six times in the first half of 2026, and three of those cuts were declared permanent. DeepSeek V4-Pro now costs $0.87 per million output tokens. Xiaomi MiMo V2.5 just flattened its long-context tier to $3 output. Alibaba’s Qwen3 Max ships at $3.90. Moonshot’s Kimi K2.6 holds the cache-hit floor at $0.07. Zhipu’s GLM-5 sits at $3.20 output. Below is the full pricing breakdown for the top five frontier APIs from China in May 2026, with capability notes and a buyer’s matrix at the end so you can pick the right one for your workload.
TL;DR
- Cheapest per token (output): DeepSeek V4-Pro at $0.87/MTok. Roughly 34x below GPT-5.5.
- Cheapest at 1M context: Xiaomi MiMo V2.5 Pro at $3/MTok output, flat regardless of input length.
- Best price-quality balance for general production: Alibaba Qwen3 Max at $3.90/MTok output, 262K context.
- Lowest cache-hit floor (long system prompts): Moonshot Kimi K2.6 at $0.07/MTok cached.
- Reasoning-heavy workloads: Zhipu GLM-5 at $3.20/MTok output, 200K context, strongest at structured chain-of-thought.
- All five labs are competing on price. Three (DeepSeek, MiMo, Kimi) treat their 2026 cuts as permanent.
How the 2026 Chinese LLM price war unfolded
The pattern started in Q4 2025 and accelerated in Q2 2026. A rough timeline:
- Q4 2025: DeepSeek V3.2 launches at $0.28/MTok input, undercutting US frontier prices by an order of magnitude. Kimi K2.6 follows with tiered context-aware pricing and an industry-low $0.07/MTok cache-hit rate.
- March 2026: Xiaomi unveils MiMo V2-Pro on OpenRouter at competitive but tier-based rates.
- April 2026: DeepSeek V4 launches with a 75% promotional discount scheduled to expire May 31.
- May 22, 2026: DeepSeek announces the 75% discount is permanent. V4-Pro stays at $0.435/$0.87 indefinitely. The full breakdown is here.
- May 27, 2026: Xiaomi makes MiMo V2.5 pricing permanent at $1/$3, killing the long-context multiplier. More on the MiMo cut.
The cuts aren’t random. Each lab is targeting a specific competitive gap. DeepSeek is going after raw cost-per-token. MiMo is going after long-context workloads that other models price out. Qwen and GLM are holding mid-tier prices and competing on capability instead. Kimi is competing on agent and coding workflows via the cache-hit floor.
At-a-glance: top 5 Chinese LLM APIs in May 2026
| Model | Input ($/MTok) | Output ($/MTok) | Cache hit | Context | Best at |
|---|---|---|---|---|---|
| DeepSeek V4-Pro | $0.435 | $0.87 | $0.003625 | 128K | Cheapest per token, coding |
| Xiaomi MiMo V2.5 Pro | $1.00 | $3.00 | $0.20 | 1M | Long-document RAG, repo agents |
| Alibaba Qwen3 Max | $0.78 | $3.90 | $0.156 | 262K | Production balance |
| Moonshot Kimi K2.6 | $0.16–$2.00 (tiered) | ~$2.50 | $0.07 | 128K | Long system prompts, coding agents |
| Zhipu GLM-5 | $1.00 | $3.20 | (provider-defined) | 200K | Structured reasoning |
A few details to read into the table:
- DeepSeek and MiMo are flat-rate. Every other lab in this set still uses some form of tiered pricing or context multiplier. Flat pricing makes production capacity planning predictable. Tiered pricing can surprise you on long-context months.
- Cache-hit rates vary widely. Kimi K2.6’s $0.07 and DeepSeek V4-Pro’s $0.003625 are the two outliers. For any agent with a stable system prompt, these are the rates you should be benchmarking against, not the cache-miss list price. See our prompt caching deep dive for the mechanics.
- Context windows split sharply. MiMo V2.5 alone gives you 1M tokens at the cheap tier. The next-largest in this set is Qwen3 Max at 262K. If your workload needs >300K tokens, MiMo isn’t optional.
Below: each model gets a section with pricing, capability, and the workload it wins.
DeepSeek: the cheapest per token
Models: V4-Pro ($0.435 in / $0.87 out / $0.003625 cache hit, 128K context), V4-Flash ($0.14 / $0.28).
DeepSeek’s V4-Pro is the price floor of the Chinese frontier-tier shelf. The May 22 permanent cut put output tokens at $0.87/MTok, roughly 34x below GPT-5.5 and 17x below Claude Opus 4.7. Cache-hit at $0.003625/MTok is the lowest first-party rate from any major lab. Confirmed against DeepSeek’s official pricing page.
Where V4-Pro wins:
- Output-heavy workloads (code generation, agent chains, content tools) where you spend 70%+ of your token budget on output.
- Anything with a stable 5K to 10K-token system prompt. Cache hits drive effective input cost to near zero.
- Cost-sensitive production where you can absorb 3 to 7 points of benchmark gap vs GPT-5.5.
Where it doesn’t fit:
- Long-document workloads (>128K context). MiMo V2.5 is the cheaper choice in absolute terms even at higher per-token rates because DeepSeek can’t fit the prompt.
- Latency-critical real-time chat. V4-Pro is a thinking model with 600 to 900ms time-to-first-token.
For deeper coverage: DeepSeek V4-Pro permanent price cut, What is DeepSeek V4, How to use the DeepSeek V4 API.
Xiaomi MiMo: the cheapest 1M-context option
Models: MiMo V2.5 Pro ($1.00 in / $3.00 out / $0.20 cache, 1M context), MiMo V2 Flash (~$0.10 / ~$0.40, 256K context).
Xiaomi’s May 27 permanent cut flattened MiMo V2.5 pricing across context windows. The old long-context tiers, which charged steep multipliers above 256K input tokens, are gone. The new pricing applies the same $1/$3 rate whether you send 5K or 950K tokens. The official price-update notice labels the cut “permanent.”
Where V2.5 Pro wins:
- Long-document RAG, repo-wide code analysis, multi-document summarization, any workload that fits 300K to 1M tokens of context.
- High-volume document processing where pricing predictability matters more than absolute floor.
Where it doesn’t fit:
- Short-prompt chat. V2.5 Pro is more expensive than DeepSeek V4-Pro at any context length DeepSeek can handle.
- Latency-critical workloads. Faster Chinese models exist for sub-second response budgets.
The 1M context window plus competitive cache rate gives MiMo a structurally unique place in the market. Until DeepSeek extends context beyond 128K or Alibaba flattens Qwen’s pricing, MiMo owns the cheap-and-long quadrant.
For deeper coverage: How Much Does It Cost to Use Xiaomi MiMo V2.5 in 2026, MiMo V2-Pro & Omni pricing, Xiaomi MiMo Orbit free 100T token program.
Alibaba Qwen: the production workhorse
Models: Qwen3 Max ($0.78 in / $3.90 out / $0.156 cache, 262K context). Newer Qwen 3.7 Max at $2.50/MTok input with 1M context is in early rollout. Rates verified against pricepertoken’s Qwen3 Max sheet.
Qwen3 Max is Alibaba’s flagship and the most-deployed Chinese model in international production. It sits at a competitive but not floor-level price point: 1.8x DeepSeek V4-Pro on input, 4.5x on output. The premium pays for the broadest tooling ecosystem (Anthropic-protocol drop-in, OpenAI-compat, Alibaba Cloud enterprise hosting) and a 262K context window that handles most enterprise document workloads.
Where Qwen3 Max wins:
- Multilingual production. Qwen’s training corpus skews heavily toward Mandarin and Asian languages, making it the strongest non-English performer in this set.
- Enterprise compliance scenarios. Alibaba’s enterprise SLA and cloud-region options are the most mature of any Chinese lab.
- Workloads that need 200K to 262K context but don’t justify MiMo’s premium quality band.
Where it doesn’t fit:
- Cost-sensitive output-heavy workloads. At $3.90/MTok output, you’re paying 4.5x DeepSeek’s rate. If your workload tolerates DeepSeek’s quality, switch.
For deeper coverage: Qwen 3 vs OpenAI & DeepSeek: in-depth technical comparison for API developers.
Moonshot Kimi: the coding specialist
Models: Kimi K2.6 with context-tiered input pricing ($0.16 to $2.00/MTok across 8K, 32K, 64K, and 128K bands), $0.07/MTok cache hit floor, output rates around $2.50/MTok in the middle band.
Kimi K2.6 is the cache-hit champion. The $0.07/MTok rate on hit is the lowest first-party number from any major lab. Combined with Kimi’s strong tool-calling and long-running agent support, K2.6 is the model that wins on workflows where you reuse a fat system prompt across many turns: coding agents, customer support chatbots with stable persona prompts, retrieval pipelines with stable context blocks.
Where K2.6 wins:
- Coding agents (Claude Code-style workflows). Strong tool-call format compliance and the lowest cache-hit floor make repeat-context patterns near-free.
- Long-running chat sessions where the system prompt and few-shot examples are stable.
Where it doesn’t fit:
- Bursty, varied workloads where prefixes change every request. The tiered input price means context-length surprises can spike your bill.
- Predictable budgeting. The tier transitions at 32K, 64K, and 128K input tokens mean the same query type can cost 4x more on a long day than on a short day.
For deeper coverage: Is Kimi K2 API pricing really worth the hype for developers in 2026.
Zhipu GLM: the reasoning challenger
Models: GLM-5 ($1.00 in / $3.20 out, 200K context), GLM-5.1 ($0.98 / $3.08, 200K context). Rates verified against Z.AI’s official pricing overview.
Zhipu’s GLM-5 launched with a 30% price increase over GLM-4.7 (a contrarian move in a market racing to the bottom), then released GLM-5.1 at a marginal discount. The pricing reflects Zhipu’s positioning: not the cheapest, but strongest at structured reasoning and chain-of-thought tasks.
Where GLM-5 wins:
- Math, formal reasoning, structured chain-of-thought tasks. GLM-5 holds the leaderboard on multiple GPQA-class benchmarks among Chinese frontier models.
- Workloads where the marginal cost is small relative to the cost of wrong answers (financial analysis, legal summarization, scientific reasoning).
- Multi-step agent workflows that benefit from clean reasoning traces.
Where it doesn’t fit:
- Cost-sensitive applications. GLM-5 is the most expensive option in this set on input and output combined. If raw cost is what you optimize, look elsewhere.
- Workloads that don’t reward strong reasoning. For straight content generation or summarization, the GLM premium isn’t worth it.
For deeper coverage: GLM-5 vs DeepSeek V3 vs GPT-5: speed, cost, and practical developer comparison, GLM-5.1 vs Claude, GPT, Gemini, DeepSeek.
Cheapest per workload: a buyer’s matrix
For five common production workloads, here’s which model wins:
| Workload | Winner | Why |
|---|---|---|
| Code generation (output-heavy) | DeepSeek V4-Pro | $0.87/MTok output is unbeatable |
| Long-document RAG (>300K context) | Xiaomi MiMo V2.5 Pro | Only flat-priced 1M-context option |
| Coding agent with stable system prompt | Kimi K2.6 | $0.07/MTok cache hit floor |
| Multilingual customer support | Alibaba Qwen3 Max | Strongest non-English performance |
| Math, formal reasoning, structured analysis | Zhipu GLM-5 | Best chain-of-thought quality |
Three combined patterns worth flagging:
- Two-model routing. Many production teams route 70 to 85% of traffic to DeepSeek V4-Pro and keep their secondary model on the hard tail. The savings are large and the quality hit is small for most workloads.
- Long-context segmentation. If your workload splits between short and long contexts, route short to DeepSeek and long to MiMo. The unified-billing pain is real but the cost arbitrage is too large to ignore.
- Cache prefix consolidation. Whatever model you pick, audit your system prompts. Cache hits are the cheap win that survives any model swap.
Quality and benchmark notes
A note on quality, since pricing means nothing if the model can’t do the job.
Per Artificial Analysis, the five models in this comparison cluster within 5 to 10 percentage points of each other on most public benchmarks. The interesting tail differences:
- DeepSeek V4-Pro: Strong on coding (SWE-bench Pro around 55%) and reasoning (GPQA around 90%). Slight gap to GPT-5.5 on long-horizon agent tasks.
- MiMo V2.5 Pro: Strong on long-context retrieval (>95% needle accuracy at 800K), middle-of-pack on coding.
- Qwen3 Max: Best non-English performance, strong general production quality.
- Kimi K2.6: Strongest tool-call format compliance, particularly for parallel tool calls.
- GLM-5: Best chain-of-thought reasoning quality in the set.
Run your own 100-sample eval before committing. Public benchmarks are useful directionally but the gap that matters is the one on your traffic.
Testing all five with Apidog
A multi-model production deploy needs a multi-model test harness. Apidog handles all five Chinese APIs out of one workspace because all five accept OpenAI Chat Completions request bodies, with minor compatibility quirks. The workflow:

- Create one environment per provider in Apidog:
api.deepseek.com,platform.xiaomimimo.com, Alibaba Cloud Model Studio, Moonshot’sapi.moonshot.cn, and Zhipu’sopen.bigmodel.cn. - Import the OpenAI Chat Completion schema once. Switch the base URL per environment.
- Run the same test scenario across all five with one click. Diff the responses, scores, and latencies.
- Wire JSON Schema validation against
tool_callsshapes to catch the streaming-format quirks unique to each provider.
Download Apidog, import your test cases, and you have a working five-way comparison in under fifteen minutes. Same workflow we recommend in the per-model deep-dives: DeepSeek V4-Pro permanent cut, MiMo V2.5 cost, Kimi K2 pricing.
Where the price war goes next
The pricing floor moved twice in May. Two more moves are likely before Q3 closes.
- Qwen response. Alibaba has rarely been first to cut, but consistently follows within weeks. Expect a Qwen3 Max revision or Qwen 3.8 announcement by July.
- GLM response. Zhipu’s 30% increase on GLM-5 looks increasingly contrarian. A GLM-5.2 with a structural cut is plausible.
- Kimi structural simplification. Tiered context pricing is going out of fashion. Moonshot may flatten K2.6 to match MiMo’s structure.
Build accordingly. Three next steps:
- Audit your top three workloads against the buyer’s matrix above. Pick one for a migration test this week.
- Lock in your cache prefixes. That’s the win regardless of which model you settle on.
- Wire an Apidog regression suite that points at all five providers so the next round of cuts takes hours to evaluate instead of weeks.
The price floor isn’t done falling. Position your stack for what’s next.



