For most of the last two years, the question “what’s the best coding model?” had a Western answer. You picked GPT, Claude, or Gemini, paid the per-token rate, and accepted that the weights stayed locked in someone else’s data center. That’s no longer the only path. A run of Chinese labs now ships models that match the frontier on coding while either publishing the weights or pricing the API so low it changes the math on every agent you run.
MiniMax M3 landed on June 1, 2026, and it’s the clearest signal yet. It’s open-weight, built for coding and agentic work, carries a 1,000,000-token context window, and adds native multimodality on top. It’s the third serious open-weight contender to arrive in weeks, alongside DeepSeek’s V4 family and Alibaba’s Qwen 3.7. If you want open weights, low cost, and no vendor lock-in, you now have a real shortlist instead of a single option.
The three contenders
MiniMax M3 is the new arrival. MiniMax positions it as a frontier coding model with a 1M-token context window and native multimodality, meaning it handles image and video input and can drive computer-use tasks, not text alone. It runs on a new MSA architecture. MiniMax says open weights and a technical report will follow within roughly ten days of launch, and it has not disclosed parameter counts. The full has breakdown is in what is MiniMax M3.
DeepSeek V4-Pro is the reasoning-and-coding workhorse. It’s a thinking model: it returns a reasoning_content chain of thought before its final answer, which catches multi-file dependencies that flat-completion models miss. DeepSeek has a long, documented history of publishing open weights across its R1 and V3 lines, and it pairs V4-Pro with a cheaper non-thinking V4-Flash variant. The standout is price, which we’ll get to. DeepSeek runs its official site and API at deepseek.com.
Qwen 3.7 is Alibaba’s flagship, led by Qwen3.7-Max-Preview. It’s a reasoning model with a 1M-token context window, pitched hard at long-horizon agent work. One honest caveat sits at the center of this comparison: as of its mid-May 2026 launch, the Qwen3.7-Max flagship is proprietary and closed-weight. Alibaba has a strong track record of open-sourcing the tier below its flagship, so open 3.7 weights are plausible later, but none had shipped. Full details are in what is Qwen 3.7. Alibaba’s open-source repos live at github.com/QwenLM.
Spec table
| Spec | MiniMax M3 | DeepSeek V4-Pro | Qwen3.7-Max-Preview |
|---|---|---|---|
| Vendor | MiniMax | DeepSeek | Alibaba (Qwen) |
| Released | June 1, 2026 | 2026 | May 2026 (preview) |
| Open weights | Yes (weights within ~10 days) | Yes (DeepSeek’s track record across R1/V3) | Not yet (flagship is closed-weight) |
| Context window | 1,000,000 tokens | Not stated here | 1,000,000 tokens |
| Multimodal | Yes (image + video, computer use) | No (text + reasoning) | Text-focused reasoning |
| Reasoning / thinking mode | Yes | Yes (reasoning_content) |
Yes (extended thinking) |
| Parameter count | Not disclosed | Not disclosed here | Not disclosed here |
| Architecture | MSA | Not stated here | Not stated here |
A note on that “open weights” row, because it’s the spine of this comparison. M3 commits to publishing weights and a tech report within about ten days of launch. DeepSeek has shipped open weights repeatedly. Qwen 3.7’s flagship is closed today. If open weights are a hard requirement right now, that narrows your field before you read a single benchmark.
Coding and agentic strength
Here’s where the data gets uneven, so we’ll lead with what’s verified and stay qualitative where it isn’t.
MiniMax M3 launched with a full slate of vendor-reported coding and agentic benchmarks. These are MiniMax’s own numbers, so treat them as launch-day vendor claims until third parties reproduce them:
| Benchmark (vendor-reported, MiniMax) | MiniMax M3 |
|---|---|
| SWE-Bench Pro | 59.0% |
| Terminal-Bench 2.1 | 66.0% |
| SWE-fficiency | 34.8% |
| KernelBench Hard | 28.8% |
| MCP Atlas | 74.2% |
| PostTrainBench | 0.37 |
| SVG-Bench | Reported above Opus 4.7 |
| OmniDocBench | Reported above Gemini 3.1 Pro |
| Claw-Eval | Reported highest in its set |
SWE-Bench Pro and Terminal-Bench measure real software-engineering tasks: resolving GitHub issues, working in a terminal. MCP Atlas measures tool-use and agent orchestration. Together they describe a model built to do agentic coding work, not just autocomplete. You can sanity-check the SWE-Bench field on the SWE-Bench leaderboard.
For DeepSeek V4-Pro and Qwen 3.7, the comparable agentic-coding numbers aren’t published in the same format, so a direct cell-by-cell match would be invented, and we won’t do that. What’s documented:
- DeepSeek V4-Pro lands its coding capability within a few benchmark points of GPT-5.5 according to third-party comparisons, while costing a fraction of the price. Its reasoning chain is the practical edge: on complex multi-file refactors, renames, and signature changes, the thinking pass catches dependencies in one shot that flat models need three rounds to handle. The setup details and cost math are in how to use DeepSeek V4-Pro with Cursor.
- Qwen 3.7 scored 57 on the Artificial Analysis Intelligence Index, a composite that blends reasoning, knowledge, math, and coding, reported as the #1 result on that leaderboard at launch, plus roughly 1,475 Elo on LM Arena with a top-ten placement in the coding category. Alibaba’s pitch is long-horizon agent work: sustained autonomous runs and heavy tool use across many steps.
The honest read: M3 ships with the most transparent agentic-coding evidence today because it published task-level numbers. DeepSeek’s strength is reasoning-driven code quality at a low price. Qwen’s strength is composite intelligence and endurance on long agent chains. Until DeepSeek and Qwen report on the same SWE-Bench Pro and Terminal-Bench tasks, run your own workload through all three, which we cover at the end. A wider frontier matchup for Qwen sits in Qwen 3.7 vs GPT-5.5 vs Opus 4.7.
Context window and long-context cost
Two of the three advertise a 1,000,000-token context window: MiniMax M3 and Qwen3.7-Max. DeepSeek’s V4-Pro context isn’t reproduced here, so we won’t state a number for it.
A million tokens is roughly 700,000 to 750,000 words. That’s enough to hold a mid-sized repository, a stack of long PDFs, or months of conversation in one request, with no manual chunking and no retrieval layer to maintain. For whole-repo reasoning, it removes a lot of plumbing.
Two caveats keep this honest. First, a big window is a ceiling, not a guarantee. Models often retrieve and reason less reliably as the window fills, and independent long-context testing for these brand-new releases is still thin. Second, big contexts cost money. Every token you send is billed, so a million-token prompt is an expensive prompt.
This is where M3’s MSA architecture is supposed to matter. MiniMax pitches it as built for long-context efficiency, with a standard API rate up to 512K input tokens and a separate long-context rate above that threshold. The split tells you the economic reality plainly: long context is a premium tier, on every model that has it. The practical defense is the same regardless of which model you pick. Use the full window only when the task needs it, and trim aggressively when it doesn’t. Concrete tactics for keeping agent context lean are in how to reduce agent token costs.
Price and access
Price is the reason this comparison exists. The same workload that costs real money on a Western flagship runs at a fraction here, and that gap is the engine behind the Chinese LLM price war 2026.
DeepSeek V4-Pro publishes the clearest per-token numbers of the three. Standard rates, permanent as of May 2026:
| Token type | DeepSeek V4-Pro rate per 1M tokens |
|---|---|
| Input (cache miss) | $0.435 |
| Input (cache hit) | $0.003625 |
| Output | $0.87 |
That output rate is roughly 1/34 the cost of GPT-5.5 output. The non-thinking V4-Flash variant is cheaper still at $0.14 / $0.28 per million input/output. A heavy day of coding-assistant use lands around $1. That’s the number that makes DeepSeek hard to ignore for high-volume agent traffic.
MiniMax M3 sells token plans rather than a single published per-token price: Plus at $20, Max at $50, and Ultra at $120. Its API uses a standard rate for inputs up to 512K tokens and a long-context rate above that. MiniMax has not published an exact per-token figure, so we won’t quote one. The plan structure suits teams that want predictable monthly spend over metered billing. Wiring details are in how to use the MiniMax M3 API.
Qwen 3.7 is billed per token through Alibaba Cloud, where the Max preview went live in May 2026. Alibaba has priced recent Qwen releases aggressively as part of the same price war, but a preview model’s exact rates can shift, so check Alibaba Cloud’s current model docs for the live number.
On access, the open-weight angle changes the cost ceiling entirely. M3’s published weights and DeepSeek’s open releases mean you can self-host and pay only for hardware, with no per-token meter at all. Qwen3.7-Max can’t be self-hosted today because its flagship weights aren’t published, so every route to it runs through Alibaba’s API. If avoiding vendor lock-in is the goal, that’s a real differentiator.
Which one to pick
The right model depends on what you’re optimizing for. Match your priority to the column.
| Your priority | Best fit | Why |
|---|---|---|
| Agentic coding with published benchmarks | MiniMax M3 | Transparent SWE-Bench Pro / Terminal-Bench / MCP Atlas numbers at launch (vendor-reported) |
| Multimodal input (image, video, computer use) | MiniMax M3 | Only one of the three with native multimodality |
| Lowest cost on high-volume API traffic | DeepSeek V4-Pro | ~$0.87/1M output, with a cheaper Flash variant and cache-hit pricing |
| Reasoning-driven code quality on hard refactors | DeepSeek V4-Pro | Thinking chain catches multi-file dependencies in one pass |
| Top composite-intelligence score on a public board | Qwen3.7-Max | AA Intelligence Index 57, reported #1 at launch |
| Long-horizon autonomous agent runs | Qwen3.7-Max or MiniMax M3 | Both pitch endurance and heavy tool use; M3 also publishes MCP Atlas |
| Self-hosting / no vendor lock-in today | MiniMax M3 or DeepSeek V4-Pro | Both publish open weights; Qwen’s flagship is closed |
A few plain reads. If open weights and agentic coding evidence are your top two boxes, M3 is the cleanest pick right now, with the caveat that its weights and tech report were still days out at launch and its benchmarks are vendor-reported. If you’re running heavy API volume and want the lowest bill, DeepSeek V4-Pro’s price is the headline. If you want the top public composite score and you’re fine staying on a hosted API, Qwen3.7-Max fits, as long as you don’t need self-hosting.
Test them yourself
A leaderboard tells you how a model does on someone else’s tasks. It doesn’t tell you how it does on yours. All three of these models expose an API, and the fastest way to settle the choice is to run identical prompts against each one and compare the responses side by side.
That’s a job for Apidog. Set up one Apidog project with three environments, one per model API, and import the OpenAI-compatible Chat Completion schema each of them uses. Then you can:
- Send the same prompt batch to M3, V4-Pro, and Qwen3.7-Max and diff the outputs in one place.
- Record golden responses and replay them on every prompt change to catch drift.
- Validate
tool_callsandreasoning_contentshapes with JSON Schema assertions, so a bad system-prompt edit doesn’t break your agent silently.
Download Apidog, point three environments at the three model endpoints, and you have a working comparison bench in a few minutes. The API setup specifics for the newest model are in how to use the MiniMax M3 API.
Frequently asked questions
Which is the best open-weight coding model in 2026 right now?
For verifiable agentic-coding evidence at launch, MiniMax M3 leads, because it published task-level benchmarks like SWE-Bench Pro 59.0% and Terminal-Bench 2.1 66.0% (vendor-reported). DeepSeek V4-Pro is the value pick: coding within a few points of GPT-5.5 at roughly 1/34 the output price. Qwen3.7-Max tops a composite leaderboard but isn’t open-weight yet. The honest answer is that the head-to-head coding numbers aren’t directly comparable across all three, so run your own workload before committing.
Are all three truly open-weight?
Not yet. MiniMax M3 is open-weight, with weights and a technical report due within about ten days of its June 1, 2026 launch. DeepSeek has a long record of publishing open weights across its R1 and V3 families. Qwen3.7-Max-Preview, the flagship most people mean by “Qwen 3.7,” is proprietary and closed-weight as of mid-May 2026. Alibaba may open-source a tier below it later, but treat that as plausible, not confirmed. Details are in what is Qwen 3.7.
Which has the biggest context window?
MiniMax M3 and Qwen3.7-Max both advertise a 1,000,000-token window, roughly 700,000 to 750,000 words. DeepSeek V4-Pro’s context isn’t stated here. Remember that a large window is a ceiling, not a promise of perfect recall, and every token in it is billed.
Which is the cheapest to run?
On published per-token rates, DeepSeek V4-Pro is the clear leader: about $0.87 per million output tokens, with a cheaper non-thinking V4-Flash variant at $0.14 / $0.28. MiniMax M3 sells monthly token plans ($20 / $50 / $120) rather than a published per-token price. Qwen3.7-Max bills per token on Alibaba Cloud. If you can self-host, the open-weight models drop your marginal cost to hardware alone. The broader pricing picture is in the Chinese LLM price war 2026.
Is MiniMax M3 actually better than DeepSeek V4-Pro at coding?
The benchmark numbers aren’t directly comparable yet. M3 published SWE-Bench Pro and Terminal-Bench results at launch; DeepSeek hasn’t reported on those same tasks in the same format. M3’s edge today is published evidence plus multimodality. DeepSeek’s edge is price and a reasoning chain that’s strong on multi-file refactors. All three speak an OpenAI-compatible API, so the fair test is to run identical prompts against each on your own repo before deciding.
The short version
Three open-weight contenders now reach the frontier on coding, and the choice comes down to what you’re optimizing for. Pick MiniMax M3 if you want published agentic-coding benchmarks, a 1M context, and multimodality, and you can wait the few days for its weights to land. Pick DeepSeek V4-Pro if low cost and reasoning-driven code quality matter most, since its per-token price is the lowest of the three and its weights are available. Consider Qwen3.7-Max if you want the top public composite score and you’re comfortable on a hosted API, knowing its flagship isn’t open-weight today.
The benchmark numbers will keep moving, and several of M3’s are still vendor-reported. The durable advice doesn’t change: run the same prompts against all three APIs in one Apidog project, watch the outputs and the bills, and let your own workload pick the winner.
