Three labs shipped flagship models within five weeks of each other, and the leaderboards have not stopped moving since. Alibaba’s Qwen3.7-Max-Preview, OpenAI’s GPT-5.5, and Anthropic’s Claude Opus 4.7 now sit at the top of every benchmark that matters, and picking between them is harder than it looks. One headline keeps circulating: Qwen3.7-Max ranked #1 on the Artificial Analysis Intelligence Index. That claim is real, but it needs context, and it does not settle the question of which model you should actually build on.
This comparison puts all three side by side across reasoning, coding, context window, pricing, availability, and latency. Every number here is attributed to a named source, because vendor marketing and independent benchmarks tell different stories. If you want to test the differences yourself, you can run all three model APIs side by side in Apidog, comparing responses, token usage, and latency in one workspace before you commit.
TL;DR
For raw benchmark intelligence, GPT-5.5 leads with a 60 on the Artificial Analysis Intelligence Index, while Qwen3.7-Max-Preview holds the overall #1 leaderboard slot at 57 and Claude Opus 4.7 also scores 57. For human-preference quality on LM Arena, Claude Opus 4.7 wins. For real-world coding, the split is close: GPT-5.5 tops SWE-bench Verified, Opus 4.7 leads on the harder SWE-bench Pro. For budget and openness, Qwen wins on price (with caveats, since it is preview-only). Pick GPT-5.5 for token-efficient agentic work, Opus 4.7 for large-codebase engineering and conversational quality, and Qwen3.7-Max if cost and a 1M-token window matter most.
The three models at a glance
Before the benchmarks, here’s what each model actually is. The differences in release status alone change how you should read every score.
Qwen3.7-Max-Preview
Qwen3.7-Max is Alibaba’s flagship reasoning model, previewed in mid-May 2026 and announced around the Alibaba Cloud Summit. It uses extended thinking, carries a 1.0M-token context window, and is built with agentic coding, tool use, and long-context reasoning as priorities. The important word is preview. As of late May 2026 it has no public API endpoint and no open weights; access runs through Alibaba Cloud Model Studio and Qwen Studio.

One nuance worth flagging: Alibaba has said Qwen3.7-Plus will ship as open source while Qwen3.7-Max stays proprietary. That is a shift from Qwen’s earlier all-open approach, and it matters if openness is part of your decision.
GPT-5.5
GPT-5.5 is OpenAI’s agentic-focused reasoning model, released April 23, 2026. It is a direct response to Claude Opus 4.7 and leans hard into autonomous workflows: terminal use, browser tasks, and tool calling. OpenAI ships it in several effort tiers (the public Artificial Analysis figures use the xhigh variant), with a 1M-token context window in the API and a smaller 400K window inside Codex. It is generally available through the OpenAI API today.

Claude Opus 4.7
Claude Opus 4.7 is Anthropic’s current flagship, released April 16, 2026 as a direct upgrade to Opus 4.6. Anthropic positioned it around advanced software engineering, especially the hardest tasks across large codebases. It runs adaptive reasoning, carries a 1.0M-token context window, and is generally available through the Anthropic API, Amazon Bedrock, and Google Vertex AI. Of the three, it has the longest track record in production and the most independent voting data behind its scores.

Reasoning and intelligence benchmarks
This is where the “Qwen #1” hook comes from, so it deserves a careful read.
The Artificial Analysis Intelligence Index
The Artificial Analysis Intelligence Index is a composite score built from a weighted average of ten evaluations covering reasoning, knowledge, math, and coding. Here is where the three models land, per Artificial Analysis as of late May 2026:
- Qwen3.7-Max scores 57, listed at #1 of 218 models on the overall leaderboard.
- GPT-5.5 (xhigh) scores 60, the highest of the three.
- Claude Opus 4.7 (max) scores 57, listed at #3 in its tracked class.
So both halves of the popular claim are technically true and slightly in tension. Qwen3.7-Max does hold the overall #1 leaderboard position on Artificial Analysis. But GPT-5.5 posts the higher index score at 60. The gap comes down to how the leaderboard ranks models that share a tier and how Artificial Analysis groups reasoning variants; a model can top the overall list while another posts a higher raw number in a different tracked group. The honest summary: GPT-5.5 has the highest measured intelligence score, and Qwen3.7-Max sits at the very top of the public leaderboard. Treat them as roughly co-leaders, with Opus 4.7 a hair behind on this particular index.
One more caveat for Qwen. Artificial Analysis notes that Qwen3.7-Max generated 97M output tokens during the evaluation, far above the roughly 26M average. It is a verbose reasoner. That verbosity inflates token costs and latency, and it is a real factor once you move from benchmarks to production.
LM Arena human-preference Elo
Benchmarks measure correctness on fixed tasks. LM Arena measures something different: which response a human prefers in a blind side-by-side. The current LM Arena text leaderboard tells a different story from the Intelligence Index:
- Claude Opus 4.7 sits around 1,492 Elo, ranked #4 overall, with 13,000-plus votes behind it.
- GPT-5.5 sits around 1,478 Elo, ranked #11.
- Qwen3.7-Max-Preview sits around 1,475 Elo, ranked #14, still marked preliminary with under 4,000 votes.
The flip is striking. The model with the highest benchmark score (GPT-5.5) does not lead on human preference, and the preview model (Qwen) has too few votes for a stable reading. Opus 4.7 wins here, which matches the broader pattern that Anthropic’s Opus models tend to top LM Arena’s text, vision, and document rankings even when they trail on academic benchmarks. If your product is conversational and quality is judged by users rather than test suites, that gap is worth weighing heavily. Elo scores shift as votes accumulate, so check the live board before quoting any single number.
Coding ability
All three labs market these models as coding tools, so the coding benchmarks carry weight.
On SWE-bench Verified, the standard test of resolving real GitHub issues, GPT-5.5 took the top spot at 88.7%, with Claude Opus 4.7 close behind at 87.6%, per SWE-bench leaderboard tracking from May 2026. That is a narrow margin and both numbers are excellent.
The picture changes on harder tests. On SWE-bench Pro, which uses tougher real-repository pull-request tasks, Claude Opus 4.7 leads at roughly 64% against GPT-5.5’s 59%. Opus 4.7 also tends to do better on tasks that need broad architectural reasoning across a large codebase. GPT-5.5, in turn, dominates unattended terminal and shell workflows, leading Terminal-Bench 2.0 by a wide margin, and it is far more token-efficient (reported around 72% fewer output tokens on equivalent tasks). Across the ten benchmarks both vendors report, independent coverage put Opus 4.7 ahead on six and GPT-5.5 ahead on four.
Qwen3.7-Max-Preview is the harder one to pin down. As of late May 2026 it has Arena Elo data but no published standardized coding benchmarks like SWE-bench. It ranks #9 in Software & IT and #10 in Coding on LM Arena’s category boards, which is strong but not a substitute for a controlled SWE-bench run. Qwen’s coder-tier models have posted SWE-bench Verified scores above 70% in the same family, so the capability is plausible; the Max-Preview number simply is not public yet. Stating a Qwen3.7-Max SWE-bench figure today would be a guess, so we are leaving it out.
Practical read for coding: GPT-5.5 for terminal-driven and cost-sensitive automation, Opus 4.7 for large-codebase engineering and the gnarliest pull requests. If you are comparing IDE-integrated coding agents specifically, our breakdown of Cursor Composer 2.5 against Opus 4.7 and GPT-5.5 goes deeper on that workflow.
Context window
Long context decides whether you can drop an entire repository, a long document set, or a multi-hour agent trace into a single call.
- Qwen3.7-Max: 1.0M tokens, per Artificial Analysis.
- Claude Opus 4.7: 1.0M tokens, per Artificial Analysis.
- GPT-5.5: 1M tokens in the API, though Artificial Analysis measured an effective window around 922K; the Codex integration caps at 400K.
This is close to a three-way tie at the headline level. All three give you roughly a million tokens, enough for about 1,500 pages of text. The practical differences are at the edges. GPT-5.5’s API window matches the others, but if you work inside Codex you get less than half of it, so check which surface you are actually calling. And a long advertised window is not the same as reliable recall deep into that window; if long-context accuracy is core to your use case, test retrieval at depth rather than trusting the headline figure.
Pricing
Cost is where the comparison gets uneven, because one of the three has no published price.
Per Artificial Analysis, GPT-5.5 (xhigh) runs $5.00 per million input tokens and $30.00 per million output tokens, with cached input at $0.50. Claude Opus 4.7 (max) runs $6.25 per million input and $25.00 per million output, also with $0.50 cached input. So Opus 4.7 is cheaper on output, GPT-5.5 is cheaper on input, and which wins depends entirely on your input-to-output ratio. Long-prompt, short-answer workloads favor GPT-5.5; generation-heavy workloads favor Opus 4.7.
Qwen3.7-Max-Preview has no announced API pricing as of late May 2026. For reference, the prior-generation Qwen3.6-Max-Preview was priced around $1.30 per million input and $7.80 per million output through Alibaba Cloud. If Qwen3.7-Max lands near that range, it would undercut both US models by a wide margin. That is a reasonable expectation, not a confirmed price, so plan around it carefully. Whatever the sticker price, remember Qwen’s verbosity: 97M tokens on a benchmark where the average is 26M means your real bill scales faster than the per-token rate suggests.
If token spend is your main constraint, the cheapest model on paper is not always the cheapest in practice. Output volume, caching, and retry behavior all move the number. Our guide on how to reduce agent token costs from the CLI covers the levers that matter more than the rate card.
Availability and openness
This category has a clear ranking, and it is the one most likely to rule a model out.
GPT-5.5 is generally available through the OpenAI API and Codex today. Proprietary, no weights, but stable and production-ready.
Claude Opus 4.7 is generally available through the Anthropic API, Amazon Bedrock, and Google Vertex AI. Also proprietary, also production-ready, with the widest cloud-platform reach of the three.
Qwen3.7-Max-Preview is preview-only. No public API endpoint, no open weights, access limited to Alibaba Cloud Model Studio and Qwen Studio. Alibaba has said the Plus tier will be open source while Max stays closed. For a production system today, preview status is a real blocker; for evaluation and roadmap planning it is fine. If you want a hands-on path, our walkthrough on how to use the Qwen 3.7 API covers current access, and there is a separate guide on how to use Qwen 3.7 for free through the Qwen chat interface while the API stabilizes.
Net: GPT-5.5 and Opus 4.7 are both ready to ship on. Qwen3.7-Max is not, yet.
Latency
Speed matters for anything user-facing or for agent loops that make many sequential calls.
Per Artificial Analysis, Claude Opus 4.7 has a time to first token around 27 seconds, and GPT-5.5 (xhigh) is slower at roughly 101 seconds. On output throughput, GPT-5.5 generates about 65.9 tokens per second against Opus 4.7’s 49.4. Two things to note. First, these are figures for the highest-effort reasoning tiers; lower-effort variants of both models respond much faster, and most production deployments do not run at max effort. Second, GPT-5.5 starts slow but streams fast once it begins, while Opus 4.7 starts faster but streams slower. For a chat UI, the faster first token usually feels better; for bulk generation, raw throughput wins.
Qwen3.7-Max has no published speed or latency data on Artificial Analysis. Given the 97M-token verbosity figure, expect longer end-to-end times on reasoning-heavy prompts regardless of raw throughput, since the model simply produces more tokens to get to an answer.
Full comparison table
| Criterion | Qwen3.7-Max-Preview | GPT-5.5 | Claude Opus 4.7 |
|---|---|---|---|
| Vendor | Alibaba | OpenAI | Anthropic |
| Released | Preview, mid-May 2026 | April 23, 2026 | April 16, 2026 |
| AA Intelligence Index | 57 (#1 / 218 overall) | 60 (highest score) | 57 (#3 in class) |
| LM Arena text Elo | ~1,475 (#14, preliminary) | ~1,478 (#11) | ~1,492 (#4) |
| SWE-bench Verified | Not published | 88.7% | 87.6% |
| SWE-bench Pro | Not published | ~59% | ~64% |
| Context window | 1.0M tokens | 1M API / ~922K effective / 400K Codex | 1.0M tokens |
| Input price (per 1M) | Not announced (Qwen3.6-Max: ~$1.30) | $5.00 | $6.25 |
| Output price (per 1M) | Not announced (Qwen3.6-Max: ~$7.80) | $30.00 | $25.00 |
| Output speed | Not published | ~65.9 tok/s | ~49.4 tok/s |
| Time to first token | Not published | ~101 s (xhigh) | ~27 s |
| Availability | Preview only (Model Studio / Qwen Studio) | GA (OpenAI API, Codex) | GA (Anthropic API, Bedrock, Vertex) |
| Open weights | No (Max proprietary; Plus to be open) | No | No |
| Reasoning model | Yes (extended thinking) | Yes (extended thinking) | Yes (adaptive reasoning) |
Sources: Artificial Analysis model pages, the LM Arena text leaderboard, SWE-bench leaderboard tracking, and vendor announcements, all current as of late May 2026. Preview-stage Qwen figures are not finalized; benchmark and Elo numbers move, so verify against the live boards before you quote them.
Real-world use cases
Benchmarks are a starting point. Here is how the three behave across jobs people actually run.
Building an autonomous coding agent
You want a model that resolves GitHub issues, runs terminal commands, and stays inside a token budget across long agent loops. GPT-5.5 fits this best. It tops SWE-bench Verified, dominates Terminal-Bench, and its 72% token-efficiency edge compounds over thousands of agent steps. Opus 4.7 is a strong alternative when the codebase is large and architectural reasoning matters more than shell throughput.
Refactoring a large legacy codebase
Here the task is reasoning across hundreds of files, holding a wide mental model, and producing PR-quality changes. Claude Opus 4.7 leads on SWE-bench Pro and on broad-codebase tasks, and its 1M-token window lets you load real context. This is its strongest single use case.
Long-document analysis and research synthesis
Feeding in lengthy contracts, research papers, or transcripts is a near tie. All three offer roughly 1M tokens. Opus 4.7’s higher LM Arena standing suggests cleaner summaries that humans prefer; Qwen3.7-Max matches the window and would likely undercut on cost once priced. For a production document pipeline today, Opus 4.7 or GPT-5.5; for a cost-sensitive internal tool where preview access is fine, Qwen is worth a pilot.
Customer-facing chat and assistants
When end users judge the output, LM Arena Elo is the most relevant signal. Opus 4.7 leads the three on human preference, which is the metric that tracks user satisfaction most directly. GPT-5.5 is a fine second choice, especially where its faster streaming improves perceived responsiveness.
High-volume, cost-sensitive workloads
For classification, extraction, or bulk generation where you process millions of tokens daily, price dominates. If Qwen3.7-Max ships near its predecessor’s rates, it would be the clear pick. Until the API and pricing are public, GPT-5.5 (cheaper input) or Opus 4.7 (cheaper output) wins depending on your token mix. Whichever you choose, validate the real per-request cost rather than trusting the rate card, because output volume varies a lot between these models.
Per-use-case picks
A quick decision guide:
- Best for coding agents and terminal automation: GPT-5.5. Top SWE-bench Verified score, best terminal performance, and the most token-efficient by a wide margin.
- Best for large-codebase engineering: Claude Opus 4.7. Leads SWE-bench Pro and broad architectural tasks, with a full 1M-token window.
- Best for conversational and user-facing products: Claude Opus 4.7. Highest LM Arena human-preference Elo of the three.
- Best for raw benchmark intelligence: GPT-5.5. Highest Artificial Analysis Intelligence Index score at 60.
- Best for budget and long context (with caveats): Qwen3.7-Max-Preview. A 1M-token window and likely low pricing, but it is preview-only with no production API yet.
- Best available-today all-rounder: a toss-up between GPT-5.5 and Opus 4.7; both are GA, both are excellent, and the right call depends on whether you optimize for token cost or human-preferred quality.
If a fourth contender belongs in your evaluation, Google’s model is worth a look too. We cover what Gemini 3.5 is separately, and there is a direct Gemini 3.5 vs GPT-5.5 vs Opus 4.7 comparison for that three-way matchup.
How to test all three yourself
Benchmarks generalize; your workload is specific. The fastest way to settle a model choice is to send the same prompts to each API and compare the responses, token counts, and latency directly.

Apidog makes that side-by-side test straightforward. Create one request for each model’s chat endpoint, drop them in a shared workspace, and run them against the same input. You can inspect full responses, measure response time, and track token usage in one place instead of juggling three separate consoles or scripts. Save the requests as a reusable test scenario and you can re-run the comparison every time a model updates, which, given how fast these three are iterating, will be often. Download Apidog to set up your first multi-model comparison.
Conclusion
There is no single winner here, and any article that names one is oversimplifying. The honest takeaways:
- GPT-5.5 has the highest benchmark intelligence (60 on the Artificial Analysis Intelligence Index), tops SWE-bench Verified, and is the most token-efficient. Best for coding agents and cost-sensitive automation.
- Claude Opus 4.7 wins human-preference quality on LM Arena, leads the harder SWE-bench Pro, and has the widest cloud availability. Best for large-codebase engineering and user-facing products.
- Qwen3.7-Max-Preview holds the #1 spot on the Artificial Analysis leaderboard, matches the others on context window, and will likely be the cheapest once priced. But it is preview-only today, so it is a roadmap candidate, not a production choice yet.
- The “Qwen ranked #1” headline is accurate but partial: Qwen tops the overall leaderboard while GPT-5.5 posts the higher raw score. Read both.
- Benchmark numbers and Elo ratings move week to week. Verify against the live boards before you commit.
The right model is the one that wins on your actual prompts, your token mix, and your latency budget. Test all three against the same requests in Apidog before you decide; an afternoon of side-by-side testing beats a month of guessing from leaderboards.



