Claude Sonnet 5 Benchmarks: What the Numbers Actually Say

Claude Sonnet 5 benchmarks explained: SWE-bench Pro 63.2%, Terminal-Bench 80.4%, OSWorld 81.2%, and how close it gets to Opus 4.8 at a lower price.

Ashley Innocent

Ashley Innocent

1 July 2026

Claude Sonnet 5 Benchmarks: What the Numbers Actually Say

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Explore Apidog Enterprise

Claude Sonnet 5 launched on June 30, 2026, and the headline claim from Anthropic is bold: agentic performance close to Opus 4.8 at a much lower price. This article walks through the benchmark scores reported at launch, explains what the pattern actually means, and shows where the numbers stop being useful. If you want the full model overview first, start with the Claude Sonnet 5 pillar guide. For the raw figures straight from the source, Anthropic published them on the official announcement page.

Here is the short version. On tasks where the model uses tools, Sonnet 5 lands within a few points of Opus 4.8. On pure reasoning with nothing to lean on, the gap widens to around six points. That single pattern explains most buying decisions, and it is the thread we pull on below.

All numbers in this article are Anthropic’s launch benchmarks, corroborated across multiple launch-day writeups. Treat them as reported figures, not as our own independent testing.

The benchmark table

Three benchmarks tell the story. Here are the reported scores for Sonnet 5, its predecessor Sonnet 4.6, and the flagship Opus 4.8.

Benchmark What it measures Sonnet 5 Sonnet 4.6 Opus 4.8
SWE-bench Pro Agentic coding on real repos 63.2% 58.1% 69.2%
Terminal-Bench 2.1 Command-line task completion 80.4% not reported 82.7%
OSWorld-Verified Computer use, GUI tasks 81.2% 78.5% 83.4%

A few things jump out.

Sonnet 5 beats Sonnet 4.6 on every benchmark where both were reported. The SWE-bench Pro jump from 58.1% to 63.2% is over five points, which is a real generational gain for agentic coding. OSWorld-Verified moves from 78.5% to 81.2%.

Against Opus 4.8, Sonnet 5 trails by 6.0 points on SWE-bench Pro, 2.3 points on Terminal-Bench 2.1, and 2.2 points on OSWorld-Verified. The gap is smallest on the two tasks that lean hardest on tools and the terminal.

The pattern that matters

Read the table again with one question in mind: how much can the model use tools to solve the problem?

On Terminal-Bench 2.1 and OSWorld-Verified, the model runs commands, reads output, and adjusts. It gets feedback from the environment on every step. Sonnet 5 sits within roughly one to three points of Opus 4.8 on both.

SWE-bench Pro is also agentic, but it stresses deeper reasoning about large codebases, and there the gap opens to six points. When the task rewards raw reasoning over tool loops, Opus pulls ahead.

Anthropic’s own framing supports this. They call Sonnet 5 the most agentic Sonnet model yet, and they position it as close to Opus 4.8 on agentic and tool-use tasks while Opus keeps its lead on pure reasoning. The benchmarks match the marketing here, which is not always the case.

So the practical read is simple. If your workload puts tools in the loop, agents, coding assistants, computer use, Sonnet 5 gives you most of Opus 4.8’s capability. If your workload is a single hard reasoning pass with no tools to correct course, Opus earns its premium. For a full side-by-side including price and context, see Claude Sonnet 5 vs Opus 4.8.

Price changes how you read these scores

Benchmarks in isolation flatter the most expensive model. Add price and the picture shifts.

Sonnet 5 runs at introductory pricing of $2 per million input tokens and $10 per million output tokens through August 31, 2026, then moves to standard $3 / $15. Opus 4.8 is $5 / $25. So on standard rates Sonnet 5 costs 60% of Opus input and 60% of Opus output, and even less during the intro window.

Now reweigh the table. A 2.3-point gap on Terminal-Bench 2.1 costs a lot less to close by picking Opus than a 6-point gap does. For agentic and tool-heavy work, paying the Opus premium to recover two or three points is often not worth it. That is the whole value argument for Sonnet 5, and the benchmarks are what make it credible.

One catch that pure scores hide: Sonnet 5 uses a new tokenizer that produces roughly 30% more tokens for the same input text. Per-token price is unchanged from Sonnet 4.6, but the cost of an equivalent request can rise because there are more tokens to bill. Benchmark accuracy says nothing about this. Model your real cost with token counting rather than assuming flat parity. The full breakdown lives in the Claude Sonnet 5 pricing guide.

What benchmarks miss

Public benchmarks are useful for ranking models. They are weak at predicting how a model behaves on your specific work. Three gaps stand out.

Your workload is not SWE-bench. If you write TypeScript against a private API with in-house conventions, a repo-solving benchmark on public Python projects is a rough proxy at best. The relative ranking tends to hold, but the absolute number will not match what you see.

Cost per solved task beats raw accuracy. A model that scores two points lower but costs 40% less can solve more tasks for the same budget. When you run agents at volume, cost-per-success is the metric that pays the bills, and no leaderboard reports it for your prompts.

Latency and throughput do not appear. Benchmarks measure whether the answer is right, not how fast it arrives or how the model behaves under adaptive thinking, which is on by default in Sonnet 5. For interactive tools, a slower correct answer can lose to a faster good-enough one.

The honest conclusion is to treat these scores as a starting filter, then run your own evaluation. Benchmarking on tasks you actually care about is the only test that reflects your results.

Safety, briefly

Benchmark tables rarely include safety, but it is part of how these numbers should be read.

Anthropic reports that Sonnet 5 has a lower overall rate of undesirable behaviors than Sonnet 4.6, with less hallucination and less sycophancy. It is the first Sonnet-tier model with real-time cybersecurity safeguards. Requests touching prohibited or high-risk cyber topics may be refused, and a refusal returns as a successful HTTP 200 response with stop_reason: "refusal", not an error, so build for that case.

Be honest about the caveats too. On Anthropic’s automated behavioral audit, Sonnet 5 showed higher misaligned-behavior rates than Opus 4.8. On cyber capability it sits below the Opus models, and neither Sonnet model could develop a working exploit at all, reported as 0.0%. Lower capability there is a feature, not a gap. Full detail is in Anthropic’s transparency hub.

Reproduce the numbers on your own tasks

The most valuable benchmark is the one that runs against your prompts. To do that reliably, you need to call the Sonnet 5 API the same way every time, save the requests, and compare responses across runs.

That is a job for an API client. Apidog lets you build a request to the Anthropic Messages API, save it in a reusable collection, store your API key as an environment variable, and run the same call repeatedly with assertions on the response. When you want to compare Sonnet 5 against Opus 4.8 or Sonnet 4.6 on your own inputs, you change one variable, the model ID, and rerun the collection.

Here is the request shape you would save. The model ID is the exact string claude-sonnet-5.

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-5",
    "max_tokens": 2048,
    "messages": [
      {
        "role": "user",
        "content": "Refactor this function to remove the nested loop and explain the change."
      }
    ]
  }'

To A/B a benchmark prompt across models, keep the body identical and swap "model" between claude-sonnet-5, claude-opus-4-8, and claude-sonnet-4-6. In Apidog you would store the model as an environment variable so a single edit switches every request in the run. Add a test assertion to check stop_reason and response length, then run the collection in CI so your eval is repeatable. If you have never set up API tests this way, the testing without Postman guide walks through the workflow.

One migration note when you script comparisons: Sonnet 5 does not accept non-default temperature, top_p, or top_k, and it rejects the old thinking: {type: "enabled", budget_tokens: N} field. Both return a 400 error. Remove those parameters before you benchmark, or your run fails before it measures anything.

Download Apidog to build the request once and reuse it across every model you want to score.

FAQ

What is Claude Sonnet 5’s SWE-bench Pro score? Anthropic’s launch figures report 63.2% for Sonnet 5, compared to 58.1% for Sonnet 4.6 and 69.2% for Opus 4.8. It is a five-point generational gain on agentic coding, and about six points behind the flagship.

Is Sonnet 5 better than Opus 4.8? Not on raw scores. Opus 4.8 leads every reported benchmark. But Sonnet 5 comes within one to three points on tool-heavy tasks at 60% of the price, which makes it the better value for agents and coding loops. The full comparison is in Claude Sonnet 5 vs Opus 4.8.

Are these benchmark numbers from independent testing? No. They are Anthropic’s own launch benchmarks, corroborated across multiple launch-day writeups. Treat them as reported figures and validate on your own workload before you commit.

Why does Sonnet 5 do relatively better on tool tasks than reasoning tasks? When the model can run commands and read the results, it corrects its own mistakes step by step. That feedback narrows the gap to Opus. On a single reasoning pass with no tools, there is nothing to correct against, so Opus’s deeper reasoning shows up as a wider lead.

How do I benchmark Sonnet 5 on my own prompts? Call the Anthropic Messages API with the model ID claude-sonnet-5, save the request in a tool like Apidog, add assertions, and rerun it across models by swapping the model ID. That gives you cost-per-task and latency, which public leaderboards never report.

Explore more

What Is Claude Sonnet 5? Features, Benchmarks, and Pricing

What Is Claude Sonnet 5? Features, Benchmarks, and Pricing

Claude Sonnet 5 explained: the June 2026 launch, 1M context, adaptive thinking, launch benchmarks vs Opus 4.8, intro pricing, availability, and who it's for.

1 July 2026

What Is Kreya?

What Is Kreya?

A look at the gRPC-first, privacy-first desktop API client by riok: protocols, offline use, git-diffable storage, pricing, and who it suits.

30 June 2026

What is composable architecture? The MACH and API-first guide

What is composable architecture? The MACH and API-first guide

What is composable architecture? A clear guide to PBCs, MACH, and the API-first backbone, with composable vs monolith and when to adopt it.

30 June 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs

Claude Sonnet 5 Benchmarks: What the Numbers Actually Say