Claude Fable 5 Benchmarks: What the Numbers Say

When Anthropic launched Claude Fable 5 on June 9, 2026, it called the model state-of-the-art on nearly every benchmark it tested. If you came here looking for clean Claude Fable 5 benchmarks with hard numbers next to every eval, there’s an honest caveat up front: Anthropic’s announcement reported benchmark placements (where Fable 5 ranks against other frontier models) more than full numeric scoreboards in its text, and several of the headline charts arrived as images rather than copy-pasteable tables. So this roundup focuses on what the placements actually mean, where Fable 5 sits, and how you can run your own quick eval if you want numbers you control. For a wider comparison of the current frontier, our breakdown of Opus 4.8 against GPT-5.5 and Gemini 3.5 is a useful companion.

Fable 5 ships at $10 per million input tokens and $50 per million output tokens, under the model id claude-fable-5. It sits a tier above Opus 4.8 in both capability and price, and Anthropic positions it as the strongest publicly available Claude for software engineering, knowledge work, vision, and scientific research.

TL;DR

Claude Fable 5 ranks first among frontier models on FrontierCode and FrontierBench (both from Cognition), is state-of-the-art on CursorBench, and posts the highest score on Hebbia’s Finance Benchmark. It shows clear strength on long-horizon, autonomous work. Anthropic reported these as placements, so exact public scores are limited. Treat the rankings as directional, not final.

The headline result

The single sentence that frames every Claude Fable 5 benchmark discussion: Anthropic describes the model as state-of-the-art on nearly all of the benchmarks it ran, covering software engineering, knowledge work, vision, and scientific research. It’s a broad claim, and broad claims deserve a careful read.

“State-of-the-art on nearly all benchmarks” means Fable 5 either tops the leaderboard or sits at the top tier on most evals Anthropic chose to report. It does not mean Fable 5 wins every test by a wide margin, and it does not mean independent labs have reproduced each result. What it signals is consistency: a model that is best-in-class on coding but mediocre on document reasoning would not earn that phrasing. Fable 5 appears to hold the top spot across categories that usually trade off against each other.

That breadth matters more than any one chart. Plenty of models spike on a favorite benchmark and sag elsewhere. A model that stays near the top across coding, finance, vision, and science is harder to game, because you can’t tune for four unrelated skills at once without genuine capability underneath. If you’re deciding whether Fable 5 is worth the jump from a cheaper tier, the breadth of the placements is the part to weigh. For the full primer on the model itself, see what Claude Fable 5 is.

A second theme runs through the results: long-horizon work. Anthropic says Fable 5 “stays focused across millions of tokens in long-running tasks” and works autonomously for longer than any previous Claude. Several of the placements below are not single-shot accuracy tests. They reward a model that can hold a plan together over thousands of steps without drifting. That is where Fable 5’s reported lead is widest, and it’s also the capability that’s hardest to capture in a single number.

Coding benchmarks: FrontierCode and CursorBench

Coding is where Fable 5’s benchmark story is strongest and most concrete.

On FrontierCode, a coding eval from Cognition (the team behind the Devin coding agent), Anthropic reports that Fable 5 is the highest-scoring frontier model, and it holds that lead even at medium effort. The “effort” qualifier is worth pausing on. Many frontier models can be pushed to higher accuracy by spending more inference compute (more reasoning tokens, more attempts, higher effort settings). A model that already leads at medium effort is reaching the top without the most expensive configuration, a better signal for everyday use than a number that only appears at maximum spend.

On CursorBench, Anthropic describes Fable 5 as state-of-the-art and frames the result around scope rather than a single accuracy figure. The phrase from the announcement is that Fable 5 “opened up a class of long-horizon problems that were out of reach” for prior models. CursorBench leans toward the multi-file, multi-step engineering work that real codebases demand, so a state-of-the-art placement here speaks to agentic coding more than to isolated function-writing.

Both results point the same direction: Fable 5 is built for sustained engineering, not snippet completion. If you spend your day in a coding agent that plans, edits across files, runs tests, and iterates, these are the benchmarks that map to your workflow. A model that tops FrontierCode at medium effort and pushes CursorBench into new territory should hold up under long agent sessions rather than fraying after a few turns.

Knowledge and finance: Finance Benchmark (Hebbia)

Outside of code, the clearest knowledge-work result comes from the Finance Benchmark built by Hebbia, a company focused on AI for document-heavy financial and legal work.

Anthropic reports that Fable 5 posts the highest score of any model on this benchmark, with gains concentrated in three areas: document reasoning, charts, and tables. That combination is telling. Financial analysis is rarely a trivia question. It’s reading a long filing, tracing a number across several pages, reconciling a chart against the text that describes it, and pulling the right cell out of a dense table without misreading the column. Those are exactly the skills the Finance Benchmark stresses, and the ones that trip up models that are strong on prose but weak on structured data.

The vision angle matters here too. Charts and tables are often images or mixed layouts, so a high Finance Benchmark score is partly a vision result. It lines up with Anthropic’s broader claim that Fable 5 is strong on vision, and suggests the model handles the messy, real-world documents that knowledge workers deal with rather than clean text-only inputs.

For developers, the practical read is that Fable 5 is a candidate for document-extraction pipelines, financial analysis tooling, and any workflow where the input is a PDF full of numbers rather than a tidy JSON payload. If your product reads contracts, statements, or reports and has to be right about the figures, this is the placement to watch. Validate on your own documents before you trust a benchmark to predict your results.

Long-horizon reasoning: FrontierBench (Cognition)

The second Cognition eval, FrontierBench, is where the autonomy story turns into a benchmark placement. Anthropic reports Fable 5 as the highest-scoring model on FrontierBench and singles out long-horizon reasoning as the reason.

Long-horizon reasoning is the ability to keep a goal and a plan coherent across a long task: many steps, many tokens, many chances to lose the thread. Most benchmarks reward a correct answer to a contained question. FrontierBench, by Anthropic’s framing, rewards a model that can stay on task while the context window fills with its own intermediate work. That’s a different muscle, and the one Anthropic keeps pointing back to with phrases like “stays focused across millions of tokens.”

This is also the placement that’s hardest to verify from the outside, precisely because it’s hard to measure. A long-horizon eval has to define what “staying on task” means, how partial progress is scored, and how to stop a model from gaming the metric by stalling. So treat the FrontierBench placement as a strong directional signal that Fable 5 is built for autonomous, long-running agents, while keeping in mind that long-horizon scoring is an evolving area where methodology still varies between labs. Taken together with CursorBench, the story is consistent: Fable 5’s edge is least about answering one hard question and most about not falling apart over a long one.

Real-world performance beyond benchmarks

Benchmarks are a proxy. The two results Anthropic highlighted from real deployments are arguably more informative than any leaderboard, because they show the model doing a job rather than passing a test.

The first is a Stripe codebase migration. Anthropic reports that Fable 5 migrated a 50-million-line Ruby codebase for Stripe in a single day, work the team estimated would have taken two months or more. Read that carefully. A 50-million-line migration is not a coding puzzle. It’s a sprawling, repetitive, context-heavy slog across thousands of files where small inconsistencies compound into broken builds. The signal isn’t that Fable 5 is clever; it’s that it can sustain correct, consistent edits at enormous scale without drifting, the long-horizon capability the benchmarks gesture at, shown on a genuine production system.

The second is a Slay the Spire test. Slay the Spire is a deck-building roguelike, and Anthropic used it to probe memory rather than coding. With persistent file memory enabled, Fable 5 showed a 3x improvement over Opus 4.8 at the game. The mechanism is the interesting part: the gain came from letting the model write notes to files and read them back across runs, accumulating strategy the way a human player would. It points to a model that gets meaningfully better when you give it durable memory, instead of starting cold every session.

What do these tell you that benchmarks don’t? Two things. First, scale endurance: a benchmark question is small by design, and the Stripe result shows behavior at a scale no standard eval reaches. Second, memory and tool use as force multipliers. The Slay the Spire result isn’t about raw model IQ, it’s about how the model improves when wired into an environment with persistent state. Both are properties you only see when a model is embedded in a real system, which is also why they’re harder to compare across vendors. If you’re evaluating Fable 5 for an agent that runs for hours and keeps its own notes, these signals matter more than a single accuracy percentage.

How to read these results

A benchmark roundup that only cheerleads isn’t useful. Here are the caveats to hold alongside the placements.

The benchmark owners are partners. FrontierCode and FrontierBench come from Cognition, and the Finance Benchmark comes from Hebbia. These are credible organizations building serious evals, and their involvement is a plus, not a red flag. But they’re also partners in the launch narrative, and a benchmark designed by one party tends to reward the capabilities that party cares about. That doesn’t make the results wrong; it means you should want independent reproduction before treating them as settled. Cross-reference with neutral comparisons like our look at MiniMax M3 versus Opus 4.7 versus GPT-5.5 to see how Anthropic’s models hold up against other framings.

“Effort” settings change the picture. The FrontierCode result was reported at medium effort, which is encouraging. But effort is a real variable across these evals. Two models compared at different effort levels aren’t being compared fairly, and a number quoted without its effort setting is incomplete. When you see a Fable 5 score online, check what effort and how many attempts produced it before you compare it to anything.

Public scores are limited. Anthropic’s announcement leaned on placements, and the detailed charts arrived as images, which is why this article stays qualitative on the specific evals. Secondary outlets have filled the gap with numbers, but those figures vary and aren’t all traceable to a primary source, so they shouldn’t anchor a buying decision yet. When Cognition and Hebbia publish their own leaderboards, prefer those.

Placement is not margin. “Highest-scoring” tells you the rank, not the gap. A model can lead by a point or by twenty, and the two mean different things for whether the upgrade is worth $10/$50 pricing. Without the underlying scores, treat the lead as real but unquantified.

None of this is a reason to dismiss the results. Fable 5 leading across coding, finance, vision, and long-horizon reasoning, plus the Stripe and Slay the Spire deployments, is a strong and coherent picture. It’s a reason to verify on your own workload before you commit, the right move with any new model regardless of who made it. The models overview is the place to confirm current ids, pricing, and context limits before you wire anything up.

Run your own benchmark with Apidog

The most reliable benchmark is the one that uses your prompts and your definition of “good.” You don’t need a research harness to get a useful read. Build a lightweight DIY eval by sending a fixed test prompt to the Fable 5 API and comparing the response against Opus 4.8 on three axes you can measure directly: output quality, latency, and token cost.

Here’s a simple way to do it with Apidog, an API platform for designing, testing, and documenting requests. The idea is to create one request in Apidog, point it at each model, and read the response, timing, and token usage side by side.

Set up a POST request to the Claude messages endpoint and save it as a reusable request in Apidog so you can re-run it without retyping anything.

POST https://api.anthropic.com/v1/messages
x-api-key: {{ANTHROPIC_API_KEY}}
anthropic-version: 2023-06-01
content-type: application/json

Give it a body with a fixed task. Pick a prompt that looks like your real work, not a toy. A migration-style instruction is a good stress test for a coding model:

{
  "model": "claude-fable-5",
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Refactor this Ruby method to use keyword arguments and add RSpec tests. Return only the updated code:\n\ndef charge(amount, currency, customer_id, idempotency_key)\n  # ...\nend"
    }
  ]
}

Run it once against claude-fable-5. Then duplicate the request, change the model field to claude-opus-4-8, and run the same prompt. Because the input is identical, any difference in output is the model, not the prompt.

Now read the three signals Apidog surfaces for each call:

Quality. Eyeball both responses against your own rubric. Did the test cover edge cases? Did the refactor stay correct? Score both before you look at which model produced which.
Latency. Apidog shows the response time for each request. For an interactive tool, a model that’s twice as accurate but four times slower may still be the wrong pick.
Token cost. The Claude response includes a usage block with input_tokens and output_tokens. Multiply by the published rates ($10 and $50 per million for Fable 5, $5 and $25 for Opus 4.8) to get the real cost of each answer.

Repeat this across five or ten prompts that mirror your actual use, and you’ll have a small, honest benchmark that tells you what the public leaderboards can’t: whether Fable 5’s edge shows up on your tasks at a price you’re willing to pay. You can download Apidog and have this set up in a few minutes. For a deeper cost breakdown, our Fable 5 pricing guide does the math.

button