AI Benchmarks Explained: How to Read the Scores Before You Choose a Model

Written by Caterina Mora | Jun 16, 2026 5:48:19 PM

On June 9, Anthropic released Claude Fable 5 and called it state-of-the-art on nearly every benchmark it tested. Within hours the charts were everywhere: 80.3% on SWE-bench Pro, 59% on something called Humanity’s Last Exam, an Elo of 1932 on GDPval. The next morning, a client forwarded one of those charts to us with a one-line question: “Is this good?”

It’s a fair question, and almost nobody answers it. The AI industry produces an enormous volume of benchmarking and almost no explanation of what the scores mean. Meanwhile, companies are choosing models and tools right now, often with a vendor’s chart as the only evidence on the table. So here’s a plain-language guide: what benchmarks are, what the major ones actually measure, and how the scores should (and shouldn’t) factor into choosing an LLM.

What is a Benchmark?

A benchmark is a standardized test for AI models. Researchers assemble a fixed set of tasks, such as exam questions, coding problems, or documents to analyze. Every model takes the same test, and the score is the share of tasks it completes correctly. Same exam, different students. That’s the whole idea.

The value comes from the “standardized” part. If Model A scores 80% and Model B scores 58% on the same software engineering test, that gap means something. What a single number can’t tell you is whether the test resembles your work. A high SAT score says very little about whether someone will be a good plumber.

Benchmarks also age quickly. MMLU, a general-knowledge test you’ll still see quoted in marketing materials, was the industry standard a few years ago. Today every frontier model scores within a few points of the top, and the test no longer separates them. When everyone gets an A, the exam stops being informative. Researchers respond by writing harder exams, which is why the names keep changing.

Benchmarks You'll See in 2026

Five tests show up in nearly every model announcement today, and each measures something different:

SWE-bench Pro (coding)

Can the model resolve real software engineering tasks in real codebases, working on its own? This is the most-watched benchmark right now because it tests sustained, multi-step work rather than one-shot answers.

GDPval-AA (knowledge work)

Models complete realistic occupational tasks, such as memos, analyses, and presentations, and their outputs are ranked against each other like chess players, producing an Elo rating. This is the closest thing we have to a benchmark for the work your team does all day.

Humanity's Last Exam (frontier reasoning)

Thousands of questions written by subject-matter experts to sit at the edge of human knowledge. It was built specifically because models had saturated the older tests.

GPQA Diamond (scientific reasoning)

PhD-level science questions designed so that even PhD holders outside the relevant field score around 34%. Top models now score in the 90s, so this one is approaching saturation too.

OSWorld-Verified (computer use)

Can the model operate an actual computer, including clicking, typing, and navigating applications, to finish a task the way a person would?

How the Frontier Models Compare Today

The table below shows published scores for the current frontier models, drawn from Anthropic’s June 9, 2026, announcement of Claude Fable 5.

Benchmark	Claude Fable 5	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro
SWE-bench Pro Real-world software engineering, done autonomously	80.3%	69.2%	58.6%	54.2%
GDPval-AA Everyday professional knowledge work (Elo rating)	1932	1890	1769	1314
Humanity's Last Exam (no tools) Expert-level reasoning across disciplines	59.0%*	49.8%	41.4%	44.4%
OSWorld-Verified Operating a computer to complete tasks	85.0%	83.4%	78.7%	76.2%
AutomationBench Using software tools to automate workflows	17.4%	15.5%	12.9%	9.6%
Legal Agent Benchmark Complex, multi-step legal tasks	13.3%	10.4%	2.1%	0.0%

Source: Anthropic, Claude Fable 5 and Claude Mythos 5 announcement, June 9, 2026. *Starred scores reflect Mythos 5, a restricted-access variant of the same model; on those topics, the generally available Fable 5 routes queries to Opus 4.8 for safety reasons and performs closer to Opus 4.8’s score.

One more note on the landscape: open-weight models (MiniMax, GLM, Kimi, and others) now post coding scores that rival proprietary models on some benchmark suites. They’re measured on different test versions, so resist comparing those numbers directly against this table. That instinct to cross-compare is exactly the trap this article is about.

What the Scores Don't Tell You

Benchmarks are useful evidence. They’re also marketing, and the same chart serves both purposes. A few things to keep in mind before quoting a number in your vendor evaluation:

Rank isn’t margin. “Highest-scoring model” tells you placement, not distance. A model can lead by one point or by twenty, and those mean different things when the leader costs twice as much. Fable 5’s 11-point lead over Opus 4.8 on SWE-bench Pro is a real gap; a half-point lead on a saturated test is noise.

Footnotes change the headline. The Fable 5 launch is a perfect teaching example. Several of its strongest published scores carry an asterisk: they belong to Mythos 5, a restricted sibling that most businesses can’t buy. The model you can deploy performs closer to Opus 4.8 in those areas. That detail lives in the methodology notes, not the chart.

Test authors are often launch partners. FrontierCode comes from Cognition; the Finance Benchmark comes from Hebbia. Both are credible organizations building serious evaluations, and both were partners in the launch story. That doesn’t make the results wrong. It means independent reproduction is worth waiting for before you treat them as settled.

Settings matter. Many models score higher when allowed more “effort,” meaning more reasoning time and compute. Two models compared at different effort levels aren’t being compared fairly, and a score quoted without its settings is incomplete.

How to Factor Benchmarks into Choosing an LLM

Match the Benchmark to the Job

If you’re choosing a model for contract review, SWE-bench is irrelevant to you; the legal and document-reasoning results are everything. Decide what your top three use cases are first, then find the two or three benchmarks that resemble them. Ignore the rest of the chart.

Read the Footnotes Before the Headline

Check which model variant produced the score, what effort setting was used, and who built the test. Five minutes in the methodology notes will save you from buying a number that doesn’t ship.

Weigh Capability Against Cost

Fable 5 costs $10 per million input tokens and $50 per million output tokens, double Opus 4.8’s price. For long, complex, autonomous work, the premium can pay for itself many times over; Anthropic reports the model compressed a two-month code migration at Stripe into a single day. For routine drafting and summarizing, the cheaper tier is often the smarter buy. The best-scoring model isn’t automatically the right model.

Run Your Own Ten-Prompt Test

Collect ten real tasks from your team’s actual week: the report someone drafts every Friday, the messy spreadsheet, the vendor email. Run them through the two or three models you’re considering and score the outputs blind. In our work with clients, this half-day exercise has changed the purchasing decision more often than any leaderboard.

Re-check Quarterly, Not Weekly

The leaderboard reshuffles every few weeks; your evaluation cycle shouldn’t. Pick a model, build with it, and revisit the decision on a quarterly cadence unless a release materially changes your top use case.

Fable 5 is a meaningful release, and if your organization runs on Claude it’s worth testing against your hardest work while it’s included in subscription plans. But the bigger takeaway is durable: a benchmark is a map, not the territory. The territory is your own work, and that’s the test that should make the decision.

View full post