GEEK HAUS
Back to feed
2026/06/10/surprise-upset-gpt-5-5-beats-claude-fable-5-on

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

·VentureBeat
read original

EDITOR BRIEF

UC Berkeley RDI and more than 300 experts launched Agents’ Last Exam, a benchmark meant to test whether AI agents can complete long, economically valuable professional workflows. OpenAI’s GPT-5.5, running through Codex, topped the leaderboard with a 24.0% pass rate, ahead of Anthropic’s Claude Fable 5 at 22.0%, underscoring that even leading models are still failing most agentic tasks.

CONTEXT

The benchmark reflects a broader shift away from narrow coding or Q&A tests toward measuring practical, end-to-end work across reasoning, perception, orchestration, and computer use. Its anti-cheating design and low top scores suggest the AI industry may be entering a more realistic phase where workflow reliability matters more than headline benchmark wins.

ARTICLE

Researchers from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI), alongside an advisory committee of over 300 domain experts, have launched Agents’ Last Exam (ALE)—a grueling new benchmark built to measure whether artificial intelligence can actually execute economically valuable, long-horizon professional workflows.In a shocking upset, OpenAI’s GPT-5.5 from April, operating through the Codex harness, secured the absolute top spot on the new ALE Leaderboard with a 24.0% pass rate, beating Anthropic's highly anticipated, brand new Mythos-class Claude Fable 5 model released just yesterday, which came in third with a score of 22.0%.Rather than testing models on isolated coding puzzles, ALE is explicitly designed as an instrument to close the gap between academic benchmark hype and real, GDP-relevant labor impact. And right now, the data proves the most advanced models in the world are fundamentally failing the exam.Ending the Era of 'Cheating' and Brittle GradersThe fundamental shift in ALE lies in its evaluation architecture and the demands it places on the agent. Historically, AI benchmarks have relied on static question-answering or narrow, text-based terminal environments. More recent agentic evaluations introduced multi-step interaction but suffered from severe grading issues. As noted in recent independent audits of older leaderboards like SWE-Bench Pro, automated verifiers frequently reject correct solutions, and certain models—specifically the Claude Opus family—have been caught "cheating" by reading hidden answer keys in a container's Git history rather than solving the underlying problem.ALE neutralizes these loopholes by forcing models into a strict Generalist Computer-Use Agent (GCUA) framework. To pass, an agent cannot merely execute terminal commands. The benchmark maps capability across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate).An agent must use its "Eyes" and "Hands" to navigate Linux or Windows virtual machines, interleaving shell scripting with point-and-click operations inside heavy desktop software.Crucially, ALE almost entirely rejects the unpredictable "LLM-as-a-judge" grading paradigm, relying on it for a mere 6.8% of its workflows. If a task involves generating a 3D mesh or parsing SEC filings, the benchmark uses deterministic, code-based evaluation to compare the agent's artifact against an expert's ground-truth reference.Measuring Task Performance Across 55 IndustriesALE launches with 1,490 task instances and is scaling toward a massive 5,000-task target. What makes the product remarkable is its authenticity. The tasks are strictly anchored in the U.S. federal occupational taxonomy (O*NET / SOC 2018), covering 55 non-physical industry sub-domains.The workflows are sourced directly from the professional histories of industry practitioners. Agents are asked to perform 3D model creation in Siemens NX, scene setup in Unreal Engine, neuroimaging analysis in FSLeyes, and visual effects compositing in Adobe After Effects.When faced with these authentic, long-horizon workflows, the limitations of current AI are glaring. ALE divides its tasks into three difficulty tiers: Near-Term, Full-Spectrum, and Last-Exam.Top 5 Agentic Harnesses on the ALE LeaderboardRankAgent HarnessUnderlying ModelPass RateMean Score1Codexgpt-5-524.0%42.8%2Ale Clawgpt-5-523.0%45.8%3Claude Codeclaude-fable-522.0%40.5%4OpenClawgpt-5-521.1%41.0%5Cursor CLIcomposer-2-520.4%38.5%The victory of GPT-5.5 aligns with recent third-party analysis suggesting that OpenAI's models are currently superior at strictly adhering to multi-part, complex prompts. Conversely, users report Anthropic's Claude architecture can sometimes be "forgetful" with multi-part instructions, abandoning required steps mid-workflow — a fatal flaw in ALE's rigorous pipeline.And while hitting a 24.0% pass rate is enough to claim the crown, the absolute performance ceiling remains remarkably low. On the hardest "Last-Exam" tier — representing the frontier of professional difficulty — most configurations, including Anthropic's older Claude Opus 4.8 and Google's Gemini CLI, record a devastating 0.0% pass rate.Solving Benchmark ContaminationA core vulnerability in modern AI evaluation is "benchmark contamination"—the phenomenon where test questions inevitably leak into the massive data lakes used to train next-generation models. Once a model memorizes the benchmark, the evaluation becomes entirely useless.ALE solves this through a dual-use deployment strategy. The project operates as an open-source research initiative, but it closely guards its evaluation data. Only about 10% of the dataset (roughly 150 tasks) is released publicly on platforms like GitHub and Hugging Face. The remaining 1,300+ tasks are kept strictly private.For developers and enterprise

COMMENTS

Discussion

> geekhaus:~$ next read?

Next read recommendations