DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole
EDITOR BRIEF
Datacurve released DeepSWE, a 113-task AI coding benchmark across 91 open-source repositories and five languages, showing a much wider performance gap among frontier models than SWE-Bench Pro. OpenAI’s GPT-5.5 led with a 70% score, 16 points ahead of the nearest rival, while Datacurve says SWE-Bench Pro graders produced incorrect verdicts in about one-third of reviewed trials.
CONTEXT
The results suggest enterprise buyers may need more realistic and auditable coding evaluations before committing to AI developer tools. If the claimed benchmark error rate is validated, it could weaken confidence in leaderboard-driven marketing and push the industry toward tougher, more transparent evaluations.
ARTICLE
For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family, Anthropic's Claude Opus, and Google's Gemini Pro have clustered within a narrow band on Scale AI's SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases.On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI's GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor."On public leaderboards, top models often look relatively close in capability," wrote Datacurve co-author Serena Ge on X. "DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work."The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve's audit found that SWE-Bench Pro's verifiers — the automated graders that determine whether an agent solved a task — issued incorrect pass/fail verdicts on roughly one-third of the trials it reviewed.If that finding holds up, it has sweeping implications. Enterprise procurement teams, venture capitalists, and AI lab marketing departments all lean heavily on benchmark scores to make multimillion-dollar decisions. A 32% error rate in the most widely cited coding benchmark suggests the industry may have been navigating by a broken compass.Why the most popular AI coding benchmark may be grading on a curveTo understand what Datacurve is claiming, it helps to understand how coding benchmarks work — and how they can go wrong.The dominant paradigm, pioneered by the SWE-Bench family maintained by Scale AI and academic researchers, constructs tasks by mining real GitHub commits. The process extracts a bug fix or feature addition from a repository's history, rolls the code back to the pre-fix state, and then asks an AI agent to reproduce the change. The original commit's test suite serves as the verifier: if the agent's patch makes the same tests pass, it gets credit. This approach has an elegant simplicity, but Datacurve argues it introduces three systemic weaknesses.First, contamination. Because tasks are drawn from public GitHub history, the problem statement, the discussion, and often the exact solution are already present in the training data of frontier models. "The SWE-Bench family scrapes existing GitHub issues and PRs, which creates two problems: memorization (models have already seen the solution) and triviality (most tasks are small)," Ge wrote.Second, scope. SWE-Bench Pro tasks require, on average, just 120 lines of code added across 5 files. DeepSWE's reference solutions average 668 lines added across 7 files — roughly 5.5 times more code. Yet DeepSWE's prompts are actually shorter, averaging 2,158 characters versus SWE-Bench Pro's 4,614. In other words, DeepSWE gives the agent less instruction but expects far more output, which more closely mirrors how a human developer might actually delegate work to an AI assistant.Third — and most damaging — verifier reliability. Datacurve drew 30 tasks at random from both DeepSWE and SWE-Bench Pro, ran three rollouts across 10 frontier model configurations, and then deployed an LLM-based judge to independently assess whether each agent's patch actually solved the problem. SWE-Bench Pro's verifiers accepted wrong implementations 8.5% of the time and rejected correct implementations 24% of the time. DeepSWE's verifiers registered 0.3% and 1.1%, respectively. The false negative problem is especially insidious because it punishes creative solutions. In one documented case, the gold-standard pull request for a SWE-Bench Pro task refactored a private helper function. An agent that correctly solved the task by inlining the same logic — a perfectly valid engineering choice — failed because the test suite tried to import a symbol that only existed in the original author's specific implementation.OpenAI's GPT-5.5 dominates the new benchmark while Claude and Gemini stumbleDeepSWE's top-line results reorder the familiar hierarchy in ways that should matter to every engineering team evaluating AI coding tools. On SWE-Bench Pro, models from OpenAI, Anthropic, and Google have traded the lead within a 30-point range. DeepSWE stretches that range to 70 points.GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop-off is steep: Claude Sonnet 4.6 lands at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kimi K2.6 tied at 24%, and then a long tail of models in the teens and single digits. Claude Haiku 4.5, w
COMMENTS
Discussion
Next read recommendations

The attack dominating financial services doesn't steal passwords. It resets MFA and steals the token.

Why prompt debt, retrieval debt, and evaluation debt are quietly reshaping enterprise AI risk
