GEEK HAUS
피드로 돌아가기
2026/05/26/deepswe-blows-up-the-ai-coding-leaderboard-crowns

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

·VentureBeat
원문 보기

편집자 요약

Datacurve는 91개 오픈소스 저장소와 5개 언어에 걸친 113개 과제로 구성된 DeepSWE를 공개하며, 기존 AI 코딩 벤치마크가 상위 모델 간 격차를 과소평가했다고 주장합니다. 이 평가에서 OpenAI GPT-5.5는 70%로 1위를 차지해 2위보다 16%포인트 앞섰고, Datacurve는 SWE-Bench Pro의 자동 채점기가 검토 대상 시험의 약 3분의 1에서 잘못된 합격·불합격 판정을 내렸다고 밝혔습니다.

맥락

이번 결과는 기업 구매팀과 투자자, AI 연구소가 의존해 온 코딩 벤치마크의 평가 신뢰성에 직접적인 의문을 제기합니다. 특히 실제 개발 환경을 더 잘 반영하는 과제 설계와 채점기 검증이 AI 코딩 에이전트 경쟁의 핵심 기준으로 부상할 가능성이 큽니다.

본문

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family, Anthropic's Claude Opus, and Google's Gemini Pro have clustered within a narrow band on Scale AI's SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases.On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI's GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor."On public leaderboards, top models often look relatively close in capability," wrote Datacurve co-author Serena Ge on X. "DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work."The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve's audit found that SWE-Bench Pro's verifiers — the automated graders that determine whether an agent solved a task — issued incorrect pass/fail verdicts on roughly one-third of the trials it reviewed.If that finding holds up, it has sweeping implications. Enterprise procurement teams, venture capitalists, and AI lab marketing departments all lean heavily on benchmark scores to make multimillion-dollar decisions. A 32% error rate in the most widely cited coding benchmark suggests the industry may have been navigating by a broken compass.Why the most popular AI coding benchmark may be grading on a curveTo understand what Datacurve is claiming, it helps to understand how coding benchmarks work — and how they can go wrong.The dominant paradigm, pioneered by the SWE-Bench family maintained by Scale AI and academic researchers, constructs tasks by mining real GitHub commits. The process extracts a bug fix or feature addition from a repository's history, rolls the code back to the pre-fix state, and then asks an AI agent to reproduce the change. The original commit's test suite serves as the verifier: if the agent's patch makes the same tests pass, it gets credit. This approach has an elegant simplicity, but Datacurve argues it introduces three systemic weaknesses.First, contamination. Because tasks are drawn from public GitHub history, the problem statement, the discussion, and often the exact solution are already present in the training data of frontier models. "The SWE-Bench family scrapes existing GitHub issues and PRs, which creates two problems: memorization (models have already seen the solution) and triviality (most tasks are small)," Ge wrote.Second, scope. SWE-Bench Pro tasks require, on average, just 120 lines of code added across 5 files. DeepSWE's reference solutions average 668 lines added across 7 files — roughly 5.5 times more code. Yet DeepSWE's prompts are actually shorter, averaging 2,158 characters versus SWE-Bench Pro's 4,614. In other words, DeepSWE gives the agent less instruction but expects far more output, which more closely mirrors how a human developer might actually delegate work to an AI assistant.Third — and most damaging — verifier reliability. Datacurve drew 30 tasks at random from both DeepSWE and SWE-Bench Pro, ran three rollouts across 10 frontier model configurations, and then deployed an LLM-based judge to independently assess whether each agent's patch actually solved the problem. SWE-Bench Pro's verifiers accepted wrong implementations 8.5% of the time and rejected correct implementations 24% of the time. DeepSWE's verifiers registered 0.3% and 1.1%, respectively. The false negative problem is especially insidious because it punishes creative solutions. In one documented case, the gold-standard pull request for a SWE-Bench Pro task refactored a private helper function. An agent that correctly solved the task by inlining the same logic — a perfectly valid engineering choice — failed because the test suite tried to import a symbol that only existed in the original author's specific implementation.OpenAI's GPT-5.5 dominates the new benchmark while Claude and Gemini stumbleDeepSWE's top-line results reorder the familiar hierarchy in ways that should matter to every engineering team evaluating AI coding tools. On SWE-Bench Pro, models from OpenAI, Anthropic, and Google have traded the lead within a 30-point range. DeepSWE stretches that range to 70 points.GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop-off is steep: Claude Sonnet 4.6 lands at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kimi K2.6 tied at 24%, and then a long tail of models in the teens and single digits. Claude Haiku 4.5, w

댓글

토론

> geekhaus:~$ 다음 읽을거리?

다음 읽을거리 추천