GEEK HAUS
피드로 돌아가기
2026/06/08/researchers-trained-an-open-source-ai-search

Researchers trained an open source AI search agent, Harness-1, that outperforms GPT-5.4 on recalling relevant information

·VentureBeat
원문 보기

편집자 요약

UIUC, UC Berkeley, Chroma 공동 연구진은 OpenAI gpt-oss-20B를 기반으로 한 20B 파라미터 오픈소스 검색 에이전트 Harness-1을 공개했습니다. Harness-1은 8개 복합 검색 벤치마크에서 관련 정보 회수 정확도 평균 73%를 기록해 GPT-5.4(70.9%)와 Tongyi DeepResearch 30B를 앞섰으며, 모델 코드와 가중치는 Apache 2.0 라이선스로 Hugging Face에 공개됐습니다. 훈련과 추론에는 Thinking Machines의 Tinker 분산 웹 기반 fine-tuning API가 사용됐습니다.

맥락

Harness-1의 성과는 검색 에이전트 경쟁의 초점이 단순한 모델 규모 확대에서 복잡한 검색 절차를 설계·학습하는 방향으로 이동하고 있음을 보여줍니다. SEC 공시, USPTO 특허, multi-hop 질의응답처럼 밀도 높은 자료를 다루는 기업 환경에서는 자율형 검색 에이전트가 사내 지식 검색과 리서치 자동화의 핵심 인프라로 부상할 가능성이 큽니다.

본문

A joint research collaboration between researchers at the University of Illinois at Urbana-Champaign (UIUC), UC Berkeley, and the open source AI-native vector database platform Chroma unveiled Harness-1, a 20-billion parameter open-source search agent built atop OpenAI's gpt-oss-20B open source model that fundamentally redesigns how AI executes complex retrieval tasks. Harness-1 achieves a massive leap in performance, scoring 73% average on its ability to recall relevant information correctly from a curated dataset, outperforming even GPT-5.4 (70.9%) and the next, most accurate open source search agent, Tongyi DeepResearch 30B, by 11.4 percentage points. (While GPT-5.5 has also been out for more than a month, the researchers didn't test against this model as it wasn't available when they were building theirs.)Crucially for developers, the model and its environment are available immediately under the highly permissive Apache 2.0 license and model code/weights on Hugging Face.Harness-1 also serves as proof-of-efficacy of another effort, Tinker, the distributed, web-based AI model training and fine-tuning API developed by Thinking Machines. Tinker was used specifically to train and run inference for Harness-1, highlighting how interactive infrastructure is actively enabling the next generation of autonomous models. So how did the researchers do it?Benchmarks Decoded (and Why Harness-1 Could Help Enterprises Tremendously) To actually put these models to the test, the researchers evaluated Harness-1 and its competitors across eight highly complex search benchmarks. Rather than asking simple trivia questions, these tests required the AI to act like a real researcher sifting through diverse, dense data sources. The benchmarks spanned several different domains, including open web searches, complex financial filings from the SEC, technical patent databases from the USPTO, and "multi-hop" question-answering tasks where the AI had to logically piece together scattered clues from multiple different documents to arrive at the correct answer.When the results came in, Harness-1 dominated the open-source competition in its ability to successfully find and curate the right facts. Even more impressively, this relatively small 20-billion parameter model went toe-to-toe with massive, expensive proprietary AI systems. It actually outperformed heavyweights like GPT-5.4, Sonnet-4.6, and Kimi-K2.5 — thought to be the hundreds of billions or trillions of parameters. Only one giant frontier model—Opus-4.6 — managed to narrowly edge it out in overall average performance.Harness-1 achieves its performance gains by offloading the exhaustive "bookkeeping" of a search session out of the model's working memory and into a structured software environment. As enterprise use cases grow more sophisticated, demanding that models autonomously sift through thousands of corporate documents or financial filings, these systems frequently succumb to "search amnesia"—forgetting their original queries, looping over rejected documents, or losing track of the specific claims they are trying to verify.Until now, the prevailing solution to this amnesia has been brute force. Engineers typically force models to constantly reread an ever-expanding, append-only transcript of their own actions, piling every search, read, and thought back into a massive context window. Harness-1 introduces a paradigm shift away from this method, proving that the bottleneck for true artificial autonomy isn't necessarily the size of the model, but how efficiently its working environment manages state. It highlights once more, as Anthropic's Claude Code has also done, that the raw model is arguably less important than the harness — or set of conditions — through which it runs.Technology: Doing the Paperwork in the EnvironmentTo understand the technical leap of Harness-1, consider a real-world analogy. Imagine hiring a brilliant research assistant and placing them in an empty room without a desk, notepads, or filing cabinets. You ask them to write a comprehensive report on a highly complex topic, which requires them to read dozens of books while keeping every single quote, citation, and dead-end search perfectly memorized in their own head. Eventually, no matter how intelligent the assistant is, their cognitive load will max out, and they will start dropping facts or losing the thread of the assignment.This is exactly how traditional search agents operate today. They are trained as policies over growing transcripts, meaning the model searches, reads, searches again, and appends everything into its own context window. As lead researcher Patrick (Pengcheng) Jiang of the University of Illinois noted on X: "At some point the model is not just 'searching' anymore. It is also being asked to be a memory system, a note taker, a verifier, and a librarian."Harness-1 solves this by giving the AI a desk and a filing cabinet—what the research team calls a

댓글

토론

> geekhaus:~$ 다음 읽을거리?

다음 읽을거리 추천