GEEK HAUS
피드로 돌아가기
2026/06/10/researchers-say-they-trained-a-foundation-model

Researchers say they trained a foundation model from scratch for about $1,500

·VentureBeat
원문 보기

편집자 요약

Sapient 연구진은 표준 Transformer를 대체하는 Hierarchical Recurrent Model 기반 HRM-Text를 통해 1B parameter 모델을 처음부터 약 1,500달러에 학습했다고 밝혔습니다. HRM-Text는 대규모 웹 텍스트의 next-token prediction 대신 instruction-response pair만으로 학습하며, 주요 benchmark에서 더 큰 open model과 경쟁 가능한 성능을 보였다고 설명합니다.

맥락

이 접근이 재현 가능하다면 foundation model pretraining은 초대형 GPU cluster와 인터넷 규모 데이터에 의존하는 방식에서 벗어나, 기업별 맞춤형 reasoning model로 확장될 수 있습니다. 특히 외부 지식 저장소와 결합하는 구조는 모델이 모든 정보를 내부에 암기하기보다 추론 능력과 업무 맥락 적응에 집중하는 흐름을 강화합니다.

본문

Training a foundation LLM from scratch costs millions and requires internet-scale data — which is why most enterprises don't bother. Sapient thinks it has a cheaper path.To overcome this brute-force scaling dogma, researchers at Sapient developed HRM-Text, which replaces standard Transformers with a highly sample-efficient Hierarchical Recurrent Model (HRM), an architecture they first introduced last year.HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. Instead of brute-force autoregressive prediction on raw text, HRM-Text trains exclusively on instruction-response pairs. This is close to real-world enterprise settings, where users usually expect a targeted answer to a specific task.The researchers were able to train a 1B-parameter HRM-Text from scratch at a fraction of the cost and tokens of normal LLMs. Their model achieved performance competitive with much larger open models on key industry benchmarks.For real-world AI applications, this means foundational pretraining is no longer restricted to highly resourced institutions. With HRM-Text, organizations can affordably pretrain their own highly capable reasoning models from scratch and pair them with external knowledge stores.The training bottleneckWhen we train an LLM, we don't actually care if it has memorized the exact sequence of words in a random 2014 Reddit thread. What we want is for the model to develop a deep, underlying understanding of human language, logic, facts, and reasoning.The current approach is brute force: scrape the internet, run next-token prediction trillions of times, and assume the model has developed a working internal model of the world.Basically, this means that we waste millions of dollars of computing power forcing models to memorize everything collected from the internet, just so they can indirectly learn how to think. For example, standard decoder-only models spend valuable compute assigning loss to reconstruct the prompt itself, even though the user's prompt is already known and provided at inference time.Instead of simply viewing this as a computational hurdle, the industry must recognize it as a severe business limitation. In comments provided to VentureBeat, Guan Wang, CEO of Sapient Intelligence, framed this as an issue of the "economics of iteration.""Enterprises today face three compounding problems: training is expensive, infrastructure is heavy, and experimentation cycles are too slow," Wang said. "The industry’s scaling addiction says: 'When the model fails, make it bigger. Add more data. Add more GPUs.' That has worked, but it is reaching a point of diminishing returns. More scale often means more memorization, more latency, more infrastructure, and more vendor dependency. It does not necessarily give an enterprise a better reasoning engine."This architectural and computational inefficiency is exactly why fine-tuning existing dense transformers isn't always the silver bullet for enterprises. Fine-tuning to preserve a model's general capabilities often requires mixing substantial general-purpose data into the process, making it computationally heavy and difficult to control."Imagine a hedge fund, insurer, or bank that has highly proprietary data: internal research notes, transaction logic, compliance rules, analyst memos, risk models, portfolio constraints," Wang said. "They may not want to send that data to an external frontier model, and they may not need a giant general-purpose model that memorized the internet. What they need is a compact reasoning core that can learn their task structure, reason across rules and numbers, and run in a controlled environment."Because HRM-Text focuses its computation strictly on task completion and latent reasoning, it allows enterprises to start with a smaller, smarter model and adapt it to a proprietary domain with far less infrastructure.Rethinking architectures with HRM-TextHRM, which was introduced in 2025, represents a fundamental departure from traditional Transformer models. To build a more sample-efficient engine, HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. The fast L-module performs local iterative refinement, while the slow H-module maintains stable semantic context across cycles. Processing consists of two high-level cycles, where each cycle executes three fast L-module updates followed by a single slow H-module update.Standard parameter-shared recurrent architectures (like Samsung's TRM) can sometimes handle small logic puzzles, but the Sapient researchers found they become highly unstable when scaled to 1-billion parameters for language tasks. The separation between HRM's slow H-module and fast L-module is mathematically necessary, not just an aesthetic choice. As Wang said: "For logic grids, you can sometimes get away with a tiny recursive mechanism because the world is clean and bounded. Language is not like that. Language needs bo

댓글

토론

> geekhaus:~$ 다음 읽을거리?

다음 읽을거리 추천