Researchers say they trained a foundation model from scratch for about $1,500
EDITOR BRIEF
Sapient researchers say their HRM-Text architecture trained a 1B-parameter foundation model from scratch using far less data and compute than standard Transformer-based LLMs. The model trains on instruction-response pairs rather than broad internet-scale next-token prediction, and reportedly performs competitively with larger open models on industry benchmarks.
CONTEXT
If validated, this suggests foundation model training could become practical for smaller enterprises instead of remaining limited to hyperscalers and top labs. The approach reflects a broader shift from brute-force scaling toward more efficient architectures and task-focused training pipelines.
ARTICLE
Training a foundation LLM from scratch costs millions and requires internet-scale data — which is why most enterprises don't bother. Sapient thinks it has a cheaper path.To overcome this brute-force scaling dogma, researchers at Sapient developed HRM-Text, which replaces standard Transformers with a highly sample-efficient Hierarchical Recurrent Model (HRM), an architecture they first introduced last year.HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. Instead of brute-force autoregressive prediction on raw text, HRM-Text trains exclusively on instruction-response pairs. This is close to real-world enterprise settings, where users usually expect a targeted answer to a specific task.The researchers were able to train a 1B-parameter HRM-Text from scratch at a fraction of the cost and tokens of normal LLMs. Their model achieved performance competitive with much larger open models on key industry benchmarks.For real-world AI applications, this means foundational pretraining is no longer restricted to highly resourced institutions. With HRM-Text, organizations can affordably pretrain their own highly capable reasoning models from scratch and pair them with external knowledge stores.The training bottleneckWhen we train an LLM, we don't actually care if it has memorized the exact sequence of words in a random 2014 Reddit thread. What we want is for the model to develop a deep, underlying understanding of human language, logic, facts, and reasoning.The current approach is brute force: scrape the internet, run next-token prediction trillions of times, and assume the model has developed a working internal model of the world.Basically, this means that we waste millions of dollars of computing power forcing models to memorize everything collected from the internet, just so they can indirectly learn how to think. For example, standard decoder-only models spend valuable compute assigning loss to reconstruct the prompt itself, even though the user's prompt is already known and provided at inference time.Instead of simply viewing this as a computational hurdle, the industry must recognize it as a severe business limitation. In comments provided to VentureBeat, Guan Wang, CEO of Sapient Intelligence, framed this as an issue of the "economics of iteration.""Enterprises today face three compounding problems: training is expensive, infrastructure is heavy, and experimentation cycles are too slow," Wang said. "The industry’s scaling addiction says: 'When the model fails, make it bigger. Add more data. Add more GPUs.' That has worked, but it is reaching a point of diminishing returns. More scale often means more memorization, more latency, more infrastructure, and more vendor dependency. It does not necessarily give an enterprise a better reasoning engine."This architectural and computational inefficiency is exactly why fine-tuning existing dense transformers isn't always the silver bullet for enterprises. Fine-tuning to preserve a model's general capabilities often requires mixing substantial general-purpose data into the process, making it computationally heavy and difficult to control."Imagine a hedge fund, insurer, or bank that has highly proprietary data: internal research notes, transaction logic, compliance rules, analyst memos, risk models, portfolio constraints," Wang said. "They may not want to send that data to an external frontier model, and they may not need a giant general-purpose model that memorized the internet. What they need is a compact reasoning core that can learn their task structure, reason across rules and numbers, and run in a controlled environment."Because HRM-Text focuses its computation strictly on task completion and latent reasoning, it allows enterprises to start with a smaller, smarter model and adapt it to a proprietary domain with far less infrastructure.Rethinking architectures with HRM-TextHRM, which was introduced in 2025, represents a fundamental departure from traditional Transformer models. To build a more sample-efficient engine, HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. The fast L-module performs local iterative refinement, while the slow H-module maintains stable semantic context across cycles. Processing consists of two high-level cycles, where each cycle executes three fast L-module updates followed by a single slow H-module update.Standard parameter-shared recurrent architectures (like Samsung's TRM) can sometimes handle small logic puzzles, but the Sapient researchers found they become highly unstable when scaled to 1-billion parameters for language tasks. The separation between HRM's slow H-module and fast L-module is mathematically necessary, not just an aesthetic choice. As Wang said: "For logic grids, you can sometimes get away with a tiny recursive mechanism because the world is clean and bounded. Language is not like that. Language needs bo


