MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost
편집자 요약
MiniMax는 M2, M2.5, M2.7 언어 모델 개발 과정을 담은 기술 보고서를 공개하고, 차기 M3 모델의 새로운 sparse attention 접근법을 예고했습니다. 회사는 M3가 100만 토큰 장문맥에서 최대 15.6배 빠른 디코딩 속도를 제공해 AI agent 배포 비용을 낮출 수 있다고 설명했습니다.
맥락
이번 발표는 중국 AI 연구소들이 공개 모델 성능 경쟁을 넘어, 장문맥 처리와 agent 활용을 위한 추론 효율성 개선에 집중하고 있음을 보여줍니다. M2 보고서는 MoE 효율화와 agent 지향 설계에 대한 실무적 참고 자료가 될 수 있으며, M3가 실제 비용 절감으로 이어질 경우 enterprise AI 도입 전략에도 영향을 줄 수 있습니다.
본문
Among the many Chinese AI companies and laboratories vying for market share and attention (no pun intended) on the global marketplace, MiniMax stands out for its commitment to providing frontier-level intelligence across a range of modalities, including text, coding, and video (through its Hailuo model series) — often under permissive, enterprise-friendly, standard open source licenses. Now, MiniMax is again raising the eyebrows of AI power users and developers around the world by releasing a new, in-depth technical report on the making of its popular M2 series of language models (M2, M2.5, and M2.7) shedding light on its numerous engineering innovations and clever approaches — while the company and its leaders also teased a whole new sparse attention approach for its upcoming MiniMax M3 series of models, which it says yields up to 15.6 times faster decoding (or LLM response) speed at long contexts (a million tokens) by adopting a custom sub-quadratic framework. In so doing, MiniMax has designed M3 to make ultra-long-context AI agent deployment economically viable.The M2 report is noteworthy for any enterprise working with AI models, and especially those looking to fine-tune and train their own in-house. After all, MiniMax's M2 series models often achieved top benchmarks in the world for open source AI performance when they were released. While the title has since been eclipsed by several other Chinese labs including DeepSeek and Xiaomi, MiniMax's new report offers a blueprint that can be used to improve AI model and agent performance by enterprises around the world.As Adina Yakup of Hugging Face observed on X, "Beyond the benchmarks, they’ve done some really solid work on MoE efficiency and agent oriented design. Excited to see where M3 goes next!" The attention dilemmaThe core technical architecture of the M2 series relies on a sparse Mixture-of-Experts (MoE) decoder-only Transformer layout used by numerous other state-of-the-art LLMs.The foundational backbone houses 229.9 billion total parameters, yet maintains a remarkably lean operational footprint by activating just 9.8 billion parameters per token across 256 fine-grained experts. To optimize routing and avoid standard load-balancing issues, however, MiniMax implemented sigmoid gating paired with learnable, expert-specific bias terms, heavily reducing reliance on restrictive auxiliary losses.The most definitive engineering decision documented in the M2 paper was the strict adherence to full multi-head attention with Grouped Query Attention (GQA) across all 62 layers. In large language models, "quadratic scaling" refers to the computationally expensive reality of standard full attention mechanisms, where every token in a sequence must mathematically connect to every other token. To use a real-world analogy, it is akin to attending a networking event and being forced to have a deep conversation with every single person in the room while simultaneously monitoring all other ongoing conversations. While this approach yields incredibly thorough context, the processing power and memory required explode at the square of the input length, creating a severe hardware bottleneck as models attempt to ingest hundreds of thousands of words.The problem with sub-quadratic scaling"Sub-quadratic" scaling introduces architectural shortcuts designed to bypass this exponential computational load. Instead of mapping every possible connection, sub-quadratic methods—such as Sliding Window Attention or compressed linear attention—might only analyze a localized window of nearby words or generate a compressed summary of the broader text. These efficient methods drastically reduce hardware costs and allow models to process massive documents at high speeds, but they historically introduce severe trade-offs in accuracy, often causing the AI to miss the "big picture" or lose track of distant context.This mathematical dilemma defines the architectural evolution from MiniMax's M2 to its upcoming M3 series. During M2's development, researchers rigorously tested sub-quadratic shortcuts but found they crippled the model's "multi-hop reasoning"—its ability to connect disparate clues across a long document—forcing the team to absorb the massive computational cost of full quadratic attention to maintain frontier-level intelligence. Indeed, they aggressively benchmarked efficient attention alternatives during pre-training but intentionally threw them out. They experimented extensively with hybrid setups, interleaving full attention with sub-quadratic architectures like Lightning Attention or hybrid Sliding Window Attention (SWA) configurations.The empirical results were definitive: at a larger scale, linear and windowed attention variants exhibited severe reasoning deficits. On evaluations exceeding 32K context windows, SWA variants performed significantly worse than full attention, dropping from a baseline score of 90.0 to 72.0 on the RULER 128K complex word extract
댓글
토론
다음 읽을거리 추천

Merck and Mastercard are seeing real agentic AI results. Both say the plumbing came first.

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole
