MiniMax-M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark performance for just 5-10% of the cost
편집자 요약
중국 AI 스타트업 MiniMax가 M3 LLM을 공개하며 100만 토큰 컨텍스트, 네이티브 멀티모달, 코딩·agentic 작업 성능을 전면에 내세웠습니다. MiniMax는 향후 10일 내 open weights 라이선스 제공을 예고했으며, API 가격은 한시적으로 100만 입력 토큰당 $0.3, 출력 토큰당 $1.20로 책정했습니다.
맥락
M3가 주장한 성능과 가격이 검증된다면, 고성능 LLM 시장의 비용 기준을 크게 낮추며 기업의 자체 배포·튜닝 선택지를 넓힐 수 있습니다. 특히 오픈 웨이트 모델이 장문 컨텍스트와 복잡한 개발 워크플로까지 흡수하는 흐름은 폐쇄형 API 중심의 frontier AI 경쟁 구도를 압박할 가능성이 큽니다.
본문
Big news in enterprise AI broke over the weekend as Chinese AI startup MiniMax released its highly anticipated M3 large language model on Sunday evening Eastern time, pairing frontier-tier coding and agentic performance with a 1-million-token context window and native multimodality for a fraction of the cost of leading proprietary models, with pricing starting at just $20 per month under its new subscription token plans. The company's leadership also announced plans to deliver the model under an open source license including "open weights," allowing for full enterprise downloading and customizability free-of-charge, coming sometime in the next 10 days. For now, it is available via the MiniMax API at a special discounted price of $0.3 per 1 million input tokens and $1.20 per million output tokens (on fresh cache) for the next week — beating proprietary U.S. giants like Google, OpenAI and Anthropic handily on cost, while also eclipsing the performance of the latest models from the former two on selected benchmarks.Even at its full price of $0.6/$2.40 per million input/output tokens, MiniMax-M3 remains at just 8-20% the cost of the leading, proprietary U.S. models. The traditional matrix governing large language model development has long dictated a rigid choice: software developers can either access top-tier closed-source intelligence behind restrictive APIs, or deploy nimble, cost-effective open models that falter on multi-step reasoning, dense coding tasks, and massive data sequences. MiniMax-M3 fundamentally upends this paradigm. By unifying these two historically separated frontier capabilities, M3 introduces a level of comprehensive utility previously restricted to expensive, closed-source ecosystems, effectively shifting the baseline of open-weights systems while drastically minimizing the operational compute footprint required to execute complex development loops. VentureBeat Frontier AI Model API Pricing SnapshotModelInputOutputTotal CostSourceMiMo-V2.5 Flash$0.10$0.30$0.40Xiaomi MiModeepseek-v4-flash$0.14$0.28$0.42DeepSeekdeepseek-v4-pro$0.435$0.87$1.305DeepSeekMiniMax-M3$0.30$1.20$1.50 (limited time only)MiniMaxGemini 3.1 Flash-Lite$0.25$1.50$1.75GoogleMiMo-V2.5$0.40$2.00$2.40Xiaomi MiMoGrok 4.3 low context$1.25$2.50$3.75xAIGLM-5$1.00$3.20$4.20Z.aiKimi-K2.6$0.95$4.00$4.95Moonshot/KimiGLM-5.1$1.40$4.40$5.80Z.aiGrok 4.3 high context$2.50$5.00$7.50xAIQwen3.7-Max$2.50$7.50$10.00Alibaba CloudGemini 3.5 Flash$1.50$9.00$10.50GoogleGemini 3.1 Pro Preview ≤200K$2.00$12.00$14.00GoogleGPT-5.4$2.50$15.00$17.50OpenAIGemini 3.1 Pro Preview >200K$4.00$18.00$22.00GoogleClaude Opus 4.8$5.00$25.00$30.00AnthropicGPT-5.5$5.00$30.00$35.00OpenAINew MiniMax Sparse Attention (MSA) technique helps keep the model's cost lowAt the core of the model's efficiency lies an architectural departure from classic Transformer networks. Standard attention mechanisms scale quadratically ($O(N^2)$), meaning computational and financial costs explode as text inputs lengthen. To combat this "inherent flaw," the engineering team implements MiniMax Sparse Attention (MSA), a clean, extensible sparse attention blueprint. To visualize this innovation, think of traditional full attention as an editor reading an entire library from scratch every time they need to verify a single sentence. MSA acts as an intelligent indexing clerk, using a pre-filtering phase to partition Key-Value (KV) matrices into highly precise blocks. At the operator level, MSA uses a "KV outer gather Q" approach. The system treats KV blocks as an outer loop, dynamically aggregating only the specific queries that hit them. Because each data block is read exactly once and memory access remains strictly contiguous, hardware utilization skyrockets. In internal trials, MSA runs more than 4x faster than alternative open-source solutions like Flash-Sparse-Attention or flash-moba. When managing a maxed-out context length of 1 million tokens, M3’s per-token compute demand drops to just 1/20th of the previous generation model, translating into a 9x acceleration in the prefilling stage and a 15x boost during decoding. Rather than taking a pretrained text network and fusing it with a separate vision model, MiniMax engineered M3 as a natively multimodal system from "Step Zero". The company overhauled its data ingest machinery to blend naturally interleaved sequences of text, images, and visual components, scaling the total pretraining corpus beyond 100 trillion tokens.This deep data alignment enables the model to translate complex visual geometries, such as programming charts or coordinate maps, into structural code without losing contextual fidelity. On standardized assessments, M3 validates this engineering path. The model records a 59.0% on SWE-Bench Pro, an autonomous agent metric, positioning it ahead of closed models like GPT-5.5 and Gemini 3.1 Pro. It achieves a 66.0% on Terminal Bench 2.1, a 74.2% on MCP Atlas, and an 83.5 on BrowseComp—outstripping Claude Opus 4.7’


