Standard Intelligence가 텍스트와 tool call 중심 접근 대신 픽셀 기반 비디오 사전학습으로 범용 컴퓨터 에이전트 개발에 나섭니다
편집자 요약
Standard Intelligence는 언어 토큰이나 스크린샷 기반 도구 호출이 아니라, 컴퓨터 사용 영상을 원시 픽셀 단위로 학습하는 픽셀 기반 비디오 사전학습을 범용 에이전트 개발의 핵심 경로로 보고 있습니다. 이 회사는 1,100만 시간 규모의 컴퓨터 행동 데이터셋과 경쟁 방식 대비 약 50배 토큰 효율적인 video encoder를 내세우며, 다음 마우스 이동·클릭·키 입력을 예측하는 모델을 훈련하고 있습니다.
맥락
이 접근은 LLM 위에 복잡한 agent framework를 얹는 현재 주류와 달리, Tesla FSD식으로 대규모 행동 데이터를 축적해 일반성을 끌어내려는 bitter lesson 성격의 전략입니다. 성공할 경우 소프트웨어 자동화 경쟁의 무게중심이 prompt engineering과 tool orchestration에서 비디오 데이터, encoder 효율, 대규모 추론 인프라로 이동할 수 있습니다.
본문
Could pixels hold the keys to training useful agents?
The race to scale language models — and the agent ecosystem around them — is white-hot. Coding agents, which reason through problems and write code to solve them, have already taken us very far.
But one ambitious young team is making a different bet: that the most promising path to general computer agents may not run through language, screenshots, and tool calls, but through scaling raw video.
Standard Intelligence’s thesis is that the best way to build a general agent is through full video pre-training on computer use, because it is the only approach that can truly scale action data. Instead of predicting text tokens, the model learns to use a computer from raw screen data, predicting the next mouse movement, click, and keystroke from the pixels in front of it.
It is the Tesla FSD approach applied to knowledge work on computer screens.
That makes the bet both deeply contrarian and deeply “bitter lesson”-pilled. Rather than hand-engineering workflows or wrapping language models in increasingly elaborate harnesses, Standard Intelligence is betting on a new pre-training paradigm: feed the model the raw stream of computer use, scale it aggressively, and let the generality emerge from the data.
“We’re not video people”
Video is unwieldy. It is computationally expensive, economically expensive, and technically unforgiving. Prior attempts to scale video toward AGI have often died on the vine.
The Standard Intelligence team is emphatically “not video people.” They did not arrive with a decade of inherited assumptions about how to work with video as a medium. Instead, they have had to reason through each challenge from first principles, and have met those challenges with unusual optimism, creativity, and scrappiness.
The results are striking. An 11-million-hour computer action dataset — the largest in the industry. A video encoder that is roughly 50× more token-efficient than competing approaches, enabling nearly two hours of 30 FPS video to fit inside a 1-million-token context window. A 30-petabyte storage cluster racked in San Francisco for under $500K, roughly 20× cheaper than hyperscaler alternatives.
FDM-1, their first foundation model trained directly on computer-use video at scale, offers an early glimpse of what this paradigm could become. It is a general model that can extrude a CAD gear in Blender, drive a car around a San Francisco block after an hour of fine-tuning, and find bugs in software by exploring its state space the way a curious human might.
Conscientious young founders
Founders Galen Mead and Devansh Pandey met as teenagers during the Atlas Fellowship in 2022, a selective fellowship for high-school students interested in AI alignment and AGI.
Galen and Devansh are unusually serious about reaching AGI, and unusually conscientious about doing so safely. Both founders are wise beyond their years (21 and 20 respectively), and both left their undergraduate programs out of a sense of urgency to work on this problem.
Galen and Devansh stand out for their combination of taste, scrappiness, technical courage, and ambition. It shows up in the product thinking, in the research direction, and in the FDM-1 report itself.
The full team of six is small but mighty. Neel, Yudhister, Ulisse, and Ryan are each quirky and exceptional. They have chosen to turn down the conventional path (fancy degrees and offers from big token) and pursue this courageous mission together.
A new pre-training regime
Video has long been a powerful training ground for AI. DQN showed that agents could learn rich behavior directly from pixels in Atari environments. Tesla scaled video models to make self-driving cars and robots navigate the physical world.
But in the race toward general knowledge agents, video-first pre-training remains an unconventional idea.
Standard Intelligence is betting that it will not stay unconventional for long.
We are thrilled to lead Standard Intelligence’s Series A alongside Miko and Yasmin from Spark Capital.