Sequoia Capital·
Standard Intelligence bets raw computer-use video, not language models, is the path to general AI agents
Summary
Standard Intelligence is training agents directly in pixel space, using video of computer activity to predict mouse movements, clicks, and keystrokes. The company argues that large-scale video pre-training can unlock more general computer agents than language-model wrappers or hand-built workflows.
Insight
The approach reflects a broader shift toward scaling raw action data rather than relying solely on text and tool orchestration. If successful, Standard Intelligence could reshape agent development by making computer-use datasets and efficient video encoders central infrastructure for AI systems.