Sequoia Capital·Sequoia Capital Perspectives·RSS·2026.04.30

Standard Intelligence — 픽셀 공간에서 일반지능(AGI) 학습

Standard Intelligence: Training General Intelligence in Pixel Space

한국어 요약 번역

By Sonya Huang. 유용한 에이전트의 미래는 '텍스트'가 아니라 '픽셀'에서 시작될지 모른다.

언어 모델 — 그리고 그 위 에이전트 생태계 — 의 스케일링 경쟁은 뜨겁다. 추론하고 코드를 작성하는 '코딩 에이전트'는 이미 우리를 멀리까지 데려왔다. 그러나 한 야심찬 젊은 팀이 다른 베팅을 하고 있다. 일반 컴퓨터 에이전트로 가는 가장 유망한 경로는 언어, 스크린샷, 툴 콜이 아니라 비디오의 스케일링일 수 있다는 것.

Standard Intelligence의 테제는 일반 에이전트를 만드는 최선의 방법은 컴퓨터 사용 영상에 대한 풀-비디오 사전학습(full video pre-training)이라는 것 — 이것이 액션 데이터를 진정으로 스케일할 수 있는 유일한 접근이기 때문이다.

텍스트 토큰을 예측하는 대신, 모델은 화면의 원시 데이터에서 컴퓨터 사용법을 학습한다 — 픽셀로부터 다음 마우스 움직임, 클릭, 키 입력을 예측. 테슬라 FSD 접근법을 컴퓨터 화면 위의 지식 노동에 적용한 것이다.

이 베팅은 깊이 비주류적이면서 동시에 깊이 'bitter lesson'에 충실하다. 워크플로를 손으로 엔지니어링하거나 언어 모델을 점점 정교한 하네스로 감싸는 대신, Standard Intelligence는 새 사전학습 패러다임에 베팅한다 — 컴퓨터 사용의 원시 스트림을 모델에 먹이고, 공격적으로 스케일하고, 일반성이 데이터에서 자연스럽게 나오게 둔다.

"우리는 비디오 사람이 아니다"

비디오는 다루기 어렵다. 계산적으로도 비싸고, 경제적으로도 비싸고, 기술적으로도 용서가 없다. 비디오를 AGI 방향으로 스케일한 이전 시도들은 자주 죽었다.

Standard Intelligence 팀은 단언컨대 "비디오 사람"이 아니다. 비디오 작업 방식에 대한 10년치 가정을 안고 시작하지 않았다…

(원문 전체는 sequoiacap.com에서 이어진다.)

English Original

Standard Intelligence: Training General Intelligence in Pixel Space

By Sonya Huang

Published April 30, 2026

The future of useful agents may begin not with text, but with pixels.

Could pixels hold the keys to training useful agents?

The race to scale language models — and the agent ecosystem around them — is white-hot. Coding agents, which reason through problems and write code to solve them, have already taken us very far.

But one ambitious young team is making a different bet: that the most promising path to general computer agents may not run through language, screenshots, and tool calls, but through scaling raw video.

Standard Intelligence’s thesis is that the best way to build a general agent is through full video pre-training on computer use, because it is the only approach that can truly scale action data. Instead of predicting text tokens, the model learns to use a computer from raw screen data, predicting the next mouse movement, click, and keystroke from the pixels in front of it.

It is the Tesla FSD approach applied to knowledge work on computer screens.

That makes the bet both deeply contrarian and deeply “bitter lesson”-pilled. Rather than hand-engineering workflows or wrapping language models in increasingly elaborate harnesses, Standard Intelligence is betting on a new pre-training paradigm: feed the model the raw stream of computer use, scale it aggressively, and let the generality emerge from the data.

“We’re not video people”

Video is unwieldy. It is computationally expensive, economically expensive, and technically unforgiving. Prior attempts to scale video toward AGI have often died on the vine.

The Standard Intelligence team is emphatically “not video people.” They did not arrive with a decade of inherited assumptions about how to work with video as a medium. Instead, they have had to reason through each challenge from first principles, and have met those challenges with unusual optimism, creativity, and scrappiness.

The results are striking. An 11-million-hour computer action dataset — the largest in the industry. A video encoder that is roughly 50× more token-efficient than competing approaches, enabling nearly two hours of 30 FPS video to fit inside a 1-million-token context window. A 30-petabyte storage cluster racked in San Francisco for under $500K, roughly 20× cheaper than hyperscaler alternatives.

FDM-1, their first foundation model trained directly on computer-use video at scale, offers an early glimpse of what this paradigm could become. It is a general model that can extrude a CAD gear in Blender, drive a car around a San Francisco block after an hour of fine-tuning, and find bugs in software by exploring its state space the way a curious human might.

Conscientious young founders

Founders Galen Mead and Devansh Pandey met as teenagers during the Atlas Fellowship in 2022, a selective fellowship for high-school students interested in AI alignment and AGI.

Galen and Devansh are unusually serious about reaching AGI, and unusually conscientious about doing so safely. Both founders are wise beyond their years (21 and 20 respectively), and both left their undergraduate programs out of a sense of urgency to work on this problem.

Galen and Devansh stand out for their combination of taste, scrappiness, technical courage, and ambition. It shows up in the product thinking, in the research direction, and in the FDM-1 report itself.

The full team of six is small but mighty. Neel, Yudhister, Ulisse, and Ryan are each quirky and exceptional. They have chosen to turn down the conventional path (fancy degrees and offers from big token) and pursue this courageous mission together.

A new pre-training regime

Video has long been a powerful training ground for AI. DQN showed that agents could learn rich behavior directly from pixels in Atari environments. Tesla scaled video models to make self-driving cars and robots navigate the physical world.

But in the race toward general knowledge agents, video-first pre-training remains an unconventional idea.

Standard Intelligence is betting that it will not stay unconventional for long.

We are thrilled to lead Standard Intelligence’s Series A alongside Miko and Yasmin from Spark Capital.

The post Standard Intelligence: Training General Intelligence in Pixel Space appeared first on Sequoia Capital.

"우리는 비디오 사람이 아니다"

Standard Intelligence: Training General Intelligence in Pixel Space

Share