GEN-0 / 具有物理交互能力的、可扩展的具身基础模型

GEN-0 / 具有物理交互能力的、可扩展的具身基础模型
GEN-0 / Embodied Foundation Models That Scale with Physical Interaction

原始链接: https://generalistai.com/blog/nov-04-2025-GEN-0

## GEN-0：利用具身基础模型扩展机器人能力 Generalist AI 发布 GEN-0，这是一种新型的具身基础模型，旨在通过直接在高质量、真实世界的物理交互数据上进行训练来扩展机器人智能。与以往严重依赖视觉-语言预训练的方法不同，GEN-0 旨在建立可预测的扩展规律——即计算量和数据的增加能够持续提高性能—— 类似于大型语言模型等领域的进展。 GEN-0 的关键在于“谐波推理”，这是一种训练方法，能够实现无缝的同时思考和行动，对于实时物理任务至关重要。实验揭示了在 7B 参数处存在“相变”：较小的模型难以处理数据过载（“化石化”），而较大的模型则继续改进。扩展到 10B+ 参数表明，在极少的后训练下，可以快速适应新任务。 GEN-0 在超过 27 万小时的真实世界操作数据的巨大且快速增长的数据集上进行训练，并表现出强大的扩展规律，将数据量与下游性能相关联。该架构设计为跨具身化的，能够在具有不同自由度的机器人上运行。研究人员发现数据质量和多样性至关重要，并正在积极探索最佳数据混合方式以增强模型特征。这项工作代表着朝着具身人工智能迈出的重要一步，其能力可以根据真实世界的经验进行可预测的扩展。

## GeneralistAI 与具身基础模型 - Hacker News 总结最近的 Hacker News 讨论强调了 GeneralistAI 构建具身 AI（由大型模型控制的机器人）的创新方法。他们没有依赖昂贵且可能存在缺陷的人类演示视频，而是使用价格实惠的 UMI 夹爪和外骨骼来扩大数据收集规模。用户估计生成 270,000 小时的数据成本为 400 万至 800 万美元，这可能比遥操作（慢 5-6 倍且贵 2-3 倍）更有效。一个关键点是人类和机器人运动之间的潜在不匹配，但其他人认为 AI 可以将从非机器人数据中学习到的技能推广到机器人应用。许多评论者认为 GeneralistAI 目前在机器人技术 + AI 领域处于领先地位，甚至超过了 Google DeepMind 的进展。讨论还涉及模型接收指令的方式（可能通过多模态文本和视觉处理），并惊叹于机器人的灵巧性。

原文

For years, foundation models in robotics have primarily used vision-language pretraining as the stepping stone towards scaling robotics, allowing us to transfer the benefits of semantic generalization from existing large multimodal models. But what's been missing is how to effectively scale large multimodal model training in the domain of robotics itself—to establish scaling laws that corroborate the consistent (and predictable) improvement of robot intelligence with more compute & data, as has underpinned progress in other domains e.g. LLMs. This requires an architecture, training procedure, and data engine that pushes new sensorimotor capabilities, provides behavioral generalization, and grows with the vast and ever-expanding experience generated by interacting with the real physical world.

To this end, we’re introducing GEN-0, a new class of embodied foundation models built for multimodal training directly on high-fidelity raw physical interaction. Its architecture builds on the strengths of vision and language models while also going beyond them—natively designed to capture human-level reflexes and physical commonsense. One core feature is Harmonic Reasoning, in which the models are trained to simultaneously think and act seamlessly. We’ve shared a glimpse of the capabilities of early precursors in our prior videos, and today we are sharing that not only does GEN-0 have breakthrough fundamental capabilities, but these capabilities are scaling:

Surpassing the Intelligence Threshold – in an unprecedented high-data regime for robotics, we observe a phase transition at 7B where smaller models exhibit ossification, while larger ones continue to improve. We’ve since scaled GEN-0 to 10B+ model sizes, and observe fast adaptation to new tasks with increasingly less post-training.
Scaling Laws – GEN-0 models exhibit strong scaling laws, in which more pretraining data and compute consistently (and predictably) improve downstream post-training performance of the model across many tasks.
Harmonic Reasoning - Although for language chatbots it is straightforward to spend more time thinking before responding, the same is not as simple for physical systems acting in the real world – physics doesn't stop. To address this problem, Harmonic Reasoning involves a fundamentally new approach to training models, and creates a "harmonic" interplay between asynchronous, continuous-time streams of sensing and acting tokens. This allows us to scale to very large model sizes without depending on System1-System2 architectures or inference-time guidance.
Cross-Embodiment – GEN-0 architecture works on different robots by design. We have tested our models on 6DoF, 7DoF, and 16+DoF semi-humanoid robots.
No Longer Limited By Data – GEN-0 is pretrained on our in-house robotics dataset, which includes over 270,000 hours of real-world diverse manipulation data, growing at a rate of 10,000 hours a week and accelerating.
The Science of Pretraining – different mixtures of pretraining data (from various sources e.g. data foundries) yield GEN-0 models with different characteristics. We share some early notes from our empirical observations in this high-data regime, and how that traces back to specific data collection operations.

We believe that GEN-0 marks the beginning of a new era: embodied foundation models whose capabilities predictably scale with physical interaction data – not just from text, images, or simulation – but the real world. Here are videos of GEN-0 in action on new tasks:

Build a camera kit (top view). This is a long horizon dexterous task that involves placing a cleaning cloth into a box, folding in a cardboard tray, picking up a camera and unsheathing it from a plastic bag, placing it into the box, closing the box (and inserting the tiny flap), then discarding the plastic bag. The model does not maintain any explicit notion of a subtask, and performs this all within a single stream of harmonic reasoning.

Our scaling experiments show that GEN-0 models must be large enough to absorb vast amounts of physical interaction data. We observe that smaller models exhibit a phenomenon similar to ossification under data overload, while larger ones continue to improve—demonstrating a surprising “phase transition” in the intelligence capacity of our models:

1B models struggle to absorb complex and diverse sensorimotor data during pretraining – model weights become unable to absorb new information over time.
6B models begin to benefit from pretraining and show strong multi-task capabilities.
7B+ models are able to internalize large-scale robotic pretraining data that transfers to downstream tasks with only a few thousand steps of post-training.

Figure 1. Scaling GEN-0 model size (different colors) improves performance in terms of next-action validation prediction error (y-axis, lower is better) on a completely-withheld (i.e. zero-shot) long-horizon downstream task. 1B parameter models exhibit clear and early ossification, while 6B and 7B models perform better at absorbing pretraining respectively. The x-axis is pretraining compute normalized so that GEN-0 7B is 1.0.

To our knowledge, this is the first time that model ossification has been observed in robotics. This might have eluded past research due to (a) the lack of a high data regime in robotics until now, and (b) large enough model sizes in this regime. Ossification has previously been observed in LLM literature in the high data regime but with much smaller models, on the order of O(10M) parameters rather than O(1B). The observation that this phase transition occurs in robotics but with much larger model sizes echoes Moravec’s Paradox: what humans find effortless—perception and dexterity—demands far more computational complexity than abstract reasoning. Our experiments suggest that intelligence in the physical world (i.e. physical commonsense) may have a higher activation threshold in terms of compute, and we’re only beginning to explore what lies beyond. Scaling laws are commonly measured during pretraining, as shown in Figure 1, which shows the relationship of model size and compute on a downstream zero-shot task during pretraining. Another type of scaling law relates to the benefits of pretraining that persist into finetuning. At sufficient model scale, we also observe a strong power-law relationship (Figure 3) between pretraining data scale and downstream post-training performance. This applies to all of our tasks we've measured, including partner and customer-inspired applications and their workflows across a wide range of industrial sectors – including apparel, manufacturing, logistics, automotive, and electronics.

More specifically, we take a variety of model checkpoints (Figure 2) that have been pretrained using our training procedure on different subsets of our pretraining dataset, and then post-train these checkpoints on multi-task language-conditioned data i.e. supervised fine-tuning simultaneously on 16 different task sets. We find that more pretraining improves downstream model performance across all tasks (Figure 2).

Figure 2. With increasingly more pretraining data (different colors), multi-task model performance during post-training improves in terms of validation loss (top) as well as next action prediction error (bottom 4x4 grid) across all 16 task sets. These tasks include ones that evaluate dexterity (e.g. build Lego), industry-specific workflows (e.g. fast food packing), and generalization (e.g. “_ anything” tasks).

Model performance is predictable with a power-law relationship (Figure 3), with which we can answer questions like “how much pretraining data do we need to reach a specific next-action prediction error?” or “how much post-training data (for a specific task) can we buy with more pretraining data?” Given a fixed data and finetuning budget on a downstream task, and given a pretraining dataset of varying size $D$, the validation error $L(\cdot)$ on the downstream task can be predicted via a power-law of the form: $$ L(D) = (D_c / D)^{\alpha_{D}} \ . $$ For example, in the case of Clothes Handling (which involves sorting, unscrambling, buttoning, and hanging clothes in a real workplace), we can predict model performance given 1 billion action trajectories. These estimates guide conversations on partner-related tasks and can provide estimates on how much more data is needed to reach specific levels of performance.

Figure 3. Our scaling laws provide a good description for asymptotic next action prediction error on a post-trained model for a given task set as a function of pretraining dataset size (in terms of number of action trajectories). Together with model size scaling laws, we can use these results to predict optimal allocation of pretraining compute and data for any downstream post-training task.

Our Foundation models are trained on an unprecedented corpus of 270,000 hours of real-world manipulation trajectories collected across diverse activities in 1,000s of homes, warehouses, and workplaces worldwide. Today, our robot data operations provide over 10,000 new hours per week and are accelerating. This is all powered by a global network of hardware and 1,000s of data collection devices and robots.

Figure 4. GEN-0 is trained on orders of magnitude more real-world manipulation data than some of the largest robotics datasets that exist to date (as of Nov 2025).

Mapping the Universe of Manipulation

To scale GEN-0 capabilities, we are constructing the largest and most diverse real-world manipulation dataset ever built, including every manipulation task humans can think of – from peeling potatoes, to threading bolts – spanning homes, bakeries, laundromats, warehouses, factories, and more. Here is an example internal search tool we have built to explore this universe:

Figure 5. This is an example of searching through

Infrastructure for Internet-Scale Robot Data

Building the operations and ML infrastructure to support this is no easy feat. For robot models and data at this scale, we built custom hardware, dataloaders, and network infrastructure (including laying new dedicated Internet lines) to support the uplink bandwidth from a diverse set of data collection sites all around the world. We’ve negotiated multi-cloud contracts, built custom upload machines, scaled to O(10K) cores for continual multimodal data processing, compressed dozens of Petabytes of data, using dataloading techniques behind frontier video foundation models, capable of absorbing 6.85 years of real-world manipulation experience per day of training.

Science of Pretraining

From large-scale ablations, we find that data quality and diversity matters more than sheer volume, and that carefully constructed data mixtures can lead to different pretrained model characteristics. For example, Table 1 shows the performance metrics of different models trained on 8 different pretraining datasets, and their downstream impact when finetuned on 10 long-horizon task sets, organized into 3 groups that evaluate different dimensions: dexterity, real-world applications, and generalization.

Performance is measured in terms of validation prediction M.S.E. $\text{MSE}_{\text{val}} = ||\mathbf{a}^{\star} - \hat{\mathbf{a}}||_2^2 $ and reverse Kullback–Leibler divergence (reverse KL), which better measures mode-seeking behavior. To estimate reverse KL, we use a Monte-Carlo estimator where the policy induces an empirical density $q =\frac{1}{M}\sum_{m=1}^{M}\mathcal{N}\left(\mathbf{a}; \hat{\mathbf{a}}_{m},\mathbf{I}\right)$ via a unit-variance mixture of Gaussians centered at $M$ policy samples $\{\hat{\mathbf{a}}_m\}_{m=1}^{M}$, and the data/ground-truth induces a unit-variance Gaussian $p(\mathbf{a})=\mathcal{N}\left(\mathbf{a}; \mathbf{a}^\star,\mathbf{I}\right) $ centered at $\mathbf{a}^\star$. We approximate the expectation with policy samples: $$ \widehat{D}_{\mathrm{KL}}(q||p) \approx \frac{1}{M}\sum_{m=1}^{M}\Big[\log q(\hat{\mathbf{a}}_{m})-\log p(\hat{\mathbf{a}}_{m})\Big] \ . $$ Experiments show that models with both low prediction errors and low reverse KL tend to perform better with supervised finetuning (SFT) for postraining, while models with high prediction errors and low reverse KL tend to be more distributionally multimodal, which can help post-training reinforcement learning. Having multiple data collection strategies at scale allows us to continually A/B test which data improves pretraining the most.

Table 1. These experiments compare different pretraining datasets, collected together with multiple data foundry partners, split across different classifications (i.e. modes) of data collection. Class 1 involves data on specific tasks, Class 3 involves do-anything type data, and Class 2 is everything in between. Different partners also have different operations, and we can use these experiments to evaluate between partners to iterate and provide feedback on what data to collect, how to do it, and which methods improve models the most.

More on these learnings in future posts. Please cite this work as

Generalist AI Team, "GEN-0: Embodied Foundation Models That Scale with Physical Interaction", Generalist AI Blog, Nov 2025.

Or use the BibTeX citation:

@article{generalist2025gen0,
author = {Generalist AI Team},
title = {GEN-0: Embodied Foundation Models That Scale with Physical Interaction},
journal = {Generalist AI Blog},
year = {2025},
note = {https://generalistai.com/blog/preview-uqlxvb-bb.html},
}