Orca: The World is in Your Mind

Boan Zhu; Chunlei Men; Congsheng Xu; Euan Liu; Guocai Yao; Hang Zhao; Hongyang Li; Huaihai Lyu; Jiaming Li; Jianlan Luo

arxiv: 2606.30534 · v1 · pith:ZJL35LI2new · submitted 2026-06-29 · 💻 cs.CV

Orca: The World is in Your Mind

Yihao Wang , Yuheng Ji , Mingyu Cao , Yanqing Shen , Runze Xiao , Huaihai Lyu , Senwei Xie , Euan Liu

show 49 more authors

Klara Tian Tianfeng Long Yichi Zhang Zhengliang Cai Ruike Chen Jifan Zhao Ruochuan Shi Zihan Tang Jing Lyu Wenxing Tan Ningbo Zhang Yangtao Hu Yuming Gao Xiansheng Chen Junkai Zhao Congsheng Xu Boan Zhu Ziqi Wang Yupu Feng Qiongqiong Zhang Yingli Zhao Yulong Ao Shaoxuan Xie You Liu Guocai Yao Leiduo Zhang Xiaodan Liu Yunyan Zhang Yance Jiao Xinyan Yang Jiaxing Wei Xu Liu Tengfei Pan Shaokai Nie Chunlei Men Sen Cui Xiaojie Jin Hongyang Li Jianlan Luo Yao Mu Yunchao Wei Jun Yan Hang Zhao Xiaolong Zheng Jiaming Li Yonghua Lin Tiejun Huang Zhongyuan Wang Pengwei Wang

This is my paper

Pith reviewed 2026-06-30 06:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords world foundation modelnext-state predictionunified latent spacemultimodal pretrainingvideo understandingembodied action generationfrozen backbone readout

0 comments

The pith

Orca learns a unified world latent space via next-state prediction from multimodal signals, enabling stronger readouts on text, image and action tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Orca as a general world foundation model that builds one shared latent representation of the world from both video and language data. Training centers on next-state prediction instead of separate next-token, next-frame or next-action objectives. Pre-training combines 125K hours of video for dense unconscious transitions with 160M language-described events for conscious sparse transitions. After pre-training the backbone is frozen and only lightweight decoders are trained, allowing the same latent to drive text generation, image prediction and embodied action generation. The results are offered as evidence that a stronger unified latent produces better performance across these readouts than similarly sized models trained for each task alone.

Core claim

Orca learns a unified world latent space from multimodal world signals by centering on Next-State-Prediction modeling. Unconscious learning captures dense natural state transitions from continuous videos, while conscious learning models sparse meaningful state transitions using language-described events and VQA supervision. The resulting latent, exposed through lightweight modality-specific decoders with a frozen backbone, supports stronger readouts on text generation, image prediction, and embodied action generation than similar-sized specialized baselines.

What carries the argument

Next-State-Prediction modeling that unifies state-transition learning across video and language into a single latent space.

If this is right

A stronger pre-trained world latent directly improves downstream performance on text generation, image prediction, and embodied action generation.
Freezing the backbone after large-scale pre-training and training only lightweight decoders is sufficient to expose the latent's utility across modalities.
The combination of unconscious video-based and conscious language-based pre-training produces a latent that scales to multiple readout tasks.
Outperformance over similar-sized specialized baselines holds when the same frozen backbone serves all three domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the latent encodes general state transitions, adding a new decoder could enable readouts for modalities such as audio without retraining the core model.
Direct comparison of internal representations between Orca and task-specific models could isolate which world-state features arise only from the unified objective.
Extending the pre-training inventory to additional sensor streams would test whether the next-state objective remains effective beyond video and text.

Load-bearing premise

That performance gains on the readout tasks are driven by the quality of the latent space itself rather than by decoder architecture choices or overlap between the pre-training inventory and the evaluation data.

What would settle it

Train identical lightweight decoders on a randomly initialized backbone or on a backbone pre-trained without the next-state-prediction objective and check whether the performance advantage over Orca disappears on the three readout tasks.

Figures

Figures reproduced from arXiv: 2606.30534 by Boan Zhu, Chunlei Men, Congsheng Xu, Euan Liu, Guocai Yao, Hang Zhao, Hongyang Li, Huaihai Lyu, Jiaming Li, Jianlan Luo, Jiaxing Wei, Jifan Zhao, Jing Lyu, Junkai Zhao, Jun Yan, Klara Tian, Leiduo Zhang, Mingyu Cao, Ningbo Zhang, Pengwei Wang, Qiongqiong Zhang, Ruike Chen, Runze Xiao, Ruochuan Shi, Sen Cui, Senwei Xie, Shaokai Nie, Shaoxuan Xie, Tengfei Pan, Tianfeng Long, Tiejun Huang, Wenxing Tan, Xiansheng Chen, Xiaodan Liu, Xiaojie Jin, Xiaolong Zheng, Xinyan Yang, Xu Liu, Yance Jiao, Yangtao Hu, Yanqing Shen, Yao Mu, Yichi Zhang, Yihao Wang, Yingli Zhao, Yonghua Lin, You Liu, Yuheng Ji, Yulong Ao, Yuming Gao, Yunchao Wei, Yunyan Zhang, Yupu Feng, Zhengliang Cai, Zhongyuan Wang, Zihan Tang, Ziqi Wang.

**Figure 1.** Figure 1: The Orca’s overall framework. Orca follows an Encoder-Decoder architecture. Given multimodal world signals, the Encoder learns a world latent through two complementary paradigms: unconscious learning and conscious learning. Unconscious learning captures dense natural state transitions, while conscious learning captures sparse meaningful state transitions. To prove that the learned latent is effective, th… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of pre-training data. Orca’s pre-training data includes video, event, and VQA data. A. Video Data supports 1) Observation-only state transition, A. Video Data and B. Event Data support 2) Event-conditioned state transition, and C. VQA Data supports 3) VQA response generation. A. Video Data is built from visual signals and covers four types of real-world observations: egocentric interaction, exo-c… view at source ↗

**Figure 4.** Figure 4: Downstream readout architectures. To language reuses the LM head for text readout. To vision only trains an MLP adaptor and LoRA on top of a frozen SD3.5 to readout images. To action trains an MLP adaptor and a DiT-based Action Expert from scratch. Action Expert receives the latent, robot proprioception state, and noisy action to generate action chunks. The specific settings are shown in Appendix C.2. 3.2… view at source ↗

**Figure 5.** Figure 5: Loss of model and data scaling. To answer Question 1.1, we first performed experiments with model sizes and data scaling, and the loss curves are shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Scaling behavior on downstream readouts performance. To answer Question 1.2, we performed probe experiments on Orca-0.8B and Orca-4B. We select some checkpoints from the pretraining process and apply them to downstream tasks to see if a strong world latent can lead to strong downstream readout performance. The readout performance curves are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison of image prediction in the real world. 1) Orca’s learned world latent transfers effectively to image readout. Compared with recent image generation baselines, Orca achieves the best average performance on PRICE and remains competitive across different real-world interaction sources. This indicates that the learned world latent contains predictive information about future visual states und… view at source ↗

**Figure 8.** Figure 8: Recovery after repeated grasp failures. Orca recovers from early spoon-grasp failures and eventually makes progress, while 𝜋0.5 remains unstable with repeated failed attempts. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

We introduce Orca, an initial instantiation of a general world foundation model. Orca learns a unified world latent space from multimodal world signals and exposes it through multimodal readout interfaces. Rather than optimizing isolated next-token, next-frame, or next-action prediction, we are centered on Next-State-Prediction modeling, offering a unified state-transition modeling route toward understanding, predicting, and acting upon the world. Orca learns through two complementary paradigms: unconscious learning captures dense natural state transitions from continuous videos, and conscious learning models sparse meaningful state transitions by language-described events and VQA supervision. For pre-training, we construct a large-scale world-learning inventory data, including 125K hours of video data and 160M event annotations. After pre-training, Orca learns a unified world latent space. To examine whether the learned latent supports downstream, we evaluate it by three representative downstream readouts: text generation, image prediction, and embodied action generation. Orca's backbone is frozen, and only the lightweight modality-specific decoders are trainable. Experiments show the scalability of the proposed paradigm and verify that stronger world latent enables stronger downstream readouts. Orca outperforms similar-sized specialized baselines. These results show that Orca, as a general world foundation model, presents a promising approach to understanding, predicting, and acting upon the world. Finally, we discuss the current limitations, aiming to provide useful insights and inspiration for the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Orca frames a world latent via unconscious video and conscious language next-state prediction but the abstract shows no numbers or ablations to back the performance claims.

read the letter

The paper's main contribution is a world model that learns a unified latent from next-state prediction in two modes: dense video transitions called unconscious learning and language-described events with VQA called conscious learning. It pretrains on 125K hours of video and 160M annotations, freezes the backbone, and trains lightweight decoders for text generation, image prediction, and embodied action. The abstract says this outperforms similar-sized specialized baselines, but gives no numbers or details.

What is new is the explicit split into unconscious and conscious paradigms for the same latent space, rather than standard next-token or next-frame training. The scale of the inventory and the direct readout tests on three modalities are concrete steps toward a general world model.

The paper does well at outlining the two complementary learning routes and at stating the goal of a shared state-transition model. The final section on limitations is direct.

The soft spots are clear. No quantitative results appear in the abstract, so the outperformance claim cannot be checked. The freezing protocol does not isolate the latent quality from data scale or decoder choices, and no ablation is described that would test this. The stress-test note is accurate on this point. Without those elements, the central claim that the stronger world latent drives the results stays under-supported.

This work is for researchers focused on multimodal pretraining and world models in vision and robotics. A reader interested in new framings for unified latents will get value from the high-level design. A reader who needs reproducible evidence or tight experimental controls will find the current version thin. It deserves a serious referee because the idea is coherent and the scale ambitious, though the experiments require more detail and controls.

I recommend sending it to peer review and asking for the quantitative results, baseline descriptions, and ablations on the latent contribution.

Referee Report

2 major / 2 minor

Summary. The paper introduces Orca as an initial general world foundation model that learns a unified world latent space from multimodal signals (video, language, VQA) via Next-State-Prediction rather than modality-specific next-token/frame/action prediction. Pre-training uses a 125K-hour video inventory with 160M event annotations through unconscious (dense video transitions) and conscious (sparse language-described events) paradigms. After pre-training the backbone is frozen and only lightweight modality-specific decoders are trained for three downstream readouts: text generation, image prediction, and embodied action generation. The manuscript claims this protocol demonstrates scalability, that stronger world latents yield stronger readouts, and that Orca outperforms similar-sized specialized baselines.

Significance. If the attribution to the unified latent holds and the outperformance is reproducible with proper controls, the work would be significant for the field of generalist multimodal models by offering a unified state-transition modeling route that could reduce the need for separate specialized models. The scale of the pre-training inventory (125K hours + 160M annotations) and the explicit separation of unconscious/conscious learning are notable strengths, as is the attempt to evaluate a single frozen backbone across text, vision, and action readouts.

major comments (2)

[abstract / evaluation protocol] The central claim that 'stronger world latent enables stronger downstream readouts' (abstract) rests on the evaluation protocol of freezing the backbone after pre-training on the 125K-hour inventory and training only lightweight decoders. No ablation is described that holds decoder capacity and data fixed while varying only the latent (e.g., random-init backbone or controlled data overlap), leaving open that gains could arise from data volume/diversity or decoder design rather than latent geometry. This is load-bearing for the attribution to the unified Next-State-Prediction latent.
[abstract] The abstract asserts that 'experiments show the scalability of the proposed paradigm and verify that stronger world latent enables stronger downstream readouts' and that 'Orca outperforms similar-sized specialized baselines,' yet provides no quantitative results, baseline details, error bars, data-split information, or statistical tests. Without these, the soundness of the outperformance claim cannot be assessed from the manuscript text.

minor comments (2)

[abstract] The abstract refers to 'three representative downstream readouts' but does not specify the exact tasks, metrics, or datasets used for text generation, image prediction, and embodied action generation.
[abstract] Notation for the 'unified world latent space' and 'Next-State-Prediction' is introduced without an accompanying equation or diagram in the abstract; a formal definition would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate planned revisions to strengthen the paper.

read point-by-point responses

Referee: [abstract / evaluation protocol] The central claim that 'stronger world latent enables stronger downstream readouts' (abstract) rests on the evaluation protocol of freezing the backbone after pre-training on the 125K-hour inventory and training only lightweight decoders. No ablation is described that holds decoder capacity and data fixed while varying only the latent (e.g., random-init backbone or controlled data overlap), leaving open that gains could arise from data volume/diversity or decoder design rather than latent geometry. This is load-bearing for the attribution to the unified Next-State-Prediction latent.

Authors: We agree that the attribution to the unified latent would be strengthened by an ablation that isolates latent quality while holding decoder capacity and training data fixed. In the revised manuscript we will add a controlled comparison of the pre-trained backbone against a randomly-initialized backbone (identical decoder training protocol and data) and will report results on a data-overlap-controlled subset. These additions directly address the load-bearing concern. revision: yes
Referee: [abstract] The abstract asserts that 'experiments show the scalability of the proposed paradigm and verify that stronger world latent enables stronger downstream readouts' and that 'Orca outperforms similar-sized specialized baselines,' yet provides no quantitative results, baseline details, error bars, data-split information, or statistical tests. Without these, the soundness of the outperformance claim cannot be assessed from the manuscript text.

Authors: The abstract is intentionally concise; the full quantitative results, baseline specifications, data splits, error bars, and statistical details appear in Sections 4–5 and the associated tables/figures. To improve standalone readability we will expand the abstract with one or two key performance deltas and will ensure the abstract explicitly references the main-text tables containing the complete experimental protocol. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical pre-training and evaluation protocol for a multimodal world model without presenting any mathematical derivations, equations, or fitting procedures. Claims rest on experimental comparisons after freezing the backbone and training lightweight decoders, which is a standard non-circular evaluation approach for foundation models. No self-citations, ansatzes, or renamings are invoked as load-bearing steps in the provided text, and the central assertion that stronger latents yield stronger readouts is supported (or not) by the reported results rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text, so the ledger records the minimal structural assumptions visible from the summary.

axioms (1)

domain assumption Next-State-Prediction provides a unified route to understanding, predicting, and acting upon the world
Stated as the modeling center in the abstract.

invented entities (1)

unified world latent space no independent evidence
purpose: Single representation learned from multimodal signals that supports multiple readout interfaces
Core postulated object of the model; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 6005 in / 1360 out tokens · 26986 ms · 2026-06-30T06:18:46.137597+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

24 AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025. 31 Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Y...

2025
[2]

General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling

9, 24, 31 Google Deepmind. Gemini 3.1 pro best for complex tasks and bringing creative concepts to life, February 2026a. URLhttps://deepmind.google/models/gemini/pro/. 10, 23, 32 Google Deepmind. Gemma 4: Byte for byte, the most capable open models, April 2026b. URLhttps: //blog.google/innovation-and-ai/technology/developers-tools/gemma-4/. 9, 10, 24, 31,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

24 Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long

URLhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/. 24 Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long. Jepa-vla: Video predictive embedding is needed for vla models.ArXiv, 2026. 25 MiniMax Team. Minimax m2.7: Early echoes of self-evolution, March 2026. URLhttps://www.mini max.io/news/minimax-m27-en. 23 Mistra...

2026
[4]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

25 Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shang- hang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InNeurIPS, 2025. 24 Yuxuan Tian, Yuheng Ji, Xiaolong Zheng, Ziheng Qin, Yipu Wang, Xinyi Zheng, Yuyang Liu, Shuanghao Bai, Zhe Li, Liang Wang, et al. Spatial int...

2025
[5]

Llama 2: Open foundation and fine-tuned chat models.ArXiv, 2023

24 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.ArXiv, 2023. 23 Wan Team, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wa...

2023
[6]

23 21 Sec 1: Intro | Sec 2: Orca | Sec 3: Training | Sec 4: Evaluation | Sec 5: Conclusion | Sec 6: Authors | References Appendix A: Conception| B: Related Work | C: Train Settings | D: Infra | E: Eval Settings | F: Visualization Appendix A. Orca Conception Passive Task Driven Active World Learner This is a fridge Next Frame Prediction Visual Dynamics Pre...

2023
[7]

next token prediction

strengthen agentic language modeling. The former emphasizes tool-oriented intelligence, and the latter focuses on long-horizon agentic engineering. MiniMax-M2.7 (MiniMax Team, 2026) investigates self-evolving for real-world productivity. Phi-4-reasoning (Abdin et al., 2025) shows the effectiveness of high-quality reasoning supervision in dense models. The...

2026
[8]

The loss is: Lobs =E h ℓlat ˆ𝑣𝑙 𝑡+1,𝑣 𝑙 𝑡+1 i

Observation-only state transition.𝑣 𝑙 𝑡+1 is the latent of the next frame. The loss is: Lobs =E h ℓlat ˆ𝑣𝑙 𝑡+1,𝑣 𝑙 𝑡+1 i . (C-2)
[9]

Event-conditioned state transition.The language specifies whether the current state should be mapped toward an adjacent (earlier or later) event state. Accordingly, Orca predicts the visual latent in the previous event selected by theprevious-event conditionand the visual latent in the next event selected by thenext-event condition. The event-conditioned ...
[10]

This term is denoted asL vqa

VQA response generation.Orca uses the language modeling head to predict the target answer with the standard next-token prediction loss. This term is denoted asL vqa. The final Orca’s pre-training objective is:Lpre =0.1L obs +0.5L evt +0.4L vqa. At the data-sampling level, Orca mixes state transition samples and VQA samples with an approximate ratio of 5 :...
[11]

The last-layer hidden state of𝑞 1 is passed to thevisual transition head(two-layer MLP), and the ground truth latent𝑣 𝑙 𝑡+1 is obtained by the frozen vision encoder of VLM backbone

Observation-only state transition.Given the current observation𝑣 𝑡 and<Query 1>𝑞 1, Orca pre- dicts the latent ˆ𝑣𝑙 𝑡+1 of a temporally next frame. The last-layer hidden state of𝑞 1 is passed to thevisual transition head(two-layer MLP), and the ground truth latent𝑣 𝑙 𝑡+1 is obtained by the frozen vision encoder of VLM backbone
[12]

The previous-eventL prev and next-event directionsL next, which are calculatedL evt in Equation C-3

Event-conditioned State Transition.Given𝑣 𝑡,𝑞 1, an instruction𝑒 𝑡+Δ, and the<Query 2>𝑞 2, Orca predicts the latent ˆ𝑣𝑙 𝑡+Δ of random frame in the instruction-specified target event.𝑒 𝑡+Δ specifies the transition direction and target event, while𝑞 2 reads out the corresponding instruction-conditioned predictive state. The previous-eventL prev and next-eve...

2024
[13]

Latent𝑞 1: predictive query states from Orca, providing latent for future state evolution
[14]

Noisy action with time embedding: Actions with Gaussian noise, and time embedding added
[15]

score": 3,

Proprioception: robot proprioceptive state, including joint and end-effector related information. Settings.TheAction Expertis trained with the flow-matching loss to obtain the action chunks. The ground-truth action chunk is perturbed with Gaussian noise, and theAction Expertpredicts the corre- sponding velocity. The architecture and training settings of t...

2026
[16]

The robot arm moves toward the book. 10
[17]

The gripper contacts the book. 10
[18]

The book is pushed to the edge, with more than 2 cm beyond the edge, without falling. 20
[19]

The book is successfully grasped. 30
[20]

The book is moved toward the bookshelf while being grasped. 20
[21]

10 Stacked Bowls

The book is successfully placed on the bookshelf. 10 Stacked Bowls
[22]

The hand moves toward Bowl 1. 10
[23]

Bowl 1 is grasped. 20
[24]

Bowl 1 is placed stably. 10
[25]

The hand moves toward Bowl 2. 10
[26]

Bowl 2 is grasped. 10
[27]

Bowl 2 is stably stacked into Bowl 1. 10
[28]

The hand moves toward Bowl 3. 10
[29]

Bowl 3 is grasped. 10
[30]

10 Pull Out Tissue

Bowl 3 is stably stacked into Bowl 2. 10 Pull Out Tissue
[31]

Arm A moves toward the tissue box. 10
[32]

Arm A holds the tissue box. 20
[33]

Arm B moves toward the tissue. 20
[34]

Arm B successfully grasps the yellow tissue. 40
[35]

10 ⊲The two arms are scored separately.- Stamp

The tissue is placed on the table. 10 ⊲The two arms are scored separately.- Stamp
[36]

The robot arm moves toward the stamp. 10
[37]

The stamp is successfully grasped and lifted. 30
[38]

The stamp is moved above the document. 10
[39]

The document is stamped by pressing the stamp. 20
[40]

The stamp is moved above the ink pad. 10
[41]

20 ⊲If the stamp topples, scoring stops.- Scoop Sugar

The stamp is placed stably without toppling. 20 ⊲If the stamp topples, scoring stops.- Scoop Sugar
[42]

The hand moves toward the spoon. 10
[43]

The spoon is successfully grasped. 20
[44]

Sugar is scooped with the spoon. 20
[45]

The spoon is moved to the mug; the spoon must be held, but sugar is not strictly required. 10
[46]

The sugar is poured into the mug; the spoon must be held, but sugar is not required. 20
[47]

Press the button to start copying

The spoon is placed back on the right side of the table. 20 Table E3.Detailed rule-based results under real-robot OOD settings. Settings Model Rule-based Score Book Bowls Tissue Stamp Sugar Average Environment OOD 𝜋0.5 27 44 32 9 26 27.6 V-JEPA 2.1 24 15 28 6 3 15.2 Qwen3.5-0.8B 1 28 0 0 10 7.8 Qwen3.5-4B 19 27 0 6 10 12.4 Orca-0.8B 23 44 28 27 15 27.4 Or...

[1] [1]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

24 AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025. 31 Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Y...

2025

[2] [2]

General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling

9, 24, 31 Google Deepmind. Gemini 3.1 pro best for complex tasks and bringing creative concepts to life, February 2026a. URLhttps://deepmind.google/models/gemini/pro/. 10, 23, 32 Google Deepmind. Gemma 4: Byte for byte, the most capable open models, April 2026b. URLhttps: //blog.google/innovation-and-ai/technology/developers-tools/gemma-4/. 9, 10, 24, 31,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

24 Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long

URLhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/. 24 Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long. Jepa-vla: Video predictive embedding is needed for vla models.ArXiv, 2026. 25 MiniMax Team. Minimax m2.7: Early echoes of self-evolution, March 2026. URLhttps://www.mini max.io/news/minimax-m27-en. 23 Mistra...

2026

[4] [4]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

25 Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shang- hang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InNeurIPS, 2025. 24 Yuxuan Tian, Yuheng Ji, Xiaolong Zheng, Ziheng Qin, Yipu Wang, Xinyi Zheng, Yuyang Liu, Shuanghao Bai, Zhe Li, Liang Wang, et al. Spatial int...

2025

[5] [5]

Llama 2: Open foundation and fine-tuned chat models.ArXiv, 2023

24 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.ArXiv, 2023. 23 Wan Team, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wa...

2023

[6] [6]

23 21 Sec 1: Intro | Sec 2: Orca | Sec 3: Training | Sec 4: Evaluation | Sec 5: Conclusion | Sec 6: Authors | References Appendix A: Conception| B: Related Work | C: Train Settings | D: Infra | E: Eval Settings | F: Visualization Appendix A. Orca Conception Passive Task Driven Active World Learner This is a fridge Next Frame Prediction Visual Dynamics Pre...

2023

[7] [7]

next token prediction

strengthen agentic language modeling. The former emphasizes tool-oriented intelligence, and the latter focuses on long-horizon agentic engineering. MiniMax-M2.7 (MiniMax Team, 2026) investigates self-evolving for real-world productivity. Phi-4-reasoning (Abdin et al., 2025) shows the effectiveness of high-quality reasoning supervision in dense models. The...

2026

[8] [8]

The loss is: Lobs =E h ℓlat ˆ𝑣𝑙 𝑡+1,𝑣 𝑙 𝑡+1 i

Observation-only state transition.𝑣 𝑙 𝑡+1 is the latent of the next frame. The loss is: Lobs =E h ℓlat ˆ𝑣𝑙 𝑡+1,𝑣 𝑙 𝑡+1 i . (C-2)

[9] [9]

Event-conditioned state transition.The language specifies whether the current state should be mapped toward an adjacent (earlier or later) event state. Accordingly, Orca predicts the visual latent in the previous event selected by theprevious-event conditionand the visual latent in the next event selected by thenext-event condition. The event-conditioned ...

[10] [10]

This term is denoted asL vqa

VQA response generation.Orca uses the language modeling head to predict the target answer with the standard next-token prediction loss. This term is denoted asL vqa. The final Orca’s pre-training objective is:Lpre =0.1L obs +0.5L evt +0.4L vqa. At the data-sampling level, Orca mixes state transition samples and VQA samples with an approximate ratio of 5 :...

[11] [11]

The last-layer hidden state of𝑞 1 is passed to thevisual transition head(two-layer MLP), and the ground truth latent𝑣 𝑙 𝑡+1 is obtained by the frozen vision encoder of VLM backbone

Observation-only state transition.Given the current observation𝑣 𝑡 and<Query 1>𝑞 1, Orca pre- dicts the latent ˆ𝑣𝑙 𝑡+1 of a temporally next frame. The last-layer hidden state of𝑞 1 is passed to thevisual transition head(two-layer MLP), and the ground truth latent𝑣 𝑙 𝑡+1 is obtained by the frozen vision encoder of VLM backbone

[12] [12]

The previous-eventL prev and next-event directionsL next, which are calculatedL evt in Equation C-3

Event-conditioned State Transition.Given𝑣 𝑡,𝑞 1, an instruction𝑒 𝑡+Δ, and the<Query 2>𝑞 2, Orca predicts the latent ˆ𝑣𝑙 𝑡+Δ of random frame in the instruction-specified target event.𝑒 𝑡+Δ specifies the transition direction and target event, while𝑞 2 reads out the corresponding instruction-conditioned predictive state. The previous-eventL prev and next-eve...

2024

[13] [13]

Latent𝑞 1: predictive query states from Orca, providing latent for future state evolution

[14] [14]

Noisy action with time embedding: Actions with Gaussian noise, and time embedding added

[15] [15]

score": 3,

Proprioception: robot proprioceptive state, including joint and end-effector related information. Settings.TheAction Expertis trained with the flow-matching loss to obtain the action chunks. The ground-truth action chunk is perturbed with Gaussian noise, and theAction Expertpredicts the corre- sponding velocity. The architecture and training settings of t...

2026

[16] [16]

The robot arm moves toward the book. 10

[17] [17]

The gripper contacts the book. 10

[18] [18]

The book is pushed to the edge, with more than 2 cm beyond the edge, without falling. 20

[19] [19]

The book is successfully grasped. 30

[20] [20]

The book is moved toward the bookshelf while being grasped. 20

[21] [21]

10 Stacked Bowls

The book is successfully placed on the bookshelf. 10 Stacked Bowls

[22] [22]

The hand moves toward Bowl 1. 10

[23] [23]

Bowl 1 is grasped. 20

[24] [24]

Bowl 1 is placed stably. 10

[25] [25]

The hand moves toward Bowl 2. 10

[26] [26]

Bowl 2 is grasped. 10

[27] [27]

Bowl 2 is stably stacked into Bowl 1. 10

[28] [28]

The hand moves toward Bowl 3. 10

[29] [29]

Bowl 3 is grasped. 10

[30] [30]

10 Pull Out Tissue

Bowl 3 is stably stacked into Bowl 2. 10 Pull Out Tissue

[31] [31]

Arm A moves toward the tissue box. 10

[32] [32]

Arm A holds the tissue box. 20

[33] [33]

Arm B moves toward the tissue. 20

[34] [34]

Arm B successfully grasps the yellow tissue. 40

[35] [35]

10 ⊲The two arms are scored separately.- Stamp

The tissue is placed on the table. 10 ⊲The two arms are scored separately.- Stamp

[36] [36]

The robot arm moves toward the stamp. 10

[37] [37]

The stamp is successfully grasped and lifted. 30

[38] [38]

The stamp is moved above the document. 10

[39] [39]

The document is stamped by pressing the stamp. 20

[40] [40]

The stamp is moved above the ink pad. 10

[41] [41]

20 ⊲If the stamp topples, scoring stops.- Scoop Sugar

The stamp is placed stably without toppling. 20 ⊲If the stamp topples, scoring stops.- Scoop Sugar

[42] [42]

The hand moves toward the spoon. 10

[43] [43]

The spoon is successfully grasped. 20

[44] [44]

Sugar is scooped with the spoon. 20

[45] [45]

The spoon is moved to the mug; the spoon must be held, but sugar is not strictly required. 10

[46] [46]

The sugar is poured into the mug; the spoon must be held, but sugar is not required. 20

[47] [47]

Press the button to start copying

The spoon is placed back on the right side of the table. 20 Table E3.Detailed rule-based results under real-robot OOD settings. Settings Model Rule-based Score Book Bowls Tissue Stamp Sugar Average Environment OOD 𝜋0.5 27 44 32 9 26 27.6 V-JEPA 2.1 24 15 28 6 3 15.2 Qwen3.5-0.8B 1 28 0 0 10 7.8 Qwen3.5-4B 19 27 0 6 10 12.4 Orca-0.8B 23 44 28 27 15 27.4 Or...