Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

Lu Qiu; Xihui Liu; Yi Chen; Yixiao Ge; Yizhuo Li; Yuying Ge

arxiv: 2606.12217 · v1 · pith:WMAVM4PKnew · submitted 2026-06-10 · 💻 cs.CV · cs.AI· cs.RO

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

Lu Qiu , Yizhuo Li , Yi Chen , Yuying Ge , Yixiao Ge , Xihui Liu This is my paper

Pith reviewed 2026-06-27 09:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords world action modelsrepresentation alignmentvideo diffusionrobot manipulationaction groundingaffordance understandingout-of-distribution generalization

0 comments

The pith

Aligning diffusion features to semantic representations makes world action model states useful for control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

World action models generate future video frames to support robot manipulation but often extract poor actions from those frames. Analysis shows the action decoder ignores task-relevant regions and reacts to irrelevant changes because its hidden states are tuned only for visual reconstruction. The paper introduces an alignment objective that pulls intermediate diffusion features toward spatially coherent outputs from a foundation visual encoder. This reorganization improves the decoder's focus, localization, and robustness without harming the video prediction itself.

Core claim

The central claim is that a representation mismatch exists between visual reconstruction objectives and action control needs, and that an Action-Grounded Representation Alignment objective resolves it by regularizing the interface between the world model and the action decoder, producing more reliable actions on real manipulation tasks.

What carries the argument

AGRA, the Action-Grounded Representation Alignment objective that matches intermediate video diffusion features to spatially coherent semantic representations from a foundation visual encoder.

If this is right

The action decoder focuses attention on correct interaction regions instead of task-irrelevant areas.
Object localization accuracy and affordance understanding both increase.
The resulting policy becomes more robust to perturbations outside the task-relevant zones.
Both in-distribution task success and out-of-distribution generalization improve over the unaligned baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment principle could be tested on other video-prediction controllers that currently separate generation from control.
If the foundation encoder's semantics prove too coarse for fine manipulation, the method would need a more action-specific reference representation.
The approach implies that pure next-frame prediction is insufficient scaffolding for control and that an explicit grounding step is required.

Load-bearing premise

Aligning the diffusion model's intermediate features to a foundation visual encoder's representations will reorganize those features into a form that supports accurate low-level action decoding.

What would settle it

If the action decoder's attention maps remain unchanged or task performance fails to rise after the alignment is added, while visual prediction quality stays the same.

Figures

Figures reproduced from arXiv: 2606.12217 by Lu Qiu, Xihui Liu, Yi Chen, Yixiao Ge, Yizhuo Li, Yuying Ge.

**Figure 2.** Figure 2: Diagnosis of the action-grounding gap in the baseline WAM. (Left) Cross-attention maps show the action head ignoring the critical hand-object interaction region, despite generating a plausible future video. (Right) Causal intervention heatmaps (where brighter regions indicate areas that cause the robot’s action to drift when disrupted) reveal that the baseline model’s decisions are heavily influenced by ta… view at source ↗

**Figure 3.** Figure 3: PCA visualization of DINOv2 and Cosmos-Predict-2.5 representations. DINOv2 or [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Main results in real-world evaluation. (a) Main results for in-distribution (ID) scenarios. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of action grounding. Compared with baseline WAM, AGRA directs [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world execution cases [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Architecture of baseline World Action Model and the proposed Action-Grounded Repre [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Evaluation results of AGRA on the RoboCasa GR1 benchmark. Videos illustrate repre [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: RoboCasa-GR1 tabletop tasks evaluation results with full training data. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Layer selection for AGRA. The figure visualizes the downstream manipulation performance when applying AGRA to different layers of Cosmos. 0 6 13 20 27 Layer 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Task Success Rate Single-layer Bridge Multi-layer Bridge [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 12.** Figure 12: Additional real-world execution cases of AGRA. The trajectories show representative [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Real-world comparison between WAM and AGRA. Given the instructions “put the ball [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Text-to-video cross-attention visualization across Cosmos layers. We visualize the cross [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Action-head cross-attention visualization. We visualize where the action head attends [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Visualization of different semantic feature types. We compare PCA visualizations of [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper diagnoses a mismatch between video prediction and action extraction in WAMs, then adds a simple alignment objective (AGRA) that reportedly improves focus and robustness on manipulation tasks.

read the letter

The main takeaway is that these world action models can generate decent future videos yet still produce weak actions because the hidden states are tuned for reconstruction rather than control. The authors trace this with attention maps and interventions, then introduce AGRA to pull intermediate diffusion features toward the spatially coherent outputs of a foundation visual encoder.

That diagnosis step is the clearest part of the work. Showing that the action decoder ignores task-relevant regions and stays brittle to irrelevant changes gives a concrete reason for the performance gap. Repurposing representation alignment for this interface is a direct, low-overhead move that fits the setting without inventing new architectures.

The soft spot is the evidence base. The abstract states consistent ID and OOD gains but supplies no numbers, no baseline tables, no dataset sizes, and no checks on whether video quality or training stability suffered. Without those, it is difficult to judge whether the alignment actually reorganizes states usefully or just adds a regularizer that happens to help on the tested tasks. The full paper may contain the missing details; if it does not, the central claim stays hard to evaluate.

This is aimed at the small group building video-based controllers for real-robot manipulation. Readers already working on diffusion world models or action grounding will see the most immediate use. It is not a broad methodological advance, so most people outside that niche can skip it.

I would send it for peer review. The idea is coherent and the diagnosis is useful, but the experimental claims need proper scrutiny on numbers, controls, and side effects before anyone treats AGRA as a reliable fix.

Referee Report

0 major / 3 minor

Summary. The paper diagnoses a representation mismatch in World Action Models (WAMs): video generation models produce plausible futures but the action decoder often fails to attend to task-relevant regions, as shown by attention analysis and causal interventions. To address this, the authors introduce AGRA, an auxiliary training objective that aligns intermediate features from a video diffusion world model with spatially coherent semantic representations from a foundation visual encoder. On real-world robot manipulation tasks, AGRA is reported to improve action grounding, object localization, affordance understanding, and both in-distribution and out-of-distribution policy performance relative to the baseline WAM.

Significance. If the reported gains are reproducible and statistically supported, the work provides a concrete, low-overhead method for making generative world models more useful for low-level control. The diagnostic experiments (attention maps and interventions) supply a clear mechanistic motivation that is often missing from alignment papers. The approach is general enough to apply to other diffusion-based action models and could influence training practices in robot learning.

minor comments (3)

[Abstract] The abstract states that AGRA 'consistently improves' performance but supplies no numerical deltas, baseline names, dataset sizes, or statistical significance values. These details must appear in §4 (Experiments) with tables and error bars so readers can judge effect size.
[§3 (Method)] The description of the alignment loss (AGRA) is given at a high level; the precise form of the feature extractor, the layers chosen for alignment, the distance metric, and the weighting hyper-parameter should be stated explicitly, ideally with an equation in §3.
[§4 (Experiments)] The claim that alignment does not degrade visual prediction quality is important for the central thesis; a quantitative comparison of reconstruction or future-frame metrics (e.g., PSNR, LPIPS) between baseline and AGRA models should be reported.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the recognition of the diagnostic experiments, and the recommendation for minor revision. The referee's description accurately reflects the core motivation (representation mismatch in WAMs), the proposed AGRA objective, and the reported improvements in action grounding and robustness. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's derivation chain consists of an empirical diagnosis (attention analysis and causal interventions on action-head behavior) followed by the introduction of an auxiliary training objective AGRA that aligns diffusion features to an external foundation encoder. No equations, fitted parameters, or self-citations are presented that reduce the claimed improvements to the inputs by construction. The central claims rest on reported performance gains on real-world robot tasks, which are evaluated externally and do not loop back to the proposal itself. This is the most common honest finding for an applied methods paper whose value is demonstrated by experiment rather than internal derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the named method itself; the central claim rests on the unstated premise that foundation-encoder semantics are already action-relevant.

invented entities (1)

AGRA no independent evidence
purpose: Action-Grounded Representation Alignment objective to regularize world-action interface
New training objective introduced in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5768 in / 1240 out tokens · 18533 ms · 2026-06-27T09:39:27.876747+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 34 linked inside Pith

[1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022
[2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[3]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[4]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025
[5]

Zheng, J

R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

Pith/arXiv arXiv 2025
[6]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[7]

M. Team, C. Xiang, F. Bao, H. Liu, H. Tan, H. Bi, J. Li, J. Liu, J. Pang, K. Jing, et al. Mo- tubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

Pith/arXiv arXiv 2026
[8]

J. Won, K. Lee, H. Jang, D. Kim, and J. Shin. Dual-stream diffusion for world-model aug- mented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

Pith/arXiv arXiv 2025
[9]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large- scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, volume 2024, pages 10641–10662, 2024

2024
[10]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[11]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[12]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[13]

J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Pith/arXiv arXiv 2025
[14]

Brooks, B

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[15]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[16]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 9

Pith/arXiv arXiv 2025
[17]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Pith/arXiv arXiv 2024
[18]

S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie. Representation align- ment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Pith/arXiv arXiv 2024
[19]

Zheng, N

B. Zheng, N. Ma, S. Tong, and S. Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025

Pith/arXiv arXiv 2025
[20]

S. Jha, A. Zholus, S. Chandar, et al. Reconstruction or semantics? what makes a latent space useful for robotic world models.arXiv preprint arXiv:2605.06388, 2026

Pith/arXiv arXiv 2026
[21]

W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

Pith/arXiv arXiv 2021
[22]

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022
[23]

Blattmann, R

A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023
[24]

Blattmann, T

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. En- glish, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[25]

Z. Zhu, X. Wang, W. Zhao, C. Min, B. Li, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024

arXiv 2024
[26]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

2024
[27]

R. Feng, H. Zhang, Z. Shu, Z. Yang, L. Tang, Z. Wang, A. Zheng, J. Xiao, Z. Liu, R. Chu, et al. The matrix: Infinite-horizon world generation with real-time moving control.Advances in Neural Information Processing Systems, 38:87318–87344, 2026

2026
[28]

J. Yu, J. Bai, Y . Qin, Q. Liu, X. Wang, P. Wan, D. Zhang, and X. Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025
[29]

Huang, J

S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

arXiv 2025
[30]

X. Chi, P. Jia, C.-K. Fan, X. Ju, W. Mi, K. Zhang, Z. Qin, W. Tian, K. Ge, H. Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

arXiv 2025
[31]

G. Team, A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Zhu, K. Li, M. Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

arXiv 2025
[32]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026. 10

Pith/arXiv arXiv 2026
[33]

Shang, X

Y . Shang, X. Zhang, Y . Tang, L. Jin, C. Gao, W. Wu, and Y . Li. Roboscape: Physics- informed embodied world model.Advances in Neural Information Processing Systems, 38: 63674–63698, 2026

2026
[34]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

Pith/arXiv arXiv 2025
[35]

Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

arXiv 2025
[36]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

arXiv 2025
[37]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Pith/arXiv arXiv 2025
[38]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE inter- national conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

2017
[39]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

2023
[40]

Y . Feng, H. Tan, X. Mao, C. Xiang, G. Liu, S. Huang, H. Su, and J. Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Pith/arXiv arXiv 2025
[41]

Liang, P

J. Liang, P. Tokmakov, R. Liu, S. Sudhakar, P. Shah, R. Ambrus, and C. V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Pith/arXiv arXiv 2025
[42]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Pith/arXiv arXiv 2025
[43]

T. Ma, J. Zheng, Z. Wang, C. Jiang, A. Cui, J. Liang, and S. Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

arXiv 2026
[44]

T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026
[45]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[46]

C. Wei, K. Mangalam, P.-Y . Huang, Y . Li, H. Fan, H. Xu, H. Wang, C. Xie, A. Yuille, and C. Feichtenhofer. Diffusion models as masked autoencoders. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16284–16294, 2023

2023
[47]

X. Chen, Z. Liu, S. Xie, and K. He. Deconstructing denoising diffusion models for self- supervised learning. InInternational Conference on Learning Representations, volume 2025, pages 55458–55472, 2025

2025
[48]

Xiang, H

W. Xiang, H. Yang, D. Huang, and Y . Wang. Denoising diffusion autoencoders are unified self- supervised learners. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15802–15812, 2023

2023
[49]

Yang and X

X. Yang and X. Wang. Diffusion model as representation learner. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18938–18949, 2023. 11

2023
[50]

Caron, H

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 9650–9660, 2021

2021
[51]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[52]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023
[53]

Singh, X

J. Singh, X. Leng, Z. Wu, L. Zheng, R. Zhang, E. Shechtman, and S. Xie. What mat- ters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

arXiv 2025
[54]

X. Leng, J. Singh, Y . Hou, Z. Xing, S. Xie, and L. Zheng. Repa-e: Unlocking vae for end- to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025
[55]

Hwang, H

S. Hwang, H. Jang, K. Kim, M. Park, and J. Choo. Cross-frame representation alignment for fine-tuning video diffusion models.arXiv preprint arXiv:2506.09229, 2025

arXiv 2025
[56]

Zhang, J

X. Zhang, J. Liao, S. Zhang, F. Meng, X. Wan, J. Yan, and Y . Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models.Advances in Neural Information Processing Systems, 38:122647–122676, 2026

2026
[57]

Assran, A

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025
[58]

G. Zhou, H. Pan, Y . LeCun, and L. Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Pith/arXiv arXiv 2024
[59]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[60]

Abdi and L

H. Abdi and L. J. Williams. Principal component analysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

2010
[61]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025
[62]

Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu. Dial: Decoupling intent and action via latent world modeling for end-to-end vla.arXiv preprint arXiv:2603.29844, 2026

Pith/arXiv arXiv 2026
[63]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Pith/arXiv arXiv 2025
[64]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[65]

J. Lyu, K. Liu, X. Zhang, H. Liao, Y . Feng, W. Zhu, T. Shen, J. Chen, J. Zhang, Y . Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Pith/arXiv arXiv 2026
[66]

Pick & Place

NVIDIA GEAR Team, A. Azzolini, J. Bjorck, V . Blukis, et al. Gr00t n1.6: An improved open foundation model for generalist humanoid robots.https://research.nvidia.com/labs/ gear/gr00t-n1_6/, December 2025. 12 Video DiT Block � Video DiT Block �+ 1 Video DiT Block �+ 2 Latent Space Feature Latent Space Feature Action DiT Block � Action DiT Block � + 2 Actio...

arXiv 2025

[1] [1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022

[2] [2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[3] [3]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[4] [4]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025

[5] [5]

Zheng, J

R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

Pith/arXiv arXiv 2025

[6] [6]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[7] [7]

M. Team, C. Xiang, F. Bao, H. Liu, H. Tan, H. Bi, J. Li, J. Liu, J. Pang, K. Jing, et al. Mo- tubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

Pith/arXiv arXiv 2026

[8] [8]

J. Won, K. Lee, H. Jang, D. Kim, and J. Shin. Dual-stream diffusion for world-model aug- mented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

Pith/arXiv arXiv 2025

[9] [9]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large- scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, volume 2024, pages 10641–10662, 2024

2024

[10] [10]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[11] [11]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[12] [12]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[13] [13]

J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Pith/arXiv arXiv 2025

[14] [14]

Brooks, B

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024

[15] [15]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[16] [16]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 9

Pith/arXiv arXiv 2025

[17] [17]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Pith/arXiv arXiv 2024

[18] [18]

S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie. Representation align- ment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Pith/arXiv arXiv 2024

[19] [19]

Zheng, N

B. Zheng, N. Ma, S. Tong, and S. Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025

Pith/arXiv arXiv 2025

[20] [20]

S. Jha, A. Zholus, S. Chandar, et al. Reconstruction or semantics? what makes a latent space useful for robotic world models.arXiv preprint arXiv:2605.06388, 2026

Pith/arXiv arXiv 2026

[21] [21]

W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

Pith/arXiv arXiv 2021

[22] [22]

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022

[23] [23]

Blattmann, R

A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023

[24] [24]

Blattmann, T

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. En- glish, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[25] [25]

Z. Zhu, X. Wang, W. Zhao, C. Min, B. Li, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024

arXiv 2024

[26] [26]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

2024

[27] [27]

R. Feng, H. Zhang, Z. Shu, Z. Yang, L. Tang, Z. Wang, A. Zheng, J. Xiao, Z. Liu, R. Chu, et al. The matrix: Infinite-horizon world generation with real-time moving control.Advances in Neural Information Processing Systems, 38:87318–87344, 2026

2026

[28] [28]

J. Yu, J. Bai, Y . Qin, Q. Liu, X. Wang, P. Wan, D. Zhang, and X. Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025

[29] [29]

Huang, J

S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

arXiv 2025

[30] [30]

X. Chi, P. Jia, C.-K. Fan, X. Ju, W. Mi, K. Zhang, Z. Qin, W. Tian, K. Ge, H. Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

arXiv 2025

[31] [31]

G. Team, A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Zhu, K. Li, M. Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

arXiv 2025

[32] [32]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026. 10

Pith/arXiv arXiv 2026

[33] [33]

Shang, X

Y . Shang, X. Zhang, Y . Tang, L. Jin, C. Gao, W. Wu, and Y . Li. Roboscape: Physics- informed embodied world model.Advances in Neural Information Processing Systems, 38: 63674–63698, 2026

2026

[34] [34]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

Pith/arXiv arXiv 2025

[35] [35]

Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

arXiv 2025

[36] [36]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

arXiv 2025

[37] [37]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Pith/arXiv arXiv 2025

[38] [38]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE inter- national conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

2017

[39] [39]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

2023

[40] [40]

Y . Feng, H. Tan, X. Mao, C. Xiang, G. Liu, S. Huang, H. Su, and J. Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Pith/arXiv arXiv 2025

[41] [41]

Liang, P

J. Liang, P. Tokmakov, R. Liu, S. Sudhakar, P. Shah, R. Ambrus, and C. V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Pith/arXiv arXiv 2025

[42] [42]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Pith/arXiv arXiv 2025

[43] [43]

T. Ma, J. Zheng, Z. Wang, C. Jiang, A. Cui, J. Liang, and S. Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

arXiv 2026

[44] [44]

T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026

[45] [45]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[46] [46]

C. Wei, K. Mangalam, P.-Y . Huang, Y . Li, H. Fan, H. Xu, H. Wang, C. Xie, A. Yuille, and C. Feichtenhofer. Diffusion models as masked autoencoders. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16284–16294, 2023

2023

[47] [47]

X. Chen, Z. Liu, S. Xie, and K. He. Deconstructing denoising diffusion models for self- supervised learning. InInternational Conference on Learning Representations, volume 2025, pages 55458–55472, 2025

2025

[48] [48]

Xiang, H

W. Xiang, H. Yang, D. Huang, and Y . Wang. Denoising diffusion autoencoders are unified self- supervised learners. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15802–15812, 2023

2023

[49] [49]

Yang and X

X. Yang and X. Wang. Diffusion model as representation learner. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18938–18949, 2023. 11

2023

[50] [50]

Caron, H

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 9650–9660, 2021

2021

[51] [51]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[52] [52]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023

[53] [53]

Singh, X

J. Singh, X. Leng, Z. Wu, L. Zheng, R. Zhang, E. Shechtman, and S. Xie. What mat- ters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

arXiv 2025

[54] [54]

X. Leng, J. Singh, Y . Hou, Z. Xing, S. Xie, and L. Zheng. Repa-e: Unlocking vae for end- to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025

[55] [55]

Hwang, H

S. Hwang, H. Jang, K. Kim, M. Park, and J. Choo. Cross-frame representation alignment for fine-tuning video diffusion models.arXiv preprint arXiv:2506.09229, 2025

arXiv 2025

[56] [56]

Zhang, J

X. Zhang, J. Liao, S. Zhang, F. Meng, X. Wan, J. Yan, and Y . Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models.Advances in Neural Information Processing Systems, 38:122647–122676, 2026

2026

[57] [57]

Assran, A

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025

[58] [58]

G. Zhou, H. Pan, Y . LeCun, and L. Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Pith/arXiv arXiv 2024

[59] [59]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[60] [60]

Abdi and L

H. Abdi and L. J. Williams. Principal component analysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

2010

[61] [61]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025

[62] [62]

Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu. Dial: Decoupling intent and action via latent world modeling for end-to-end vla.arXiv preprint arXiv:2603.29844, 2026

Pith/arXiv arXiv 2026

[63] [63]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Pith/arXiv arXiv 2025

[64] [64]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[65] [65]

J. Lyu, K. Liu, X. Zhang, H. Liao, Y . Feng, W. Zhu, T. Shen, J. Chen, J. Zhang, Y . Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Pith/arXiv arXiv 2026

[66] [66]

Pick & Place

NVIDIA GEAR Team, A. Azzolini, J. Bjorck, V . Blukis, et al. Gr00t n1.6: An improved open foundation model for generalist humanoid robots.https://research.nvidia.com/labs/ gear/gr00t-n1_6/, December 2025. 12 Video DiT Block � Video DiT Block �+ 1 Video DiT Block �+ 2 Latent Space Feature Latent Space Feature Action DiT Block � Action DiT Block � + 2 Actio...

arXiv 2025