WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

Andrea Bajcsy; Arnav Kumar Jain; Gokul Swamy; Jesse Farebrother; Yilin Wu

arxiv: 2606.13672 · v2 · pith:KBVS24ANnew · submitted 2026-06-11 · 💻 cs.RO

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

Arnav Kumar Jain , Yilin Wu , Jesse Farebrother , Gokul Swamy , Andrea Bajcsy This is my paper

Pith reviewed 2026-06-27 06:21 UTC · model grok-4.3

classification 💻 cs.RO

keywords world modelsrobotic manipulationflow matchingpolicy evaluationpolicy improvementtest-time planningmulti-view predictionlatent forecasting

0 comments

The pith

WEAVER uses a multi-view flow-matching world model to simulate robotic manipulation with high fidelity, long-term consistency, and speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a world model that meets three simultaneous requirements for robotics use: simulated trajectories must match real outcomes closely enough to be trusted, must stay coherent over many time steps, and must generate predictions fast enough for repeated use. WEAVER meets these by training on multiple camera views to forecast future latent states and reward signals with a flow-matching objective. The resulting model supports policy evaluation that correlates strongly with real success, lets policies improve through simulated rollouts, and enables faster test-time planning, all while showing gains on tasks that previously challenged world models.

Core claim

WEAVER is a multi-view world model trained to predict future latents and reward values via a flow-matching loss. It simultaneously satisfies fidelity measured by 0.870 correlation with real-world success rates, consistency over long horizons for dynamic manipulation, and efficiency with 5-10 times speedup over prior world models, producing 38 percent real-world success improvement for policy improvement and 14 percent improvement for test-time planning on robotic hardware, with better out-of-distribution results than earlier approaches.

What carries the argument

Multi-view architecture trained with flow-matching loss to predict future latents and rewards, allowing joint satisfaction of fidelity, consistency, and efficiency.

If this is right

Policy evaluation can rely on simulated rollouts with 0.870 correlation to real outcomes instead of extensive hardware trials.
Policy improvement on top of a foundation model reaches 38 percent higher real success rates through simulated data.
Test-time planning gains 14 percent success with 5-10 times faster simulation than earlier world models.
Performance remains higher than prior world models when tested on out-of-distribution manipulation scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-view flow-matching design could reduce real-world data needs for other robot learning settings that require long-horizon prediction.
Replacing standard prediction losses with flow-matching might improve coherence in world models used for non-manipulation tasks.
Embedding this style of world model into additional robot foundation models could further cut the amount of physical interaction required for training.

Load-bearing premise

The measured correlation between simulated trajectories and real success rates carries over to policy improvement and planning experiments on the tested hardware and tasks without further adjustments.

What would settle it

Replicating the policy improvement experiments on new hardware or tasks and measuring no statistically significant real-world success gains would show the simulation-to-reality transfer does not hold as claimed.

Figures

Figures reproduced from arXiv: 2606.13672 by Andrea Bajcsy, Arnav Kumar Jain, Gokul Swamy, Jesse Farebrother, Yilin Wu.

**Figure 1.** Figure 1: We present WEAVER, a WM that satisfies three desiderata: (i) high fidelity, (ii) long-horizon consistency and (iii) efficient generation. With these, we unlock the potential for downstream policy evaluation (middle), policy improvement (top right) and Test-time Planning (bottom right). Despite rapid progress, no existing robot WM satisfies all three desiderata in tandem. For example, video generation model… view at source ↗

**Figure 2.** Figure 2: WEAVER Architecture. Left: The world model encodes memory, history, and action sequences to image future rollouts in latent space. Middle: The latent verifier, equipped with reward and critic heads, selects samples with high advantage to steer the policy distribution. Right: Decoded generation corresponding to different outcomes of action sequences. WEAVER. Furthermore, the use of pretrained video generati… view at source ↗

**Figure 3.** Figure 3: We report FID at various horizon lengths and find that WEAVER is consistently better at long-horizon rollouts. Datasets & Tasks. To align the world model with the data distribution of the base policy, we first pre-train the WEAVER world model on the DROID dataset and then fine-tune it on our real-world setup. We collect data to fine-tune the world model DFT real by running π0.5 for five real-world manipu… view at source ↗

**Figure 4.** Figure 4: Reward Prediction & Test-time Planning with Advantage Filtering. (Left) Predicted rewards from WEAVER match the Robometer reward over trajectory. (Right) The highlighted action sample is the one with the best advantage value and the best outcome in WEAVER’s imagination. 10 20 30 40 Inference Time (s) 40 80 120 FVD ( ) DROID: Exterior 10 20 30 40 Inference Time (s) 150 300 DROID: Wrist 10 20 30 40 Inference… view at source ↗

**Figure 5.** Figure 5: We present FVD vs inference time (in seconds) for [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Policy Evaluation. We compare performance across different policies and world models. (Left) For PnP Towel, only WEAVER and WEAVER-FT accurately imagine the towel inside the basket. For Pour Beans, only WEAVER-FT captures the beans scattering across the table. (Right) Evaluation inside WEAVER-FT attains an impressively high correlation of success rate with the real world. 5.2 WEAVER Enables Effective Evalu… view at source ↗

**Figure 7.** Figure 7: (Left) Policy Improvement with Finetuning. We finetune π0.5 with multiple data sources and see that combining real and synthetic (Syn) obtained with WEAVER outperforms other variants. (Right) Data Scaling for Policy Improvement. We ablate the number of segments in synthetic data for finetuning and report the success rate across 20 trials for the Pour Beans Task. Right Camera View Wrist Camera View t = 0 t … view at source ↗

**Figure 8.** Figure 8: Policy Improvement Results. We present real rollouts from the base policy and the policy finetuned with synthetic data. Finetuning on synthetic data generated by WEAVER leads to improved policy performance and more successful task execution compared to the base policy. 4% average performance gap. This indicates that out synthetic data is of such a high quality that it unlocks similar policy improvement to … view at source ↗

**Figure 9.** Figure 9: We demonstrate test-time steering with WEAVER outperforms the base policy π0.5 by 14% when averaged across all five tasks. Setup. We use π0.5 as the base policy and sample a batch of action chunks. For each chunk, WEAVER imagines latents of future states and evaluates the advantage using the reward and critic heads. This reduces the cost of decoding predicted observations and querying external VLM judges.… view at source ↗

**Figure 10.** Figure 10: Hardware setup and tasks.On the left, it is the robot setup with cameras. On the right, it shows the five tasks with top row as initial state and bottom row as one of the goal configuration. A1.2 Action Space The π0.5 base policy on the DROID setup outputs joint-velocity commands for control. To match this action representation, we define the action space of our world model in joint space, avoiding potent… view at source ↗

**Figure 11.** Figure 11: We compare FID vs inference time for WEAVER and Ctrl-World and find that WEAVER outperforms the baseline with upto 16× more inference time. 3https://github.com/Lightning-AI/torchmetrics 4https://github.com/mseitzer/pytorch-fid 5https://github.com/universome/stylegan-v 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Policy Evaluation Results. We show policy evaluation results for all five tasks across three world models. We provide the full policy evaluation rollouts in [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Policy Improvement Results. We demonstrate the rollouts for five tasks among the base policy and policy FT w/ Synthetic Data. With WEAVER generated synthetic data, policy finetuning can improvement upon all tasks. A5.1 Partial Observability Our world model relies primarily on visual observations, which provide only partial access to the underlying physical state. During manipulation, task-relevant informa… view at source ↗

**Figure 14.** Figure 14: We compare the rollouts on task obtained from [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: We compare the rollouts on task obtained from [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: We compare the rollouts on task obtained from [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: We compare the rollouts on task obtained from [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: We compare the rollouts on task obtained from [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: We compare the rollouts on Pour Beans task obtained from [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

**Figure 20.** Figure 20: We compare the rollouts on Pour Beans task obtained from [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

read the original abstract

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose WEAVER (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. WEAVER is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply WEAVER in robotic hardware, demonstrating its effectiveness at policy evaluation ($\rho$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $\pi_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). WEAVER also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WEAVER shows usable hardware gains from a multi-view flow-matching world model on long-horizon manipulation, with the main open question being how much the architecture versus the distilled choices actually drive the reported numbers.

read the letter

The paper introduces WEAVER, a multi-view latent world model trained with flow-matching on both states and rewards. It reports a 0.87 correlation between simulated trajectories and real success rates, a 38% lift in real-world policy success over a foundation model baseline, and a 14% planning improvement at 5-10x speedup. The authors also claim better out-of-distribution behavior than prior world models.

What the work does well is close the loop on real hardware for all three claimed uses: policy evaluation, improvement, and test-time planning. Distilling concrete choices around architecture, memory, and objectives for dynamic long-horizon tasks is a practical step that earlier models apparently did not get right. The downstream metrics are tied directly to robot performance rather than purely simulated benchmarks.

The soft spots are mostly in the level of detail available. The abstract gives point estimates without error bars or clear statements on trial counts and task selection for the correlation metric. The transfer from the policy-evaluation correlation to the improvement and planning results rests on an assumption that needs the full methods section to evaluate. No internal contradictions or circular definitions appear in the description, but the relative contribution of the multi-view structure versus the flow-matching loss versus the distilled heuristics is not broken out here.

This paper is for people working on world models that must run fast enough for planning on real robots. A reader who cares about reducing real-world rollouts for manipulation would get concrete numbers to compare against. The empirical grounding is strong enough that it deserves a serious referee even if some ablations end up looking thin.

Referee Report

2 major / 2 minor

Summary. The paper proposes WEAVER, a multi-view world model for robotic manipulation trained via flow-matching to predict future latents and rewards. It claims to jointly achieve fidelity (ρ=0.870 correlation with real-world success rates), consistency over long horizons, and efficiency, enabling state-of-the-art results in policy evaluation, 38% real-world success rate improvement for policy improvement on top of π_{0.5}, and 14% improvement with 5-10× speedup in test-time planning, plus better out-of-distribution performance.

Significance. If the empirical claims hold with proper statistical support, WEAVER would represent a meaningful step forward for world models in robotics by distilling architectural and objective choices that address long-horizon manipulation challenges, directly linking the three desiderata to measurable gains in policy improvement and planning with limited real-world data. The public release of code, models, and videos strengthens reproducibility.

major comments (2)

[Experiments] Experiments section: the fidelity claim rests on ρ=0.870 correlation with real-world success rate, yet no error bars, number of independent trials, data selection criteria, or verification procedure for the metric are provided; this directly affects assessment of whether the correlation supports the reported 38% and 14% downstream gains.
[Policy Improvement and Test-time Planning] Policy improvement and test-time planning results: the manuscript does not include an explicit analysis or ablation demonstrating that the measured correlation transfers to the observed success-rate improvements without unstated task- or hardware-specific adjustments, which is load-bearing for the central claim that fidelity + consistency + efficiency jointly enable the gains.

minor comments (2)

[Method] Notation for the flow-matching loss and latent prediction objectives could be clarified with an explicit equation reference to distinguish from prior flow-matching formulations.
[Figures] Figure captions for qualitative trajectory comparisons should explicitly state the horizon length and number of rollouts shown to aid assessment of consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the fidelity claim rests on ρ=0.870 correlation with real-world success rate, yet no error bars, number of independent trials, data selection criteria, or verification procedure for the metric are provided; this directly affects assessment of whether the correlation supports the reported 38% and 14% downstream gains.

Authors: We agree that these statistical details are required for rigorous assessment. In the revised manuscript we will add error bars computed across independent trials, report the exact number of trials and seeds, specify data selection criteria for the correlation computation, and describe the verification procedure. These additions will clarify the reliability of ρ=0.870 and its relation to the reported downstream gains. revision: yes
Referee: [Policy Improvement and Test-time Planning] Policy improvement and test-time planning results: the manuscript does not include an explicit analysis or ablation demonstrating that the measured correlation transfers to the observed success-rate improvements without unstated task- or hardware-specific adjustments, which is load-bearing for the central claim that fidelity + consistency + efficiency jointly enable the gains.

Authors: The experiments separately establish the correlation via policy evaluation and demonstrate the success-rate gains via dedicated policy-improvement and test-time-planning sections on the same hardware and tasks. We acknowledge that an explicit linking analysis would strengthen the argument. In revision we will add a dedicated discussion that examines how the fidelity metric aligns with the observed improvements and notes any task- or hardware-specific factors; we will also include any feasible ablations using existing data. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents WEAVER as an empirical architecture for world models in robotics, trained via flow-matching on multi-view latents and rewards, with performance validated through direct hardware experiments measuring correlation (ρ=0.870), policy improvement (+38%), and planning speedup (+14% at 5-10×). No equations, derivations, or self-referential definitions appear that would reduce any claimed prediction or result to its own fitted inputs by construction. The three desiderata (fidelity, consistency, efficiency) are tied to downstream metrics via external benchmarks rather than internal renaming or self-citation chains. The central claims rest on reproducible hardware comparisons outside any fitted parameter loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no visibility into explicit free parameters, axioms, or invented entities; the central claims rest on empirical demonstration rather than new theoretical constructs.

pith-pipeline@v0.9.1-grok · 5845 in / 1018 out tokens · 18778 ms · 2026-06-27T06:21:09.373297+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 20 canonical work pages · 12 internal anchors

[1]

Thinking fast and slow with deep learning and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InNeural Information Processing Systems (NeurIPS), 2017

2017
[2]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, (CVPR), 2023

2023
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.CoRR, abs/2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, abs/2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Robust noise attenuation via adaptive pooling of transformer outputs

Greyson Brothers. Robust noise attenuation via adaptive pooling of transformer outputs. In International Conference on Learning Representations (ICLR), 2025

2025
[6]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

2024
[7]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeural Information Processing Systems (NeurIPS), 2024

2024
[8]

arXiv:2510.02387 (2025)

Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.CoRR, abs/2510.02387, 2025

work page arXiv 2025
[9]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

2024
[10]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.CoRR, abs/2602.06949, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Vlaw: Iterative co-improvement of vision-language-action policy and world model.CoRR, abs/2602.12063, 2026

Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. Vlaw: Iterative co-improvement of vision-language-action policy and world model.CoRR, abs/2602.12063, 2026

work page arXiv 2026
[12]

Ctrl-world: A controllable generative world model for robot manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026

2026
[13]

World Models

David Ha and Jürgen Schmidhuber. World models.CoRR, abs/1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions (ICLR), 2020

2020
[15]

Mastering atari with discrete world models

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations (ICLR), 2021

2021
[16]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.CoRR, abs/2509.24527, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Temporal difference learning for model predictive control

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. InInternational Conference on Machine Learning (ICML), 2022. 11

2022
[18]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.CoRR, abs/2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[20]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeural Information Processing Systems (NeurIPS), 2017

2017
[21]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision- language-action model with open-world generalization.CoRR, abs/2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Learning robust dynamics through variational sparse gating

Arnav Kumar Jain, Shiva Kanth Sujit, Shruti Joshi, Vincent Michalski, Danijar Hafner, and Samira Ebrahimi Kahou. Learning robust dynamics through variational sparse gating. InNeural Information Processing Systems (NeurIPS), 2022

2022
[23]

A smooth sea never made a skilled SAILOR: Robust imitation via learning to search

Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, and Gokul Swamy. A smooth sea never made a skilled SAILOR: Robust imitation via learning to search. InNeural Information Processing Systems (NeurIPS), 2026

2026
[24]

EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.CoRR, abs/2505.09723, 2025

work page arXiv 2025
[25]

Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. InRobotics: Science and Systems 2...

2026
[26]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations (ICLR), 2024

2024
[27]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023
[28]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

2023
[29]

Video generation models in robotics-applications, research challenges, future directions.CoRR, abs/2601.07823, 2026

Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.CoRR, abs/2601.07823, 2026

work page arXiv 2026
[30]

Sprint: Sparse-dense residual fusion for efficient diffusion transformers

Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J Kim, Aliaksandr Siarohin, and Anil Kag. Sprint: Sparse-dense residual fusion for efficient diffusion transformers. InInternational Conference on Learning Representations (ICLR), 2026

2026
[31]

Notes on the history of correlation.Biometrika, 13(1):25–45, 1920

Karl Pearson. Notes on the history of correlation.Biometrika, 13(1):25–45, 1920

1920
[32]

Inference-time enhancement of generative robot policies via predictive world modeling.IEEE Robotics and Automation Letters, 11(5):5534–5541, 2026

Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Inference-time enhancement of generative robot policies via predictive world modeling.IEEE Robotics and Automation Letters, 11(5):5534–5541, 2026

2026
[33]

Worldgym: World model as an environment for policy evaluation

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation. InInternational Conference on Learning Representations (ICLR), 2026. 12

2026
[34]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

2021
[35]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[36]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving (2025).CoRR, abs/2503.20523, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

World-gymnast: Training robots with reinforcement learning in a world model.CoRR, abs/2602.02454, 2026

Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-gymnast: Training robots with reinforcement learning in a world model.CoRR, abs/2602.02454, 2026

work page arXiv 2026
[38]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.CoRR, abs/2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[39]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[40]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[41]

Droid: A large-scale in-the-wild robot manipulation dataset

DROID Team. Droid: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems, 2024

2024
[42]

arXiv preprint arXiv:2512.10675 (2025)

Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, et al. Evaluating gemini robotics policies in a veo world simulator.CoRR, abs/2512.10675, 2025

work page arXiv 2025
[43]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. CoRR, abs/1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[45]

Drive- dreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InEuropean Confer- ence on Computer Vision (ECCV), 2024

2024
[46]

Interactive world simulator for robot policy training and evaluation.CoRR, abs/2603.08546, 2026

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation.CoRR, abs/2603.08546, 2026

work page arXiv 2026
[47]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. CoRR, abs/2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

The spearman correlation formula.Science, 22(558):309–311, 1905

Clark Wissler. The spearman correlation formula.Science, 22(558):309–311, 1905

1905
[49]

Day- dreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- dreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), 2023

2023
[50]

From foresight to forethought: Vlm- in-the-loop policy steering via latent alignment

Yilin Wu, Ran Tian, Gokul Swamy, and Andrea Bajcsy. From foresight to forethought: Vlm- in-the-loop policy steering via latent alignment. InRobotics: Science and Systems (RSS), 2025

2025
[51]

PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, et al. Playworld: Learning robot world models from autonomous play.CoRR, abs/2603.09030, 2026. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeural Information Processing Systems (NeurIPS), 2019

2019
[53]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018
[54]

Dino-wm: World models on pre-trained visual features enable zero-shot planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. InInternational Conference on Machine Learning (ICML), 2025. 14 Contents A1 Robot Setup & Tasks 16 A1.1 Tasks Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A1.2 Action Space . . . . ...

work page arXiv 2025

[1] [1]

Thinking fast and slow with deep learning and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InNeural Information Processing Systems (NeurIPS), 2017

2017

[2] [2]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, (CVPR), 2023

2023

[3] [3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.CoRR, abs/2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, abs/2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Robust noise attenuation via adaptive pooling of transformer outputs

Greyson Brothers. Robust noise attenuation via adaptive pooling of transformer outputs. In International Conference on Learning Representations (ICLR), 2025

2025

[6] [6]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

2024

[7] [7]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeural Information Processing Systems (NeurIPS), 2024

2024

[8] [8]

arXiv:2510.02387 (2025)

Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.CoRR, abs/2510.02387, 2025

work page arXiv 2025

[9] [9]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

2024

[10] [10]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.CoRR, abs/2602.06949, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Vlaw: Iterative co-improvement of vision-language-action policy and world model.CoRR, abs/2602.12063, 2026

Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. Vlaw: Iterative co-improvement of vision-language-action policy and world model.CoRR, abs/2602.12063, 2026

work page arXiv 2026

[12] [12]

Ctrl-world: A controllable generative world model for robot manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026

2026

[13] [13]

World Models

David Ha and Jürgen Schmidhuber. World models.CoRR, abs/1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions (ICLR), 2020

2020

[15] [15]

Mastering atari with discrete world models

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations (ICLR), 2021

2021

[16] [16]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.CoRR, abs/2509.24527, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Temporal difference learning for model predictive control

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. InInternational Conference on Machine Learning (ICML), 2022. 11

2022

[18] [18]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.CoRR, abs/2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020

[20] [20]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeural Information Processing Systems (NeurIPS), 2017

2017

[21] [21]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision- language-action model with open-world generalization.CoRR, abs/2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Learning robust dynamics through variational sparse gating

Arnav Kumar Jain, Shiva Kanth Sujit, Shruti Joshi, Vincent Michalski, Danijar Hafner, and Samira Ebrahimi Kahou. Learning robust dynamics through variational sparse gating. InNeural Information Processing Systems (NeurIPS), 2022

2022

[23] [23]

A smooth sea never made a skilled SAILOR: Robust imitation via learning to search

Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, and Gokul Swamy. A smooth sea never made a skilled SAILOR: Robust imitation via learning to search. InNeural Information Processing Systems (NeurIPS), 2026

2026

[24] [24]

EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.CoRR, abs/2505.09723, 2025

work page arXiv 2025

[25] [25]

Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. InRobotics: Science and Systems 2...

2026

[26] [26]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations (ICLR), 2024

2024

[27] [27]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023

[28] [28]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

2023

[29] [29]

Video generation models in robotics-applications, research challenges, future directions.CoRR, abs/2601.07823, 2026

Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.CoRR, abs/2601.07823, 2026

work page arXiv 2026

[30] [30]

Sprint: Sparse-dense residual fusion for efficient diffusion transformers

Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J Kim, Aliaksandr Siarohin, and Anil Kag. Sprint: Sparse-dense residual fusion for efficient diffusion transformers. InInternational Conference on Learning Representations (ICLR), 2026

2026

[31] [31]

Notes on the history of correlation.Biometrika, 13(1):25–45, 1920

Karl Pearson. Notes on the history of correlation.Biometrika, 13(1):25–45, 1920

1920

[32] [32]

Inference-time enhancement of generative robot policies via predictive world modeling.IEEE Robotics and Automation Letters, 11(5):5534–5541, 2026

Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Inference-time enhancement of generative robot policies via predictive world modeling.IEEE Robotics and Automation Letters, 11(5):5534–5541, 2026

2026

[33] [33]

Worldgym: World model as an environment for policy evaluation

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation. InInternational Conference on Learning Representations (ICLR), 2026. 12

2026

[34] [34]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

2021

[35] [35]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[36] [36]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving (2025).CoRR, abs/2503.20523, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

World-gymnast: Training robots with reinforcement learning in a world model.CoRR, abs/2602.02454, 2026

Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-gymnast: Training robots with reinforcement learning in a world model.CoRR, abs/2602.02454, 2026

work page arXiv 2026

[38] [38]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.CoRR, abs/2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[39] [39]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[40] [40]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998

[41] [41]

Droid: A large-scale in-the-wild robot manipulation dataset

DROID Team. Droid: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems, 2024

2024

[42] [42]

arXiv preprint arXiv:2512.10675 (2025)

Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, et al. Evaluating gemini robotics policies in a veo world simulator.CoRR, abs/2512.10675, 2025

work page arXiv 2025

[43] [43]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. CoRR, abs/1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [44]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[45] [45]

Drive- dreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InEuropean Confer- ence on Computer Vision (ECCV), 2024

2024

[46] [46]

Interactive world simulator for robot policy training and evaluation.CoRR, abs/2603.08546, 2026

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation.CoRR, abs/2603.08546, 2026

work page arXiv 2026

[47] [47]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. CoRR, abs/2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

The spearman correlation formula.Science, 22(558):309–311, 1905

Clark Wissler. The spearman correlation formula.Science, 22(558):309–311, 1905

1905

[49] [49]

Day- dreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- dreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), 2023

2023

[50] [50]

From foresight to forethought: Vlm- in-the-loop policy steering via latent alignment

Yilin Wu, Ran Tian, Gokul Swamy, and Andrea Bajcsy. From foresight to forethought: Vlm- in-the-loop policy steering via latent alignment. InRobotics: Science and Systems (RSS), 2025

2025

[51] [51]

PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, et al. Playworld: Learning robot world models from autonomous play.CoRR, abs/2603.09030, 2026. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeural Information Processing Systems (NeurIPS), 2019

2019

[53] [53]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018

[54] [54]

Dino-wm: World models on pre-trained visual features enable zero-shot planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. InInternational Conference on Machine Learning (ICML), 2025. 14 Contents A1 Robot Setup & Tasks 16 A1.1 Tasks Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A1.2 Action Space . . . . ...

work page arXiv 2025