Recognition: 3 theorem links
· Lean TheoremWorld Simulation with Video Foundation Models for Physical AI
Pith reviewed 2026-05-12 22:56 UTC · model grok-4.3
The pith
Cosmos-Predict2.5 unifies text, image, and video inputs into controllable world generation for robotics simulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cosmos-Predict2.5 is a flow-based model that unifies Text2World, Image2World, and Video2World generation while integrating Cosmos-Reason1 for richer text grounding and finer control. Trained on 200M curated video clips and refined with RL post-training, it produces substantial gains in video quality and instruction alignment over Cosmos-Predict1 at 2B and 14B scales. Paired with the 3.5 times smaller Cosmos-Transfer2.5 for world translation, the family is presented as a set of open tools that enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for embodied intelligence.
What carries the argument
The flow-based architecture that unifies Text2World, Image2World, and Video2World generation, augmented by integration with Cosmos-Reason1 for text grounding and control.
If this is right
- More reliable synthetic data can be generated for training physical AI systems.
- Policy evaluation becomes feasible inside longer, higher-fidelity simulated episodes.
- Closed-loop simulation supports iterative testing of robotics and autonomous driving agents.
- A smaller control-net style model delivers robust Sim2Real and Real2Real video translation.
Where Pith is reading between the lines
- Open release of the models and benchmarks could let independent groups build hybrid training pipelines that mix simulated and real data more aggressively.
- If the simulated worlds hold up under long-horizon prediction, they may reduce the volume of real-world robot trials needed during development.
- The same generation stack might later support multi-agent or multi-view scenarios once the training distribution expands.
Load-bearing premise
That gains in generated video quality and instruction following will produce simulations accurate enough for downstream robotics tasks such as policy evaluation.
What would settle it
A controlled test in which robot policies trained or evaluated inside Cosmos-Predict2.5 simulations show no measurable improvement in real-world success rate compared with policies trained inside prior simulators or with real data.
read the original abstract
We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Cosmos-Predict2.5, a flow-based world foundation model for Physical AI that unifies Text2World, Image2World, and Video2World generation in a single architecture. It incorporates Cosmos-Reason1 for richer text grounding and control, trains on 200M curated video clips with reinforcement learning post-training, and releases 2B and 14B parameter models. The authors claim substantial improvements over Cosmos-Predict1 in video quality and instruction alignment, enabling more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics. They also present Cosmos-Transfer2.5, a 3.5x smaller control-net style model for Sim2Real and Real2Real translation with higher fidelity, and release source code, pretrained checkpoints, and benchmarks under an open license.
Significance. If the claimed gains in controllability and physical fidelity are substantiated, the work could meaningfully advance embodied AI by supplying scalable open-source world simulators for robotics research. The explicit release of code, checkpoints, and curated benchmarks under the NVIDIA Open Model License is a concrete strength that supports reproducibility and community adoption.
major comments (2)
- [Abstract] Abstract: The central claim that Cosmos-Predict2.5 'achieves substantial improvements over Cosmos-Predict1 in video quality and instruction alignment' is presented without any quantitative metrics, baselines, ablation studies, or evaluation details. This directly undermines assessment of the downstream assertion that the models enable 'more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.'
- [Training and post-training description] Training and post-training description: No ablations isolate the contributions of the flow-based unification, 200M-clip curation, RL post-training, or Cosmos-Reason1 integration to any performance metric. Without such evidence, the weakest assumption—that these elements produce simulations sufficiently accurate and controllable for policy evaluation—remains untested.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the abstract and training sections can be strengthened with more explicit quantitative support and component analysis. We will revise the manuscript accordingly while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that Cosmos-Predict2.5 'achieves substantial improvements over Cosmos-Predict1 in video quality and instruction alignment' is presented without any quantitative metrics, baselines, ablation studies, or evaluation details. This directly undermines assessment of the downstream assertion that the models enable 'more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.'
Authors: We agree that the abstract should include concrete quantitative metrics to support the claims of improvement. In the revised version, we will expand the abstract to report key metrics from the experimental evaluation, including specific gains in video quality (e.g., perceptual and temporal consistency scores) and instruction alignment (e.g., text-video matching accuracy) relative to Cosmos-Predict1, along with brief mention of the evaluation protocols used. These details already appear in the body of the paper and will now be summarized upfront to allow readers to better assess the downstream utility for synthetic data generation and policy evaluation. revision: yes
-
Referee: [Training and post-training description] Training and post-training description: No ablations isolate the contributions of the flow-based unification, 200M-clip curation, RL post-training, or Cosmos-Reason1 integration to any performance metric. Without such evidence, the weakest assumption—that these elements produce simulations sufficiently accurate and controllable for policy evaluation—remains untested.
Authors: We acknowledge that isolating the individual contributions of the flow-based unification, data curation scale, RL post-training, and Cosmos-Reason1 integration would provide stronger evidence. The current manuscript demonstrates overall gains via direct comparisons to Cosmos-Predict1, but we agree targeted ablations would be valuable. In the revision, we will add a dedicated ablation subsection (or supplementary material) that quantifies the incremental impact of each component on metrics such as video fidelity and controllability. This will directly address the concern about untested assumptions for policy evaluation use cases. revision: partial
Circularity Check
No circularity: purely empirical model description and release
full rationale
The paper introduces Cosmos-Predict2.5 and Cosmos-Transfer2.5 as trained video foundation models, describing their flow-based architecture, training on 200M clips, RL post-training, integration with Cosmos-Reason1, and empirical improvements in quality/alignment. No mathematical derivation chain, predictive equations, uniqueness theorems, or fitted parameters are presented that could reduce to inputs by construction. Claims rest on training procedures and released checkpoints rather than any self-referential logic. This matches the default expectation of a non-circular empirical release paper.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scales =
2B, 14B
- training dataset size =
200M
axioms (2)
- domain assumption Flow-based generative models can capture physical dynamics in video sufficiently well for robotics simulation
- domain assumption Curated video data plus RL post-training yields controllable physical behavior
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Built on a flow-based architecture, Cosmos-Predict2.5 unifies Text2World, Image2World, and Video2World generation... Trained on 200M curated video clips and refined with reinforcement learning-based post-training... enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further extend the family with Cosmos-Transfer2.5, a control-net style framework for Sim2Real and Real2Real world translation
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 32 Pith papers
-
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
-
Coding Agent Is Good As World Simulator
A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.
-
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
-
Action Images: End-to-End Policy Learning via Multiview Video Generation
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
-
Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
-
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
-
Learning physically grounded traffic accident reconstruction from public accident reports
A multimodal learning model with a new dataset of 6,217 cases reconstructs lane-consistent pre-impact motion and collision interactions from public accident reports, outperforming baselines in accuracy and consistency.
-
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
-
ShapeGen: Robotic Data Generation for Category-Level Manipulation
ShapeGen generates shape-diverse 3D robotic manipulation demonstrations without simulators by curating a functional shape library and applying a minimal-annotation pipeline for novel, physically plausible data.
-
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
-
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations
Di-BiLPS combines a variational autoencoder, latent diffusion, and contrastive learning to achieve state-of-the-art accuracy on PDE problems with as little as 3% observations while supporting zero-shot super-resolutio...
-
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization
SyncFix improves 3D reconstructions by synchronizing multi-view latent representations in a diffusion refinement process, generalizing from pair-wise training to arbitrary view counts at inference.
-
RLDX-1 Technical Report
RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
Reference graph
Works this paper leans on
-
[1]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 35
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, et al. Edify image: High-quality image generation with pixel space laplacian diffusion models.arXiv preprint arXiv:2411.07126, 2024. 8
-
[3]
Recammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, 2025. 31
work page 2025
-
[4]
Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints
Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints. InICLR, 2025. 31
work page 2025
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 5, 6, 7, 32
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Genie 3: A new frontier for world models, 2025
Philip J Ball, J Bauer, F Belletti, et al. Genie 3: A new frontier for world models, 2025. 35
work page 2025
- [7]
-
[8]
Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 36
-
[9]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 6, 36
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.https://www.reddit.com/r/LocalLLaMA/comments/ 14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/, 2023. Reddit post, r/LocalL- LaMA. 9
work page 2023
-
[11]
Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025
Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments. arXiv preprint arXiv:2506.09849, 2025. 36
-
[12]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InIROS, 2025. 6, 31
work page 2025
-
[13]
Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025
Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025. 35
-
[14]
Video depth anything: Consistent depth estimation for super-long videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InCVPR, 2025. 19 38 World Simulation with Video Foundation Models for Physical AI
work page 2025
-
[15]
On the importance of noise scheduling for diffusion models
Ting Chen. On the importance of noise scheduling for diffusion models.arXiv preprint arXiv:2301.10972,
-
[16]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin CM Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRSS, 2023. 22
work page 2023
-
[17]
Delta lake: Open-source storage framework that enables building lakehouses.https: //delta.io/, 2019
Databricks. Delta lake: Open-source storage framework that enables building lakehouses.https: //delta.io/, 2019. Open-source project, Delta Lake. 6
work page 2019
-
[18]
Google DeepMind. Veo 3, 5 2025. URLhttps://deepmind.google/technologies/veo/veo-3/. 35
work page 2025
-
[19]
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 36
-
[20]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024. 8, 11
work page 2024
-
[21]
LLM-based Realistic Safety-Critical Driving Video Generation
YongjieFu, RuijianZha, PeiTian, andXuanDi. Llm-basedrealisticsafety-criticaldrivingvideogeneration. arXiv preprint arXiv:2507.01264, 2025. 36
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Diffusion models and gaussian flow matching: Two sides of the same coin
Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin Patrick Murphy, and Tim Salimans. Diffusion models and gaussian flow matching: Two sides of the same coin. InThe Fourth Blogpost Track at ICLR 2025, 2025. 8
work page 2025
-
[23]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 35
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
YOLOX: Exceeding YOLO Series in 2021
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021. 7
work page internal anchor Pith review arXiv 2021
-
[25]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025. 36
-
[27]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 35
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 35
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 35
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[30]
Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, et al. Generalized neighborhood attention: Multi- dimensional sparse attention at the speed of light.arXiv preprint arXiv:2504.16922, 2025. 14
-
[31]
Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar, Alexander Keller, Sanja Fidler, Igor Gilitschenski, Zan Gojcic, and Zian Wang. Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025. 36 39 World Simulation with Video Foundation Models for Physical AI
-
[32]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, 2023. 8
work page 2023
-
[33]
arXiv preprint arXiv:2508.10934 (2025)
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 31
-
[34]
Let- 3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection, 2024
Wei-Chih Hung, Vincent Casser, Henrik Kretzschmar, Jyh-Jing Hwang, and Dragomir Anguelov. Let- 3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection, 2024. URL https://arxiv.org/abs/2206.07705. 25
-
[35]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
arXiv preprint arXiv:2505.12705 (2025)
Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025. 3, 32, 36
-
[37]
arXiv preprint arXiv:2303.07399 (2023)
Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. Rtmpose: Real-time multi-person pose estimation based on mmpose.arXiv preprint arXiv:2303.07399, 2023. 7
-
[38]
Elucidating the design space of diffusion-based generative models.NeurIPS, 2022
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.NeurIPS, 2022. 8
work page 2022
-
[39]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
AlexanderKhazatsky, KarlPertsch, SurajNair, AshwinBalakrishna, SudeepDasari, SiddharthKaramcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large- scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 32, 35
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [41]
-
[42]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers, 2022. URLhttps://arxiv.org/abs/2203.17270. 28
-
[43]
Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, and Jiajun Wu. Won- derplay: Dynamic 3d scene generation from a single image and actions.arXiv preprint arXiv:2505.18151,
-
[44]
Torchtitan: One-stop pytorch native solution for production ready LLM pretraining
Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. InICLR, 2025. 14
work page 2025
-
[45]
Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 36
-
[46]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 8 40 World Simulation with Video Foundation Models for Physical AI
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 13
work page internal anchor Pith review arXiv 2025
-
[48]
Dynamicscaler: Seamless and scalable video generation for panoramic scenes
Jinxiu Liu, Shaoheng Lin, Yinxiao Li, and Ming-Hsuan Yang. Dynamicscaler: Seamless and scalable video generation for panoramic scenes. InCVPR, 2025. 35
work page 2025
-
[49]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 22
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081, 2024. 14
work page internal anchor Pith review arXiv 2024
-
[51]
Latr: 3d lane detection from monocular images with transformer, 2023
Yueru Luo, Chaoda Zheng, Xu Yan, Tang Kun, Chao Zheng, Shuguang Cui, and Zhen Li. Latr: 3d lane detection from monocular images with transformer, 2023. URLhttps://arxiv.org/abs/2308.04583. 28
- [52]
-
[53]
Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 36
-
[54]
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523, 2024. 35
work page internal anchor Pith review arXiv 2024
-
[55]
NVIDIA. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 3, 9, 35
-
[56]
Cosmos-transfer1: Conditional world generation with adaptive multimodal control, 2025
NVIDIA. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 3, 18, 19, 28, 36
-
[57]
Cosmos World Foundation Model Platform for Physical AI
NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
3, 4, 8, 9, 31, 35, 36
- [59]
-
[60]
Training language models to follow instructions with human feedback.NeurIPS, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.NeurIPS, 2022. 13
work page 2022
-
[61]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023. 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 35
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InKDD, 2020. 14
work page 2020
-
[64]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 19, 22 41 World Simulation with Video Foundation Models for Physical AI
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Allen Z. Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. InICLR,
-
[66]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024. 22
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Available: https://arxiv.org/abs/2506.09042
Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025. 3, 25, 28, 36
-
[68]
Gen3c: 3d-informed world-consistent video generation with precise camera control
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, 2025. 36
work page 2025
-
[69]
Runway. Gen 3, 2024. URLhttps://runwayml.com/research/introducing-gen-3-alpha. 35
work page 2024
-
[70]
Paul D Sampson. Fitting conic sections to “very scattered” data: An iterative refinement of the bookstein algorithm.Computer Graphics and Image Processing, 1982. ISSN 0146-664X. doi: https://doi.org/ 10.1016/0146-664X(82)90101-0. URL https://www.sciencedirect.com/science/article/pii/ 0146664X82901010. 25
-
[71]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 13
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[72]
arXiv preprint arXiv:2301.11280 , year=
Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation.arXiv preprint arXiv:2301.11280, 2023. 35
-
[73]
Light field networks: Neural scene representations with single-evaluation rendering.NeurIPS, 2021
Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.NeurIPS, 2021. 31
work page 2021
-
[74]
Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2, 2021
Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2, 2021. 25
work page 2021
-
[75]
Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. cuRobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023. 21
-
[76]
1x technologies | safe humanoids for the home, 2025
1X Technologies. 1x technologies | safe humanoids for the home, 2025. URLhttps://www.1x.tech/. 6
work page 2025
-
[77]
Open x-embodiment: Robotic learning datasets and rt-x models
Quan Vuong, Sergey Levine, Homer Rich Walke, Karl Pertsch, Anikait Singh, Ria Doshi, Charles Xu, Jianlan Luo, Liam Tan, Dhruv Shah, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023,
-
[78]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InCoRL, 2023. 6, 33 42 World Simulation with Video Foundation Models for Physical AI
work page 2023
-
[79]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 9, 32, 35
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
A comprehensive study of decoder-only llms for text-to-image generation
Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InCVPR, 2025. 9
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.