Learning World Models for Interactive Video Generation
Pith reviewed 2026-05-19 12:29 UTC · model grok-4.3
The pith
Video retrieval augmented generation with explicit global state conditioning reduces compounding errors and improves consistency in interactive video world models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Foundational world models for interactive video must address compounding errors, which are inherently irreducible in autoregressive setups, and insufficient memory mechanisms that cause incoherence. Enhancing image-to-video models with action conditioning and autoregressive generation reveals these limits, while video retrieval augmented generation (VRAG) paired with explicit global state conditioning significantly reduces long-term errors and boosts spatiotemporal consistency.
What carries the argument
Video retrieval augmented generation (VRAG) with explicit global state conditioning, which augments the generation process by retrieving past clips and maintaining a global state to preserve coherence over time.
If this is right
- Interactive video generation becomes feasible for longer sequences without rapid loss of consistency.
- World models can better support future planning with action choices in simulated environments.
- Current limitations in video models' in-context learning are bypassed by explicit retrieval rather than relying on context windows alone.
- Naive extensions like longer contexts or basic retrieval prove less effective, highlighting the need for structured augmentation.
Where Pith is reading between the lines
- Similar retrieval and state mechanisms could improve other autoregressive generative models in domains like text or audio.
- Implementing VRAG might allow incremental improvements to existing video models without complete retraining from scratch.
- This approach could be tested in real-world robotics or game environments to measure planning accuracy gains.
Load-bearing premise
That the main problems in video world models stem from insufficient memory and that retrieving past clips with global state can fix incoherence without creating new inconsistencies or needing full model retraining.
What would settle it
A direct comparison experiment showing whether videos generated with VRAG maintain object positions and scene coherence over many more frames than standard autoregressive methods, or if errors still accumulate similarly.
Figures
read the original abstract
Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies compounding errors and insufficient memory as core limitations in autoregressive video generation for world models. It augments image-to-video models with action conditioning, asserts that compounding error is inherently irreducible under autoregressive generation, and proposes video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce long-term errors and improve spatiotemporal consistency. It further claims that naive extended-context autoregressive generation and standard retrieval-augmented generation are less effective due to limited in-context learning in current video models, while positioning the work as establishing a benchmark for internal world modeling capabilities.
Significance. If the claimed reductions in compounding error and gains in consistency are demonstrated, the introduction of VRAG with global state conditioning would address a practically important bottleneck in long-horizon interactive video generation, offering a concrete direction for memory-augmented world models beyond simple context extension.
major comments (2)
- [Abstract] Abstract: the claim that 'compounding error is inherently irreducible in autoregressive video generation' is presented as a foundational revelation motivating VRAG, yet the manuscript supplies neither a formal argument, mathematical characterization, nor any empirical measurement of this irreducibility.
- [Abstract] Abstract: the assertion that VRAG 'significantly reduces long-term compounding errors and increases spatiotemporal consistency' is the central empirical claim, but the text contains no experimental protocol, quantitative metrics, baselines, or results that would allow verification of these improvements.
minor comments (1)
- [Abstract] Abstract: the phrase 'establishes a comprehensive benchmark' is used without any description of the benchmark's tasks, metrics, or evaluation protocol.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address the major points below and will revise the manuscript to better support the claims presented in the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'compounding error is inherently irreducible in autoregressive video generation' is presented as a foundational revelation motivating VRAG, yet the manuscript supplies neither a formal argument, mathematical characterization, nor any empirical measurement of this irreducibility.
Authors: We acknowledge that the abstract presents this claim concisely without a formal argument, mathematical characterization, or empirical measurement. The abstract is a high-level summary. We will revise the manuscript to include a dedicated discussion with a simple mathematical model of error propagation in autoregressive frame prediction and empirical measurements from long-horizon experiments showing persistent compounding even under extended context. revision: yes
-
Referee: [Abstract] Abstract: the assertion that VRAG 'significantly reduces long-term compounding errors and increases spatiotemporal consistency' is the central empirical claim, but the text contains no experimental protocol, quantitative metrics, baselines, or results that would allow verification of these improvements.
Authors: We agree that the abstract states the empirical claim without including the experimental protocol, quantitative metrics, baselines, or results. These elements appear in the experimental sections of the full manuscript. To address the concern, we will revise the abstract to briefly note the evaluation metrics (such as spatiotemporal consistency scores) and the main baselines (naive autoregressive and standard RAG) so that the improvements can be more readily understood and verified. revision: yes
Circularity Check
No significant circularity detected in available text
full rationale
The provided abstract states observations on limitations of current video generation models (compounding errors and insufficient memory) and proposes VRAG with explicit global state conditioning as an enhancement. No equations, detailed derivation steps, fitted parameters, or self-citations appear in the text. Claims such as the inherent irreducibility of compounding errors in autoregressive setups are presented as revelations without any shown reduction to inputs by construction, self-definitional loops, or renaming of known results. The central proposal remains a high-level method suggestion rather than a closed loop equivalent to its own premises, making the argument self-contained at the level of the abstract.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Compounding error is inherently irreducible in autoregressive video generation
- domain assumption Current video models have limited in-context learning capabilities
invented entities (1)
-
VRAG (video retrieval augmented generation)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
global state vector s ∈ R^S consists of two key components: spos representing 3D position coordinates and sori capturing orientation angles
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images.Advances in neural information processing systems, 28, 2015
work page 2015
-
[2]
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018
work page 2018
-
[3]
Mastering Atari with Discrete World Models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[4]
Weiss, Niru Maheswaranathan, and Surya Ganguli
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, pages 2256–2265, 2015
work page 2015
-
[5]
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019
work page 2019
-
[6]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[7]
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[8]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024
work page 2024
-
[9]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[10]
Diffusion Models Are Real-Time Game Engines
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Scaling autoregressive video models
Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019
-
[12]
Flexible diffusion modeling of long videos.arXiv preprint arXiv:2205.11495, 2022
William Harvey, Søren Nørskov, Niklas Kölch, and George V ogiatzis. Flexible diffusion modeling of long videos.arXiv preprint arXiv:2205.11495, 2022
-
[13]
Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024
-
[14]
Progressive autoregressive video diffusion models.arXiv preprint arXiv:2410.08151, 2024
Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models.arXiv preprint arXiv:2410.08151, 2024
-
[15]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2:1, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023
work page 2023
-
[20]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7312–7322, 2023
work page 2023
-
[21]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024
work page 2024
-
[23]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models.arXiv preprint arXiv:2204.03458, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Eliya Nachmani, Guy Dahan, Eli Shechtman, and Haggai Ha- cohen. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Yu Hong, Jing Wei, Xing Liu, Xiaodi Wang, Yutong Bai, Haitao Li, Ming Zhang, and Hao Xu. Cogvideo: Large-scale pretraining for text-to-video generation with transformers.arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization
Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization.arXiv preprint arXiv:2412.15689, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Auto-encoding variational bayes, 2013
Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013
work page 2013
-
[31]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[32]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[33]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[34]
Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024. 12
-
[35]
Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024
work page 2024
-
[36]
Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.arXiv preprint arXiv:2405.11473, 2024
-
[37]
Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024
-
[38]
Magi-1: Autoregressive video generation at scale, 2025
Sand-AI. Magi-1: Autoregressive video generation at scale, 2025
work page 2025
-
[39]
Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024
-
[40]
Vikram V oleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022
work page 2022
-
[41]
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024
work page 2024
-
[42]
Zihan Ding, Amy Zhang, Yuandong Tian, and Qinqing Zheng. Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024
-
[43]
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
work page 2024
-
[44]
Oasis: A universe in a transformer
Decart, Etched, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. 2024
work page 2024
-
[45]
Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024
Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024
-
[46]
Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025
-
[47]
Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos.arXiv preprint arXiv:2501.08325, 2025
-
[48]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Navigation world models.arXiv preprint arXiv:2412.03572, 2024
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models.arXiv preprint arXiv:2412.03572, 2024
-
[51]
Genie 2: A large-scale foundation world model
Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...
work page 2024
-
[52]
Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025
Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025
-
[53]
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control.arXiv preprint arXiv:2503.03751, 2025
-
[54]
Reconx: Reconstruct any scene from sparse views with video diffusion model
Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model. arXiv preprint arXiv:2408.16767, 2024
-
[55]
Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024
Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024
-
[56]
Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025
Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025
-
[57]
Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025
Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xin- gang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025
-
[58]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[59]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Leave no context behind: Efficient infinite context transformers with infini-attention, 2024
Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention, 2024
work page 2024
-
[61]
Packing input frame context in next-frame prediction models for video generation, 2025
Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation, 2025
work page 2025
-
[62]
Minerl: A large-scale dataset of minecraft demonstrations
William H Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019
-
[63]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004
work page 2004
-
[64]
The unrea- sonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
work page 2018
-
[65]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pag...
work page 2024
-
[66]
History-guided video diffusion, 2025
Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion, 2025
work page 2025
-
[67]
Google. Realestate10k. https://google.github.io/realestate10k/index.html, 2018. Accessed: 2025-07-27
work page 2018
-
[68]
Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024. 14 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? ...
-
[69]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.