X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

Baolu Li; Boyang Wang; Chaoda Zheng; Hanpeng Liu; Jingyu Qian; Junhong Zhou; Pengkun Zheng; Rui Guo; Ruixin Liu; Sean Li

arxiv: 2605.24892 · v3 · pith:RCCT6GDKnew · submitted 2026-05-24 · 💻 cs.CV

X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

Baolu Li , Jingyu Qian , Rui Guo , Yilun Chen , Hanpeng Liu , Yuan Lin , Junhong Zhou , Ruixin Liu

show 11 more authors

Willow Yang Yutong Zheng Zhenli Zhang Sean Li Chaoda Zheng Boyang Wang Tenglong (Victor) Gu Zhuangzhuang Ding Pengkun Zheng Yu Zhang Xianming Liu

This is my paper

Pith reviewed 2026-06-30 12:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language-actionworld modelingvideo predictioncausal forecastingautonomous planningchunk-wise predictiontemporal importance sampling

0 comments

The pith

X-Foresight integrates chunk-wise video forecasting into VLA models to learn physical dynamics and long-term causality for improved planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that VLA models can gain useful world knowledge by predicting future video inside their architecture rather than treating prediction as a separate task. It identifies that standard next-frame prediction on video tends to produce repetitive outputs because frames share too much information, and that dense short-term prediction cannot scale to long causal chains. The proposed fix is a chunk-wise auto-regressive method that skips ahead to distant chunks, keeps dense frames inside each chunk for immediate motion, and uses sparse links between chunks for extended causality. Curriculum lengthening of the horizon and importance sampling on safety-critical segments further stabilize training. The authors report that the resulting joint model produces higher-quality plans than plain VLA baselines while still generating coherent video.

Core claim

X-Foresight is a predictive world model placed inside a VLA network that uses a long-horizon chunk-wise auto-regressive strategy to forecast future video, thereby internalizing physical dynamics and causality; dense frames within chunks handle instantaneous motion while sparse transitions between chunks capture longer causal structure, and the same representations support real-time action control.

What carries the argument

The long-horizon chunk-wise auto-regressive strategy that predicts semantically distant chunks instead of adjacent frames, keeping dense intra-chunk frames for short-term dynamics and sparse inter-chunk transitions for long-term causality.

If this is right

VLA models gain internal knowledge of physical dynamics without separate pre-training stages.
Planning success rates rise while video generation quality stays high.
Curriculum lengthening of prediction horizons stabilizes training for longer sequences.
Focusing supervision on ego-motion and behavior signals improves attention to safety-critical future segments.
A separate diffusion renderer can be attached to produce photorealistic output without changing the forecasting core.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The chunk separation idea could be tested on non-driving video domains such as manipulation or navigation in unstructured environments.
The same architecture might reduce reliance on real-world interaction data by allowing more efficient use of passive video for pre-training.
If the importance sampling heuristic proves general, similar signals could be derived from other sensor streams to guide long-horizon supervision.

Load-bearing premise

Predicting semantically distant chunks rather than adjacent frames will avoid trivial repetition while still letting the model learn both immediate motion and extended causal relationships through the chunk structure.

What would settle it

A controlled comparison on the same planning benchmarks in which a standard next-frame VLA baseline achieves equal or higher success rates than the chunk-wise version would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.24892 by Baolu Li, Boyang Wang, Chaoda Zheng, Hanpeng Liu, Jingyu Qian, Junhong Zhou, Pengkun Zheng, Rui Guo, Ruixin Liu, Sean Li, Tenglong (Victor) Gu, Willow Yang, Xianming Liu, Yilun Chen, Yuan Lin, Yutong Zheng, Yu Zhang, Zhenli Zhang, Zhuangzhuang Ding.

**Figure 2.** Figure 2: Overview of the large-scale multi-camera driving dataset. The dataset contained approxi [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Training scenario distribution of X-Foresight. Fine-grained auto-tags were grouped into [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The training pipelines of the proposed X-Foresight. X-Foresight consisted of two main [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt formulations for future frame prediction. (a) Frame-wise foresight one frame at each step. (b) Frame-wise longer foresight increased the temporal stride s. (c) Chunk-wise foresight predictd a chunk of K consecutive frames in parallel. (d) Chunk-wise longer foresight combined chunk length K with stride s for longer-horizon prediction. (e) Chunk-wise temporal importance sampling. Chunk-Wise Prediction… view at source ↗

**Figure 6.** Figure 6: Semi-causal block-sparse attention mask for long-sequence training. Each colored pixel denoted one token block, where attention was allowed. The mask preserves bidirectional attention within each temporal chunk, allowed access to the global system prompt and previous prompt tokens, and prohibited attention between query tokens across different chunks. The two panels showed complementary sparse patterns ass… view at source ↗

**Figure 7.** Figure 7: Qualitative comparisons. The baseline failed on events lying ahead in space (top, a far-exit [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of the Vision Renderer conditioned on camera tokens. The horizontal axis [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

X-Foresight's chunk-wise autoregressive approach to VLA world modeling is a direct attempt at the video redundancy and temporal horizon problems, but the abstract gives no evidence that the claimed planning gains are real.

read the letter

The main point is that this paper takes the standard next-frame prediction issues in video-based world models and tries to fix them with a chunk-wise autoregressive setup: predict distant chunks to avoid trivial extrapolation, keep dense frames inside chunks for short-term dynamics, and use sparse transitions between chunks for longer causality. They add curriculum learning to grow the horizon gradually and temporal importance sampling that weights safety-critical chunks based on ego-motion and behavior signals, plus a diffusion renderer to handle appearance separately.

That framing of the temporal dilemma is clear and the chunk strategy is a logical response to it. Separating the modeling from photorealistic synthesis via diffusion also makes sense practically for VLA integration.

The soft spot is the complete absence of any results. The abstract asserts significant outperformance over VLA baselines in planning performance with maintained generative fidelity, yet supplies no metrics, no baseline list, no ablations, and no description of how the joint vision-action training actually works or what datasets were used. Without those, the central assumption that distant chunks capture causality better than adjacent frames stays untested, and it's impossible to tell whether any gains come from the proposed components or from other training details.

This is for the embodied AI and robotics crowd working on predictive models inside VLA architectures. A reader looking for new ideas on handling long-horizon video prediction might pick up the chunk and sampling approach, but the lack of evidence limits its immediate value. It deserves a serious referee if the full paper contains reproducible experiments and ablations that actually support the claims; otherwise the idea stays speculative.

I'd send it to review only after seeing the results section and checking whether the gains survive proper controls.

Referee Report

1 major / 0 minor

Summary. The paper proposes X-Foresight, a predictive world model integrated into Vision-Language-Action (VLA) architectures. It introduces a long-horizon chunk-wise auto-regressive prediction strategy that predicts semantically distant chunks (with dense intra-chunk frames and sparse inter-chunk transitions), a curriculum learning schedule to extend horizons, temporal importance sampling focused on safety-critical chunks via ego-motion and behavioral signals, and delegation of photorealistic synthesis to a diffusion-based multi-view renderer. The central claim is that this joint vision-action causal forecasting approach enables effective internalization of physical dynamics and long-term causality, yielding significantly better planning performance than VLA baselines while preserving generative fidelity.

Significance. If the experimental claims hold, the work could advance world-knowledge-driven autonomous systems by providing a concrete mechanism to overcome low-entropy video prediction degeneracy and the dense-vs-long-horizon tradeoff in VLA models.

major comments (1)

[Abstract] Abstract: the claim that X-Foresight 'significantly outperforms VLA baselines in planning performance' is presented without any quantitative results, baselines, metrics, ablation studies, or experimental setup, preventing any evaluation of whether the chunk-wise strategy, curriculum, or importance sampling produces the asserted gains or reduces to self-referential signals.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater specificity in the abstract. The manuscript's experimental section provides the requested quantitative support, but we agree the abstract can be strengthened by incorporating key results to better contextualize the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that X-Foresight 'significantly outperforms VLA baselines in planning performance' is presented without any quantitative results, baselines, metrics, ablation studies, or experimental setup, preventing any evaluation of whether the chunk-wise strategy, curriculum, or importance sampling produces the asserted gains or reduces to self-referential signals.

Authors: The abstract is intentionally concise and summarizes the core findings, with full quantitative details—including specific planning metrics (e.g., success rates on long-horizon tasks), baseline comparisons (standard VLA models), ablation studies isolating the chunk-wise autoregressive prediction, curriculum schedule, and temporal importance sampling, plus experimental setups—provided in Sections 4 (Experiments) and 5 (Ablations). These results demonstrate that the gains arise from the proposed mechanisms rather than self-referential signals, as ablations show performance drops when components are removed. To address the concern directly, we will revise the abstract to include 1-2 key quantitative highlights (e.g., relative improvement percentages) while preserving its brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained in abstract

full rationale

The provided abstract describes a chunk-wise autoregressive strategy, curriculum learning, and temporal importance sampling as architectural choices for addressing next-frame prediction issues, but contains no equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems. No load-bearing step reduces by construction to its inputs; the central claim of outperformance is presented as an empirical result rather than a definitional equivalence. Without equations or self-referential derivations in the given text, the derivation chain remains independent of the inputs it claims to predict.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted or audited from the paper.

pith-pipeline@v0.9.1-grok · 5877 in / 1140 out tokens · 37615 ms · 2026-06-30T12:03:41.735899+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Situation Perception: A Necessary Primitive to Artificial Superintelligence
cs.CY 2026-06 unverdicted novelty 3.0

Situation perception is proposed as a necessary primitive for artificial superintelligence, requiring abstract prediction, long-term compressed memory, and objective-guided active learning.

Reference graph

Works this paper leans on

23 extracted references · 9 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, 2023

2023
[2]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[3]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

2024
[4]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024

2024
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

One layer is enough: Adapting pretrained visual encoders for image generation.ArXiv, 2512.07829, 2025

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.ArXiv, 2512.07829, 2025

work page arXiv 2025
[8]

Block Sparse Attention.https://github.com/mit-han-lab/Block-Sparse-Attention, 2024

Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention.https://github.com/mit-han-lab/Block-Sparse-Attention, 2024

2024
[9]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

2017
[10]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems (NeurIPS), 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems (NeurIPS), 38:167283–167308, 2026

2026
[11]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Marble: World Labs Spatial Intelligence.https://www.worldlabs.ai/, 2026

World Labs. Marble: World Labs Spatial Intelligence.https://www.worldlabs.ai/, 2026. 18 PWM TEAMX-ForesightTECHNICALREPORT

2026
[13]

Radial attention: O(nlogn) sparse attention with energy decay for long video generation

Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, et al. Radial attention: O(nlogn) sparse attention with energy decay for long video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[14]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023
[15]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023

2023
[16]

SVG: Latent diffusion model without variational autoencoder.arXiv:2510.15301, 2025

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. SVG: Latent diffusion model without variational autoencoder.arXiv:2510.15301, 2025

work page arXiv 2025
[17]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

XPeng VLA 2.0.https://www.xpeng.com, 2026

XPeng Inc. XPeng VLA 2.0.https://www.xpeng.com, 2026

2026
[20]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, 2025

2025
[21]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv:2603.04379, 2026

work page arXiv 2026
[22]

X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv:2603.19979, 2026

Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, et al. X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv:2603.19979, 2026

work page arXiv 2026
[23]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), pages 2165–2183. PMLR, 2023. 19

2023

[1] [1]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, 2023

2023

[2] [2]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024

[3] [3]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

2024

[4] [4]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pages 35549–35562, 2024

2024

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

One layer is enough: Adapting pretrained visual encoders for image generation.ArXiv, 2512.07829, 2025

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.ArXiv, 2512.07829, 2025

work page arXiv 2025

[8] [8]

Block Sparse Attention.https://github.com/mit-han-lab/Block-Sparse-Attention, 2024

Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention.https://github.com/mit-han-lab/Block-Sparse-Attention, 2024

2024

[9] [9]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

2017

[10] [10]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems (NeurIPS), 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems (NeurIPS), 38:167283–167308, 2026

2026

[11] [11]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Marble: World Labs Spatial Intelligence.https://www.worldlabs.ai/, 2026

World Labs. Marble: World Labs Spatial Intelligence.https://www.worldlabs.ai/, 2026. 18 PWM TEAMX-ForesightTECHNICALREPORT

2026

[13] [13]

Radial attention: O(nlogn) sparse attention with energy decay for long video generation

Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, et al. Radial attention: O(nlogn) sparse attention with energy decay for long video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[14] [14]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023

[15] [15]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023

2023

[16] [16]

SVG: Latent diffusion model without variational autoencoder.arXiv:2510.15301, 2025

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. SVG: Latent diffusion model without variational autoencoder.arXiv:2510.15301, 2025

work page arXiv 2025

[17] [17]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

XPeng VLA 2.0.https://www.xpeng.com, 2026

XPeng Inc. XPeng VLA 2.0.https://www.xpeng.com, 2026

2026

[20] [20]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, 2025

2025

[21] [21]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv:2603.04379, 2026

work page arXiv 2026

[22] [22]

X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv:2603.19979, 2026

Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, et al. X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv:2603.19979, 2026

work page arXiv 2026

[23] [23]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), pages 2165–2183. PMLR, 2023. 19

2023