arxiv: 2604.18486 · v3 · submitted 2026-04-20 · 💻 cs.CV · cs.CL· cs.RO

Recognition: 3 theorem links

· Lean Theorem

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Jinghui Lu , Jiayi Guan , Zhijian Huang , Jinlong Li , Guang Li , Lingdong Kong , Yingyan Li , Han Wang

show 42 more authors

Shaoqing Xu Yuechen Luo Fang Li Chenxu Dang Junli Wang Tao Xu Jing Wu Jianhua Wu Xiaoshuai Hao Wen Zhang Tianyi Jiang Lingfeng Zhang Lei Zhou Yingbo Tang Jie Wang Yinfeng Gao Xizhou Bu Haochen Tian Yihang Qiu Feiyang Jia Lin Liu Yigu Ge Hanbing Li Yuannan Shen Jianwei Cui Hongwei Xie Bing Wang Haiyang Sun Jingwei Zhao Jiahui Huang Pei Liu Zeyu Zhu Yuncheng Jiang Zibin Guo Chuhong Gong Hanchao Leng Kun Ma Naiyan Wang Guang Chen Kuiyuan Yang Hangjun Ye Long Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.RO

keywords latent chain-of-thoughtvision-language-actionworld model supervisionautonomous drivingtrajectory predictionone-step inferencefuture frame predictionvisual reasoning

0 comments

The pith

Supervising latent reasoning tokens with future-frame visual predictions lets latent Chain-of-Thought exceed explicit token-by-token reasoning in driving tasks while running at answer-only speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard latent reasoning in vision-language-action models stays too abstract because it only reconstructs text. By adding a second decoder that predicts future video frames from the same hidden states, the latents are forced to encode actual road layout, motion, and change. A staged training process aligns the tokens to trajectory, language, and visual goals in sequence. At test time the extra decoders are dropped and all reasoning happens in one parallel step. On four driving benchmarks this produces higher accuracy than full explicit reasoning yet keeps the latency of a direct answer.

Core claim

OneVL routes driving reasoning through compact latent tokens supervised by both a language decoder that reconstructs text Chain-of-Thought and a visual world-model decoder that predicts future-frame tokens. The visual supervision pushes the latent space to represent causal dynamics of geometry, agent movement, and scene evolution rather than linguistic abstractions alone. After three-stage progressive alignment of trajectory, language, and visual objectives, inference discards the auxiliary decoders and prefills every latent token in a single parallel pass, matching answer-only latency while surpassing explicit CoT accuracy across four benchmarks.

What carries the argument

Dual auxiliary decoders (language reconstruction plus visual future-frame prediction) that supervise the same compact latent tokens during training so the hidden states internalize physical dynamics.

If this is right

Latent CoT can now deliver higher accuracy than explicit CoT in VLA driving models.
Reasoning no longer adds autoregressive latency at deployment time.
Representations learned with world-model supervision generalize better than those learned from text alone.
The same latent tokens can be aligned to trajectory, language, and visual objectives in one stable pipeline.
Auxiliary decoders are needed only for training and can be removed without losing the performance gain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual-supervision trick could be tested in other embodied tasks such as robotic manipulation where physical dynamics matter more than language.
If the visual decoder is the key, then models without any explicit reasoning tokens might still improve simply by adding future-frame prediction as an auxiliary loss.
Safety-critical driving systems could benefit from latents that have been forced to track geometry and motion rather than only reciting explanations.
A direct comparison on non-driving VLA benchmarks would show whether the gain is specific to road scenes or holds more broadly.

Load-bearing premise

Adding a decoder that predicts future video frames will make the model's hidden states encode real physical changes in the scene instead of remaining just patterns of words.

What would settle it

Remove the future-frame decoder during training and check whether accuracy on the four benchmarks drops below that of explicit CoT while latency stays the same.

read the original abstract

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. In inference, the auxiliary decoders are discarded, and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering superior accuracy at answer-only latency. These results show that with world model supervision, latent CoT produces more generalizable representations than verbose token-by-token reasoning. Code has been open-sourced to the community. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OneVL claims the first latent CoT win over explicit CoT in driving VLA by adding visual future-frame supervision, but the evidence does not yet isolate that mechanism from other training effects.

read the letter

The main point is that this paper presents a latent CoT architecture for vision-language-action models that reportedly surpasses explicit CoT accuracy on four driving benchmarks while keeping answer-only latency. The approach adds a visual world model decoder that predicts future frames to supervise compact latent tokens, alongside the usual language CoT reconstruction decoder, then drops both at inference after a three-stage training alignment with trajectory, language, and visual objectives.

Referee Report

2 major / 2 minor

Summary. The paper introduces OneVL, a unified VLA and world-model framework for autonomous driving that performs one-step latent CoT reasoning and planning. Compact latent tokens are supervised during training by dual auxiliary decoders (a language decoder reconstructing text CoT and a visual world-model decoder predicting future-frame tokens) via a three-stage pipeline that aligns latents to trajectory, language, and visual objectives. At inference the decoders are dropped and all latents are prefilled in one parallel pass, yielding explicit-CoT accuracy at answer-only latency. The central claim is that this is the first latent-CoT method to surpass explicit CoT across four benchmarks because world-model supervision forces the latent space to internalize causal dynamics of road geometry, agent motion, and environmental change rather than remaining a linguistic abstraction.

Significance. If the reported gains are reproducible and causally attributable to the visual decoder, the work would be significant for real-time VLA deployment: it would demonstrate that latent CoT can outperform verbose explicit reasoning when properly grounded in visual dynamics, while preserving answer-only speed. Open-sourcing the code is a concrete strength that supports verification and follow-up work.

major comments (2)

[§3 and abstract] §3 (Method) and abstract: the claim that the visual world-model decoder 'forces the latent space to internalize the causal dynamics' is load-bearing for the superiority explanation, yet the manuscript contains no ablation, probing, or visualization that isolates this mechanism from the language decoder, trajectory alignment, or simple capacity increases. Without such controls, benchmark gains could arise from unrelated optimization effects.
[§4] §4 (Experiments): the paper reports that OneVL surpasses explicit CoT on four benchmarks but supplies neither full baseline details, ablation tables removing the visual decoder, nor statistical reporting (standard deviations, multiple seeds). This absence directly undermines confidence in the central causal-dynamics attribution and the 'first latent CoT to surpass explicit CoT' claim.

minor comments (2)

[Figure 1 and §3.1] Figure 1 and §3.1: the diagram and text description of how latent tokens are routed to the two decoders would benefit from an explicit equation showing the joint loss and the exact prefill procedure at inference.
[§3.3] The three-stage training schedule is described at a high level; a table listing the loss weights and data schedules per stage would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about isolating the visual decoder's contribution and providing fuller experimental details are valid and will be addressed through targeted revisions to strengthen the manuscript's claims.

read point-by-point responses

Referee: [§3 and abstract] §3 (Method) and abstract: the claim that the visual world-model decoder 'forces the latent space to internalize the causal dynamics' is load-bearing for the superiority explanation, yet the manuscript contains no ablation, probing, or visualization that isolates this mechanism from the language decoder, trajectory alignment, or simple capacity increases. Without such controls, benchmark gains could arise from unrelated optimization effects.

Authors: We agree that the current version lacks direct evidence isolating the visual decoder's role. The three-stage pipeline and dual-decoder design are intended to ground latents in visual dynamics rather than linguistic abstractions, but without controls this remains an attribution. In revision we will add: an ablation removing only the visual decoder (keeping language decoder and trajectory alignment fixed), linear probing of latents for dynamic properties such as agent motion and road geometry, and qualitative visualizations of predicted future frames from the latents. These will be placed in §3 and §4. revision: yes
Referee: [§4] §4 (Experiments): the paper reports that OneVL surpasses explicit CoT on four benchmarks but supplies neither full baseline details, ablation tables removing the visual decoder, nor statistical reporting (standard deviations, multiple seeds). This absence directly undermines confidence in the central causal-dynamics attribution and the 'first latent CoT to surpass explicit CoT' claim.

Authors: We accept that the experimental section requires expansion for reproducibility and statistical rigor. The revised §4 will include: complete implementation details and hyperparameters for all baselines, a new ablation table that isolates removal of the visual decoder, and all main results reported as mean ± std over three random seeds with seed values stated. This will directly support evaluation of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external benchmarks and explicit training objectives

full rationale

The paper describes a three-stage training pipeline with dual decoders (language and visual world model) whose objectives are stated directly as supervision signals. At inference the decoders are discarded and latents are prefilled in one pass. Superior benchmark performance is reported against external baselines rather than any quantity defined inside the paper's own fitted parameters or equations. No self-citation chain, uniqueness theorem, or ansatz is invoked to force the central result. The assumption that visual-frame prediction embeds causal dynamics is presented as a modeling hypothesis, not derived by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central claim rests on the domain assumption that visual future-frame prediction supplies causal dynamics missing from language-only latents.

axioms (1)

domain assumption A visual world model decoder predicting future-frame tokens will force latent representations to capture causal dynamics of driving scenes.
This is the explicit justification given for why the new supervision improves over prior linguistic latent CoT.

invented entities (1)

compact latent tokens supervised by dual auxiliary decoders no independent evidence
purpose: To compress reasoning into a single parallel pass while internalizing both linguistic CoT and visual causal dynamics.
New unified component introduced to solve the latency-accuracy tradeoff in VLA driving models.

pith-pipeline@v0.9.0 · 5776 in / 1403 out tokens · 30683 ms · 2026-05-12T00:50:33.475946+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

compression drives generalization... tighter compression forces the model to retain only the causal structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Is Your Driving World Model an All-Around Player?
cs.CV 2026-05 unverdicted novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation
cs.CV 2026-05 conditional novelty 6.0

A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 2 Pith papers · 22 internal anchors

[1]

Claude 3.7 Sonnet and Claude Code.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

Anthropic. Claude 3.7 Sonnet and Claude Code.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

work page 2025
[2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

work page 2023
[3]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

work page arXiv 2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Dynamiccity: Large-scale 4d occupancy generation from dynamic scenes.arXiv preprint arXiv:2410.18084, 2024

Hengwei Bian, Lingdong Kong, Haozhe Xie, Liang Pan, Yu Qiao, and Ziwei Liu. Dynamiccity: Large-scale 4d occupancy generation from dynamic scenes.arXiv preprint arXiv:2410.18084, 2024

work page arXiv 2024
[8]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020
[9]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024
[10]

arXiv preprint arXiv:2510.25122 (2025)

Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, and Jinghui Lu. Nanovla: Routing decoupled vision-language understanding for nano-sized generalist robotic policies.arXiv preprint arXiv:2510.25122, 2025

work page arXiv 2025
[11]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7817–7826. IEEE, 2025

work page 2025
[13]

Driving with llms: Fusing object-level vector modality for explainable autonomous driving

Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In2024 IEEE InternationalConference on Robotics and Automation (ICRA), pages 14093–14100. IEEE, 2024. 23

work page 2024
[14]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Vilta: A vlm-in-the-loop adversary for enhancing driving policy robustness.arXiv preprint arXiv:2601.12672, 2026

Qimao Chen, Fang Li, Shaoqing Xu, Zhiyi Lai, Zixun Xie, Yuechen Luo, Shengyin Jiang, Hanbing Li, Long Chen, Bing Wang, et al. Vilta: A vlm-in-the-loop adversary for enhancing driving policy robustness.arXiv preprint arXiv:2601.12672, 2026

work page arXiv 2026
[16]

and Van Durme, B

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171, 2024

work page arXiv 2024
[17]

Impromptu VLA: Open weights and open data for driving vision-language-action models

Haohan Chi, Huan ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, and Hao Zhao. Impromptu VLA: Open weights and open data for driving vision-language-action models. InAdvancesin Neural Information Processing Systems (Datasets and Benchmarks Track), volume...

work page 2025
[18]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page arXiv 2025
[19]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

work page 2024
[20]

Language Modeling Is Compression , year =

Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023

work page arXiv 2023
[21]

From explicit cot to implicit cot: Learning to internalize cot step by step

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

work page arXiv 2024
[22]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

work page 2024
[23]

Hauptmann, and Zhi-Qi Cheng

Yifei Dong, Fengyi Wu, Guangyu Chen, Lingdong Kong, Xu Zhu, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, Alexander G. Hauptmann, and Zhi-Qi Cheng. Towards unified world models for visual navigation via memory-augmented planning and foresight.arXiv preprint arXiv:2510.08713, 2025

work page arXiv 2025
[24]

Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, et al. Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

work page arXiv 2026
[25]

Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021

work page 2021
[26]

Advancing sequential numerical prediction in autoregressive models

Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, and Can Huang. Advancing sequential numerical prediction in autoregressive models. InAnnual Meeting of the Association for Computational Linguistics, pages 562–574, 2025

work page 2025
[27]

Dolphin: Document image parsing via heterogeneous anchor prompting

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InAnnual Meeting of the Association for Computational Linguistics, pages 21919–21936, 2025

work page 2025
[28]

A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

work page arXiv 2025
[29]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InIEEE/CVF InternationalConference on Computer Vision, pages 24823–24834, 2025. 24

work page 2025
[30]

Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning,

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, andXiangBai. MindDrive: Avision-language-actionmodelforautonomousdrivingviaonlinereinforcement learning. arXiv preprint arXiv:2512.13636, 2025

work page arXiv 2025
[31]

Vision meets robotics: The KITTI dataset

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research, 32(11):1231–1237, 2013

work page 2013
[32]

Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving

Jiaheng Geng, Jiatong Du, Xinyu Zhang, Ye Li, Panqu Wang, and Yanjun Huang. Driving in corner case: A real-world adversarial closed-loop evaluation platform for end-to-end autonomous driving.arXiv preprint arXiv:2512.16055, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Narasimhan

Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, and Srinivasa G. Narasimhan. ROADWork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. InIEEE/CVF International Conference on Computer Vision, pages 6132–6142, 2025

work page 2025
[34]

Gemini 2.5 Pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance, 2025

Google. Gemini 2.5 Pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance, 2025

work page 2025
[35]

World models for autonomous driving: An initial survey.IEEE Transactionson Intelligent Vehicles, pages 1–17, 2024

Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey.IEEE Transactionson Intelligent Vehicles, pages 1–17, 2024

work page 2024
[36]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review arXiv 1912
[40]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, Yuchen Zhang, Jing Wu, Jinghui Lu, Chenxu Dang, Jiayi Guan, Jianhua Wu, Zhiyi Hou, Hanbing Li, Shumeng Xia, Mingliang Zhou, Yinan Zheng, Zihao Yue, Shuhao Gu, Hao Tian, Yuannan Shen, Jianwei Cui, Wen Zhang, Shaoqing Xu, Bing Wang...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Drivemrp: Enhancing vision-language models with synthetic motion data for motion risk prediction

Zhiyi Hou, Enhui Ma, Fang Li, Zhiyi Lai, Kalok Ho, Zhanqian Wu, Lijun Zhou, Long Chen, Chitian Sun, Haiyang Sun, et al. Drivemrp: Enhancing vision-language models with synthetic motion data for motion risk prediction. arXiv preprint arXiv:2507.02948, 2025

work page arXiv 2025
[43]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review arXiv 2023
[44]

arXiv preprint arXiv:2512.16760 (2025) 2 16 X

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16...

work page arXiv 2025
[45]

NavThinker: Action-conditioned world models for coupled prediction and planning in social navigation

Tianshuai Hu, Zeying Gong, Lingdong Kong, Xiaodong Mei, Yiyi Ding, Qi Zeng, Ao Liang, Rong Li, Yangyi Zhong, and Junwei Liang. NavThinker: Action-conditioned world models for coupled prediction and planning in social navigation. arXiv preprint arXiv:2603.15359, 2026

work page arXiv 2026
[46]

Fuller: Unified multi-modality multi-task 3D perception via multi-level gradient calibration

Zhijian Huang, Sihao Lin, Guiyu Liu, Mukun Luo, Chaoqiang Ye, Hang Xu, Xiaojun Chang, and Xiaodan Liang. Fuller: Unified multi-modality multi-task 3D perception via multi-level gradient calibration. InIEEE/CVF International Conference on Computer Vision, pages 3502–3511, 2023. 25

work page 2023
[47]

Making large language models better planners with reasoning-decision alignment

Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, and Xiaodan Liang. Making large language models better planners with reasoning-decision alignment. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024

work page 2024
[48]

RoboTron-Drive: All-in-one large multimodal model for autonomous driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. RoboTron-Drive: All-in-one large multimodal model for autonomous driving. InIEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025

work page 2025
[49]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

DriveLMM-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding

Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, and Salman Khan. DriveLMM-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding. arXiv preprint arXiv:2503.10621, 2025

work page arXiv 2025
[51]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Meml-grpo: Heterogeneous multi-expert mutual learning for rlvr advancement

Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, et al. Meml-grpo: Heterogeneous multi-expert mutual learning for rlvr advancement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31283–31291, 2026

work page 2026
[53]

Towards learning- based planning: The nuPlan benchmark for real-world autonomous driving

Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning- based planning: The nuPlan benchmark for real-world autonomous driving. InIEEE International Conference on Robotics and Automation, pages 629–636, 2024

work page 2024
[54]

Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Yaru Niu, Wei Tsang Ooi, Benoit R. Cottereau, Lai Xing Ng, Yuexin Ma, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Weichao Qiu, Wei Zhang, Xu Cao, Hao Lu, Ying-Cong Chen, Caixin Kang, Xinning Zhou, Chengyang Ying, Wentao Shang, Xingxing Wei, Yinpeng Dong, Bo Yang, Shengyin Jiang, Zeliang Ma, Dengyi Ji, Haiwen Li,...

work page arXiv 2024
[55]

Multi-modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi-modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

work page 2025
[56]

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, and Ziwei Liu. 3D and 4D world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025

work page arXiv 2025
[57]

LargeAD: Large-scale cross-sensor data pretraining for autonomous driving.IEEE Transactionson Pattern Analysis and Machine Intelligence, 48(2):1291–1308, 2026

Lingdong Kong, Xiang Xu, Youquan Liu, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. LargeAD: Large-scale cross-sensor data pretraining for autonomous driving.IEEE Transactionson Pattern Analysis and Machine Intelligence, 48(2):1291–1308, 2026

work page 2026
[58]

Universal intelligence: A definition of machine intelligence.Minds and Machines, 17(4):391–444, 2007

Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence.Minds and Machines, 17(4):391–444, 2007

work page 2007
[59]

Enhancing end- to-end autonomous driving with latent world model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end- to-end autonomous driving with latent world model. InInternational Conference on Learning Representations, 2025

work page 2025
[60]

End-to-end driving with online trajectory evaluation via BEV world model

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via BEV world model. InIEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025. 26

work page 2025
[61]

DriveVLA-W0: World models amplify data scaling law in autonomous driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. DriveVLA-W0: World models amplify data scaling law in autonomous driving. InInternational Conference on Learning Representations, 2026

work page 2026
[62]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, and Xinggang Wang. ReCogDrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review arXiv 2025
[63]

Lidarcrafter: Dynamic 4d world modeling from lidar sequences.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18406–18414, Mar

Alan Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, and Wei Tsang Ooi. Lidarcrafter: Dynamic 4d world modeling from lidar sequences.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18406–18414, Mar. 2026. doi: 10.1609/aaai.v40i22.38905. URLhttps://ojs.aaai. org/index.php/AAAI/article/view/38905

work page doi:10.1609/aaai.v40i22.38905 2026
[64]

Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, and Ziwei Liu

Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, and Ziwei Liu. WorldLens: Full-spectrum evaluations of driving world models in real wor...

work page 2026
[65]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024

work page 2024
[66]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, pages 34892–34916, 2023

work page 2023
[67]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[68]

GuideFlow: Constraint-guided flow matching for planning in end-to-end autonomous driving

Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, Junqiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, and Yadan Luo. GuideFlow: Constraint-guided flow matching for planning in end-to-end autonomous driving.arXiv preprint arXiv:2511.18729, 2025

work page arXiv 2025
[69]

DriveWorld-VLA: Unified latent- space world modeling with vision-language-action for autonomous driving.arXiv preprint arXiv:2602.06521, 2026

Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, and Long Chen. DriveWorld-VLA: Unified latent- space world modeling with vision-language-action for autonomous driving.arXiv preprint arXiv:2602.06521, 2026

work page arXiv 2026
[70]

ReasonPlan: Unified scene prediction and decision reasoning for closed-loop autonomous driving

Xueyi Liu, Zuodong Zhong, Junli Wang, Yuxin Guo, Zhiguo Su, Qichao Zhang, Yinfeng Gao, Yupeng Zheng, Donbin Zhao, et al. ReasonPlan: Unified scene prediction and decision reasoning for closed-loop autonomous driving. In Conference on Robot Learning, pages 3051–3068. PMLR, 2025

work page 2025
[71]

A rationale-centric framework for human-in-the-loop machine learning

Jinghui Lu, Linyi Yang, Brian Namee, and Yue Zhang. A rationale-centric framework for human-in-the-loop machine learning. InAnnual Meeting of the Association for Computational Linguistics, pages 6986–6996, 2022

work page 2022
[72]

Punifiedner: a prompting-based unified ner system for diverse datasets

Jinghui Lu, Rui Zhao, Brian Mac Namee, and Fei Tan. Punifiedner: a prompting-based unified ner system for diverse datasets. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intellig...

work page doi:10.1609/aaai.v37i11.26564 2023
[73]

What makes pre-trained language models better zero-shot learners? InAnnual Meeting of the Association for Computational Linguistics, pages 2288–2303, 2023

Jinghui Lu, Dongsheng Zhu, Weidong Han, Rui Zhao, Brian Mac Namee, and Fei Tan. What makes pre-trained language models better zero-shot learners? InAnnual Meeting of the Association for Computational Linguistics, pages 2288–2303, 2023

work page 2023
[74]

PaDeLLM-NER: Parallel decoding in large language models for named entity recognition

Jinghui Lu, Ziwei Yang, Yanjie Wang, Xuejing Liu, Brian Mac Namee, and Can Huang. PaDeLLM-NER: Parallel decoding in large language models for named entity recognition. InAdvancesin Neural Information Processing Systems, volume 37, pages 117853–117880, 2024

work page 2024
[75]

A bounding box is worth one token - interleaving layout and text in a large language model for document understanding

Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, and Can Huang. A bounding box is worth one token - interleaving layout and text in a large language model for document understanding. InAnnual Meeting of the Association for Computational Linguistics, pages 7252–7273, 2025. 27

work page 2025
[76]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advancesin Neural Information Processing Systems, volume 35, pages 2507–2521, 2022

work page 2022
[77]

Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928,

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, and Fuxi Wen. LaST-VLA: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2025

work page arXiv 2025
[78]

2509.13769 , archivePrefix =

Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

work page arXiv 2025
[79]

Unleashing VLA potentials in autonomous driving via explicit learning from failures.arXiv preprint arXiv:2603.01063, 2026

Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, and Fuxi Wen. Unleash- ing VLA potentials in autonomous driving via explicit learning from failures.arXiv preprint arXiv:2603.01063, 2026

work page arXiv 2026
[80]

MTRDrive: Memory-tool synergistic reasoning for robust autonomous driving in corner cases.arXiv preprint arXiv:2509.20843, 2025

Ziang Luo, Kangan Qian, Jiahua Wang, Yuechen Luo, Jinyu Miao, Zheng Fu, Yunlong Wang, Sicong Jiang, Zilin Huang, Yifei Hu, Yuhao Yang, Hao Ye, Mengmeng Yang, Xiaojian Dong, Kun Jiang, and Diange Yang. MTRDrive: Memory-tool synergistic reasoning for robust autonomous driving in corner cases.arXiv preprint arXiv:2509.20843, 2025

work page arXiv 2025

Showing first 80 references.