Recognition: 3 theorem links
· Lean TheoremXiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3
The pith
Supervising latent reasoning tokens with future-frame visual predictions lets latent Chain-of-Thought exceed explicit token-by-token reasoning in driving tasks while running at answer-only speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OneVL routes driving reasoning through compact latent tokens supervised by both a language decoder that reconstructs text Chain-of-Thought and a visual world-model decoder that predicts future-frame tokens. The visual supervision pushes the latent space to represent causal dynamics of geometry, agent movement, and scene evolution rather than linguistic abstractions alone. After three-stage progressive alignment of trajectory, language, and visual objectives, inference discards the auxiliary decoders and prefills every latent token in a single parallel pass, matching answer-only latency while surpassing explicit CoT accuracy across four benchmarks.
What carries the argument
Dual auxiliary decoders (language reconstruction plus visual future-frame prediction) that supervise the same compact latent tokens during training so the hidden states internalize physical dynamics.
If this is right
- Latent CoT can now deliver higher accuracy than explicit CoT in VLA driving models.
- Reasoning no longer adds autoregressive latency at deployment time.
- Representations learned with world-model supervision generalize better than those learned from text alone.
- The same latent tokens can be aligned to trajectory, language, and visual objectives in one stable pipeline.
- Auxiliary decoders are needed only for training and can be removed without losing the performance gain.
Where Pith is reading between the lines
- The same visual-supervision trick could be tested in other embodied tasks such as robotic manipulation where physical dynamics matter more than language.
- If the visual decoder is the key, then models without any explicit reasoning tokens might still improve simply by adding future-frame prediction as an auxiliary loss.
- Safety-critical driving systems could benefit from latents that have been forced to track geometry and motion rather than only reciting explanations.
- A direct comparison on non-driving VLA benchmarks would show whether the gain is specific to road scenes or holds more broadly.
Load-bearing premise
Adding a decoder that predicts future video frames will make the model's hidden states encode real physical changes in the scene instead of remaining just patterns of words.
What would settle it
Remove the future-frame decoder during training and check whether accuracy on the four benchmarks drops below that of explicit CoT while latency stays the same.
read the original abstract
Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. In inference, the auxiliary decoders are discarded, and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering superior accuracy at answer-only latency. These results show that with world model supervision, latent CoT produces more generalizable representations than verbose token-by-token reasoning. Code has been open-sourced to the community. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OneVL, a unified VLA and world-model framework for autonomous driving that performs one-step latent CoT reasoning and planning. Compact latent tokens are supervised during training by dual auxiliary decoders (a language decoder reconstructing text CoT and a visual world-model decoder predicting future-frame tokens) via a three-stage pipeline that aligns latents to trajectory, language, and visual objectives. At inference the decoders are dropped and all latents are prefilled in one parallel pass, yielding explicit-CoT accuracy at answer-only latency. The central claim is that this is the first latent-CoT method to surpass explicit CoT across four benchmarks because world-model supervision forces the latent space to internalize causal dynamics of road geometry, agent motion, and environmental change rather than remaining a linguistic abstraction.
Significance. If the reported gains are reproducible and causally attributable to the visual decoder, the work would be significant for real-time VLA deployment: it would demonstrate that latent CoT can outperform verbose explicit reasoning when properly grounded in visual dynamics, while preserving answer-only speed. Open-sourcing the code is a concrete strength that supports verification and follow-up work.
major comments (2)
- [§3 and abstract] §3 (Method) and abstract: the claim that the visual world-model decoder 'forces the latent space to internalize the causal dynamics' is load-bearing for the superiority explanation, yet the manuscript contains no ablation, probing, or visualization that isolates this mechanism from the language decoder, trajectory alignment, or simple capacity increases. Without such controls, benchmark gains could arise from unrelated optimization effects.
- [§4] §4 (Experiments): the paper reports that OneVL surpasses explicit CoT on four benchmarks but supplies neither full baseline details, ablation tables removing the visual decoder, nor statistical reporting (standard deviations, multiple seeds). This absence directly undermines confidence in the central causal-dynamics attribution and the 'first latent CoT to surpass explicit CoT' claim.
minor comments (2)
- [Figure 1 and §3.1] Figure 1 and §3.1: the diagram and text description of how latent tokens are routed to the two decoders would benefit from an explicit equation showing the joint loss and the exact prefill procedure at inference.
- [§3.3] The three-stage training schedule is described at a high level; a table listing the loss weights and data schedules per stage would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The concerns about isolating the visual decoder's contribution and providing fuller experimental details are valid and will be addressed through targeted revisions to strengthen the manuscript's claims.
read point-by-point responses
-
Referee: [§3 and abstract] §3 (Method) and abstract: the claim that the visual world-model decoder 'forces the latent space to internalize the causal dynamics' is load-bearing for the superiority explanation, yet the manuscript contains no ablation, probing, or visualization that isolates this mechanism from the language decoder, trajectory alignment, or simple capacity increases. Without such controls, benchmark gains could arise from unrelated optimization effects.
Authors: We agree that the current version lacks direct evidence isolating the visual decoder's role. The three-stage pipeline and dual-decoder design are intended to ground latents in visual dynamics rather than linguistic abstractions, but without controls this remains an attribution. In revision we will add: an ablation removing only the visual decoder (keeping language decoder and trajectory alignment fixed), linear probing of latents for dynamic properties such as agent motion and road geometry, and qualitative visualizations of predicted future frames from the latents. These will be placed in §3 and §4. revision: yes
-
Referee: [§4] §4 (Experiments): the paper reports that OneVL surpasses explicit CoT on four benchmarks but supplies neither full baseline details, ablation tables removing the visual decoder, nor statistical reporting (standard deviations, multiple seeds). This absence directly undermines confidence in the central causal-dynamics attribution and the 'first latent CoT to surpass explicit CoT' claim.
Authors: We accept that the experimental section requires expansion for reproducibility and statistical rigor. The revised §4 will include: complete implementation details and hyperparameters for all baselines, a new ablation table that isolates removal of the visual decoder, and all main results reported as mean ± std over three random seeds with seed values stated. This will directly support evaluation of the reported gains. revision: yes
Circularity Check
No circularity; claims rest on external benchmarks and explicit training objectives
full rationale
The paper describes a three-stage training pipeline with dual decoders (language and visual world model) whose objectives are stated directly as supervision signals. At inference the decoders are discarded and latents are prefilled in one pass. Superior benchmark performance is reported against external baselines rather than any quantity defined inside the paper's own fitted parameters or equations. No self-citation chain, uniqueness theorem, or ansatz is invoked to force the central result. The assumption that visual-frame prediction embeds causal dynamics is presented as a modeling hypothesis, not derived by construction from the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A visual world model decoder predicting future-frame tokens will force latent representations to capture causal dynamics of driving scenes.
invented entities (1)
-
compact latent tokens supervised by dual auxiliary decoders
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
compression drives generalization... tighter compression forces the model to retain only the causal structure
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Is Your Driving World Model an All-Around Player?
WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
-
OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation
A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.
Reference graph
Works this paper leans on
-
[1]
Claude 3.7 Sonnet and Claude Code.https://www.anthropic.com/news/claude-3-7-sonnet, 2025
Anthropic. Claude 3.7 Sonnet and Claude Code.https://www.anthropic.com/news/claude-3-7-sonnet, 2025
work page 2025
-
[2]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023
work page 2023
-
[3]
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Hengwei Bian, Lingdong Kong, Haozhe Xie, Liang Pan, Yu Qiao, and Ziwei Liu. Dynamiccity: Large-scale 4d occupancy generation from dynamic scenes.arXiv preprint arXiv:2410.18084, 2024
-
[8]
nuscenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020
work page 2020
-
[9]
Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding
Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024
work page 2024
-
[10]
arXiv preprint arXiv:2510.25122 (2025)
Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, and Jinghui Lu. Nanovla: Routing decoupled vision-language understanding for nano-sized generalist robotic policies.arXiv preprint arXiv:2510.25122, 2025
-
[11]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Automated evaluation of large vision-language models on self-driving corner cases
Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7817–7826. IEEE, 2025
work page 2025
-
[13]
Driving with llms: Fusing object-level vector modality for explainable autonomous driving
Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In2024 IEEE InternationalConference on Robotics and Automation (ICRA), pages 14093–14100. IEEE, 2024. 23
work page 2024
-
[14]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Qimao Chen, Fang Li, Shaoqing Xu, Zhiyi Lai, Zixun Xie, Yuechen Luo, Shengyin Jiang, Hanbing Li, Long Chen, Bing Wang, et al. Vilta: A vlm-in-the-loop adversary for enhancing driving policy robustness.arXiv preprint arXiv:2601.12672, 2026
-
[16]
Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171, 2024
-
[17]
Impromptu VLA: Open weights and open data for driving vision-language-action models
Haohan Chi, Huan ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, and Hao Zhao. Impromptu VLA: Open weights and open data for driving vision-language-action models. InAdvancesin Neural Information Processing Systems (Datasets and Benchmarks Track), volume...
work page 2025
- [18]
-
[19]
Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024
work page 2024
-
[20]
Language Modeling Is Compression , year =
Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023
-
[21]
From explicit cot to implicit cot: Learning to internalize cot step by step
Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024
-
[22]
Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models
Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024
work page 2024
-
[23]
Yifei Dong, Fengyi Wu, Guangyu Chen, Lingdong Kong, Xu Zhu, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, Alexander G. Hauptmann, and Zhi-Qi Cheng. Towards unified world models for visual navigation via memory-augmented planning and foresight.arXiv preprint arXiv:2510.08713, 2025
-
[24]
Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026
Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, et al. Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026
-
[25]
Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset
Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021
work page 2021
-
[26]
Advancing sequential numerical prediction in autoregressive models
Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, and Can Huang. Advancing sequential numerical prediction in autoregressive models. InAnnual Meeting of the Association for Computational Linguistics, pages 562–574, 2025
work page 2025
-
[27]
Dolphin: Document image parsing via heterogeneous anchor prompting
Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InAnnual Meeting of the Association for Computational Linguistics, pages 21919–21936, 2025
work page 2025
-
[28]
A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025
Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025
-
[29]
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InIEEE/CVF InternationalConference on Computer Vision, pages 24823–24834, 2025. 24
work page 2025
-
[30]
Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning,
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, andXiangBai. MindDrive: Avision-language-actionmodelforautonomousdrivingviaonlinereinforcement learning. arXiv preprint arXiv:2512.13636, 2025
-
[31]
Vision meets robotics: The KITTI dataset
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research, 32(11):1231–1237, 2013
work page 2013
-
[32]
Jiaheng Geng, Jiatong Du, Xinyu Zhang, Ye Li, Panqu Wang, and Yanjun Huang. Driving in corner case: A real-world adversarial closed-loop evaluation platform for end-to-end autonomous driving.arXiv preprint arXiv:2512.16055, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, and Srinivasa G. Narasimhan. ROADWork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. InIEEE/CVF International Conference on Computer Vision, pages 6132–6142, 2025
work page 2025
-
[34]
Google. Gemini 2.5 Pro preview: even better coding performance.https://developers.googleblog.com/en/ gemini-2-5-pro-io-improved-coding-performance, 2025
work page 2025
-
[35]
Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey.IEEE Transactionson Intelligent Vehicles, pages 1–17, 2024
work page 2024
-
[36]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019
work page internal anchor Pith review arXiv 1912
-
[40]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
MiMo-Embodied: X-Embodied Foundation Model Technical Report
Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, Yuchen Zhang, Jing Wu, Jinghui Lu, Chenxu Dang, Jiayi Guan, Jianhua Wu, Zhiyi Hou, Hanbing Li, Shumeng Xia, Mingliang Zhou, Yinan Zheng, Zihao Yue, Shuhao Gu, Hao Tian, Yuannan Shen, Jianwei Cui, Wen Zhang, Shaoqing Xu, Bing Wang...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Drivemrp: Enhancing vision-language models with synthetic motion data for motion risk prediction
Zhiyi Hou, Enhui Ma, Fang Li, Zhiyi Lai, Kalok Ho, Zhanqian Wu, Lijun Zhou, Long Chen, Chitian Sun, Haiyang Sun, et al. Drivemrp: Enhancing vision-language models with synthetic motion data for motion risk prediction. arXiv preprint arXiv:2507.02948, 2025
-
[43]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review arXiv 2023
-
[44]
Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16...
-
[45]
NavThinker: Action-conditioned world models for coupled prediction and planning in social navigation
Tianshuai Hu, Zeying Gong, Lingdong Kong, Xiaodong Mei, Yiyi Ding, Qi Zeng, Ao Liang, Rong Li, Yangyi Zhong, and Junwei Liang. NavThinker: Action-conditioned world models for coupled prediction and planning in social navigation. arXiv preprint arXiv:2603.15359, 2026
-
[46]
Fuller: Unified multi-modality multi-task 3D perception via multi-level gradient calibration
Zhijian Huang, Sihao Lin, Guiyu Liu, Mukun Luo, Chaoqiang Ye, Hang Xu, Xiaojun Chang, and Xiaodan Liang. Fuller: Unified multi-modality multi-task 3D perception via multi-level gradient calibration. InIEEE/CVF International Conference on Computer Vision, pages 3502–3511, 2023. 25
work page 2023
-
[47]
Making large language models better planners with reasoning-decision alignment
Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, and Xiaodan Liang. Making large language models better planners with reasoning-decision alignment. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024
work page 2024
-
[48]
RoboTron-Drive: All-in-one large multimodal model for autonomous driving
Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. RoboTron-Drive: All-in-one large multimodal model for autonomous driving. InIEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025
work page 2025
-
[49]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, and Salman Khan. DriveLMM-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding. arXiv preprint arXiv:2503.10621, 2025
-
[51]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Meml-grpo: Heterogeneous multi-expert mutual learning for rlvr advancement
Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, et al. Meml-grpo: Heterogeneous multi-expert mutual learning for rlvr advancement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31283–31291, 2026
work page 2026
-
[53]
Towards learning- based planning: The nuPlan benchmark for real-world autonomous driving
Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning- based planning: The nuPlan benchmark for real-world autonomous driving. InIEEE International Conference on Robotics and Automation, pages 629–636, 2024
work page 2024
-
[54]
Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Yaru Niu, Wei Tsang Ooi, Benoit R. Cottereau, Lai Xing Ng, Yuexin Ma, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Weichao Qiu, Wei Zhang, Xu Cao, Hao Lu, Ying-Cong Chen, Caixin Kang, Xinning Zhou, Chengyang Ying, Wentao Shang, Xingxing Wei, Yinpeng Dong, Bo Yang, Shengyin Jiang, Zeliang Ma, Dengyi Ji, Haiwen Li,...
-
[55]
Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi-modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025
work page 2025
-
[56]
Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, and Ziwei Liu. 3D and 4D world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025
-
[57]
Lingdong Kong, Xiang Xu, Youquan Liu, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. LargeAD: Large-scale cross-sensor data pretraining for autonomous driving.IEEE Transactionson Pattern Analysis and Machine Intelligence, 48(2):1291–1308, 2026
work page 2026
-
[58]
Universal intelligence: A definition of machine intelligence.Minds and Machines, 17(4):391–444, 2007
Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence.Minds and Machines, 17(4):391–444, 2007
work page 2007
-
[59]
Enhancing end- to-end autonomous driving with latent world model
Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end- to-end autonomous driving with latent world model. InInternational Conference on Learning Representations, 2025
work page 2025
-
[60]
End-to-end driving with online trajectory evaluation via BEV world model
Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via BEV world model. InIEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025. 26
work page 2025
-
[61]
DriveVLA-W0: World models amplify data scaling law in autonomous driving
Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. DriveVLA-W0: World models amplify data scaling law in autonomous driving. InInternational Conference on Learning Representations, 2026
work page 2026
-
[62]
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, and Xinggang Wang. ReCogDrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025
work page internal anchor Pith review arXiv 2025
-
[63]
Alan Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, and Wei Tsang Ooi. Lidarcrafter: Dynamic 4d world modeling from lidar sequences.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18406–18414, Mar. 2026. doi: 10.1609/aaai.v40i22.38905. URLhttps://ojs.aaai. org/index.php/AAAI/article/view/38905
-
[64]
Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, and Ziwei Liu
Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, and Ziwei Liu. WorldLens: Full-spectrum evaluations of driving world models in real wor...
work page 2026
-
[65]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024
work page 2024
-
[66]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, pages 34892–34916, 2023
work page 2023
-
[67]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
work page 2024
-
[68]
GuideFlow: Constraint-guided flow matching for planning in end-to-end autonomous driving
Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, Junqiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, and Yadan Luo. GuideFlow: Constraint-guided flow matching for planning in end-to-end autonomous driving.arXiv preprint arXiv:2511.18729, 2025
-
[69]
Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, and Long Chen. DriveWorld-VLA: Unified latent- space world modeling with vision-language-action for autonomous driving.arXiv preprint arXiv:2602.06521, 2026
-
[70]
ReasonPlan: Unified scene prediction and decision reasoning for closed-loop autonomous driving
Xueyi Liu, Zuodong Zhong, Junli Wang, Yuxin Guo, Zhiguo Su, Qichao Zhang, Yinfeng Gao, Yupeng Zheng, Donbin Zhao, et al. ReasonPlan: Unified scene prediction and decision reasoning for closed-loop autonomous driving. In Conference on Robot Learning, pages 3051–3068. PMLR, 2025
work page 2025
-
[71]
A rationale-centric framework for human-in-the-loop machine learning
Jinghui Lu, Linyi Yang, Brian Namee, and Yue Zhang. A rationale-centric framework for human-in-the-loop machine learning. InAnnual Meeting of the Association for Computational Linguistics, pages 6986–6996, 2022
work page 2022
-
[72]
Punifiedner: a prompting-based unified ner system for diverse datasets
Jinghui Lu, Rui Zhao, Brian Mac Namee, and Fei Tan. Punifiedner: a prompting-based unified ner system for diverse datasets. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intellig...
-
[73]
Jinghui Lu, Dongsheng Zhu, Weidong Han, Rui Zhao, Brian Mac Namee, and Fei Tan. What makes pre-trained language models better zero-shot learners? InAnnual Meeting of the Association for Computational Linguistics, pages 2288–2303, 2023
work page 2023
-
[74]
PaDeLLM-NER: Parallel decoding in large language models for named entity recognition
Jinghui Lu, Ziwei Yang, Yanjie Wang, Xuejing Liu, Brian Mac Namee, and Can Huang. PaDeLLM-NER: Parallel decoding in large language models for named entity recognition. InAdvancesin Neural Information Processing Systems, volume 37, pages 117853–117880, 2024
work page 2024
-
[75]
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, and Can Huang. A bounding box is worth one token - interleaving layout and text in a large language model for document understanding. InAnnual Meeting of the Association for Computational Linguistics, pages 7252–7273, 2025. 27
work page 2025
-
[76]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advancesin Neural Information Processing Systems, volume 35, pages 2507–2521, 2022
work page 2022
-
[77]
Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, and Fuxi Wen. LaST-VLA: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2025
-
[78]
Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025
-
[79]
Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, and Fuxi Wen. Unleash- ing VLA potentials in autonomous driving via explicit learning from failures.arXiv preprint arXiv:2603.01063, 2026
-
[80]
Ziang Luo, Kangan Qian, Jiahua Wang, Yuechen Luo, Jinyu Miao, Zheng Fu, Yunlong Wang, Sicong Jiang, Zilin Huang, Yifei Hu, Yuhao Yang, Hao Ye, Mengmeng Yang, Xiaojian Dong, Kun Jiang, and Diange Yang. MTRDrive: Memory-tool synergistic reasoning for robust autonomous driving in corner cases.arXiv preprint arXiv:2509.20843, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.