pith. sign in

arxiv: 2605.21061 · v1 · pith:RB3E4MGTnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI· cs.RO

Grounding Driving VLA via Inverse Kinematics

Pith reviewed 2026-05-21 05:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords driving VLAinverse kinematicsvisual groundingtrajectory planningNAVSIMnuScenesvision language modelsfuture state prediction
0
0 comments X

The pith

A 0.5B driving VLA matches 7B-8B models by treating trajectory prediction as inverse kinematics that requires future visual states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing driving vision-language-action models largely ignore visual tokens because the task of generating trajectories from only current visuals, ego status, and text commands is structurally ill-posed and invites shortcuts. It fixes this by adding a next visual state prediction objective that supplies dense visual supervision and by introducing a separate Inverse Kinematics Network that receives only the current and future visual states to decode the trajectory. With these changes alone, the 0.5B model recovers the ability to exploit visual features and reaches closed-loop trajectory planning performance on NAVSIM-v2 and nuScenes that is comparable to models more than ten times larger. The gains appear strongest in dynamic situations such as turning, indicating that proper visual boundary conditions matter for grounding.

Core claim

Trajectory recovery in driving VLAs requires both a current and a future visual state as boundary conditions; supplying only the current state encourages reliance on ego status and text. By adding a next visual state prediction objective and a cross-attention-based conditional diffusion Inverse Kinematics Network that takes solely the current and future visual states, the model suppresses shortcut paths during decoding. This redesign allows a 0.5B-scale model to recover visual grounding and match the trajectory planning performance of 7B-8B VLAs on NAVSIM-v2 and nuScenes, with the largest improvements in dynamic driving scenarios.

What carries the argument

The Inverse Kinematics Network, a cross-attention-based conditional diffusion model that decodes trajectories from current and future visual states alone while ignoring ego status and text.

If this is right

  • The model regains the ability to exploit visual features instead of bypassing them.
  • Trajectory planning performance becomes comparable to much larger models without increasing parameter count.
  • Gains concentrate in dynamic situations such as turning where visual grounding is most needed.
  • The structural change reduces dependence on ego status and textual commands during decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boundary-condition approach could be tested in other vision-language-action domains where future observations provide natural constraints.
  • Correcting task formulation this way may reduce the need for ever-larger models in grounded planning tasks.
  • Closed-loop tests on additional benchmarks or real vehicles would check whether the recovered visual grounding transfers beyond the reported datasets.

Load-bearing premise

That providing a predicted future visual state together with a visual-only IK network is sufficient to remove the model's reliance on ego-status and text shortcuts.

What would settle it

Measure whether trajectory accuracy collapses and shortcut usage rises when the future visual state input is replaced by random noise or mismatched frames while keeping ego status and text available.

Figures

Figures reproduced from arXiv: 2605.21061 by Hyunjung Shim, Junsung Park.

Figure 1
Figure 1. Figure 1: Counterfactual obstacle stitching on nuScenes val. Left columns: original front-camera [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Existing Driving VLA (a) vs. our model (b). Existing models decode the trajectory directly [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our model. The vision encoder extracts visual tokens [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of GradCAM saliency on nuScenes [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples of the position–size stitching variants on nuScenes [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
read the original abstract

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that Driving VLAs under-exploit visual tokens because the task is structurally ill-posed: trajectory decoding receives only the current visual state and can therefore shortcut through ego status and text commands. The authors reframe the problem as inverse kinematics by adding (1) a next-visual-state prediction objective that supplies dense visual supervision to the LLM and (2) a separate Inverse Kinematics Network (cross-attention conditional diffusion model) that receives only the current and predicted future visual states and outputs the trajectory. They report that this change alone enables a 0.5 B model to reach closed-loop performance on NAVSIM-v2 and nuScenes comparable to 7–8 B VLAs, with the largest gains in dynamic maneuvers, and attribute the improvement to recovered visual grounding.

Significance. If the central mechanism is verified, the work offers a lightweight structural remedy for visual grounding in driving VLAs that does not require scaling model size. The inverse-kinematics framing is conceptually clean and the reported efficiency gain (0.5 B vs. 7–8 B) would be practically relevant for deployment. The approach also supplies a concrete testbed for studying shortcut behavior in multimodal planners.

major comments (2)
  1. [Abstract / analysis section] Abstract and analysis section: The claim that the next-visual-state prediction 'suppresses shortcut paths' and that 'extensive analysis' demonstrates recovered visual exploitation is load-bearing for the central thesis. However, because the LLM still receives ego status and text when predicting the future scene, it remains possible that the prediction head satisfies its loss primarily via non-visual signals; the downstream IK network would then inherit any such bias. A control that isolates whether the visual-prediction head actually conditions on current visual tokens (e.g., an ablation that masks visual inputs during prediction and measures degradation) is required to substantiate the structural fix.
  2. [Experiments section] Experiments section: The manuscript states that the 0.5 B model reaches performance 'comparable' to 7–8 B VLAs on both NAVSIM-v2 and nuScenes. To support this cross-scale claim, the results must be accompanied by full ablation tables, error analysis, and statistical comparisons (including variance across seeds or runs). Without these details the attribution of gains specifically to visual grounding versus other implementation choices cannot be verified.
minor comments (2)
  1. [Method section] The description of the Inverse Kinematics Network should include the precise conditioning variables, diffusion schedule, and training losses so that the visual-only constraint can be reproduced exactly.
  2. [Figure 1] Figure 1 (or equivalent architecture diagram) would benefit from explicit arrows or annotations distinguishing the information flow into the LLM versus the IK network.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional evidence would strengthen the central claims regarding visual grounding recovery. We address each major point below and commit to revisions that directly respond to the concerns.

read point-by-point responses
  1. Referee: [Abstract / analysis section] Abstract and analysis section: The claim that the next-visual-state prediction 'suppresses shortcut paths' and that 'extensive analysis' demonstrates recovered visual exploitation is load-bearing for the central thesis. However, because the LLM still receives ego status and text when predicting the future scene, it remains possible that the prediction head satisfies its loss primarily via non-visual signals; the downstream IK network would then inherit any such bias. A control that isolates whether the visual-prediction head actually conditions on current visual tokens (e.g., an ablation that masks visual inputs during prediction and measures degradation) is required to substantiate the structural fix.

    Authors: We appreciate the referee's identification of this potential ambiguity. Our existing analysis shows that the visual prediction objective leads to higher attention weights on visual tokens and disproportionately larger gains in dynamic scenarios that require visual reasoning. Nevertheless, we agree that a direct isolation experiment is needed. In the revised manuscript we will add an ablation that masks current visual inputs to the prediction head (while retaining ego status and text) and report the resulting degradation in both next-state prediction accuracy and downstream closed-loop planning performance. This will substantiate that the head conditions on visual tokens rather than non-visual shortcuts. revision: yes

  2. Referee: [Experiments section] Experiments section: The manuscript states that the 0.5 B model reaches performance 'comparable' to 7–8 B VLAs on both NAVSIM-v2 and nuScenes. To support this cross-scale claim, the results must be accompanied by full ablation tables, error analysis, and statistical comparisons (including variance across seeds or runs). Without these details the attribution of gains specifically to visual grounding versus other implementation choices cannot be verified.

    Authors: We concur that expanded experimental reporting is required to support the cross-scale comparison and to isolate the source of the gains. The revised experiments section will include: (i) complete ablation tables showing the incremental contribution of the next-visual-state prediction objective and the separate Inverse Kinematics Network, (ii) error analysis stratified by scenario type (e.g., turns, lane changes, and straight driving), and (iii) statistical comparisons reporting mean performance with standard deviation across at least three independent runs with different random seeds. These additions will allow readers to verify attribution to recovered visual grounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's central argument identifies an ill-posed task formulation in existing Driving VLAs and addresses it by introducing two new components: a next-visual-state prediction objective for the LLM and a separate visual-only Inverse Kinematics Network implemented as a conditional diffusion model. These are explicit architectural additions and training objectives rather than parameters fitted to a data subset and then relabeled as predictions. Performance gains are reported via direct empirical comparison on external benchmarks (NAVSIM-v2 and nuScenes), and the claim of recovered visual grounding rests on the design of the new components plus post-hoc analysis, none of which reduce by construction to quantities defined inside the same loop or to self-citations. The derivation therefore remains self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the applicability of inverse kinematics boundary conditions to visual driving scenes and on the ability of the added prediction objective to provide effective supervision without introducing new fitting artifacts.

axioms (1)
  • domain assumption Trajectory recovery in driving requires both current and future visual states as boundary conditions for the inverse kinematics formulation.
    Invoked in the abstract to diagnose why existing VLAs ignore visual tokens.
invented entities (1)
  • Inverse Kinematics Network (cross-attention conditional diffusion model) no independent evidence
    purpose: Decode trajectory from current and future visual states only, suppressing ego-status and text shortcuts.
    New component introduced to enforce visual grounding.

pith-pipeline@v0.9.0 · 5758 in / 1388 out tokens · 25581 ms · 2026-05-21T05:29:56.490489+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 16 internal anchors

  1. [1]

    Caesar, V

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

  2. [2]

    Chi, H.-a

    H. Chi, H.-a. Gao, Z. Liu, J. Liu, C. Liu, J. Li, K. Yang, Y . Yu, Z. Wang, W. Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

  3. [3]

    Chitta, A

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

  4. [4]

    C. Dang, S. Ang, Y . Li, H. Tian, J. Wang, G. Li, H. Ye, J. Ma, L. Chen, and Y . Wang. Drivefine: Refining-augmented masked diffusion vla for precise and robust driving.arXiv preprint arXiv:2602.14577, 2026

  5. [5]

    Dauner, M

    D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InAdvances in Neural Information Processing Systems, 2024

  6. [6]

    K. Ding, B. Chen, Y . Su, H.-a. Gao, B. Jin, C. Sima, W. Zhang, X. Li, P. Barsch, H. Li, et al. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving.arXiv preprint arXiv:2409.06702, 2024

  7. [7]

    R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. IEEE Robotics and Automation Letters, 11(1):226–233, 2025

  8. [8]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    W. Han, D. Guo, C.-Z. Xu, and J. Shen. Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3347–3355, 2025

  10. [10]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  11. [11]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  12. [12]

    P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan. Safe local motion planning with self- supervised freespace forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021

  13. [13]

    S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

  14. [14]

    T. Hu, X. Liu, S. Wang, Y . Zhu, A. Liang, L. Kong, G. Zhao, Z. Gong, J. Cen, Z. Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

  15. [15]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  16. [16]

    Huang, T

    Z. Huang, T. Tang, S. Chen, S. Lin, Z. Jie, L. Ma, G. Wang, and X. Liang. Making large language models better planners with reasoning-decision alignment. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024. 10

  17. [17]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  18. [18]

    X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

  19. [19]

    Jiang, S

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  20. [20]

    Khurana, P

    T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. InEuropean Conference on Computer Vision, pages 353–369. Springer, 2022

  21. [21]

    T. Li, H. Wang, X. Li, W. Liao, T. He, and P. Peng. Generative planning with 3d-vision language pre-training for end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4950–4958, 2025

  22. [22]

    Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

  23. [23]

    Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

  24. [24]

    Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

  25. [25]

    Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

  26. [26]

    Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  27. [27]

    Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

  28. [28]

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  29. [29]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  30. [30]

    Y . Luo, F. Li, S. Xu, Z. Lai, L. Yang, Q. Chen, Z. Luo, Z. Xie, S. Jiang, J. Liu, et al. Ada- thinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

  31. [31]

    J. Mao, Y . Qian, J. Ye, H. Zhao, and Y . Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

  32. [32]

    A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. InInterna- tional conference on machine learning, pages 8162–8171. PMLR, 2021

  33. [33]

    T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024. 11

  34. [34]

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

    L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

  35. [35]

    W. Ryu, S. Yu, S. Moon, H. Choi, J. Park, J. Kim, and H. Shim. SUPER-AD: Semantic uncertainty-aware planning for end-to-end robust autonomous driving, 2025. URL https: //arxiv.org/abs/2511.22865

  36. [36]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: visual explanations from deep networks via gradient-based localization.International journal of computer vision, 128(2):336–359, 2020

  37. [37]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  38. [38]

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

  39. [39]

    DINOv3

    O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  40. [40]

    R. Song, X. Guo, Y . Peng, Q. Wei, H. Wu, and L. Chen. Insightdrive: Insight scene representation for end-to-end autonomous driving.arXiv preprint arXiv:2503.13047, 2025

  41. [41]

    X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  42. [42]

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

  43. [43]

    Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

  44. [44]

    Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  45. [45]

    S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1001–1009, 2025

  46. [46]

    Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

  47. [47]

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024

  48. [48]

    Qwen2.5 Technical Report

    A. Yang et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115

  49. [49]

    S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, H. Li, Y . Guo, et al. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9247–9255, 2025. 12

  50. [50]

    W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

  51. [51]

    J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang. Rethink- ing the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023

  52. [52]

    X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model.arXiv preprint arXiv:2503.23463, 2025

  53. [53]

    Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language- action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

  54. [54]

    re- gions important for driving

    J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to- end autonomous driving.arXiv preprint arXiv:2512.07745, 2025. A Related Works A.1 End-to-end Planning End-to-end planning models take vehicle information such as the input image...

  55. [55]

    To predict it, you MUST internally infer a feasible ego trajectory; DO NOT output the trajectory –- it is handled by a separate inverse-kinematics model

    Predict the NEXT BEV state (one step ahead). To predict it, you MUST internally infer a feasible ego trajectory; DO NOT output the trajectory –- it is handled by a separate inverse-kinematics model. 15

  56. [56]

    - Velocity (vx,vy): (·,·) - Heading Angular Velocity (v_yaw): (·) - Acceleration (ax,ay): (·,·) - Can Bus: (·,·) - Heading Speed: (·) - Steering: (·)

    If a QA question is provided, output the QA answer. Output format (STRICT). Output ONLY the blocks below. Block 1 (next BEV state): <bev_start> <BEV_STATE> ... <BEV_STATE> <bev_end> Block 2 (optional QA): <answer_start>SHORT_ANSWER <answer_end> The number of <BEV_STATE> placeholders matches the spatial token count of ˆVt+∆, and the LLM hidden states at th...