Grounding Driving VLA via Inverse Kinematics
Pith reviewed 2026-05-21 05:29 UTC · model grok-4.3
The pith
A 0.5B driving VLA matches 7B-8B models by treating trajectory prediction as inverse kinematics that requires future visual states.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Trajectory recovery in driving VLAs requires both a current and a future visual state as boundary conditions; supplying only the current state encourages reliance on ego status and text. By adding a next visual state prediction objective and a cross-attention-based conditional diffusion Inverse Kinematics Network that takes solely the current and future visual states, the model suppresses shortcut paths during decoding. This redesign allows a 0.5B-scale model to recover visual grounding and match the trajectory planning performance of 7B-8B VLAs on NAVSIM-v2 and nuScenes, with the largest improvements in dynamic driving scenarios.
What carries the argument
The Inverse Kinematics Network, a cross-attention-based conditional diffusion model that decodes trajectories from current and future visual states alone while ignoring ego status and text.
If this is right
- The model regains the ability to exploit visual features instead of bypassing them.
- Trajectory planning performance becomes comparable to much larger models without increasing parameter count.
- Gains concentrate in dynamic situations such as turning where visual grounding is most needed.
- The structural change reduces dependence on ego status and textual commands during decoding.
Where Pith is reading between the lines
- The same boundary-condition approach could be tested in other vision-language-action domains where future observations provide natural constraints.
- Correcting task formulation this way may reduce the need for ever-larger models in grounded planning tasks.
- Closed-loop tests on additional benchmarks or real vehicles would check whether the recovered visual grounding transfers beyond the reported datasets.
Load-bearing premise
That providing a predicted future visual state together with a visual-only IK network is sufficient to remove the model's reliance on ego-status and text shortcuts.
What would settle it
Measure whether trajectory accuracy collapses and shortcut usage rises when the future visual state input is replaced by random noise or mismatched frames while keeping ego status and text available.
Figures
read the original abstract
Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that Driving VLAs under-exploit visual tokens because the task is structurally ill-posed: trajectory decoding receives only the current visual state and can therefore shortcut through ego status and text commands. The authors reframe the problem as inverse kinematics by adding (1) a next-visual-state prediction objective that supplies dense visual supervision to the LLM and (2) a separate Inverse Kinematics Network (cross-attention conditional diffusion model) that receives only the current and predicted future visual states and outputs the trajectory. They report that this change alone enables a 0.5 B model to reach closed-loop performance on NAVSIM-v2 and nuScenes comparable to 7–8 B VLAs, with the largest gains in dynamic maneuvers, and attribute the improvement to recovered visual grounding.
Significance. If the central mechanism is verified, the work offers a lightweight structural remedy for visual grounding in driving VLAs that does not require scaling model size. The inverse-kinematics framing is conceptually clean and the reported efficiency gain (0.5 B vs. 7–8 B) would be practically relevant for deployment. The approach also supplies a concrete testbed for studying shortcut behavior in multimodal planners.
major comments (2)
- [Abstract / analysis section] Abstract and analysis section: The claim that the next-visual-state prediction 'suppresses shortcut paths' and that 'extensive analysis' demonstrates recovered visual exploitation is load-bearing for the central thesis. However, because the LLM still receives ego status and text when predicting the future scene, it remains possible that the prediction head satisfies its loss primarily via non-visual signals; the downstream IK network would then inherit any such bias. A control that isolates whether the visual-prediction head actually conditions on current visual tokens (e.g., an ablation that masks visual inputs during prediction and measures degradation) is required to substantiate the structural fix.
- [Experiments section] Experiments section: The manuscript states that the 0.5 B model reaches performance 'comparable' to 7–8 B VLAs on both NAVSIM-v2 and nuScenes. To support this cross-scale claim, the results must be accompanied by full ablation tables, error analysis, and statistical comparisons (including variance across seeds or runs). Without these details the attribution of gains specifically to visual grounding versus other implementation choices cannot be verified.
minor comments (2)
- [Method section] The description of the Inverse Kinematics Network should include the precise conditioning variables, diffusion schedule, and training losses so that the visual-only constraint can be reproduced exactly.
- [Figure 1] Figure 1 (or equivalent architecture diagram) would benefit from explicit arrows or annotations distinguishing the information flow into the LLM versus the IK network.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional evidence would strengthen the central claims regarding visual grounding recovery. We address each major point below and commit to revisions that directly respond to the concerns.
read point-by-point responses
-
Referee: [Abstract / analysis section] Abstract and analysis section: The claim that the next-visual-state prediction 'suppresses shortcut paths' and that 'extensive analysis' demonstrates recovered visual exploitation is load-bearing for the central thesis. However, because the LLM still receives ego status and text when predicting the future scene, it remains possible that the prediction head satisfies its loss primarily via non-visual signals; the downstream IK network would then inherit any such bias. A control that isolates whether the visual-prediction head actually conditions on current visual tokens (e.g., an ablation that masks visual inputs during prediction and measures degradation) is required to substantiate the structural fix.
Authors: We appreciate the referee's identification of this potential ambiguity. Our existing analysis shows that the visual prediction objective leads to higher attention weights on visual tokens and disproportionately larger gains in dynamic scenarios that require visual reasoning. Nevertheless, we agree that a direct isolation experiment is needed. In the revised manuscript we will add an ablation that masks current visual inputs to the prediction head (while retaining ego status and text) and report the resulting degradation in both next-state prediction accuracy and downstream closed-loop planning performance. This will substantiate that the head conditions on visual tokens rather than non-visual shortcuts. revision: yes
-
Referee: [Experiments section] Experiments section: The manuscript states that the 0.5 B model reaches performance 'comparable' to 7–8 B VLAs on both NAVSIM-v2 and nuScenes. To support this cross-scale claim, the results must be accompanied by full ablation tables, error analysis, and statistical comparisons (including variance across seeds or runs). Without these details the attribution of gains specifically to visual grounding versus other implementation choices cannot be verified.
Authors: We concur that expanded experimental reporting is required to support the cross-scale comparison and to isolate the source of the gains. The revised experiments section will include: (i) complete ablation tables showing the incremental contribution of the next-visual-state prediction objective and the separate Inverse Kinematics Network, (ii) error analysis stratified by scenario type (e.g., turns, lane changes, and straight driving), and (iii) statistical comparisons reporting mean performance with standard deviation across at least three independent runs with different random seeds. These additions will allow readers to verify attribution to recovered visual grounding. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper's central argument identifies an ill-posed task formulation in existing Driving VLAs and addresses it by introducing two new components: a next-visual-state prediction objective for the LLM and a separate visual-only Inverse Kinematics Network implemented as a conditional diffusion model. These are explicit architectural additions and training objectives rather than parameters fitted to a data subset and then relabeled as predictions. Performance gains are reported via direct empirical comparison on external benchmarks (NAVSIM-v2 and nuScenes), and the claim of recovered visual grounding rests on the design of the new components plus post-hoc analysis, none of which reduce by construction to quantities defined inside the same loop or to self-citations. The derivation therefore remains self-contained with independent content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Trajectory recovery in driving requires both current and future visual states as boundary conditions for the inverse kinematics formulation.
invented entities (1)
-
Inverse Kinematics Network (cross-attention conditional diffusion model)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the IK Network ... takes only the current and future visual states as input ... suppressing reliance on ego status and textual shortcuts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020
work page 2020
- [2]
- [3]
- [4]
-
[5]
D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InAdvances in Neural Information Processing Systems, 2024
work page 2024
- [6]
-
[7]
R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. IEEE Robotics and Automation Letters, 11(1):226–233, 2025
work page 2025
-
[8]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
W. Han, D. Guo, C.-Z. Xu, and J. Shen. Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3347–3355, 2025
work page 2025
-
[10]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[11]
A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan. Safe local motion planning with self- supervised freespace forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021
work page 2021
-
[13]
S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022
work page 2022
- [14]
-
[15]
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023
work page 2023
- [16]
-
[17]
EMMA: End-to-End Multimodal Model for Autonomous Driving
J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024
work page 2024
- [19]
-
[20]
T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. InEuropean Conference on Computer Vision, pages 353–369. Springer, 2022
work page 2022
-
[21]
T. Li, H. Wang, X. Li, W. Liao, T. He, and P. Peng. Generative planning with 3d-vision language pre-training for end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4950–4958, 2025
work page 2025
-
[22]
Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025
work page 2025
-
[25]
Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024
work page 2024
-
[28]
B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025
work page 2025
-
[29]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019
work page 2019
- [30]
-
[31]
J. Mao, Y . Qian, J. Ye, H. Zhao, and Y . Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. InInterna- tional conference on machine learning, pages 8162–8171. PMLR, 2021
work page 2021
-
[33]
T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024. 11
work page 2024
-
[34]
GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving
L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [35]
-
[36]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: visual explanations from deep networks via gradient-based localization.International journal of computer vision, 128(2):336–359, 2020
work page 2020
-
[37]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024
work page 2024
-
[39]
O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [40]
-
[41]
X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025
work page 2025
-
[43]
Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024
work page 2024
-
[44]
Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1001–1009, 2025
work page 2025
-
[46]
Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025
work page 2025
-
[47]
Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024
work page 2024
-
[48]
A. Yang et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, H. Li, Y . Guo, et al. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9247–9255, 2025. 12
work page 2025
-
[50]
W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026
work page 2026
-
[51]
J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang. Rethink- ing the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [52]
-
[53]
Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language- action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
re- gions important for driving
J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to- end autonomous driving.arXiv preprint arXiv:2512.07745, 2025. A Related Works A.1 End-to-end Planning End-to-end planning models take vehicle information such as the input image...
-
[55]
Predict the NEXT BEV state (one step ahead). To predict it, you MUST internally infer a feasible ego trajectory; DO NOT output the trajectory –- it is handled by a separate inverse-kinematics model. 15
-
[56]
If a QA question is provided, output the QA answer. Output format (STRICT). Output ONLY the blocks below. Block 1 (next BEV state): <bev_start> <BEV_STATE> ... <BEV_STATE> <bev_end> Block 2 (optional QA): <answer_start>SHORT_ANSWER <answer_end> The number of <BEV_STATE> placeholders matches the spatial token count of ˆVt+∆, and the LLM hidden states at th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.