Grounding Driving VLA via Inverse Kinematics

Hyunjung Shim; Junsung Park

arxiv: 2605.21061 · v1 · pith:RB3E4MGTnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI· cs.RO

Grounding Driving VLA via Inverse Kinematics

Junsung Park , Hyunjung Shim This is my paper

Pith reviewed 2026-05-21 05:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords driving VLAinverse kinematicsvisual groundingtrajectory planningNAVSIMnuScenesvision language modelsfuture state prediction

0 comments

The pith

A 0.5B driving VLA matches 7B-8B models by treating trajectory prediction as inverse kinematics that requires future visual states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing driving vision-language-action models largely ignore visual tokens because the task of generating trajectories from only current visuals, ego status, and text commands is structurally ill-posed and invites shortcuts. It fixes this by adding a next visual state prediction objective that supplies dense visual supervision and by introducing a separate Inverse Kinematics Network that receives only the current and future visual states to decode the trajectory. With these changes alone, the 0.5B model recovers the ability to exploit visual features and reaches closed-loop trajectory planning performance on NAVSIM-v2 and nuScenes that is comparable to models more than ten times larger. The gains appear strongest in dynamic situations such as turning, indicating that proper visual boundary conditions matter for grounding.

Core claim

Trajectory recovery in driving VLAs requires both a current and a future visual state as boundary conditions; supplying only the current state encourages reliance on ego status and text. By adding a next visual state prediction objective and a cross-attention-based conditional diffusion Inverse Kinematics Network that takes solely the current and future visual states, the model suppresses shortcut paths during decoding. This redesign allows a 0.5B-scale model to recover visual grounding and match the trajectory planning performance of 7B-8B VLAs on NAVSIM-v2 and nuScenes, with the largest improvements in dynamic driving scenarios.

What carries the argument

The Inverse Kinematics Network, a cross-attention-based conditional diffusion model that decodes trajectories from current and future visual states alone while ignoring ego status and text.

If this is right

The model regains the ability to exploit visual features instead of bypassing them.
Trajectory planning performance becomes comparable to much larger models without increasing parameter count.
Gains concentrate in dynamic situations such as turning where visual grounding is most needed.
The structural change reduces dependence on ego status and textual commands during decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary-condition approach could be tested in other vision-language-action domains where future observations provide natural constraints.
Correcting task formulation this way may reduce the need for ever-larger models in grounded planning tasks.
Closed-loop tests on additional benchmarks or real vehicles would check whether the recovered visual grounding transfers beyond the reported datasets.

Load-bearing premise

That providing a predicted future visual state together with a visual-only IK network is sufficient to remove the model's reliance on ego-status and text shortcuts.

What would settle it

Measure whether trajectory accuracy collapses and shortcut usage rises when the future visual state input is replaced by random noise or mismatched frames while keeping ego status and text available.

Figures

Figures reproduced from arXiv: 2605.21061 by Hyunjung Shim, Junsung Park.

**Figure 2.** Figure 2: Existing Driving VLA (a) vs. our model (b). Existing models decode the trajectory directly [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our model. The vision encoder extracts visual tokens [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of GradCAM saliency on nuScenes [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative examples of the position–size stitching variants on nuScenes [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

read the original abstract

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The inverse kinematics framing plus future visual prediction and a visual-only diffusion decoder is a clean structural change that lets a 0.5B model reach 7-8B performance on NAVSIM-v2 and nuScenes, but the shortcut-elimination story still needs tighter controls.

read the letter

The main thing to know is that the authors treat trajectory output as an inverse kinematics problem that needs both a current and a future visual state. They add a next-visual-state prediction loss on the LLM and then decode trajectories with a separate cross-attention conditional diffusion network that receives only those two visual states. This is the concrete change that lets their 0.5B model close the gap to much larger VLAs on the closed-loop benchmarks, with the biggest lift in turning maneuvers where visuals matter most. The idea is straightforward and the reported numbers are the part worth checking first. What the paper does well is name a structural reason why current VLAs can ignore their visual tokens and then supply a fix that is easy to implement on top of existing architectures. The separation of the IK network is a clear way to block direct ego and text shortcuts at decode time, and the emphasis on dynamic scenes shows they looked at where the gains actually appear. The soft spots are around the mechanism. The LLM still sees ego status and text when it predicts the future visual state, so it could meet the prediction loss without using current visual tokens at all. The IK network would then inherit that bias. The abstract mentions extensive analysis that recovers visual exploitation, but without seeing the specific ablations that isolate the visual-prediction head from text and ego routes, the causal story remains plausible rather than locked down. Adding the extra network also brings some complexity, even if the scale reduction is the headline benefit. This paper is for groups working on efficient driving VLAs who want a practical lever for visual grounding without just scaling parameters. A reader who cares about real-time deployment and benchmark numbers will find the results useful to test. I would send it to peer review because the structural proposal is well-defined and the performance claims are falsifiable on public benchmarks.

Referee Report

2 major / 2 minor

Summary. The paper argues that Driving VLAs under-exploit visual tokens because the task is structurally ill-posed: trajectory decoding receives only the current visual state and can therefore shortcut through ego status and text commands. The authors reframe the problem as inverse kinematics by adding (1) a next-visual-state prediction objective that supplies dense visual supervision to the LLM and (2) a separate Inverse Kinematics Network (cross-attention conditional diffusion model) that receives only the current and predicted future visual states and outputs the trajectory. They report that this change alone enables a 0.5 B model to reach closed-loop performance on NAVSIM-v2 and nuScenes comparable to 7–8 B VLAs, with the largest gains in dynamic maneuvers, and attribute the improvement to recovered visual grounding.

Significance. If the central mechanism is verified, the work offers a lightweight structural remedy for visual grounding in driving VLAs that does not require scaling model size. The inverse-kinematics framing is conceptually clean and the reported efficiency gain (0.5 B vs. 7–8 B) would be practically relevant for deployment. The approach also supplies a concrete testbed for studying shortcut behavior in multimodal planners.

major comments (2)

[Abstract / analysis section] Abstract and analysis section: The claim that the next-visual-state prediction 'suppresses shortcut paths' and that 'extensive analysis' demonstrates recovered visual exploitation is load-bearing for the central thesis. However, because the LLM still receives ego status and text when predicting the future scene, it remains possible that the prediction head satisfies its loss primarily via non-visual signals; the downstream IK network would then inherit any such bias. A control that isolates whether the visual-prediction head actually conditions on current visual tokens (e.g., an ablation that masks visual inputs during prediction and measures degradation) is required to substantiate the structural fix.
[Experiments section] Experiments section: The manuscript states that the 0.5 B model reaches performance 'comparable' to 7–8 B VLAs on both NAVSIM-v2 and nuScenes. To support this cross-scale claim, the results must be accompanied by full ablation tables, error analysis, and statistical comparisons (including variance across seeds or runs). Without these details the attribution of gains specifically to visual grounding versus other implementation choices cannot be verified.

minor comments (2)

[Method section] The description of the Inverse Kinematics Network should include the precise conditioning variables, diffusion schedule, and training losses so that the visual-only constraint can be reproduced exactly.
[Figure 1] Figure 1 (or equivalent architecture diagram) would benefit from explicit arrows or annotations distinguishing the information flow into the LLM versus the IK network.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional evidence would strengthen the central claims regarding visual grounding recovery. We address each major point below and commit to revisions that directly respond to the concerns.

read point-by-point responses

Referee: [Abstract / analysis section] Abstract and analysis section: The claim that the next-visual-state prediction 'suppresses shortcut paths' and that 'extensive analysis' demonstrates recovered visual exploitation is load-bearing for the central thesis. However, because the LLM still receives ego status and text when predicting the future scene, it remains possible that the prediction head satisfies its loss primarily via non-visual signals; the downstream IK network would then inherit any such bias. A control that isolates whether the visual-prediction head actually conditions on current visual tokens (e.g., an ablation that masks visual inputs during prediction and measures degradation) is required to substantiate the structural fix.

Authors: We appreciate the referee's identification of this potential ambiguity. Our existing analysis shows that the visual prediction objective leads to higher attention weights on visual tokens and disproportionately larger gains in dynamic scenarios that require visual reasoning. Nevertheless, we agree that a direct isolation experiment is needed. In the revised manuscript we will add an ablation that masks current visual inputs to the prediction head (while retaining ego status and text) and report the resulting degradation in both next-state prediction accuracy and downstream closed-loop planning performance. This will substantiate that the head conditions on visual tokens rather than non-visual shortcuts. revision: yes
Referee: [Experiments section] Experiments section: The manuscript states that the 0.5 B model reaches performance 'comparable' to 7–8 B VLAs on both NAVSIM-v2 and nuScenes. To support this cross-scale claim, the results must be accompanied by full ablation tables, error analysis, and statistical comparisons (including variance across seeds or runs). Without these details the attribution of gains specifically to visual grounding versus other implementation choices cannot be verified.

Authors: We concur that expanded experimental reporting is required to support the cross-scale comparison and to isolate the source of the gains. The revised experiments section will include: (i) complete ablation tables showing the incremental contribution of the next-visual-state prediction objective and the separate Inverse Kinematics Network, (ii) error analysis stratified by scenario type (e.g., turns, lane changes, and straight driving), and (iii) statistical comparisons reporting mean performance with standard deviation across at least three independent runs with different random seeds. These additions will allow readers to verify attribution to recovered visual grounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's central argument identifies an ill-posed task formulation in existing Driving VLAs and addresses it by introducing two new components: a next-visual-state prediction objective for the LLM and a separate visual-only Inverse Kinematics Network implemented as a conditional diffusion model. These are explicit architectural additions and training objectives rather than parameters fitted to a data subset and then relabeled as predictions. Performance gains are reported via direct empirical comparison on external benchmarks (NAVSIM-v2 and nuScenes), and the claim of recovered visual grounding rests on the design of the new components plus post-hoc analysis, none of which reduce by construction to quantities defined inside the same loop or to self-citations. The derivation therefore remains self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the applicability of inverse kinematics boundary conditions to visual driving scenes and on the ability of the added prediction objective to provide effective supervision without introducing new fitting artifacts.

axioms (1)

domain assumption Trajectory recovery in driving requires both current and future visual states as boundary conditions for the inverse kinematics formulation.
Invoked in the abstract to diagnose why existing VLAs ignore visual tokens.

invented entities (1)

Inverse Kinematics Network (cross-attention conditional diffusion model) no independent evidence
purpose: Decode trajectory from current and future visual states only, suppressing ego-status and text shortcuts.
New component introduced to enforce visual grounding.

pith-pipeline@v0.9.0 · 5758 in / 1388 out tokens · 25581 ms · 2026-05-21T05:29:56.490489+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the IK Network ... takes only the current and future visual states as input ... suppressing reliance on ego status and textual shortcuts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 16 internal anchors

[1]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020
[2]

Chi, H.-a

H. Chi, H.-a. Gao, Z. Liu, J. Liu, C. Liu, J. Li, K. Yang, Y . Yu, Z. Wang, W. Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

work page arXiv 2025
[3]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

work page 2022
[4]

C. Dang, S. Ang, Y . Li, H. Tian, J. Wang, G. Li, H. Ye, J. Ma, L. Chen, and Y . Wang. Drivefine: Refining-augmented masked diffusion vla for precise and robust driving.arXiv preprint arXiv:2602.14577, 2026

work page arXiv 2026
[5]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[6]

K. Ding, B. Chen, Y . Su, H.-a. Gao, B. Jin, C. Sima, W. Zhang, X. Li, P. Barsch, H. Li, et al. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving.arXiv preprint arXiv:2409.06702, 2024

work page arXiv 2024
[7]

R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. IEEE Robotics and Automation Letters, 11(1):226–233, 2025

work page 2025
[8]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

W. Han, D. Guo, C.-Z. Xu, and J. Shen. Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3347–3355, 2025

work page 2025
[10]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[11]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan. Safe local motion planning with self- supervised freespace forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021

work page 2021
[13]

S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

work page 2022
[14]

T. Hu, X. Liu, S. Wang, Y . Zhu, A. Liang, L. Kong, G. Zhao, Z. Gong, J. Cen, Z. Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

work page arXiv 2025
[15]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

work page 2023
[16]

Huang, T

Z. Huang, T. Tang, S. Chen, S. Lin, Z. Jie, L. Ma, G. Wang, and X. Liang. Making large language models better planners with reasoning-decision alignment. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024. 10

work page 2024
[17]

EMMA: End-to-End Multimodal Model for Autonomous Driving

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

work page 2024
[19]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023
[20]

Khurana, P

T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. InEuropean Conference on Computer Vision, pages 353–369. Springer, 2022

work page 2022
[21]

T. Li, H. Wang, X. Li, W. Liao, T. He, and P. Peng. Generative planning with 3d-vision language pre-training for end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4950–4958, 2025

work page 2025
[22]

Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

work page 2025
[25]

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

work page 2024
[28]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025
[29]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[30]

Y . Luo, F. Li, S. Xu, Z. Lai, L. Yang, Q. Chen, Z. Luo, Z. Xie, S. Jiang, J. Liu, et al. Ada- thinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

work page arXiv 2025
[31]

J. Mao, Y . Qian, J. Ye, H. Zhao, and Y . Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. InInterna- tional conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021
[33]

T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024. 11

work page 2024
[34]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

W. Ryu, S. Yu, S. Moon, H. Choi, J. Park, J. Kim, and H. Shim. SUPER-AD: Semantic uncertainty-aware planning for end-to-end robust autonomous driving, 2025. URL https: //arxiv.org/abs/2511.22865

work page arXiv 2025
[36]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: visual explanations from deep networks via gradient-based localization.International journal of computer vision, 128(2):336–359, 2020

work page 2020
[37]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024
[39]

DINOv3

O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

R. Song, X. Guo, Y . Peng, Q. Wei, H. Wu, and L. Chen. Insightdrive: Insight scene representation for end-to-end autonomous driving.arXiv preprint arXiv:2503.13047, 2025

work page arXiv 2025
[41]

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

work page 2025
[43]

Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

work page 2024
[44]

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1001–1009, 2025

work page 2025
[46]

Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

work page 2025
[47]

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024

work page 2024
[48]

Qwen2.5 Technical Report

A. Yang et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, H. Li, Y . Guo, et al. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9247–9255, 2025. 12

work page 2025
[50]

W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

work page 2026
[51]

J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang. Rethink- ing the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model.arXiv preprint arXiv:2503.23463, 2025

work page arXiv 2025
[53]

Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language- action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

re- gions important for driving

J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to- end autonomous driving.arXiv preprint arXiv:2512.07745, 2025. A Related Works A.1 End-to-end Planning End-to-end planning models take vehicle information such as the input image...

work page arXiv 2025
[55]

To predict it, you MUST internally infer a feasible ego trajectory; DO NOT output the trajectory –- it is handled by a separate inverse-kinematics model

Predict the NEXT BEV state (one step ahead). To predict it, you MUST internally infer a feasible ego trajectory; DO NOT output the trajectory –- it is handled by a separate inverse-kinematics model. 15

work page
[56]

- Velocity (vx,vy): (·,·) - Heading Angular Velocity (v_yaw): (·) - Acceleration (ax,ay): (·,·) - Can Bus: (·,·) - Heading Speed: (·) - Steering: (·)

If a QA question is provided, output the QA answer. Output format (STRICT). Output ONLY the blocks below. Block 1 (next BEV state): <bev_start> <BEV_STATE> ... <BEV_STATE> <bev_end> Block 2 (optional QA): <answer_start>SHORT_ANSWER <answer_end> The number of <BEV_STATE> placeholders matches the spatial token count of ˆVt+∆, and the LLM hidden states at th...

work page arXiv 1900

[1] [1]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020

[2] [2]

Chi, H.-a

H. Chi, H.-a. Gao, Z. Liu, J. Liu, C. Liu, J. Li, K. Yang, Y . Yu, Z. Wang, W. Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

work page arXiv 2025

[3] [3]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

work page 2022

[4] [4]

C. Dang, S. Ang, Y . Li, H. Tian, J. Wang, G. Li, H. Ye, J. Ma, L. Chen, and Y . Wang. Drivefine: Refining-augmented masked diffusion vla for precise and robust driving.arXiv preprint arXiv:2602.14577, 2026

work page arXiv 2026

[5] [5]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[6] [6]

K. Ding, B. Chen, Y . Su, H.-a. Gao, B. Jin, C. Sima, W. Zhang, X. Li, P. Barsch, H. Li, et al. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving.arXiv preprint arXiv:2409.06702, 2024

work page arXiv 2024

[7] [7]

R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. IEEE Robotics and Automation Letters, 11(1):226–233, 2025

work page 2025

[8] [8]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

W. Han, D. Guo, C.-Z. Xu, and J. Shen. Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3347–3355, 2025

work page 2025

[10] [10]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[11] [11]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan. Safe local motion planning with self- supervised freespace forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021

work page 2021

[13] [13]

S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

work page 2022

[14] [14]

T. Hu, X. Liu, S. Wang, Y . Zhu, A. Liang, L. Kong, G. Zhao, Z. Gong, J. Cen, Z. Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

work page arXiv 2025

[15] [15]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

work page 2023

[16] [16]

Huang, T

Z. Huang, T. Tang, S. Chen, S. Lin, Z. Jie, L. Ma, G. Wang, and X. Liang. Making large language models better planners with reasoning-decision alignment. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024. 10

work page 2024

[17] [17]

EMMA: End-to-End Multimodal Model for Autonomous Driving

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

work page 2024

[19] [19]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023

[20] [20]

Khurana, P

T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. InEuropean Conference on Computer Vision, pages 353–369. Springer, 2022

work page 2022

[21] [21]

T. Li, H. Wang, X. Li, W. Liao, T. He, and P. Peng. Generative planning with 3d-vision language pre-training for end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4950–4958, 2025

work page 2025

[22] [22]

Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

work page 2025

[25] [25]

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

work page 2024

[28] [28]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025

[29] [29]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[30] [30]

Y . Luo, F. Li, S. Xu, Z. Lai, L. Yang, Q. Chen, Z. Luo, Z. Xie, S. Jiang, J. Liu, et al. Ada- thinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

work page arXiv 2025

[31] [31]

J. Mao, Y . Qian, J. Ye, H. Zhao, and Y . Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. InInterna- tional conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021

[33] [33]

T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024. 11

work page 2024

[34] [34]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

W. Ryu, S. Yu, S. Moon, H. Choi, J. Park, J. Kim, and H. Shim. SUPER-AD: Semantic uncertainty-aware planning for end-to-end robust autonomous driving, 2025. URL https: //arxiv.org/abs/2511.22865

work page arXiv 2025

[36] [36]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: visual explanations from deep networks via gradient-based localization.International journal of computer vision, 128(2):336–359, 2020

work page 2020

[37] [37]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024

[39] [39]

DINOv3

O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

R. Song, X. Guo, Y . Peng, Q. Wei, H. Wu, and L. Chen. Insightdrive: Insight scene representation for end-to-end autonomous driving.arXiv preprint arXiv:2503.13047, 2025

work page arXiv 2025

[41] [41]

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

work page 2025

[43] [43]

Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

work page 2024

[44] [44]

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1001–1009, 2025

work page 2025

[46] [46]

Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

work page 2025

[47] [47]

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024

work page 2024

[48] [48]

Qwen2.5 Technical Report

A. Yang et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, H. Li, Y . Guo, et al. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9247–9255, 2025. 12

work page 2025

[50] [50]

W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

work page 2026

[51] [51]

J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang. Rethink- ing the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model.arXiv preprint arXiv:2503.23463, 2025

work page arXiv 2025

[53] [53]

Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language- action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

re- gions important for driving

J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to- end autonomous driving.arXiv preprint arXiv:2512.07745, 2025. A Related Works A.1 End-to-end Planning End-to-end planning models take vehicle information such as the input image...

work page arXiv 2025

[55] [55]

To predict it, you MUST internally infer a feasible ego trajectory; DO NOT output the trajectory –- it is handled by a separate inverse-kinematics model

Predict the NEXT BEV state (one step ahead). To predict it, you MUST internally infer a feasible ego trajectory; DO NOT output the trajectory –- it is handled by a separate inverse-kinematics model. 15

work page

[56] [56]

- Velocity (vx,vy): (·,·) - Heading Angular Velocity (v_yaw): (·) - Acceleration (ax,ay): (·,·) - Can Bus: (·,·) - Heading Speed: (·) - Steering: (·)

If a QA question is provided, output the QA answer. Output format (STRICT). Output ONLY the blocks below. Block 1 (next BEV state): <bev_start> <BEV_STATE> ... <BEV_STATE> <bev_end> Block 2 (optional QA): <answer_start>SHORT_ANSWER <answer_end> The number of <BEV_STATE> placeholders matches the spatial token count of ˆVt+∆, and the LLM hidden states at th...

work page arXiv 1900