pith. machine review for the scientific record. sign in

arxiv: 2604.22851 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.CL· cs.RO

Recognition: unknown

EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:06 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.RO
keywords ego-motionphysical reasoningvision-language modelsautonomous drivingfoundation modelsperception bottleneckbenchmark evaluation
0
0 comments X

The pith

Vision-centric foundation models fail to align physical ego-motion concepts with visual observations in autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoDyn-Bench to diagnose how well vision-language models understand the physics of their own movement from camera views. It shows that these models possess logical ideas about motion but cannot reliably connect them to what they see, performing below even basic non-learned geometric methods. The shortfall remains no matter the model size or whether the model was trained on driving scenes. Adding explicit descriptions of the vehicle's path in language greatly improves results, pointing to a split where reasoning about movement lives in the text processing and vision adds almost nothing.

Core claim

By mapping continuous vehicle kinematics to discrete motion concepts with a deterministic oracle, EgoDyn-Bench reveals a perception bottleneck: models exhibit logical physical concepts but fail to align them accurately with visual observations, underperforming classical baselines. Providing explicit trajectory encodings restores physical consistency, demonstrating that egomotion logic derives almost exclusively from the language modality while visual observations contribute negligible signal.

What carries the argument

The EgoDyn-Bench diagnostic benchmark, which uses a deterministic oracle to decouple a model's internal physical logic from its visual perception by mapping kinematics to discrete concepts.

If this is right

  • Models require improved coupling between visual perception and physical reasoning to achieve reliable embodied behavior.
  • Explicit trajectory information can serve as a practical bridge to enhance consistency in existing architectures.
  • The observed disentanglement suggests that language provides the primary pathway for physical logic in current designs.
  • This benchmark offers a standardized way to measure progress toward physically grounded vision-language models for driving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If vision truly adds no signal, incorporating more direct visual-to-kinematics training pairs could force better integration.
  • Similar benchmarks might expose parallel issues in other physical reasoning domains like object interaction or navigation.
  • Architectures that embed physics simulators directly into the vision encoder could bypass the language-only pathway.
  • Real-world deployment in autonomous vehicles may need auxiliary sensors or explicit state estimation to compensate for this visual deficit.

Load-bearing premise

The deterministic oracle provides an accurate and unbiased mapping from continuous vehicle kinematics to discrete motion concepts, and the benchmark tasks successfully isolate the perception component without confounding factors from other model abilities.

What would settle it

Demonstrating a vision-only model that achieves higher accuracy on EgoDyn-Bench tasks than the same model provided with explicit trajectory encodings, or that surpasses classical geometric baselines without additional inputs.

Figures

Figures reproduced from arXiv: 2604.22851 by Dingrui Wang, Finn Rasmus Sch\"afer, Johannes Betz, Mattia Piccinini, Sebastian Schmidt, Stephan G\"unnemann, Thomas Stauner, Yuan Gao.

Figure 1
Figure 1. Figure 1: EgoDyn-Bench Overview. Continuous kinematic states S are mapped to semantic labels via a deterministic oracle to define a VideoQA task over visual ob￾servations O. Models are evaluated on their ability to infer motion dynamics through semantic, temporal, and physical consistency (WPCR) metrics. 2 Related Work Existing evaluation frameworks for vision-centric foundation models in the au￾tonomous driving dom… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of Dataset Augmentation. (a) Spatial coverage of nuScenes (or￾ange) vs. CARLA-derived scenarios (blue). CARLA expands the state-space to include complex maneuvers required for robust benchmarking. (b) Positive label fractions for representative questions. EgoDyn-Bench corrects the low-dynamic bias of nuScenes by injecting dynamically augmented synthetic sequences. signals and labeling rules. Detaile… view at source ↗
Figure 3
Figure 3. Figure 3: Global performance and ranking stability under threshold perturbation (α ∈ [0.5, 1.5]). While raw and balanced accuracy exhibit minor scaling effects, Kendall’s τ demonstrates that the relative ranking of models remains highly stable (τ > 0.9) across almost all perturbation levels. This confirms that the observed perception bottleneck is robust to the specific kinematic calibration. As shown in [PITH_FULL… view at source ↗
Figure 4
Figure 4. Figure 4: Stability of the deterministic oracle’s physics-grounded consistency rules. The Weighted Physics Consistency Rate (WPCR) remains stable across the perturbation sweep, indicating that the Boolean implication logic is invariant to the specific scalar boundaries defining the maneuvers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Clip Viewer Web Interface. The dashboard provides a holistic view of each benchmark sample, merging multi-modal video playback (top row), dynamic physical state tracking (middle row), and linguistic QA pairs (bottom row) into a single, syn￾chronized timeline for human-in-the-loop verification [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗
read the original abstract

While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model's internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces EgoDyn-Bench, a diagnostic benchmark that maps continuous vehicle kinematics to discrete motion concepts using a deterministic oracle. Through a large-scale evaluation of over 20 vision-centric foundation models (including MLLMs, VLMs, and VLAs), it identifies a Perception Bottleneck: models possess logical physical concepts but fail to ground them in visual observations, often underperforming non-learned geometric baselines. The failure holds across scales and domain-specific training; providing explicit trajectory encodings restores consistency, which the authors interpret as evidence of functional disentanglement where egomotion logic derives almost exclusively from the language modality while visual inputs add negligible signal.

Significance. If the central claims hold after addressing the oracle and evaluation details, the work supplies a standardized diagnostic for embodied physical reasoning in VLMs and a concrete intervention (trajectory encoding) that improves consistency. The scale of the audit and direct comparison to geometric baselines are strengths that could guide future architecture design for autonomous driving.

major comments (3)
  1. [§3.2] §3.2 (Oracle Definition): The Perception Bottleneck and disentanglement claims rest on the oracle supplying unbiased discrete labels that cleanly isolate visual perception from reasoning. No sensitivity analysis, human validation, or justification is provided for the velocity/curvature thresholds and temporal aggregation windows; if these boundaries do not align with cues recoverable from image sequences (perspective foreshortening, occlusion, lighting), the observed model failures and the restoration via text encodings could be artifacts of label mismatch rather than an architectural property.
  2. [§4.2] §4.2 and Table 3 (Baseline Comparisons): The claim that models underperform classical geometric baselines is load-bearing for the structural-deficit conclusion, yet the manuscript provides no statistical significance tests, confidence intervals, or variance across data splits for the reported accuracy gaps. Without these, it is unclear whether the differences are robust or driven by particular motion-concept categories.
  3. [§4.4] §4.4 (Disentanglement via Trajectory Encoding): The assertion that visual observations contribute negligible additional signal is supported only by the performance lift when explicit trajectory text is added. No control experiments (e.g., equivalent-length neutral text, shuffled trajectories, or vision-only ablations) are reported, leaving open the possibility that the gain stems from prompt engineering rather than modality disentanglement.
minor comments (3)
  1. [§1] The abstract and §1 use the term 'structural deficit' without a precise definition; a short paragraph clarifying what architectural property is hypothesized to produce the bottleneck would improve clarity.
  2. [Figure 2] Figure 2 (benchmark pipeline) would benefit from explicit annotation of the oracle's input/output interfaces and the exact motion-concept vocabulary.
  3. [§4.1] Model selection criteria and data-split details (train/test overlap with pre-training corpora) are mentioned only at high level in §4.1; expanding this subsection would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of robustness and experimental controls that we address below. We have prepared revisions to incorporate additional analyses and clarifications.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Oracle Definition): The Perception Bottleneck and disentanglement claims rest on the oracle supplying unbiased discrete labels that cleanly isolate visual perception from reasoning. No sensitivity analysis, human validation, or justification is provided for the velocity/curvature thresholds and temporal aggregation windows; if these boundaries do not align with cues recoverable from image sequences (perspective foreshortening, occlusion, lighting), the observed model failures and the restoration via text encodings could be artifacts of label mismatch rather than an architectural property.

    Authors: We selected the velocity and curvature thresholds based on established discretizations in autonomous driving literature to produce semantically distinct motion concepts (e.g., straight-line vs. gentle vs. sharp turns). The temporal windows follow the natural frame rates of the source datasets. Nevertheless, we acknowledge the absence of explicit sensitivity checks and human alignment validation. In the revised manuscript we will add (i) a sensitivity study varying thresholds by ±10 % and ±20 % with resulting accuracy tables, and (ii) a human validation study on 200 randomly sampled clips where annotators judge whether the oracle label matches the visible ego-motion. These results will be reported in §3.2 and the appendix. revision: yes

  2. Referee: [§4.2] §4.2 and Table 3 (Baseline Comparisons): The claim that models underperform classical geometric baselines is load-bearing for the structural-deficit conclusion, yet the manuscript provides no statistical significance tests, confidence intervals, or variance across data splits for the reported accuracy gaps. Without these, it is unclear whether the differences are robust or driven by particular motion-concept categories.

    Authors: The referee correctly notes that statistical support is required to substantiate the performance gaps. In the revision we will augment Table 3 with (i) bootstrap 95 % confidence intervals computed over 1 000 resamples of the test set, (ii) paired t-test p-values comparing each model against the strongest geometric baseline within each motion category, and (iii) standard deviation across three random 80/20 splits of the benchmark. These additions will appear in §4.2 and the supplementary material. revision: yes

  3. Referee: [§4.4] §4.4 (Disentanglement via Trajectory Encoding): The assertion that visual observations contribute negligible additional signal is supported only by the performance lift when explicit trajectory text is added. No control experiments (e.g., equivalent-length neutral text, shuffled trajectories, or vision-only ablations) are reported, leaving open the possibility that the gain stems from prompt engineering rather than modality disentanglement.

    Authors: We agree that stronger controls are needed to isolate the contribution of trajectory semantics from generic prompt effects. In the revised version we will report two new control conditions: (a) neutral text prompts of matched token length that contain no motion information, and (b) shuffled trajectory encodings that preserve length and format but destroy temporal order. Results for both controls will be added to §4.4 and Table 4. The original vision-only results already serve as the vision-only ablation; we will explicitly label them as such for clarity. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent geometric baselines

full rationale

The paper introduces EgoDyn-Bench via a deterministic oracle that maps continuous kinematics to discrete motion concepts, then reports direct empirical comparisons of 20+ models against non-learned geometric baselines. No equations, fitted parameters, derivations, or self-citations are presented as load-bearing steps that reduce any claim to its own inputs by construction. The perception-bottleneck finding and modality-disentanglement observation rest on observable performance gaps rather than any self-referential logic or renamed ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central empirical claim rests on the assumption that the deterministic oracle faithfully converts kinematics to semantic concepts without bias and that the evaluation isolates the intended perception-reasoning gap.

axioms (1)
  • domain assumption The deterministic oracle provides an accurate and unbiased mapping from continuous vehicle kinematics to discrete motion concepts.
    Invoked to decouple internal physical logic from visual perception as stated in the abstract.

pith-pipeline@v0.9.0 · 5561 in / 1339 out tokens · 47625 ms · 2026-05-10T00:06:08.396622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 40 canonical work pages · 8 internal anchors

  1. [1]

    Anthropic: Claude sonnet 4.5 model card.https://www.anthropic.com/news/ claude-sonnet-4-5(2025)

  2. [2]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  3. [3]

    In: 2014 IEEE Intelligent Vehicles Symposium Proceedings

    Bergasa, L.M., Almería, D., Almazán, J., Yebes, J.J., Arroyo, R.: Drivesafe: An app for alerting inattentive drivers and scoring driving behaviors. In: 2014 IEEE Intelligent Vehicles Symposium Proceedings. pp. 240–245 (2014).https://doi. org/10.1109/IVS.2014.6856461

  4. [4]

    Caesar, H., et al.: nuscenes: A multimodal dataset for autonomous driving (2020), https://arxiv.org/abs/1903.11027

  5. [5]

    Chen,Y.,Zhan,Z.,Lin,X.,Song,Z.,Liu,H.,Lyu,Q.,Zu,Y.,Chen,X.,Liu,Z.,Pu, T., Chen, T., Wang, K., Lin, L., Wang, G.: Radar: Benchmarking vision-language- action generalization via real-world dynamics, spatial-physical intelligence, and autonomous evaluation (2026),https://arxiv.org/abs/2602.10980

  6. [6]

    Chi, H., ang Gao, H., Liu, Z., Liu, J., Liu, C., Li, J., Yang, K., Yu, Y., Wang, Z., Li, W., Wang, L., Hu, X., Sun, H., Zhao, H., Zhao, H.: Impromptu vla: Open weights and open data for driving vision-language-action models (2025),https: //arxiv.org/abs/2505.23757

  7. [7]

    Dosovitskiy, A., et al.: Carla: An open urban driving simulator (2017),https: //arxiv.org/abs/1711.03938

  8. [8]

    Gholami,M.,Rezaei,A.,Weimin,Z.,Mao,S.,Zhou,S.,Zhang,Y.,Akbari,M.:Spa- tial reasoning with vision-language models in ego-centric multi-view scenes (2025), https://arxiv.org/abs/2509.06266

  9. [9]

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering (2017),https://arxiv.org/abs/1612.00837

  10. [10]

    Cam- bridge University Press, 2 edn

    Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cam- bridge University Press, 2 edn. (2004)

  11. [11]

    Vehicle System Dynamics58(10), 1497–1527 (2020).https: / / doi

    Heilmeier, A., Wischnewski, A., Hermansdorfer, L., Betz, J., Lienkamp, M., Lohmann, B.: Minimum curvature trajectory planning and control for an au- tonomous race car. Vehicle System Dynamics58(10), 1497–1527 (2020).https: / / doi . org / 10 . 1080 / 00423114 . 2019 . 1631455,https : / / doi . org / 10 . 1080 / 00423114.2019.1631455

  12. [12]

    Artificial Intelligence 17(1), 185–203 (1981).https://doi.org/https://doi.org/10.1016/0004- 3702(81)90024- 2,https://www.sciencedirect.com/science/article/pii/ 0004370281900242

    Horn, B.K., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17(1), 185–203 (1981).https://doi.org/https://doi.org/10.1016/0004- 3702(81)90024- 2,https://www.sciencedirect.com/science/article/pii/ 0004370281900242

  13. [13]

    Huang, Z., Feng, C., Yan, F., Xiao, B., Jie, Z., Zhong, Y., Liang, X., Ma, L.: Robotron-drive: All-in-one large multimodal model for autonomous driving (2025), https://arxiv.org/abs/2412.07689

  14. [14]

    Jiang, S., Huang, Z., Qian, K., Luo, Z., Zhu, T., Zhong, Y., Tang, Y., Kong, M., Wang, Y., Jiao, S., Ye, H., Sheng, Z., Zhao, X., Wen, T., Fu, Z., Chen, S., Jiang, K., Yang, D., Choi, S., Sun, L.: A survey on vision-language-action models for autonomous driving (2025),https://arxiv.org/abs/2506.24044

  15. [15]

    In: 2011 14th International IEEE Conference on Intelligent Trans- portation Systems (ITSC)

    Johnson, D.A., Trivedi, M.M.: Driving style recognition using a smartphone as a sensor platform. In: 2011 14th International IEEE Conference on Intelligent Trans- portation Systems (ITSC). pp. 1609–1615 (2011).https://doi.org/10.1109/ ITSC.2011.6083078

  16. [16]

    Karnchanachari, N., Geromichalos, D., Tan, K.S., Li, N., Eriksen, C., Yaghoubi, S., Mehdipour, N., Bernasconi, G., Fong, W.K., Guo, Y., Caesar, H.: Towards learning-based planning:the nuplan benchmark for real-world autonomous driving (2024),https://arxiv.org/abs/2403.04133

  17. [17]

    Biometrika30(1-2), 81–93 (1938) https://doi.org/10.1093/biomet/30.1-2.81

    KENDALL, M.G.: A new measure of rank correlation. Biometrika30(1-2), 81–93 (06 1938).https://doi.org/10.1093/biomet/30.1-2.81,https://doi.org/10. 1093/biomet/30.1-2.81 34 F. Schäferet al

  18. [18]

    In: Proc

    Klischat, M., Althoff, M.: Generating critical test scenarios for automated vehicles with evolutionary algorithms. In: Proc. of the IEEE Intelligent Vehicles Sympo- sium. pp. 2352 – 2358 (2019).https://doi.org/10.1109/ivs.2019.8814230

  19. [19]

    In: 2011 IEEE Intelligent Vehicles Symposium (IV)

    Levinson, J., Askeland, J., Becker, J., Dolson, J., Held, D., Kammel, S., Kolter, J.Z., Langer, D., Pink, O., Pratt, V., Sokolsky, M., Stanek, G., Stavens, D., Te- ichman, A., Werling, M., Thrun, S.: Towards fully autonomous driving: Systems and algorithms. In: 2011 IEEE Intelligent Vehicles Symposium (IV). pp. 163–168 (2011).https://doi.org/10.1109/IVS.2...

  20. [20]

    Liu, J., Zhou, J., Ye, K., Lin, K.Y., Wang, A., Liang, J.: Egotraj-bench: Towards robust trajectory prediction under ego-view noisy observations (2025),https:// arxiv.org/abs/2510.00405

  21. [21]

    Nature293(5828), 133–135 (1981)

    Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature293(5828), 133–135 (1981)

  22. [22]

    In: IJCAI’81: 7th international joint conference on Artificial intelligence

    Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli- cation to stereo vision. In: IJCAI’81: 7th international joint conference on Artificial intelligence. vol. 2, pp. 674–679 (1981)

  23. [23]

    Nie, M., Peng, R., Wang, C., Cai, X., Han, J., Xu, H., Zhang, L.: Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving (2024), https://arxiv.org/abs/2312.03661

  24. [24]

    NVIDIA, :, Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chat- topadhyay, P., Chen, Y., Cui, Y., Ding, Y., Dworakowski, D., Fan, J., Fenzi, M., Ferroni, F., Fidler, S., Fox, D., Ge, S., Ge, Y., Gu, J., Gururani, S., He, E., Huang, J., Huffman, J., Jannaty, P., Jin, J., Kim, S.W., Klár, G., Lam, G., Lan, S., Leal- Taixe, L., Li, A., Li, ...

  25. [25]

    OpenAI: Gpt-5.1.https://openai.com/gpt-5(2025)

  26. [26]

    Puyin, L., Xiang, T., Mao, E., Wei, S., Chen, X., Masood, A., Fei-fei, L., Adeli, E.: Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models (2025),https://arxiv.org/abs/2512.19526

  27. [27]

    Qian, T., Chen, J., Zhuo, L., Jiao, Y., Jiang, Y.G.: Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario (2024), https://arxiv.org/abs/2305.14836

  28. [28]

    Research, N., et al.: Cosmos-transfer1: Conditional world generation with adaptive multimodal control (2025),https://arxiv.org/abs/2503.14492

  29. [29]

    In: 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    Shi, J., Tomasi: Good features to track. In: 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 593–600 (1994).https://doi. org/10.1109/CVPR.1994.323794

  30. [30]

    Sima, C., et al.: Drivelm: Driving with graph visual question answering (2025), https://arxiv.org/abs/2312.14150

  31. [31]

    In: 2019 IEEE Intelli- gent Transportation Systems Conference (ITSC)

    Stahl, T., Wischnewski, A., Betz, J., Lienkamp, M.: Multilayer graph-based tra- jectory planning for race vehicles in dynamic scenarios. In: 2019 IEEE Intelli- gent Transportation Systems Conference (ITSC). pp. 3149–3154 (2019).https: //doi.org/10.1109/ITSC.2019.8917032 EgoDyn-Bench 35

  32. [32]

    arXiv (2019)

    Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhao, S., Cheng, S., Zhang, Y., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: Waymo open data...

  33. [33]

    Team, G., et al.: Gemini: A family of highly capable multimodal models (2025), https://arxiv.org/abs/2312.11805

  34. [34]

    Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow (2020), https://arxiv.org/abs/2003.12039

  35. [35]

    Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., Zhao, H.: Drivevlm: The convergence of autonomous driving and large vision- language models (2024),https://arxiv.org/abs/2402.12289

  36. [36]

    IEEE Access pp

    Trauth, R., Moller, K., Würsching, G., Betz, J.: Frenetix: A high-performance and modular motion planning framework for autonomous driving. IEEE Access pp. 1–1 (2024).https://doi.org/10.1109/ACCESS.2024.3436835

  37. [37]

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

  38. [38]

    Wang, W., Hu, Y., Scherer, S.: Tartanvo: A generalizable learning-based vo (2020), https://arxiv.org/abs/2011.00359

  39. [39]

    Wang, Y., Zheng, Y., Fan, W., Wang, T., Chu, H., Tian, D., Gao, B., Wang, J., Chen, H.: Scenepilot-bench: A large-scale dataset and benchmark for evaluation of vision-language models in autonomous driving (2026),https://arxiv.org/abs/ 2601.19582

  40. [40]

    Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S., Pan, B., Kumar, R., Hartnett, A., Pontes, J.K., Ramanan, D., Carr, P., Hays, J.: Argov- erse 2: Next generation datasets for self-driving perception and forecasting (2023), https://arxiv.org/abs/2301.00493

  41. [41]

    Wu, H., Cai, Y., Li, Z., Ge, H., Sun, B., Yuan, J., Wang, Y.: Camreasoner: Rein- forcing camera movement understanding via structured spatial reasoning (2026), https://arxiv.org/abs/2602.00181

  42. [42]

    In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV)

    Xie, S., Kong, L., Dong, Y., Sima, C., Zhang, W., Chen, Q.A., Liu, Z., Pan, L.: Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 6585–6597 (October 2025)

  43. [43]

    IEEE Transactions on Vehicular Tech- nology69(2), 1341–1352 (2020).https://doi.org/10.1109/TVT.2019.2960110

    Xing, Y., Lv, C., Cao, D.: Personalized vehicle trajectory prediction based on joint time-series modeling for connected vehicles. IEEE Transactions on Vehicular Tech- nology69(2), 1341–1352 (2020).https://doi.org/10.1109/TVT.2019.2960110

  44. [44]

    Xu, Z., et al.: Drivegpt4: Interpretable end-to-end autonomous driving via large language model (2024),https://arxiv.org/abs/2310.01412

  45. [45]

    Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

    Zhang, C., Cherniavskii, D., Tragoudaras, A., Vozikis, A., Nijdam, T., Prinzhorn, D.W.E., Bodracska, M., Sebe, N., Zadaianchuk, A., Gavves, E.: Morpheus: Bench- 36 F. Schäferet al. marking physical reasoning of video generative models with real physical experi- ments (2025),https://arxiv.org/abs/2504.02918

  46. [46]

    Ai safety assurance for automated vehicles: A survey on research, standardization, regulation,

    Zhou, X., Liu, M., Yurtsever, E., Zagar, B.L., Zimmer, W., Cao, H., Knoll, A.C.: Vision language models in autonomous driving: A survey and outlook. IEEE Trans- actions on Intelligent Vehicles pp. 1–20 (2024).https://doi.org/10.1109/TIV. 2024.3402136

  47. [47]

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...