arxiv: 2605.10564 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

Lingjun Zhang , Changjie Wu , Linzhe Shi , Jiangyang Li , Jiaxin Liu , Lei Yang , Hang Zhang , Mu Xu

show 1 more author

Hong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:08 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords autonomous drivingworld modelBEV predictionlatent statesvision-language modelend-to-end drivinglong-horizon planningclosed-loop evaluation

0 comments

The pith

A driving world model predicts latent semantic features in parallel across future BEV frames to support long-horizon planning while adding adaptive text reasoning for rare scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an end-to-end autonomous driving system built around a world model that forecasts multiple future frames simultaneously rather than one step at a time. By operating in bird's-eye-view space and predicting latent semantic features, the approach aims to reduce error buildup over extended time horizons and produce more coherent future scene representations. An additional adaptive text reasoning layer draws on social knowledge to handle unusual traffic situations that standard visual pipelines miss. If the method works as described, it would allow planning modules to reason about longer sequences of events with less drift and greater robustness in edge cases. The authors report that the combined system reaches state-of-the-art closed-loop performance on the Bench2drive benchmark.

Core claim

The paper presents a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view space, thereby enabling long-horizon modeling of future world states, together with an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge to improve driving performance in challenging long-tail scenarios, achieving state-of-the-art results on the closed-loop Bench2drive benchmark.

What carries the argument

Parallel prediction of latent semantic features for consecutive future frames in BEV space, which generates a sequence of future world-state representations at once, augmented by an adaptive text reasoning module that incorporates social knowledge to refine decisions.

If this is right

The system can model future world states over longer time horizons without sequential error accumulation.
Driving decisions gain robustness in long-tail scenarios through the addition of social-knowledge text reasoning.
End-to-end models can achieve state-of-the-art closed-loop results on the Bench2drive benchmark.
Vision and language components can be optimized separately yet still cooperate within a single driving policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of latent visual prediction from text-based reasoning may allow independent scaling of each module for other sequential decision tasks such as robotic manipulation.
Testing the model on simulated long-tail event sets with controlled rarity would quantify whether the social-knowledge layer actually reduces collision rates in rare conditions.
Extending the parallel prediction horizon further could reveal whether the latent representation remains stable or begins to lose fidelity beyond the lengths tested here.
The approach implies that world-modeling and reasoning can be decoupled in autonomous systems, opening the possibility of swapping in stronger language models without retraining the visual backbone.

Load-bearing premise

That predicting latent semantic features in parallel within BEV space will produce accurate long-horizon future states and that layering adaptive text reasoning with social knowledge will deliver clear gains precisely in long-tail driving situations.

What would settle it

Run an ablation on the Bench2drive closed-loop benchmark that disables the parallel latent prediction and measures both long-horizon prediction error and overall driving success rate; if the full model shows no meaningful improvement over the ablated version, the central mechanism is not carrying the claimed benefit.

Figures

Figures reproduced from arXiv: 2605.10564 by Changjie Wu, Hang Zhang, Hong Wang, Jiangyang Li, Jiaxin Liu, Lei Yang, Lingjun Zhang, Linzhe Shi, Mu Xu.

**Figure 1.** Figure 1: Illustration of Paradigms of Different Unified World Models. (a) the VLMs predicting images by explicitly outputting codebook tokens for future single-frame cannot support long-term prediction, this short-sightedness hinders accurate trajectory planning. (b)our VLMs achieve long-term world modeling by predicting future multi-frame latent features, enabling long-sighted planning of safe trajectories.(c)The … view at source ↗

**Figure 2.** Figure 2: The pipeline of our method, a holistic training and inference framework for closed-loop driving. It consists of two main modules: (a) Long-term driving-world model, for aligning DINOv3 features extracted from future multi-frame RGB images in the BEV space during training. (b) An adaptive CoT module for integrating external knowledge to enhance reasoning and decision-making in long-tail cases. 2.3. World Mo… view at source ↗

**Figure 3.** Figure 3: Qualitative results of DeepSight on the Bench2Drive closed-loop evaluation set. The three rows of the figure depict three distinct driving scenarios. The first three columns present temporally consecutive BEV frames, with red bounding boxes indicating the ego vehicle’s position. The fourth column displays corresponding PV images that highlight critical frame for safe driving. The final column presents the … view at source ↗

**Figure 4.** Figure 4: A complete sample of the annotation dataset. • Format Filter: This rule-based filter checks whether the text reasoning is composed of the three parts: (1) Infer current action based on historical information, (2) Determine whether complex decision-making is required based on road conditions, and (3) Summarize inference results. Furthermore, we examined and filtered two types of errors: when the model judge… view at source ↗

**Figure 5.** Figure 5: Prompt for CoT annotation by Qwen3-VL-235B for image feature extraction and downstream vision tasks. Trained on the large-scale LVD-1689M dataset, this model demonstrates robust global modeling capabilities, making it highly suitable for dense prediction tasks such as classification and object detection. Its performance compares favorably to the much larger ViT-7B/16 model, achieving superior results acros… view at source ↗

**Figure 6.** Figure 6: Qualitative results of DeepSight on the Bench2Drive closed-loop evaluation set. D. Sensitive Analysis Since both trajectory waypoints and CoT reasoning are encoded as text tokens, they exhibit similar gradient magnitudes and convergence behaviors. We calibrate both terms using a shared hyperparameter λcot = λtraj = 1.0. To evaluate 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of DeepSight’s world prediction. hyperparameter sensitivity on Dev 10 set, as shown in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes DeepSight, an end-to-end autonomous driving system that introduces a driving world model performing parallel prediction of latent semantic features for consecutive future frames in BEV space to enable long-horizon modeling of future world states. It further adds an efficient adaptive text reasoning mechanism incorporating social knowledge to improve robustness in long-tail scenarios. The central claim is that this yields state-of-the-art results on the closed-loop Bench2drive benchmark, with code released at a public GitHub repository.

Significance. If the performance claims are substantiated, the combination of parallel latent BEV prediction with tailored text reasoning could offer a practical route to longer-horizon world modeling and better handling of rare events in autonomous driving. The explicit release of code is a clear strength that supports reproducibility and follow-on work.

major comments (1)

[Abstract] Abstract: The manuscript asserts SOTA results on Bench2drive yet supplies no quantitative metrics, baseline comparisons, ablation studies, or error analysis to support the claim. This absence makes the central empirical assertion impossible to evaluate from the provided text and is load-bearing for the paper's contribution.

minor comments (1)

[Abstract] The abstract states that the text reasoning mechanism 'utilizes additional social knowledge' but does not define what social knowledge is sourced from or how it is integrated at the architectural level.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address it point-by-point below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts SOTA results on Bench2drive yet supplies no quantitative metrics, baseline comparisons, ablation studies, or error analysis to support the claim. This absence makes the central empirical assertion impossible to evaluate from the provided text and is load-bearing for the paper's contribution.

Authors: We agree that the abstract would be strengthened by including key quantitative results. While the full manuscript contains detailed experimental results (including closed-loop success rates on Bench2drive, comparisons against baselines such as DriveGPT4 and other end-to-end methods, ablation studies on the latent prediction and text reasoning modules, and error analysis) in the Experiments section, these specifics are not summarized in the abstract. In the revision we will update the abstract to explicitly state the main performance numbers (e.g., the absolute improvement over prior SOTA on the closed-loop metric) and reference the supporting tables. This directly addresses the evaluability concern without altering the technical claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and available description present architectural proposals (parallel latent BEV semantic feature prediction for long-horizon modeling, plus adaptive text reasoning with social knowledge) and an empirical SOTA claim on Bench2drive. No equations, parameter-fitting steps, self-definitional loops, or load-bearing self-citations are described. The central claims rest on empirical benchmark results rather than any derivation that reduces to its own inputs by construction. The derivation chain is therefore self-contained with no identifiable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; all arrays left empty.

pith-pipeline@v0.9.0 · 5490 in / 1170 out tokens · 61270 ms · 2026-05-12T04:08:38.319816+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

adaptive Chain-of-Thought (CoT) mechanism... social knowledge and reasoning capabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

232 extracted references · 232 canonical work pages · 20 internal anchors

[1]

2026 , eprint=

Context-Alignment: Activating and Enhancing LLM Capabilities in Time Series , author=. 2026 , eprint=

work page 2026
[2]

2025 , eprint=

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning , author=. 2025 , eprint=

work page 2025
[3]

2024 , eprint=

ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles , author=. 2024 , eprint=

work page 2024
[4]

2025 , eprint=

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models , author=. 2025 , eprint=

work page 2025
[5]

2025 , eprint=

WorldVLA: Towards Autoregressive Action World Model , author=. 2025 , eprint=

work page 2025
[6]

2025 , eprint=

EMMA: End-to-End Multimodal Model for Autonomous Driving , author=. 2025 , eprint=

work page 2025
[7]

2024 , eprint=

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving , author=. 2024 , eprint=

work page 2024
[8]

2024 , eprint=

Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving , author=. 2024 , eprint=

work page 2024
[9]

2025 , eprint=

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models , author=. 2025 , eprint=

work page 2025
[10]

OpenVLA: An Open-Source Vision-Language-Action Model

Openvla: An open-source vision-language-action model , author=. arXiv preprint arXiv:2406.09246 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

and Xu, Chenfeng and Tang, Chen and Li, Chenran and Ding, Mingyu and Tomizuka, Masayoshi and Zhan, Wei , booktitle=

Li, Yiheng and Zhao, Seth Z. and Xu, Chenfeng and Tang, Chen and Li, Chenran and Ding, Mingyu and Tomizuka, Masayoshi and Zhan, Wei , booktitle=. Pre-training on Synthetic Driving Data for Trajectory Prediction , year=

work page
[13]

Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reason- ing.arXiv preprint arXiv:2503.07608, 2025

Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning , author=. arXiv preprint arXiv:2503.07608 , year=

work page arXiv
[14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Para-drive: Parallelized architecture for real-time autonomous driving , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[15]

arXiv preprint arXiv:2408.03601 (2024) 13

Drama: An efficient end-to-end motion planner for autonomous driving with mamba , author=. arXiv preprint arXiv:2408.03601 , year=

work page arXiv
[16]

Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025a

Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2503.10434 , year=

work page arXiv
[17]

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

Covla: Comprehensive vision-language-action dataset for autonomous driving , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

work page 2025
[18]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Vlaad: Vision and language assistant for autonomous driving , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page
[19]

Closed-loop supervised fine-tuning of tokenized traffic models.arXiv preprint arXiv:2412.05334, 2024

Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models , author=. arXiv preprint arXiv:2412.05334 , year=

work page arXiv
[20]

Advances in Neural Information Processing Systems , volume=

SMART: scalable multi-agent real-time motion generation via next-token prediction , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

Orion: A holis- tic end-to-end autonomous driving framework by vision- language instructed action generation.arXiv preprint arXiv:2503.19755,

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation , author=. arXiv preprint arXiv:2503.19755 , year=

work page arXiv
[22]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation , author=. arXiv preprint arXiv:2406.06978 , year=

work page internal anchor Pith review arXiv
[23]

Centaur: Robust end-to-end autonomous driving with test-time training.arXiv preprint arXiv:2503.11650, 2025

Centaur: Robust End-to-End Autonomous Driving with Test-Time Training , author=. arXiv preprint arXiv:2503.11650 , year=

work page arXiv
[24]

Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning.arXiv preprint arXiv:2410.05582, 2024

Gen-Drive: Enhancing Diffusion Generative Driving Policies with Reward Modeling and Reinforcement Learning Fine-tuning , author=. arXiv preprint arXiv:2410.05582 , year=

work page arXiv
[25]

IEEE transactions on pattern analysis and machine intelligence , volume=

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2022 , publisher=

work page 2022
[26]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Planning-oriented autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[27]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

End-to-end autonomous driving: Challenges and frontiers , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[28]

European Conference on Computer Vision , pages=

Drivelm: Driving with graph visual question answering , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[29]

GPT-Driver: Learning to Drive with GPT

Gpt-driver: Learning to drive with gpt , author=. arXiv preprint arXiv:2310.01415 , year=

work page internal anchor Pith review arXiv
[30]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

nuscenes: A multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[31]

Wisead: Knowledge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951, 2024

WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model , author=. arXiv preprint arXiv:2412.09951 , year=

work page arXiv
[32]

Sparsedrive: End-to-end autonomous driving via sparse scene representation.arXiv preprint arXiv:2405.19620, 2024

Sparsedrive: End-to-end autonomous driving via sparse scene representation , author=. arXiv preprint arXiv:2405.19620 , year=

work page arXiv
[33]

2025 , howpublished =

2025 Waymo Open Dataset Challenge: Vision-based End-to-End Driving , author =. 2025 , howpublished =

work page 2025
[34]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[35]

Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes , author=. arXiv preprint arXiv:2305.10430 , year=

work page arXiv
[36]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Jaeger, Bernhard and Chitta, Kashyap and Geiger, Andreas , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

work page 2023
[37]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Fast: Efficient action tokenization for vision-language-action models , author=. arXiv preprint arXiv:2501.09747 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, March 2025

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models , author=. arXiv preprint arXiv:2503.22020 , year=

work page arXiv
[39]

Advances in Neural Information Processing Systems , volume=

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

IEEE Journal of Biomedical and Health Informatics , year=

Benchmarking Radiology Report Generation From Noisy Free-Texts , author=. IEEE Journal of Biomedical and Health Informatics , year=

work page
[41]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Towards learning-based planning: The nuPlan benchmark for real-world autonomous driving , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

work page 2024
[43]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024

Vlm-ad: End-to-end autonomous driving through vision-language model supervision , author=. arXiv preprint arXiv:2412.14446 , year=

work page arXiv
[44]

Advances in Neural Information Processing Systems , volume=

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

OpenScene: The Largest Up-to-Date 3D Occupancy Prediction Benchmark in Autonomous Driving , author =

work page
[48]

Trajeglish: Traffic modeling as next-token prediction.arXiv preprint arXiv:2312.04535, 2023

Trajeglish: Traffic modeling as next-token prediction , author=. arXiv preprint arXiv:2312.04535 , year=

work page arXiv
[49]

Drivecot: Integrating chain-of-thought reasoning with end-to-end driving

Drivecot: Integrating chain-of-thought reasoning with end-to-end driving , author=. arXiv preprint arXiv:2403.16996 , year=

work page arXiv
[50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Driving everywhere with large language model policy adaptation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[51]

IEEE Robotics and Automation Letters , year=

Drivegpt4: Interpretable end-to-end autonomous driving via large language model , author=. IEEE Robotics and Automation Letters , year=

work page
[52]

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

Automated evaluation of large vision-language models on self-driving corner cases , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

work page 2025
[53]

Carllava: Vision language models for camera- only closed-loop driving.arXiv preprint arXiv:2406.10165, 2024

CarLLaVA: Vision language models for camera-only closed-loop driving , author=. arXiv preprint arXiv:2406.10165 , year=

work page arXiv
[54]

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.arXiv preprint arXiv:2502.13144,

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning , author=. arXiv preprint arXiv:2502.13144 , year=

work page arXiv
[55]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

work page internal anchor Pith review arXiv
[56]

´Alvarez

Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning , author=. arXiv preprint arXiv:2405.01533 , year=

work page arXiv
[57]

European Conference on Computer Vision , pages=

LingoQA: Visual question answering for autonomous driving , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[58]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Drivevlm: The convergence of autonomous driving and large vision-language models , author=. arXiv preprint arXiv:2402.12289 , year=

work page internal anchor Pith review arXiv
[59]

Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models.arXiv preprint arXiv:2503.12772, 2025

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models , author=. arXiv preprint arXiv:2503.12772 , year=

work page arXiv
[60]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment.arXiv preprint arXiv:2503.09594, 2025

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment , author=. arXiv preprint arXiv:2503.09594 , year=

work page arXiv
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lmdrive: Closed-loop end-to-end driving with large language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[62]

arXiv preprint arXiv:2503.23463 (2025) 4

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model , author=. arXiv preprint arXiv:2503.23463 , year=

work page arXiv
[63]

Advances in Neural Information Processing Systems , volume=

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking , author=. Advances in Neural Information Processing Systems , volume=

work page
[64]

Driving with regulation: Interpretable decision-making for autonomous vehicles with retrieval-augmented reasoning via llm.arXiv preprint arXiv:2410.04759, 2024

Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM , author=. arXiv preprint arXiv:2410.04759 , year=

work page arXiv
[65]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning , author=. arXiv preprint arXiv:2506.13757 , year=

work page internal anchor Pith review arXiv
[68]

Proceedings of the Winter Conference on Applications of Computer Vision , pages=

Openemma: Open-source multimodal model for end-to-end autonomous driving , author=. Proceedings of the Winter Conference on Applications of Computer Vision , pages=

work page
[69]

arXiv preprint arXiv:2505.20024 , year=

ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving , author=. arXiv preprint arXiv:2505.20024 , year=

work page arXiv
[70]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving , author=. arXiv preprint arXiv:2411.15139 , year=

work page arXiv
[71]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Bevformer: learning bird's-eye-view representation from lidar-camera via spatiotemporal transformers , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[72]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2024 , publisher=

work page 2024
[73]

Qcnext: A next-generation framework for joint multi-agent trajectory prediction.arXiv preprint arXiv:2306.10508, 2023

Qcnext: A next-generation framework for joint multi-agent trajectory prediction , author=. arXiv preprint arXiv:2306.10508 , year=

work page arXiv
[74]

IEEE transactions on neural networks and learning systems , year=

Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving , author=. IEEE transactions on neural networks and learning systems , year=

work page
[75]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[76]

Conference on Robot Learning , pages=

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries , author=. Conference on Robot Learning , pages=. 2022 , organization=

work page 2022
[77]

2025 , eprint=

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving , author=. 2025 , eprint=

work page 2025
[78]

Zhou, Zewei and Xiang, Hao and Zheng, Zhaoliang and Zhao, Seth Z and Lei, Mingyue and Zhang, Yun and Cai, Tianhui and Liu, Xinyi and Liu, Johnson and Bajji, Maheswari and others , journal=

work page
[79]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Hybrid-Prediction Integrated Planning for Autonomous Driving , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[80]

Advances in Neural Information Processing Systems , volume=

Bevfusion: A simple and robust lidar-camera fusion framework , author=. Advances in Neural Information Processing Systems , volume=

work page
[81]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[82]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Reasonnet: End-to-end driving with temporal and global reasoning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[83]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vlp: Vision language planning for autonomous driving , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

Showing first 80 references.