Recognition: 3 theorem links
· Lean TheoremSenna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Pith reviewed 2026-05-15 15:19 UTC · model grok-4.3
The pith
Senna uses a large vision-language model for natural language driving plans that an end-to-end model converts into precise trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Senna decouples high-level planning from low-level trajectory prediction by having its vision-language component generate natural language planning decisions using multi-image encoding and multi-view prompts, while the end-to-end component predicts precise trajectories, supported by planning-oriented question-answering data and a three-stage training strategy. This yields state-of-the-art performance, including a 27.12% reduction in average planning error and 33.33% in collision rate after pre-training on DriveX and fine-tuning on nuScenes.
What carries the argument
The hybrid architecture that separates natural language planning decisions from trajectory prediction, with the LVLM handling scene understanding and reasoning.
If this is right
- Senna achieves state-of-the-art planning performance on standard benchmarks.
- Pre-training on large-scale data like DriveX followed by fine-tuning significantly improves results over no pre-training.
- The system demonstrates improved handling of complex and rare scenarios through commonsense from the language model.
- Cross-scenario generalization and transferability support progress toward fully autonomous driving.
Where Pith is reading between the lines
- This bridging strategy might apply to other domains needing both high-level reasoning and low-level precision, such as robotics manipulation.
- Further work could test whether the language-to-trajectory translation holds in real-time, safety-critical driving without simulation artifacts.
- Exploring richer prompt engineering or additional sensor inputs could enhance the multi-view scene understanding.
Load-bearing premise
Natural language planning outputs from the LVLM can be translated into low-level trajectories by the E2E model without introducing critical errors or losing necessary detail.
What would settle it
A high rate of collisions in rare scenarios where the generated language plan is accurate but the resulting trajectory deviates would show the translation step fails.
read the original abstract
End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-Language Models (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene understanding. Besides, we introduce planning-oriented QAs alongside a three-stage training strategy, which enhances Senna-VLM's planning performance while preserving commonsense. Extensive experiments on two datasets show that Senna achieves state-of-the-art planning performance. Notably, with pre-training on a large-scale dataset DriveX and fine-tuning on nuScenes, Senna significantly reduces average planning error by 27.12% and collision rate by 33.33% over model without pre-training. We believe Senna's cross-scenario generalization and transferability are essential for achieving fully autonomous driving. Code and models will be released at https://github.com/hustvl/Senna.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Senna, which decouples high-level planning from low-level control in autonomous driving by pairing an LVLM (Senna-VLM) that outputs natural-language planning decisions with an E2E model (Senna-E2E) that predicts precise trajectories. Senna-VLM uses multi-image encoding, multi-view prompts, and planning-oriented QAs trained in three stages; experiments on DriveX pre-training followed by nuScenes fine-tuning report SOTA planning results, including a 27.12% drop in average planning error and 33.33% reduction in collision rate relative to the no-pretrain baseline.
Significance. If the empirical gains hold under rigorous controls, the work offers a practical route to inject commonsense reasoning into E2E driving pipelines, addressing their documented weakness in rare or complex scenes. The public release of code and models is a concrete strength that supports reproducibility and follow-on research.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the reported 27.12% planning-error and 33.33% collision-rate reductions are presented without error bars, without an ablation isolating the language-to-trajectory translation step, and without a full table of competing baselines; these omissions make it impossible to attribute the gains specifically to the LVLM component rather than to scale or training schedule.
- [§3.2] §3.2 (Senna-VLM to Senna-E2E interface): the central architectural claim rests on the assumption that natural-language plans can be mapped to E2E trajectory inputs without critical loss of spatial or temporal detail; no quantitative metric (e.g., conversion accuracy on held-out ambiguous prompts or rare-scenario subsets) is provided to validate this interface, which directly undermines the attribution of performance improvements to the LVLM.
minor comments (2)
- [Abstract] The abstract states that LVLMs are ill-suited for numerical outputs yet does not quantify how the chosen conditioning mechanism (tokenization of planning QAs into E2E inputs) mitigates this limitation.
- [§4] Figure captions and §4 should explicitly list the exact metrics (e.g., L2 error, collision rate) and the precise definition of the “model without pre-training” baseline for each reported number.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects for strengthening the experimental validation and interface analysis. We address each point below and commit to revisions that improve clarity and rigor without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported 27.12% planning-error and 33.33% collision-rate reductions are presented without error bars, without an ablation isolating the language-to-trajectory translation step, and without a full table of competing baselines; these omissions make it impossible to attribute the gains specifically to the LVLM component rather than to scale or training schedule.
Authors: We agree that error bars and expanded baselines would improve attribution. In the revised version we will add standard deviations computed over three independent runs for the reported metrics on nuScenes. We will also expand the main results table to include all relevant competing methods with their full metrics. For isolating the language-to-trajectory step, we will insert a new controlled ablation that feeds ground-truth parsed plans versus Senna-VLM-generated plans into the identical Senna-E2E backbone; this directly quantifies the contribution of the LVLM component beyond scale or schedule. These additions will appear in the updated §4 and supplementary material. revision: yes
-
Referee: [§3.2] §3.2 (Senna-VLM to Senna-E2E interface): the central architectural claim rests on the assumption that natural-language plans can be mapped to E2E trajectory inputs without critical loss of spatial or temporal detail; no quantitative metric (e.g., conversion accuracy on held-out ambiguous prompts or rare-scenario subsets) is provided to validate this interface, which directly undermines the attribution of performance improvements to the LVLM.
Authors: The interface uses a lightweight deterministic parser that extracts structured fields (target velocity, lane-change flag, stop/go decision, and approximate waypoint offsets) from the generated natural-language sentence; these fields are tokenized and concatenated to the E2E visual features. While the current manuscript provides qualitative examples, we acknowledge the absence of a quantitative fidelity metric. In revision we will add a parser-accuracy evaluation on a held-out set of 500 Senna-VLM outputs (including 100 rare-scenario cases), reporting exact-match accuracy for each extracted field against human annotations. This metric will be reported in an expanded §3.2 and will support the claim that critical spatial-temporal information is preserved. revision: yes
Circularity Check
Empirical benchmark results with no self-referential derivations
full rationale
The paper reports measured planning error and collision reductions (27.12% and 33.33%) from pre-training on DriveX followed by fine-tuning on nuScenes, compared against an internal no-pretrain baseline. These are direct empirical outcomes on external datasets rather than any mathematical derivation, fitted parameter, or prediction that reduces to a quantity defined inside the paper. No equations, uniqueness theorems, or ansatzes are invoked in a load-bearing way; the architecture (multi-image encoding, planning-oriented QAs, three-stage training) is described independently of the final metrics, which remain falsifiable on the cited benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LVLMs excel in scene understanding and reasoning while end-to-end models struggle with commonsense in rare scenarios
Forward citations
Cited by 22 Pith papers
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
-
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
-
MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving
MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic
GuardAD reduces accident rates by 32% in autonomous driving MLLMs by using n-th order Markovian logic to infer latent hazards and revise actions.
-
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
-
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
ST-Prune is a training-free spatio-temporal token pruning framework for VLMs in autonomous driving that achieves near-lossless results at 90% token reduction by exploiting motion volatility, temporal recency, and mult...
-
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
-
FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving
FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronge...
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
-
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.
-
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
-
C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving
C-CoT applies VLMs to autonomous driving via five-stage reasoning with a meta-action tree for counterfactuals, yielding 81.9% risk recall, 3.52% collision rate, and 1.98 m L2 error on a new dataset.
-
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Reference graph
Works this paper leans on
-
[1]
Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,
Y . Wang, V . C. Guizilini, T. Zhang, Y . Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in CoRL, 2022
work page 2022
-
[2]
Planning-oriented autonomous driving,
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang et al., “Planning-oriented autonomous driving,” in CVPR, 2023
work page 2023
-
[3]
Vad: Vectorized scene representation for efficient autonomous driving,
B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” in ICCV, 2023
work page 2023
-
[4]
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,
J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in ECCV, 2020
work page 2020
-
[5]
Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bev- former: Learning bird’s-eye-view representation from multi-camera im- ages via spatiotemporal transformers,” arXiv preprint arXiv:2203.17270 , 2022
-
[6]
Maptr: Structured modeling and learning for online vectorized hd map construction,
B. Liao, S. Chen, X. Wang, T. Cheng, Q. Zhang, W. Liu, and C. Huang, “Maptr: Structured modeling and learning for online vectorized hd map construction,” arXiv preprint arXiv:2208.14437 , 2022
-
[7]
Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,
Y . Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” arXiv preprint arXiv:1910.05449, 2019
-
[8]
Vip3d: End-to-end visual trajectory prediction via 3d agent queries,
J. Gu, C. Hu, T. Zhang, X. Chen, Y . Wang, Y . Wang, and H. Zhao, “Vip3d: End-to-end visual trajectory prediction via 3d agent queries,” arXiv preprint arXiv:2208.01582 , 2022
-
[9]
Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction,
B. Jiang, S. Chen, X. Wang, B. Liao, T. Cheng, J. Chen, H. Zhou, Q. Zhang, W. Liu, and C. Huang, “Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction,” arXiv preprint arXiv:2212.02181, 2022
-
[10]
End-to-end model-free reinforcement learning for urban driving using implicit affordances,
M. Toromanoff, E. Wirbel, and F. Moutarde, “End-to-end model-free reinforcement learning for urban driving using implicit affordances,” in CVPR, 2020
work page 2020
-
[11]
Multi-modal fusion transformer for end-to-end autonomous driving,
A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in CVPR, 2021
work page 2021
-
[12]
St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,
S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,” in ECCV, 2022
work page 2022
-
[13]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, OCTOBER 2024 10 Scene Description The image shows a daytime urban driving scene with clear visibility. There are several cyclists on the road and a pedestrian in the front crossing the street. The time of day is likely morning...
work page 2024
-
[14]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
CogVLM: Visual Expert for Pretrained Language Models
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al., “Cogvlm: Visual expert for pretrained language models,” arXiv preprint arXiv:2311.03079 , 2023
work page internal anchor Pith review arXiv 2023
-
[16]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” in CVPR, 2024
work page 2024
-
[17]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Hierarchical models of behavior and prefrontal function,
M. M. Botvinick, “Hierarchical models of behavior and prefrontal function,” Trends in cognitive sciences , 2008
work page 2008
-
[19]
The architecture of cognitive control in the human prefrontal cortex,
E. Koechlin, C. Ody, and F. Kouneiher, “The architecture of cognitive control in the human prefrontal cortex,” Science, 2003
work page 2003
-
[20]
Cognitive control, hierarchy, and the rostro–caudal organiza- tion of the frontal lobes,
D. Badre, “Cognitive control, hierarchy, and the rostro–caudal organiza- tion of the frontal lobes,” Trends in cognitive sciences , 2008
work page 2008
-
[21]
Drivegpt4: Interpretable end-to-end autonomous driving via large language model,
Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K. K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,” arXiv preprint arXiv:2310.01412 , 2023
-
[22]
Driving with llms: Fusing object- level vector modality for explainable autonomous driving,
L. Chen, O. Sinavski, J. Hünermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- level vector modality for explainable autonomous driving,” arXiv preprint arXiv:2310.01957, 2023
-
[23]
W. Wang, J. Xie, C. Hu, H. Zou, J. Fan, W. Tong, Y . Wen, S. Wu, H. Deng, Z. Li et al., “Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving,” arXiv preprint arXiv:2312.09245, 2023
- [24]
-
[25]
Mathematical capabilities of chatgpt,
S. Frieder, L. Pinchetti, R.-R. Griffiths, T. Salvatori, T. Lukasiewicz, P. Petersen, and J. Berner, “Mathematical capabilities of chatgpt,” in NeurIPS, 2024
work page 2024
-
[26]
Measuring Mathematical Problem Solving With the MATH Dataset
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” arXiv preprint arXiv:2103.03874 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
X. Tian, J. Gu, B. Li, Y . Liu, C. Hu, Y . Wang, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,” arXiv preprint arXiv:2402.12289 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Vila: On pre-training for visual language models,
J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” in CVPR, 2024
work page 2024
-
[29]
Embodied understanding of driving scenarios,
Y . Zhou, L. Huang, Q. Bu, J. Zeng, T. Li, H. Qiu, H. Zhu, M. Guo, Y . Qiao, and H. Li, “Embodied understanding of driving scenarios,”arXiv preprint arXiv:2403.04593, 2024
-
[30]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in CVPR, 2024
work page 2024
-
[31]
Drivelm: Driving with graph visual question answering,
C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” arXiv preprint arXiv:2312.14150 , 2023
-
[32]
Language prompt for autonomous driving,
D. Wu, W. Han, T. Wang, Y . Liu, X. Zhang, and J. Shen, “Language prompt for autonomous driving,” arXiv preprint arXiv:2309.04379, 2023
-
[33]
Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,
T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,” in AAAI, 2024
work page 2024
-
[34]
nuscenes: A multimodal dataset for autonomous driving,
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020
work page 2020
-
[35]
A survey of motion planning and control techniques for self-driving urban vehicles,
B. Paden, M. ˇCáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for self-driving urban vehicles,” IEEE Transactions on intelligent vehicles , 2016
work page 2016
-
[36]
Stanley: The robot that won the darpa grand challenge,
S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann et al., “Stanley: The robot that won the darpa grand challenge,” Journal of field Robotics , 2006
work page 2006
-
[37]
Autonomous driving in urban environments: Boss and the urban challenge,
C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of field Robotics , 2008
work page 2008
-
[38]
Exploring the limitations of behavior cloning for autonomous driving,
F. Codevilla, E. Santana, A. M. López, and A. Gaidon, “Exploring the limitations of behavior cloning for autonomous driving,” in ICCV, 2019
work page 2019
-
[39]
Alvinn: An autonomous land vehicle in a neural network,
D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in NeurIPS, 1988
work page 1988
-
[40]
End-to-end urban driving by imitating a reinforcement learning coach,
Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “End-to-end urban driving by imitating a reinforcement learning coach,” in ICCV, 2021
work page 2021
-
[41]
X. Jia, Y . Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,” in ICCV, 2023
work page 2023
-
[42]
Maptrv2: An end-to-end framework for online vectorized hd map construction,
B. Liao, S. Chen, Y . Zhang, B. Jiang, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Maptrv2: An end-to-end framework for online vectorized hd map construction,” arXiv preprint arXiv:2308.05736 , 2023
-
[43]
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Vadv2: End-to-end vectorized autonomous driving via probabilistic planning,” arXiv preprint arXiv:2402.13243 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017
work page 2017
-
[45]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020
work page 2020
-
[46]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang et al. , “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Instructblip: Towards general-purpose vision-language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023
work page 2023
-
[51]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021
work page 2021
-
[52]
Eva: Exploring the limits of masked visual representation learning at scale,
Y . Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y . Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, 2023
work page 2023
-
[53]
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022
work page 2022
-
[54]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in ICML, 2023
work page 2023
-
[55]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021
work page 2021
-
[56]
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,
J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, 2019
work page 2019
-
[57]
Lxmert: Learning cross-modality encoder representations from transformers,
H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490 , 2019
-
[58]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Languagempc: Large language models as deci- sion makers for autonomous driving,
H. Sha, Y . Mu, Y . Jiang, L. Chen, C. Xu, P. Luo, S. E. Li, M. Tomizuka, W. Zhan, and M. Ding, “Languagempc: Large language models as deci- sion makers for autonomous driving,” arXiv preprint arXiv:2310.03026 , 2023
-
[60]
Carla: An open urban driving simulator,
A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “Carla: An open urban driving simulator,” in CoRL, 2017
work page 2017
-
[61]
Referring multi-object tracking,
D. Wu, W. Han, T. Wang, X. Dong, X. Zhang, and J. Shen, “Referring multi-object tracking,” in CVPR, 2023
work page 2023
-
[62]
Talk2car: Taking control of your self-driving car,
T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M.-F. Moens, “Talk2car: Taking control of your self-driving car,”arXiv preprint arXiv:1909.10838, 2019
-
[63]
Textual explanations for self-driving vehicles,
J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual explanations for self-driving vehicles,” in Proceedings of the European conference on computer vision (ECCV) , 2018, pp. 563–578
work page 2018
-
[64]
Judging llm-as-a-judge with mt-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” in NeurIPS, 2023
work page 2023
-
[65]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[66]
N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 729–736. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, OCTOBER 2024 13
work page 2006
-
[67]
End-to-end interpretable neural motion planner,
W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 8660–8669
work page 2019
-
[68]
Safe local motion planning with self-supervised freespace forecasting,
P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan, “Safe local motion planning with self-supervised freespace forecasting,” in CVPR, 2021
work page 2021
-
[69]
Differentiable raycasting for self-supervised occupancy forecasting,
T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan, “Differentiable raycasting for self-supervised occupancy forecasting,” in ECCV, 2022
work page 2022
-
[70]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002
work page 2002
-
[71]
Cider: Consensus- based image description evaluation,
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus- based image description evaluation,” in CVPR, 2015
work page 2015
-
[72]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,
S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , 2005
work page 2005
-
[73]
Is ego status all you need for open-loop end-to-end autonomous driving?
Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez, “Is ego status all you need for open-loop end-to-end autonomous driving?” in CVPR, 2024. Bo Jiang received his B.E. degree in data science and big data technology from Central South University, China, in 2021. Currently, he is a Ph.D. candidate at Huazhong University of Science and Technolog...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.