Recognition: 2 theorem links
· Lean TheoremEMMA: End-to-End Multimodal Model for Autonomous Driving
Pith reviewed 2026-05-15 05:03 UTC · model grok-4.3
The pith
EMMA turns raw camera images into driving trajectories, object detections, and road graphs by encoding all outputs as natural language text inside a multimodal LLM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing trajectories, 3D locations, and road elements as natural language text, EMMA lets a multimodal LLM jointly process raw camera images and generate accurate outputs for motion planning, object detection, and road graph prediction, reaching state-of-the-art planning results on nuScenes and competitive results on Waymo datasets.
What carries the argument
The unified language space that encodes all non-sensor inputs and all spatial outputs (trajectories, 3D locations, road graphs) as natural language text, allowing one model and task-specific prompts to handle multiple driving tasks together.
If this is right
- Joint training on planner trajectories, object detection, and road graphs produces gains in all three tasks simultaneously.
- A single set of model weights can generate outputs for different driving subtasks simply by changing the prompt text.
- The same architecture scales across multiple public driving benchmarks without task-specific heads or loss functions.
Where Pith is reading between the lines
- If text encoding works for these tasks, the same pattern could absorb additional inputs such as HD maps or V2X messages without new model components.
- Real-world deployment would still require verification that text decoding never drops critical spatial constraints that numeric planners enforce directly.
- The approach suggests future driving stacks could treat perception, prediction, and planning as prompt variations inside one model rather than separate modules.
Load-bearing premise
Converting precise geometric quantities such as trajectories and 3D object positions into natural language text preserves every detail needed for safe control without loss of spatial accuracy.
What would settle it
Demonstration that EMMA produces colliding or off-road trajectories in dense urban scenarios where centimeter-level geometry matters, while a conventional geometric planner succeeds on the same inputs.
read the original abstract
We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built upon a multi-modal large language model foundation like Gemini, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. We hope that our results will inspire research to further evolve the state of the art in autonomous driving model architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EMMA, an end-to-end multimodal model for autonomous driving built on a pre-trained LLM foundation (Gemini). It directly maps raw camera inputs to driving outputs including planner trajectories, 3D object detections, and road-graph elements by encoding all non-sensor inputs/outputs as natural-language text, enabling joint processing via task-specific prompts. The work reports state-of-the-art motion-planning results on nuScenes, competitive performance on WOMD and camera-primary 3D detection on WOD, and performance gains from co-training the three tasks.
Significance. If the results hold under rigorous validation, the work is significant for showing that a frozen LLM backbone plus text-based unification can deliver competitive or superior driving performance while leveraging pre-trained world knowledge and enabling multi-task synergies. The co-training improvements across perception, planning, and mapping tasks provide evidence for the value of a generalist text interface in autonomous driving.
major comments (2)
- [§3.2] §3.2 (Output Representation): The central modeling choice of encoding continuous trajectories, 3D object centers, and road-graph polylines as tokenized natural-language strings is load-bearing for all reported metrics, yet the manuscript supplies no reconstruction-error quantification (e.g., mean L2 deviation between original coordinates and text-decoded outputs on the validation split). Without this measurement, it is impossible to determine whether the SOTA nuScenes planning numbers reflect faithful geometry preservation or metric tolerance of discretization artifacts.
- [§4.3] §4.3 (Ablation and Co-training Results): The claim that co-training yields improvements across all three domains rests on comparisons that lack an ablation isolating the text-discretization head versus a continuous regression head; the reported gains could therefore be confounded by the choice of output representation rather than task synergy.
minor comments (2)
- [Figure 3] Figure 3 and Table 1: axis labels and coordinate units for the visualized trajectories are not explicitly stated, making it difficult to verify geometric fidelity at a glance.
- [§4.1] The training-details paragraph in §4.1 omits the exact prompt templates and tokenization scheme for coordinate strings; providing these in an appendix would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for explicit validation of our output encoding and for sharpening the interpretation of our co-training ablations. We address each comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] The central modeling choice of encoding continuous trajectories, 3D object centers, and road-graph polylines as tokenized natural-language strings is load-bearing for all reported metrics, yet the manuscript supplies no reconstruction-error quantification (e.g., mean L2 deviation between original coordinates and text-decoded outputs on the validation split). Without this measurement, it is impossible to determine whether the SOTA nuScenes planning numbers reflect faithful geometry preservation or metric tolerance of discretization artifacts.
Authors: We agree that an explicit quantification of discretization error is necessary to substantiate the geometric fidelity of the text-based outputs. In the revised manuscript we will add a dedicated paragraph to §3.2 that reports the mean L2 reconstruction error (in meters) for planner trajectories, 3D object centers, and road-graph polylines on the nuScenes validation split, computed by decoding the generated text tokens back to coordinates and comparing against the original ground-truth values. This addition will allow readers to verify that discretization artifacts remain negligible relative to the reported planning metrics. revision: yes
-
Referee: [§4.3] The claim that co-training yields improvements across all three domains rests on comparisons that lack an ablation isolating the text-discretization head versus a continuous regression head; the reported gains could therefore be confounded by the choice of output representation rather than task synergy.
Authors: We respectfully note that the co-training ablations already control for output representation: every single-task and multi-task variant of EMMA uses the identical text-discretization head. Consequently, performance differences between these variants can be attributed to the benefits of joint optimization in the shared language space rather than to the representation itself. Because the text interface is a foundational design choice that enables the pre-trained LLM to process all tasks uniformly, a continuous-regression ablation would require an entirely different architecture outside the scope of the present work. We will revise §4.3 to explicitly articulate this controlled comparison and to clarify that the observed synergies arise from multi-task training within the unified text framework. revision: partial
Circularity Check
No circularity; empirical model with public-benchmark results
full rationale
The paper describes an end-to-end trained multimodal LLM (Gemini backbone plus task prompts) that outputs trajectories and detections as tokenized text. All reported numbers are standard benchmark metrics on nuScenes, WOMD, and WOD; no equations, fitted parameters, or self-referential predictions are present. The text-representation choice is an architectural decision whose information-loss consequences are not measured in the provided text, but that absence does not create a circular derivation. No self-citation chains or uniqueness theorems are invoked to justify the core claims.
Axiom & Free-Parameter Ledger
free parameters (1)
- task-specific prompt templates
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text
-
Foundation.LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This approach allows EMMA to jointly process various driving tasks in a unified language space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baselin...
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
-
MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving
MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting
EggHand unifies VLA action decoding with viewpoint-aware video-text encoding to forecast egocentric hand poses, achieving SOTA accuracy on EgoExo4D while remaining robust to ego-motion and controllable via language prompts.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
-
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
-
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
-
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
-
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.
-
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
-
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.
-
C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving
C-CoT applies VLMs to autonomous driving via five-stage reasoning with a meta-action tree for counterfactuals, yielding 81.9% risk recall, 3.52% collision rate, and 1.98 m L2 error on a new dataset.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Reference graph
Works this paper leans on
-
[2]
Video-language critic: Transferable reward functions for language-conditioned robotics
Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, and Kai Yuan. Video-language critic: Transferable reward functions for language-conditioned robotics. Transactions on Machine Learning Research, 2024
work page 2024
-
[3]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022
work page 2022
-
[5]
Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst
Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS, 2019
work page 2019
-
[6]
Look, remember and reason: Grounded reasoning in videos with language models
Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Grounded reasoning in videos with language models. In ICRA, 2023
work page 2023
-
[8]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023
work page 2023
-
[9]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020
work page 2020
-
[10]
nuscenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020
work page 2020
-
[11]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020
work page 2020
-
[12]
Gri: General reinforced imitation and its application to vision-based autonomous driving
Raphael Chekroun, Marin Toromanoff, Sascha Hornauer, and Fabien Moutarde. Gri: General reinforced imitation and its application to vision-based autonomous driving. Robotics, 2023
work page 2023
- [13]
- [14]
-
[15]
Womd-lidar: Raw sensor dataset benchmark for motion forecasting
Kan Chen, Runzhou Ge, Hang Qiu, Rami Ai-Rfou, Charles R Qi, Xuanyu Zhou, Zoey Yang, Scott Ettinger, Pei Sun, Zhaoqi Leng, et al. Womd-lidar: Raw sensor dataset benchmark for motion forecasting. In ICRA, 2024 a
work page 2024
-
[16]
Driving with llms: Fusing object-level vector modality for explainable autonomous driving
Long Chen, Oleg Sinavski, Jan H \"u nermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In ICRA, 2024 b
work page 2024
-
[18]
Pix2seq: A language modeling framework for object detection
Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In ICLR, 2022 a
work page 2022
-
[19]
A unified sequence interface for vision tasks
Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey E Hinton. A unified sequence interface for vision tasks. In NeurIPS, 2022 b
work page 2022
-
[20]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...
work page 2023
-
[21]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. In CVPR, 2024 d
work page 2024
-
[22]
Transfuser: Imitation with transformer-based sensor fusion for autonomous driving
Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. PAMI, 2022
work page 2022
-
[23]
Unifying vision-and-language tasks via text generation
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021
work page 2021
-
[24]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. JMLR, 2023
work page 2023
-
[25]
End-to-end driving via conditional imitation learning
Felipe Codevilla, Matthias M \"u ller, Antonio L \'o pez, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In ICRA, 2018
work page 2018
-
[26]
Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking
Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. NeurIPS, 2024
work page 2024
-
[27]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019
work page 2019
-
[28]
Pivotnet: Vectorized pivot learning for end-to-end hd map construction
Wenjie Ding, Limeng Qiao, Xi Qiu, and Chi Zhang. Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In ICCV, 2023
work page 2023
-
[29]
Long-term recurrent convolutional networks for visual recognition and description
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015
work page 2015
-
[32]
Open-vocabulary object detection via vision and language knowledge distillation
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022
work page 2022
-
[33]
Training compute-optimal large language models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In NeurIPS, 2022
work page 2022
-
[34]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In CVPR, 2023
work page 2023
-
[35]
Language is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language models. In NeurIPS, 2023
work page 2023
-
[36]
Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection
Wei-Chih Hung, Vincent Casser, Henrik Kretzschmar, Jyh-Jing Hwang, and Dragomir Anguelov. Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection. In ICRA, 2024
work page 2024
-
[37]
Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection
Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty, Nicholas Armstrong-Crews, Tiffany Chen, and Dragomir Anguelov. Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection. In ECCV, 2022
work page 2022
-
[38]
Vad: Vectorized scene representation for efficient autonomous driving
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. In ICCV, 2023
work page 2023
-
[39]
Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In ICRA, 2019
work page 2019
-
[40]
Pointpillars: Fast encoders for object detection from point clouds
Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019
work page 2019
-
[41]
Sara-rt: Scaling up robotics transformers with self-adaptive robust attention
Isabel Leal, Krzysztof Choromanski, Deepali Jain, Avinava Dubey, Jake Varley, Michael Ryoo, Yao Lu, Frederick Liu, Vikas Sindhwani, Quan Vuong, et al. Sara-rt: Scaling up robotics transformers with self-adaptive robust attention. In ICRA, 2024
work page 2024
-
[42]
Hdmapnet: An online hd map construction and evaluation framework
Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022 a
work page 2022
-
[43]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022 b
work page 2022
-
[44]
Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024
Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024
work page 2024
-
[45]
Cirl: Controllable imitative reinforcement learning for vision-based self-driving
Xiaodan Liang, Tairui Wang, Luona Yang, and Eric Xing. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In ECCV, 2018
work page 2018
-
[46]
Maptr: Structured modeling and learning for online vectorized hd map construction
Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2023
work page 2023
-
[47]
Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction
Bencheng Liao, Shaoyu Chen, Bo Jiang, Tianheng Cheng, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction. In ECCV, 2024 a
work page 2024
-
[48]
Maptrv2: An end-to-end framework for online vectorized hd map construction
Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Maptrv2: An end-to-end framework for online vectorized hd map construction. IJCV, 2024 b
work page 2024
-
[49]
Titrated: Learned human driving behavior without infractions via amortized inference
Vasileios Lioutas, Adam Scibior, and Frank Wood. Titrated: Learned human driving behavior without infractions via amortized inference. Transactions on Machine Learning Research, 2022
work page 2022
-
[50]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2024 a
work page 2024
-
[51]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024 b
work page 2024
-
[52]
Vectormapnet: End-to-end vectorized hd map learning
Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, and Hang Zhao. Vectormapnet: End-to-end vectorized hd map learning. In ICML, 2023
work page 2023
-
[53]
Unified-io: A unified model for vision, language, and multi-modal tasks
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022
work page 2022
-
[54]
Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024
work page 2024
-
[55]
Wayformer: Motion forecasting via simple & efficient attention networks
Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In ICRA, 2023
work page 2023
-
[56]
Vlp: Vision language planning for autonomous driving
Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. Vlp: Vision language planning for autonomous driving. In CVPR, 2024
work page 2024
-
[57]
Kosmos-2: Grounding multimodal large language models to the world
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. In ICLR, 2024
work page 2024
-
[58]
Alvinn: An autonomous land vehicle in a neural network
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NeurIPS, 1988
work page 1988
-
[59]
Multi-modal fusion transformer for end-to-end autonomous driving
Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi-modal fusion transformer for end-to-end autonomous driving. In CVPR, 2021
work page 2021
-
[60]
End-to-end vectorized hd-map construction with piecewise bezier curve
Limeng Qiao, Wenjie Ding, Xi Qiu, and Chi Zhang. End-to-end vectorized hd-map construction with piecewise bezier curve. In CVPR, 2023
work page 2023
-
[61]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI blog, 2018
work page 2018
-
[62]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019
work page 2019
-
[63]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020
work page 2020
-
[65]
Motionlm: Multi-agent motion forecasting as language modeling
Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S Refaat, Rami Al-Rfou, and Benjamin Sapp. Motionlm: Multi-agent motion forecasting as language modeling. In ICCV, 2023
work page 2023
-
[66]
Lmdrive: Closed-loop end-to-end driving with large language models
Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In CVPR, 2024
work page 2024
-
[67]
Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying
Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. PAMI, 2024
work page 2024
-
[68]
Drivelm: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In ECCV, 2024
work page 2024
-
[69]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020
work page 2020
-
[70]
Swformer: Sparse window transformer for 3d object detection in point clouds
Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV, 2022
work page 2022
-
[71]
Beyond text: Utilizing vocal cues to improve decision making in llms for robot navigation tasks
Xingpeng Sun, Haoming Meng, Souradip Chakraborty, Amrit Singh Bedi, and Aniket Bera. Beyond text: Utilizing vocal cues to improve decision making in llms for robot navigation tasks. Transactions on Machine Learning Research, 2024
work page 2024
-
[72]
Block-nerf: Scalable large scene neural view synthesis
Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. In CVPR, 2022
work page 2022
-
[73]
Motion planning for autonomous driving: The state of the art and future perspectives
Siyu Teng, Xuemin Hu, Peng Deng, Bai Li, Yuchen Li, Yunfeng Ai, Dongsheng Yang, Lingxi Li, Zhe Xuanyuan, Fenghua Zhu, et al. Motion planning for autonomous driving: The state of the art and future perspectives. T-IV, 2023
work page 2023
-
[74]
Drivevlm: The convergence of autonomous driving and large vision-language models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024
work page 2024
-
[75]
End-to-end model-free reinforcement learning for urban driving using implicit affordances
Marin Toromanoff, Emilie Wirbel, and Fabien Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020
work page 2020
-
[78]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017
work page 2017
-
[79]
Show and tell: A neural image caption generator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015
work page 2015
-
[81]
Goplan: Goal-conditioned offline reinforcement learning by planning with learned models
Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, and Giovanni Montana. Goplan: Goal-conditioned offline reinforcement learning by planning with learned models. Transactions on Machine Learning Research, 2023
work page 2023
-
[82]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022
work page 2022
-
[84]
Fcos3d: Fully convolutional one-stage monocular 3d object detection
Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021
work page 2021
-
[86]
Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models
Tsun-Hsuan Wang, Alaa Maalouf, Wei Xiao, Yutong Ban, Alexander Amini, Guy Rosman, Sertac Karaman, and Daniela Rus. Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models. In ICRA, 2024 c
work page 2024
-
[87]
Visionllm: Large language model is also an open-ended decoder for vision-centric tasks
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS, 2024 d
work page 2024
-
[88]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022
work page 2022
-
[89]
Para-drive: Parallelized architecture for real-time autonomous driving
Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. In CVPR, 2024
work page 2024
-
[90]
Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline
Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022
work page 2022
-
[91]
Drivegpt4: Interpretable end-to-end autonomous driving via large language model
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. RA-L, 2024
work page 2024
-
[93]
Coca: Contrastive captioners are image-text foundation models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.