arxiv: 2410.23262 · v3 · submitted 2024-10-30 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang , Runsheng Xu , Hubert Lin , Wei-Chih Hung , Jingwei Ji , Kristy Choi , Di Huang , Tong He

show 6 more authors

Paul Covington Benjamin Sapp Yin Zhou James Guo Dragomir Anguelov Mingxing Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.RO

keywords autonomous drivingend-to-end modelmultimodal language modelmotion planning3D object detectionroad graph predictionnuScenesWaymo

0 comments

The pith

EMMA turns raw camera images into driving trajectories, object detections, and road graphs by encoding all outputs as natural language text inside a multimodal LLM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EMMA as an end-to-end model that takes camera sensor data plus navigation and vehicle status, then produces planner trajectories, 3D object detections, and road graph elements. It does so by converting every input and output into plain text so a single pre-trained language model can handle planning, perception, and mapping tasks together through task-specific prompts. This unified text space lets the model draw on broad world knowledge while jointly optimizing across tasks. On benchmarks the approach reaches state-of-the-art motion planning on nuScenes and competitive scores for planning on WOMD and camera-based 3D detection on WOD. Co-training the three tasks together improves performance in each domain rather than trading off.

Core claim

By representing trajectories, 3D locations, and road elements as natural language text, EMMA lets a multimodal LLM jointly process raw camera images and generate accurate outputs for motion planning, object detection, and road graph prediction, reaching state-of-the-art planning results on nuScenes and competitive results on Waymo datasets.

What carries the argument

The unified language space that encodes all non-sensor inputs and all spatial outputs (trajectories, 3D locations, road graphs) as natural language text, allowing one model and task-specific prompts to handle multiple driving tasks together.

If this is right

Joint training on planner trajectories, object detection, and road graphs produces gains in all three tasks simultaneously.
A single set of model weights can generate outputs for different driving subtasks simply by changing the prompt text.
The same architecture scales across multiple public driving benchmarks without task-specific heads or loss functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If text encoding works for these tasks, the same pattern could absorb additional inputs such as HD maps or V2X messages without new model components.
Real-world deployment would still require verification that text decoding never drops critical spatial constraints that numeric planners enforce directly.
The approach suggests future driving stacks could treat perception, prediction, and planning as prompt variations inside one model rather than separate modules.

Load-bearing premise

Converting precise geometric quantities such as trajectories and 3D object positions into natural language text preserves every detail needed for safe control without loss of spatial accuracy.

What would settle it

Demonstration that EMMA produces colliding or off-road trajectories in dense urban scenarios where centimeter-level geometry matters, while a conventional geometric planner succeeds on the same inputs.

read the original abstract

We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built upon a multi-modal large language model foundation like Gemini, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. We hope that our results will inspire research to further evolve the state of the art in autonomous driving model architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EMMA shows a Gemini-style LLM can handle planning, detection, and road graphs by turning all inputs and outputs into text, with co-training gains, but the geometric cost of that text format is not measured.

read the letter

EMMA's core move is to run perception, prediction, and planning inside one multimodal LLM by encoding camera images, navigation commands, ego state, and all outputs as natural language tokens. It reports state-of-the-art motion planning on nuScenes and competitive numbers on Waymo Open Motion and Waymo Open Dataset detection. The co-training across trajectories, objects, and road graphs improves every task, which is the clearest empirical signal in the work. That joint training result is useful because it shows the shared LLM representation can transfer knowledge without obvious negative interference. The prompt-based output generation keeps the architecture simple and reuses the pre-trained backbone directly. The main limitation is the text encoding of continuous geometry. Trajectories and 3D points are serialized as coordinate strings, so tokenization, rounding, and decoding steps can introduce error that is never quantified with a reconstruction metric or compared against a continuous regression head. The benchmark scores look fine, but without that check it is impossible to separate faithful geometry from metric tolerance. The paper is aimed at groups already experimenting with foundation models for driving stacks. Anyone building production systems will want to see the missing precision analysis before adopting the approach. It is coherent enough and the results are concrete enough to justify sending it to referees who can examine the full training details and any unreported ablations.

Referee Report

2 major / 2 minor

Summary. The paper introduces EMMA, an end-to-end multimodal model for autonomous driving built on a pre-trained LLM foundation (Gemini). It directly maps raw camera inputs to driving outputs including planner trajectories, 3D object detections, and road-graph elements by encoding all non-sensor inputs/outputs as natural-language text, enabling joint processing via task-specific prompts. The work reports state-of-the-art motion-planning results on nuScenes, competitive performance on WOMD and camera-primary 3D detection on WOD, and performance gains from co-training the three tasks.

Significance. If the results hold under rigorous validation, the work is significant for showing that a frozen LLM backbone plus text-based unification can deliver competitive or superior driving performance while leveraging pre-trained world knowledge and enabling multi-task synergies. The co-training improvements across perception, planning, and mapping tasks provide evidence for the value of a generalist text interface in autonomous driving.

major comments (2)

[§3.2] §3.2 (Output Representation): The central modeling choice of encoding continuous trajectories, 3D object centers, and road-graph polylines as tokenized natural-language strings is load-bearing for all reported metrics, yet the manuscript supplies no reconstruction-error quantification (e.g., mean L2 deviation between original coordinates and text-decoded outputs on the validation split). Without this measurement, it is impossible to determine whether the SOTA nuScenes planning numbers reflect faithful geometry preservation or metric tolerance of discretization artifacts.
[§4.3] §4.3 (Ablation and Co-training Results): The claim that co-training yields improvements across all three domains rests on comparisons that lack an ablation isolating the text-discretization head versus a continuous regression head; the reported gains could therefore be confounded by the choice of output representation rather than task synergy.

minor comments (2)

[Figure 3] Figure 3 and Table 1: axis labels and coordinate units for the visualized trajectories are not explicitly stated, making it difficult to verify geometric fidelity at a glance.
[§4.1] The training-details paragraph in §4.1 omits the exact prompt templates and tokenization scheme for coordinate strings; providing these in an appendix would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for explicit validation of our output encoding and for sharpening the interpretation of our co-training ablations. We address each comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] The central modeling choice of encoding continuous trajectories, 3D object centers, and road-graph polylines as tokenized natural-language strings is load-bearing for all reported metrics, yet the manuscript supplies no reconstruction-error quantification (e.g., mean L2 deviation between original coordinates and text-decoded outputs on the validation split). Without this measurement, it is impossible to determine whether the SOTA nuScenes planning numbers reflect faithful geometry preservation or metric tolerance of discretization artifacts.

Authors: We agree that an explicit quantification of discretization error is necessary to substantiate the geometric fidelity of the text-based outputs. In the revised manuscript we will add a dedicated paragraph to §3.2 that reports the mean L2 reconstruction error (in meters) for planner trajectories, 3D object centers, and road-graph polylines on the nuScenes validation split, computed by decoding the generated text tokens back to coordinates and comparing against the original ground-truth values. This addition will allow readers to verify that discretization artifacts remain negligible relative to the reported planning metrics. revision: yes
Referee: [§4.3] The claim that co-training yields improvements across all three domains rests on comparisons that lack an ablation isolating the text-discretization head versus a continuous regression head; the reported gains could therefore be confounded by the choice of output representation rather than task synergy.

Authors: We respectfully note that the co-training ablations already control for output representation: every single-task and multi-task variant of EMMA uses the identical text-discretization head. Consequently, performance differences between these variants can be attributed to the benefits of joint optimization in the shared language space rather than to the representation itself. Because the text interface is a foundational design choice that enables the pre-trained LLM to process all tasks uniformly, a continuous-regression ablation would require an entirely different architecture outside the scope of the present work. We will revise §4.3 to explicitly articulate this controlled comparison and to clarify that the observed synergies arise from multi-task training within the unified text framework. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical model with public-benchmark results

full rationale

The paper describes an end-to-end trained multimodal LLM (Gemini backbone plus task prompts) that outputs trajectories and detections as tokenized text. All reported numbers are standard benchmark metrics on nuScenes, WOMD, and WOD; no equations, fitted parameters, or self-referential predictions are present. The text-representation choice is an architectural decision whose information-loss consequences are not measured in the provided text, but that absence does not create a circular derivation. No self-citation chains or uniqueness theorems are invoked to justify the core claims.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach rests on the pre-trained weights of an existing multimodal LLM and on standard supervised fine-tuning; no new physical axioms or invented entities are introduced.

free parameters (1)

task-specific prompt templates
Prompt wording is chosen by the authors and directly affects output quality.

pith-pipeline@v0.9.0 · 5586 in / 1140 out tokens · 42901 ms · 2026-05-15T05:03:42.087168+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text
Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This approach allows EMMA to jointly process various driving tasks in a unified language space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views
cs.RO 2026-04 conditional novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baselin...
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
cs.CV 2026-05 unverdicted novelty 6.0

Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting
cs.CV 2026-05 unverdicted novelty 6.0

EggHand unifies VLA action decoding with viewpoint-aware video-text encoding to forecast egocentric hand poses, achieving SOTA accuracy on EgoExo4D while remaining robust to ego-motion and controllable via language prompts.
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
cs.CV 2026-04 unverdicted novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
cs.CV 2026-04 unverdicted novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
cs.CV 2026-03 unverdicted novelty 6.0

CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
cs.CV 2025-06 unverdicted novelty 6.0

AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
cs.RO 2026-05 unverdicted novelty 5.0

CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.
C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving
cs.CV 2026-05 unverdicted novelty 5.0

C-CoT applies VLMs to autonomous driving via five-stage reasoning with a meta-action tree for counterfactuals, yielding 81.9% risk recall, 3.52% collision rate, and 1.98 m L2 error on a new dataset.
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 5.0

VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
cs.RO 2026-04 accept novelty 4.0

A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

205 extracted references · 205 canonical work pages · cited by 22 Pith papers · 10 internal anchors

[2]

Video-language critic: Transferable reward functions for language-conditioned robotics

Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, and Kai Yuan. Video-language critic: Transferable reward functions for language-conditioned robotics. Transactions on Machine Learning Research, 2024

work page 2024
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

work page 2022
[5]

Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst

Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS, 2019

work page 2019
[6]

Look, remember and reason: Grounded reasoning in videos with language models

Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Grounded reasoning in videos with language models. In ICRA, 2023

work page 2023
[8]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023

work page 2023
[9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020

work page 2020
[10]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020

work page 2020
[11]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020

work page 2020
[12]

Gri: General reinforced imitation and its application to vision-based autonomous driving

Raphael Chekroun, Marin Toromanoff, Sascha Hornauer, and Fabien Moutarde. Gri: General reinforced imitation and its application to vision-based autonomous driving. Robotics, 2023

work page 2023
[13]

a henb \

Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Kr \"a henb \"u hl. Learning by cheating. In CoRL, 2020

work page 2020
[14]

a henb \

Dian Chen, Vladlen Koltun, and Philipp Kr \"a henb \"u hl. Learning to drive from a world on rails. In ICCV, 2021

work page 2021
[15]

Womd-lidar: Raw sensor dataset benchmark for motion forecasting

Kan Chen, Runzhou Ge, Hang Qiu, Rami Ai-Rfou, Charles R Qi, Xuanyu Zhou, Zoey Yang, Scott Ettinger, Pei Sun, Zhaoqi Leng, et al. Womd-lidar: Raw sensor dataset benchmark for motion forecasting. In ICRA, 2024 a

work page 2024
[16]

Driving with llms: Fusing object-level vector modality for explainable autonomous driving

Long Chen, Oleg Sinavski, Jan H \"u nermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In ICRA, 2024 b

work page 2024
[18]

Pix2seq: A language modeling framework for object detection

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In ICLR, 2022 a

work page 2022
[19]

A unified sequence interface for vision tasks

Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey E Hinton. A unified sequence interface for vision tasks. In NeurIPS, 2022 b

work page 2022
[20]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

work page 2023
[21]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. In CVPR, 2024 d

work page 2024
[22]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. PAMI, 2022

work page 2022
[23]

Unifying vision-and-language tasks via text generation

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021

work page 2021
[24]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. JMLR, 2023

work page 2023
[25]

End-to-end driving via conditional imitation learning

Felipe Codevilla, Matthias M \"u ller, Antonio L \'o pez, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In ICRA, 2018

work page 2018
[26]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. NeurIPS, 2024

work page 2024
[27]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019

work page 2019
[28]

Pivotnet: Vectorized pivot learning for end-to-end hd map construction

Wenjie Ding, Limeng Qiao, Xi Qiu, and Chi Zhang. Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In ICCV, 2023

work page 2023
[29]

Long-term recurrent convolutional networks for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015

work page 2015
[32]

Open-vocabulary object detection via vision and language knowledge distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022

work page 2022
[33]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In NeurIPS, 2022

work page 2022
[34]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In CVPR, 2023

work page 2023
[35]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language models. In NeurIPS, 2023

work page 2023
[36]

Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection

Wei-Chih Hung, Vincent Casser, Henrik Kretzschmar, Jyh-Jing Hwang, and Dragomir Anguelov. Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection. In ICRA, 2024

work page 2024
[37]

Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection

Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty, Nicholas Armstrong-Crews, Tiffany Chen, and Dragomir Anguelov. Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection. In ECCV, 2022

work page 2022
[38]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. In ICCV, 2023

work page 2023
[39]

Learning to drive in a day

Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In ICRA, 2019

work page 2019
[40]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019

work page 2019
[41]

Sara-rt: Scaling up robotics transformers with self-adaptive robust attention

Isabel Leal, Krzysztof Choromanski, Deepali Jain, Avinava Dubey, Jake Varley, Michael Ryoo, Yao Lu, Frederick Liu, Vikas Sindhwani, Quan Vuong, et al. Sara-rt: Scaling up robotics transformers with self-adaptive robust attention. In ICRA, 2024

work page 2024
[42]

Hdmapnet: An online hd map construction and evaluation framework

Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022 a

work page 2022
[43]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022 b

work page 2022
[44]

Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024

work page 2024
[45]

Cirl: Controllable imitative reinforcement learning for vision-based self-driving

Xiaodan Liang, Tairui Wang, Luona Yang, and Eric Xing. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In ECCV, 2018

work page 2018
[46]

Maptr: Structured modeling and learning for online vectorized hd map construction

Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2023

work page 2023
[47]

Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction

Bencheng Liao, Shaoyu Chen, Bo Jiang, Tianheng Cheng, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction. In ECCV, 2024 a

work page 2024
[48]

Maptrv2: An end-to-end framework for online vectorized hd map construction

Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Maptrv2: An end-to-end framework for online vectorized hd map construction. IJCV, 2024 b

work page 2024
[49]

Titrated: Learned human driving behavior without infractions via amortized inference

Vasileios Lioutas, Adam Scibior, and Frank Wood. Titrated: Learned human driving behavior without infractions via amortized inference. Transactions on Machine Learning Research, 2022

work page 2022
[50]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2024 a

work page 2024
[51]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024 b

work page 2024
[52]

Vectormapnet: End-to-end vectorized hd map learning

Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, and Hang Zhao. Vectormapnet: End-to-end vectorized hd map learning. In ICML, 2023

work page 2023
[53]

Unified-io: A unified model for vision, language, and multi-modal tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022

work page 2022
[54]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024

work page 2024
[55]

Wayformer: Motion forecasting via simple & efficient attention networks

Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In ICRA, 2023

work page 2023
[56]

Vlp: Vision language planning for autonomous driving

Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. Vlp: Vision language planning for autonomous driving. In CVPR, 2024

work page 2024
[57]

Kosmos-2: Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. In ICLR, 2024

work page 2024
[58]

Alvinn: An autonomous land vehicle in a neural network

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NeurIPS, 1988

work page 1988
[59]

Multi-modal fusion transformer for end-to-end autonomous driving

Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi-modal fusion transformer for end-to-end autonomous driving. In CVPR, 2021

work page 2021
[60]

End-to-end vectorized hd-map construction with piecewise bezier curve

Limeng Qiao, Wenjie Ding, Xi Qiu, and Chi Zhang. End-to-end vectorized hd-map construction with piecewise bezier curve. In CVPR, 2023

work page 2023
[61]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI blog, 2018

work page 2018
[62]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019

work page 2019
[63]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020

work page 2020
[65]

Motionlm: Multi-agent motion forecasting as language modeling

Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S Refaat, Rami Al-Rfou, and Benjamin Sapp. Motionlm: Multi-agent motion forecasting as language modeling. In ICCV, 2023

work page 2023
[66]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In CVPR, 2024

work page 2024
[67]

Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. PAMI, 2024

work page 2024
[68]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In ECCV, 2024

work page 2024
[69]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020

work page 2020
[70]

Swformer: Sparse window transformer for 3d object detection in point clouds

Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV, 2022

work page 2022
[71]

Beyond text: Utilizing vocal cues to improve decision making in llms for robot navigation tasks

Xingpeng Sun, Haoming Meng, Souradip Chakraborty, Amrit Singh Bedi, and Aniket Bera. Beyond text: Utilizing vocal cues to improve decision making in llms for robot navigation tasks. Transactions on Machine Learning Research, 2024

work page 2024
[72]

Block-nerf: Scalable large scene neural view synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. In CVPR, 2022

work page 2022
[73]

Motion planning for autonomous driving: The state of the art and future perspectives

Siyu Teng, Xuemin Hu, Peng Deng, Bai Li, Yuchen Li, Yunfeng Ai, Dongsheng Yang, Lingxi Li, Zhe Xuanyuan, Fenghua Zhu, et al. Motion planning for autonomous driving: The state of the art and future perspectives. T-IV, 2023

work page 2023
[74]

Drivevlm: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024

work page 2024
[75]

End-to-end model-free reinforcement learning for urban driving using implicit affordances

Marin Toromanoff, Emilie Wirbel, and Fabien Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020

work page 2020
[78]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

work page 2017
[79]

Show and tell: A neural image caption generator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015

work page 2015
[81]

Goplan: Goal-conditioned offline reinforcement learning by planning with learned models

Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, and Giovanni Montana. Goplan: Goal-conditioned offline reinforcement learning by planning with learned models. Transactions on Machine Learning Research, 2023

work page 2023
[82]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022

work page 2022
[84]

Fcos3d: Fully convolutional one-stage monocular 3d object detection

Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021

work page 2021
[86]

Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models

Tsun-Hsuan Wang, Alaa Maalouf, Wei Xiao, Yutong Ban, Alexander Amini, Guy Rosman, Sertac Karaman, and Daniela Rus. Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models. In ICRA, 2024 c

work page 2024
[87]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS, 2024 d

work page 2024
[88]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

work page 2022
[89]

Para-drive: Parallelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. In CVPR, 2024

work page 2024
[90]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022

work page 2022
[91]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. RA-L, 2024

work page 2024
[93]

Coca: Contrastive captioners are image-text foundation models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022

work page 2022

Showing first 80 references.