pith. machine review for the scientific record. sign in

arxiv: 2410.23262 · v3 · submitted 2024-10-30 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

EMMA: End-to-End Multimodal Model for Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.RO
keywords autonomous drivingend-to-end modelmultimodal language modelmotion planning3D object detectionroad graph predictionnuScenesWaymo
0
0 comments X

The pith

EMMA turns raw camera images into driving trajectories, object detections, and road graphs by encoding all outputs as natural language text inside a multimodal LLM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EMMA as an end-to-end model that takes camera sensor data plus navigation and vehicle status, then produces planner trajectories, 3D object detections, and road graph elements. It does so by converting every input and output into plain text so a single pre-trained language model can handle planning, perception, and mapping tasks together through task-specific prompts. This unified text space lets the model draw on broad world knowledge while jointly optimizing across tasks. On benchmarks the approach reaches state-of-the-art motion planning on nuScenes and competitive scores for planning on WOMD and camera-based 3D detection on WOD. Co-training the three tasks together improves performance in each domain rather than trading off.

Core claim

By representing trajectories, 3D locations, and road elements as natural language text, EMMA lets a multimodal LLM jointly process raw camera images and generate accurate outputs for motion planning, object detection, and road graph prediction, reaching state-of-the-art planning results on nuScenes and competitive results on Waymo datasets.

What carries the argument

The unified language space that encodes all non-sensor inputs and all spatial outputs (trajectories, 3D locations, road graphs) as natural language text, allowing one model and task-specific prompts to handle multiple driving tasks together.

If this is right

  • Joint training on planner trajectories, object detection, and road graphs produces gains in all three tasks simultaneously.
  • A single set of model weights can generate outputs for different driving subtasks simply by changing the prompt text.
  • The same architecture scales across multiple public driving benchmarks without task-specific heads or loss functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If text encoding works for these tasks, the same pattern could absorb additional inputs such as HD maps or V2X messages without new model components.
  • Real-world deployment would still require verification that text decoding never drops critical spatial constraints that numeric planners enforce directly.
  • The approach suggests future driving stacks could treat perception, prediction, and planning as prompt variations inside one model rather than separate modules.

Load-bearing premise

Converting precise geometric quantities such as trajectories and 3D object positions into natural language text preserves every detail needed for safe control without loss of spatial accuracy.

What would settle it

Demonstration that EMMA produces colliding or off-road trajectories in dense urban scenarios where centimeter-level geometry matters, while a conventional geometric planner succeeds on the same inputs.

read the original abstract

We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built upon a multi-modal large language model foundation like Gemini, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. We hope that our results will inspire research to further evolve the state of the art in autonomous driving model architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EMMA, an end-to-end multimodal model for autonomous driving built on a pre-trained LLM foundation (Gemini). It directly maps raw camera inputs to driving outputs including planner trajectories, 3D object detections, and road-graph elements by encoding all non-sensor inputs/outputs as natural-language text, enabling joint processing via task-specific prompts. The work reports state-of-the-art motion-planning results on nuScenes, competitive performance on WOMD and camera-primary 3D detection on WOD, and performance gains from co-training the three tasks.

Significance. If the results hold under rigorous validation, the work is significant for showing that a frozen LLM backbone plus text-based unification can deliver competitive or superior driving performance while leveraging pre-trained world knowledge and enabling multi-task synergies. The co-training improvements across perception, planning, and mapping tasks provide evidence for the value of a generalist text interface in autonomous driving.

major comments (2)
  1. [§3.2] §3.2 (Output Representation): The central modeling choice of encoding continuous trajectories, 3D object centers, and road-graph polylines as tokenized natural-language strings is load-bearing for all reported metrics, yet the manuscript supplies no reconstruction-error quantification (e.g., mean L2 deviation between original coordinates and text-decoded outputs on the validation split). Without this measurement, it is impossible to determine whether the SOTA nuScenes planning numbers reflect faithful geometry preservation or metric tolerance of discretization artifacts.
  2. [§4.3] §4.3 (Ablation and Co-training Results): The claim that co-training yields improvements across all three domains rests on comparisons that lack an ablation isolating the text-discretization head versus a continuous regression head; the reported gains could therefore be confounded by the choice of output representation rather than task synergy.
minor comments (2)
  1. [Figure 3] Figure 3 and Table 1: axis labels and coordinate units for the visualized trajectories are not explicitly stated, making it difficult to verify geometric fidelity at a glance.
  2. [§4.1] The training-details paragraph in §4.1 omits the exact prompt templates and tokenization scheme for coordinate strings; providing these in an appendix would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for explicit validation of our output encoding and for sharpening the interpretation of our co-training ablations. We address each comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] The central modeling choice of encoding continuous trajectories, 3D object centers, and road-graph polylines as tokenized natural-language strings is load-bearing for all reported metrics, yet the manuscript supplies no reconstruction-error quantification (e.g., mean L2 deviation between original coordinates and text-decoded outputs on the validation split). Without this measurement, it is impossible to determine whether the SOTA nuScenes planning numbers reflect faithful geometry preservation or metric tolerance of discretization artifacts.

    Authors: We agree that an explicit quantification of discretization error is necessary to substantiate the geometric fidelity of the text-based outputs. In the revised manuscript we will add a dedicated paragraph to §3.2 that reports the mean L2 reconstruction error (in meters) for planner trajectories, 3D object centers, and road-graph polylines on the nuScenes validation split, computed by decoding the generated text tokens back to coordinates and comparing against the original ground-truth values. This addition will allow readers to verify that discretization artifacts remain negligible relative to the reported planning metrics. revision: yes

  2. Referee: [§4.3] The claim that co-training yields improvements across all three domains rests on comparisons that lack an ablation isolating the text-discretization head versus a continuous regression head; the reported gains could therefore be confounded by the choice of output representation rather than task synergy.

    Authors: We respectfully note that the co-training ablations already control for output representation: every single-task and multi-task variant of EMMA uses the identical text-discretization head. Consequently, performance differences between these variants can be attributed to the benefits of joint optimization in the shared language space rather than to the representation itself. Because the text interface is a foundational design choice that enables the pre-trained LLM to process all tasks uniformly, a continuous-regression ablation would require an entirely different architecture outside the scope of the present work. We will revise §4.3 to explicitly articulate this controlled comparison and to clarify that the observed synergies arise from multi-task training within the unified text framework. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical model with public-benchmark results

full rationale

The paper describes an end-to-end trained multimodal LLM (Gemini backbone plus task prompts) that outputs trajectories and detections as tokenized text. All reported numbers are standard benchmark metrics on nuScenes, WOMD, and WOD; no equations, fitted parameters, or self-referential predictions are present. The text-representation choice is an architectural decision whose information-loss consequences are not measured in the provided text, but that absence does not create a circular derivation. No self-citation chains or uniqueness theorems are invoked to justify the core claims.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach rests on the pre-trained weights of an existing multimodal LLM and on standard supervised fine-tuning; no new physical axioms or invented entities are introduced.

free parameters (1)
  • task-specific prompt templates
    Prompt wording is chosen by the authors and directly affects output quality.

pith-pipeline@v0.9.0 · 5586 in / 1140 out tokens · 42901 ms · 2026-05-15T05:03:42.087168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text

  • Foundation.LedgerForcing conservation_from_balance unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This approach allows EMMA to jointly process various driving tasks in a unified language space

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

    cs.RO 2026-04 conditional novelty 8.0

    V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baselin...

  2. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  3. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  4. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  5. MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.

  6. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  7. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  8. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  9. Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

    cs.CV 2026-05 unverdicted novelty 6.0

    Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...

  10. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  11. EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

    cs.CV 2026-05 unverdicted novelty 6.0

    EggHand unifies VLA action decoding with viewpoint-aware video-text encoding to forecast egocentric hand poses, achieving SOTA accuracy on EgoExo4D while remaining robust to ego-motion and controllable via language prompts.

  12. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.

  13. OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

    cs.CV 2026-04 unverdicted novelty 6.0

    OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

  14. Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.

  15. ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

  16. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  17. CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

    cs.CV 2026-03 unverdicted novelty 6.0

    CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.

  18. AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    cs.CV 2025-06 unverdicted novelty 6.0

    AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...

  19. Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

    cs.RO 2026-05 unverdicted novelty 5.0

    CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.

  20. C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 5.0

    C-CoT applies VLMs to autonomous driving via five-stage reasoning with a meta-action tree for counterfactuals, yielding 81.9% risk recall, 3.52% collision rate, and 1.98 m L2 error on a new dataset.

  21. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 5.0

    VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.

  22. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

  23. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

  24. Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

    cs.RO 2026-04 accept novelty 4.0

    A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.

  25. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

205 extracted references · 205 canonical work pages · cited by 22 Pith papers · 10 internal anchors

  1. [2]

    Video-language critic: Transferable reward functions for language-conditioned robotics

    Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, and Kai Yuan. Video-language critic: Transferable reward functions for language-conditioned robotics. Transactions on Machine Learning Research, 2024

  2. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

  3. [5]

    Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst

    Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS, 2019

  4. [6]

    Look, remember and reason: Grounded reasoning in videos with language models

    Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Grounded reasoning in videos with language models. In ICRA, 2023

  5. [8]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023

  6. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020

  7. [10]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020

  8. [11]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020

  9. [12]

    Gri: General reinforced imitation and its application to vision-based autonomous driving

    Raphael Chekroun, Marin Toromanoff, Sascha Hornauer, and Fabien Moutarde. Gri: General reinforced imitation and its application to vision-based autonomous driving. Robotics, 2023

  10. [13]

    a henb \

    Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Kr \"a henb \"u hl. Learning by cheating. In CoRL, 2020

  11. [14]

    a henb \

    Dian Chen, Vladlen Koltun, and Philipp Kr \"a henb \"u hl. Learning to drive from a world on rails. In ICCV, 2021

  12. [15]

    Womd-lidar: Raw sensor dataset benchmark for motion forecasting

    Kan Chen, Runzhou Ge, Hang Qiu, Rami Ai-Rfou, Charles R Qi, Xuanyu Zhou, Zoey Yang, Scott Ettinger, Pei Sun, Zhaoqi Leng, et al. Womd-lidar: Raw sensor dataset benchmark for motion forecasting. In ICRA, 2024 a

  13. [16]

    Driving with llms: Fusing object-level vector modality for explainable autonomous driving

    Long Chen, Oleg Sinavski, Jan H \"u nermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In ICRA, 2024 b

  14. [18]

    Pix2seq: A language modeling framework for object detection

    Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In ICLR, 2022 a

  15. [19]

    A unified sequence interface for vision tasks

    Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey E Hinton. A unified sequence interface for vision tasks. In NeurIPS, 2022 b

  16. [20]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

  17. [21]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. In CVPR, 2024 d

  18. [22]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. PAMI, 2022

  19. [23]

    Unifying vision-and-language tasks via text generation

    Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021

  20. [24]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. JMLR, 2023

  21. [25]

    End-to-end driving via conditional imitation learning

    Felipe Codevilla, Matthias M \"u ller, Antonio L \'o pez, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In ICRA, 2018

  22. [26]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. NeurIPS, 2024

  23. [27]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019

  24. [28]

    Pivotnet: Vectorized pivot learning for end-to-end hd map construction

    Wenjie Ding, Limeng Qiao, Xi Qiu, and Chi Zhang. Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In ICCV, 2023

  25. [29]

    Long-term recurrent convolutional networks for visual recognition and description

    Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015

  26. [32]

    Open-vocabulary object detection via vision and language knowledge distillation

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022

  27. [33]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In NeurIPS, 2022

  28. [34]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In CVPR, 2023

  29. [35]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language models. In NeurIPS, 2023

  30. [36]

    Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection

    Wei-Chih Hung, Vincent Casser, Henrik Kretzschmar, Jyh-Jing Hwang, and Dragomir Anguelov. Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection. In ICRA, 2024

  31. [37]

    Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection

    Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty, Nicholas Armstrong-Crews, Tiffany Chen, and Dragomir Anguelov. Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection. In ECCV, 2022

  32. [38]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. In ICCV, 2023

  33. [39]

    Learning to drive in a day

    Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In ICRA, 2019

  34. [40]

    Pointpillars: Fast encoders for object detection from point clouds

    Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019

  35. [41]

    Sara-rt: Scaling up robotics transformers with self-adaptive robust attention

    Isabel Leal, Krzysztof Choromanski, Deepali Jain, Avinava Dubey, Jake Varley, Michael Ryoo, Yao Lu, Frederick Liu, Vikas Sindhwani, Quan Vuong, et al. Sara-rt: Scaling up robotics transformers with self-adaptive robust attention. In ICRA, 2024

  36. [42]

    Hdmapnet: An online hd map construction and evaluation framework

    Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022 a

  37. [43]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022 b

  38. [44]

    Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024

  39. [45]

    Cirl: Controllable imitative reinforcement learning for vision-based self-driving

    Xiaodan Liang, Tairui Wang, Luona Yang, and Eric Xing. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In ECCV, 2018

  40. [46]

    Maptr: Structured modeling and learning for online vectorized hd map construction

    Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2023

  41. [47]

    Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction

    Bencheng Liao, Shaoyu Chen, Bo Jiang, Tianheng Cheng, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction. In ECCV, 2024 a

  42. [48]

    Maptrv2: An end-to-end framework for online vectorized hd map construction

    Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Maptrv2: An end-to-end framework for online vectorized hd map construction. IJCV, 2024 b

  43. [49]

    Titrated: Learned human driving behavior without infractions via amortized inference

    Vasileios Lioutas, Adam Scibior, and Frank Wood. Titrated: Learned human driving behavior without infractions via amortized inference. Transactions on Machine Learning Research, 2022

  44. [50]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2024 a

  45. [51]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024 b

  46. [52]

    Vectormapnet: End-to-end vectorized hd map learning

    Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, and Hang Zhao. Vectormapnet: End-to-end vectorized hd map learning. In ICML, 2023

  47. [53]

    Unified-io: A unified model for vision, language, and multi-modal tasks

    Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022

  48. [54]

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024

  49. [55]

    Wayformer: Motion forecasting via simple & efficient attention networks

    Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In ICRA, 2023

  50. [56]

    Vlp: Vision language planning for autonomous driving

    Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. Vlp: Vision language planning for autonomous driving. In CVPR, 2024

  51. [57]

    Kosmos-2: Grounding multimodal large language models to the world

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. In ICLR, 2024

  52. [58]

    Alvinn: An autonomous land vehicle in a neural network

    Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NeurIPS, 1988

  53. [59]

    Multi-modal fusion transformer for end-to-end autonomous driving

    Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi-modal fusion transformer for end-to-end autonomous driving. In CVPR, 2021

  54. [60]

    End-to-end vectorized hd-map construction with piecewise bezier curve

    Limeng Qiao, Wenjie Ding, Xi Qiu, and Chi Zhang. End-to-end vectorized hd-map construction with piecewise bezier curve. In CVPR, 2023

  55. [61]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI blog, 2018

  56. [62]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019

  57. [63]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020

  58. [65]

    Motionlm: Multi-agent motion forecasting as language modeling

    Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S Refaat, Rami Al-Rfou, and Benjamin Sapp. Motionlm: Multi-agent motion forecasting as language modeling. In ICCV, 2023

  59. [66]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In CVPR, 2024

  60. [67]

    Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying

    Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. PAMI, 2024

  61. [68]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In ECCV, 2024

  62. [69]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020

  63. [70]

    Swformer: Sparse window transformer for 3d object detection in point clouds

    Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV, 2022

  64. [71]

    Beyond text: Utilizing vocal cues to improve decision making in llms for robot navigation tasks

    Xingpeng Sun, Haoming Meng, Souradip Chakraborty, Amrit Singh Bedi, and Aniket Bera. Beyond text: Utilizing vocal cues to improve decision making in llms for robot navigation tasks. Transactions on Machine Learning Research, 2024

  65. [72]

    Block-nerf: Scalable large scene neural view synthesis

    Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. In CVPR, 2022

  66. [73]

    Motion planning for autonomous driving: The state of the art and future perspectives

    Siyu Teng, Xuemin Hu, Peng Deng, Bai Li, Yuchen Li, Yunfeng Ai, Dongsheng Yang, Lingxi Li, Zhe Xuanyuan, Fenghua Zhu, et al. Motion planning for autonomous driving: The state of the art and future perspectives. T-IV, 2023

  67. [74]

    Drivevlm: The convergence of autonomous driving and large vision-language models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024

  68. [75]

    End-to-end model-free reinforcement learning for urban driving using implicit affordances

    Marin Toromanoff, Emilie Wirbel, and Fabien Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020

  69. [78]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

  70. [79]

    Show and tell: A neural image caption generator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015

  71. [81]

    Goplan: Goal-conditioned offline reinforcement learning by planning with learned models

    Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, and Giovanni Montana. Goplan: Goal-conditioned offline reinforcement learning by planning with learned models. Transactions on Machine Learning Research, 2023

  72. [82]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022

  73. [84]

    Fcos3d: Fully convolutional one-stage monocular 3d object detection

    Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021

  74. [86]

    Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models

    Tsun-Hsuan Wang, Alaa Maalouf, Wei Xiao, Yutong Ban, Alexander Amini, Guy Rosman, Sertac Karaman, and Daniela Rus. Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models. In ICRA, 2024 c

  75. [87]

    Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

    Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS, 2024 d

  76. [88]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

  77. [89]

    Para-drive: Parallelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. In CVPR, 2024

  78. [90]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022

  79. [91]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. RA-L, 2024

  80. [93]

    Coca: Contrastive captioners are image-text foundation models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022

Showing first 80 references.