Recognition: 3 theorem links
· Lean TheoremDriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Pith reviewed 2026-05-12 19:18 UTC · model grok-4.3
The pith
Vision-language models integrated with traditional pipelines enable autonomous vehicles to handle complex urban scenarios more effectively.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DriveVLM uses Vision-Language Models to perform scene description, scene analysis, and hierarchical planning, creating an end-to-end reasoning chain for autonomous driving. DriveVLM-Dual merges this VLM component with the standard pipeline to compensate for spatial reasoning gaps and latency. Experiments on nuScenes and SUP-AD datasets plus on-vehicle deployment confirm that the combined system manages challenging and unpredictable driving situations more reliably than either approach alone.
What carries the argument
DriveVLM, consisting of VLM-based modules for scene description, analysis, and hierarchical planning, together with the hybrid DriveVLM-Dual that fuses VLM outputs into a conventional autonomous driving pipeline.
If this is right
- Traditional autonomous driving stacks gain access to language-based reasoning for edge cases without discarding their geometric strengths.
- Hierarchical planning generated from scene descriptions can produce more adaptable trajectories in urban environments.
- Real-world deployment on production hardware shows the hybrid approach meets latency and safety requirements for actual roads.
- The combination reduces reliance on hand-crafted rules for every possible rare scenario.
Where Pith is reading between the lines
- Future autonomous systems could use the same hybrid pattern to incorporate newer vision-language models as they improve, updating only the language component rather than retraining the entire stack.
- If the VLM modules prove reliable for description, they might support more natural human-machine interfaces, such as explaining why the vehicle chose a particular action.
- The approach could extend to other embodied AI tasks like robotics where spatial precision and high-level reasoning must coexist.
Load-bearing premise
Adding VLM reasoning for description and planning will meaningfully improve results on rare or complex driving cases without introducing delays or spatial mistakes that the hybrid architecture cannot correct.
What would settle it
A controlled comparison on the same nuScenes or SUP-AD test cases showing no reduction in collision or planning failure rates for DriveVLM-Dual versus the baseline traditional pipeline, or a real-vehicle test where the hybrid system produces unsafe maneuvers in long-tail scenes.
read the original abstract
A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DriveVLM, a system using Vision-Language Models for scene description, analysis, and hierarchical planning in autonomous driving to handle complex long-tail urban scenarios. It proposes DriveVLM-Dual as a hybrid that combines this with traditional pipelines to address VLM weaknesses in spatial reasoning and compute. Experiments are reported on nuScenes and the authors' SUP-AD dataset, with a final real-vehicle deployment of DriveVLM-Dual claimed to verify real-world effectiveness.
Significance. If the integration benefits can be rigorously quantified, the hybrid VLM-traditional approach could meaningfully extend autonomous driving robustness to long-tail cases while respecting safety and latency constraints. The real-vehicle deployment is a notable strength for a vision-language model paper. However, the absence of detailed metrics and controls currently limits the ability to assess whether the claimed convergence delivers net gains over existing pipelines.
major comments (3)
- [§4] §4 (Experiments): No quantitative metrics (e.g., planning success rate, collision rate, latency, or long-tail subset performance) or baseline comparisons are provided for DriveVLM or DriveVLM-Dual on nuScenes or SUP-AD. The central efficacy claim therefore rests on qualitative assertions rather than verifiable results.
- [§4.2] §4.2 (DriveVLM-Dual description and evaluation): The hybrid system retains the traditional pipeline; without ablations that disable the VLM scene-description and planning modules while keeping the rest fixed, it is impossible to attribute any robustness gains to the VLM components rather than the retained stack. This directly undermines the claim that the integration improves handling of complex conditions.
- [§5] §5 (Real-world deployment): The production-vehicle deployment is presented without reported metrics on spatial-reasoning errors, latency, or mitigation strategies for VLM limitations, nor any comparison to the non-VLM baseline under the same conditions. This leaves the verification of real-world effectiveness unquantified.
minor comments (2)
- [Abstract, §3] Abstract and §3: The SUP-AD dataset is referenced without a description of its size, collection protocol, or annotation process; this should be added for reproducibility.
- [§4] Figure captions and §4: Several figures appear to show qualitative examples; quantitative error bars or statistical significance tests on any reported trends would strengthen the presentation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional quantitative evidence and controls would strengthen the claims regarding the benefits of integrating VLMs into autonomous driving pipelines. We have revised the manuscript to incorporate these elements.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): No quantitative metrics (e.g., planning success rate, collision rate, latency, or long-tail subset performance) or baseline comparisons are provided for DriveVLM or DriveVLM-Dual on nuScenes or SUP-AD. The central efficacy claim therefore rests on qualitative assertions rather than verifiable results.
Authors: We agree that the original presentation relied heavily on qualitative case studies. In the revised manuscript we have added a new quantitative evaluation section that reports planning success rates, collision rates, average latency, and performance on long-tail scenario subsets for both DriveVLM and DriveVLM-Dual. Direct comparisons against the traditional autonomous driving pipeline and other VLM-based planners are now included on both the nuScenes and SUP-AD datasets. revision: yes
-
Referee: [§4.2] §4.2 (DriveVLM-Dual description and evaluation): The hybrid system retains the traditional pipeline; without ablations that disable the VLM scene-description and planning modules while keeping the rest fixed, it is impossible to attribute any robustness gains to the VLM components rather than the retained stack. This directly undermines the claim that the integration improves handling of complex conditions.
Authors: This observation is accurate. We have added ablation experiments in the revised version that systematically disable the VLM scene-description and hierarchical-planning modules while keeping the traditional perception and control stack unchanged. The results quantify the incremental improvement in robustness for complex urban scenarios attributable to the VLM components. revision: yes
-
Referee: [§5] §5 (Real-world deployment): The production-vehicle deployment is presented without reported metrics on spatial-reasoning errors, latency, or mitigation strategies for VLM limitations, nor any comparison to the non-VLM baseline under the same conditions. This leaves the verification of real-world effectiveness unquantified.
Authors: We acknowledge the limitation in the original deployment description. The revised manuscript now reports quantitative metrics collected during the production-vehicle tests, including measured latency, observed spatial-reasoning error rates, and explicit mitigation strategies (e.g., fallback logic to the traditional pipeline). Where feasible, side-by-side comparisons with the non-VLM baseline under matched conditions are also provided. revision: yes
Circularity Check
No circularity: empirical claims rest on dataset evaluations and deployment, not self-referential definitions or fitted inputs
full rationale
The paper presents DriveVLM and DriveVLM-Dual as architectural integrations of VLMs with planning modules, evaluated via experiments on nuScenes and SUP-AD plus real-vehicle deployment. No derivation chain, equations, or first-principles results are claimed; performance assertions are supported by external benchmarks rather than reducing to fitted parameters renamed as predictions or self-citations that bear the central load. The hybrid design is motivated by stated VLM limitations (spatial reasoning, compute) without smuggling ansatzes or uniqueness theorems from prior self-work. This is a standard empirical systems paper whose claims can be falsified by the reported metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLMs possess sufficient spatial reasoning when augmented with traditional pipelines
invented entities (2)
-
DriveVLM
no independent evidence
-
DriveVLM-Dual
no independent evidence
Forward citations
Cited by 36 Pith papers
-
Membership Inference Attacks on Vision-Language-Action Models
Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
-
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
-
V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views
V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baselin...
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
-
Hyperbolic Concept Bottleneck Models
HypCBM reformulates concept activations as geometric containment in hyperbolic space to produce sparse, hierarchy-aware signals that match Euclidean models trained on 20 times more data.
-
Hyperbolic Concept Bottleneck Models
Hyperbolic Concept Bottleneck Models reformulate concept activations as test-time geometric containment in hyperbolic entailment cones to produce sparse, hierarchy-aware signals without extra supervision.
-
Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning
A language refinement framework with geometry-aware preference optimization lets VLMs generate more traversable 3D trajectories for off-road vehicles, yielding modest gains in error, traversability compliance, and ele...
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
-
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
-
MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving
MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
Quantifying the human visual exposome with vision language models
Vision language models applied to daily-life photos quantify visual environmental features that correlate with momentary affect and chronic stress, establishing a paradigm for visual exposomics.
-
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
-
VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions
VLM-VPI uses Qwen3-VL and GPT-OSS models for pedestrian intent and age reasoning plus a tiered safety controller, reporting 92.3% intent accuracy in CARLA and reduced conflicts versus rule-based and supervised baselines.
-
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
-
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
-
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
-
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
-
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
-
ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving
ICR-Drive reveals substantial performance drops in end-to-end language-driven driving models when instructions are paraphrased, made ambiguous, noised, or misleading.
-
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Reference graph
Works this paper leans on
-
[1]
I. Barabas, A. Todorut ¸, N. Cordos ¸, and A. Molea. Current challenges in autonomous driving. In IOP conference series: materials science and engineering , volume 252, page 012096. IOP Publishing, 2017
work page 2017
-
[2]
C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017
work page 2017
-
[3]
A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. Pointpillars: Fast en- coders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 12697–12705, 2019
work page 2019
-
[4]
Y . Wang, V . C. Guizilini, T. Zhang, Y . Wang, H. Zhao, and J. Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning , pages 180–191. PMLR, 2022
work page 2022
-
[5]
Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022
work page 2022
-
[6]
J. Gao, C. Sun, H. Zhao, Y . Shen, D. Anguelov, C. Li, and C. Schmid. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020
work page 2020
-
[7]
H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y . Shen, Y . Shen, Y . Chai, C. Schmid, C. Li, and D. Anguelov. Tnt: Target-driven trajectory prediction. In J. Kober, F. Ramos, and C. Tomlin, editors, Proceedings of the 2020 Conference on Robot Learning , volume 155 of Proceedings of Machine Learning Research, pages 895–904. PMLR, 16–18 Nov 2021
work page 2020
-
[8]
Y . Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou. Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7577–7586, 2021
work page 2021
-
[9]
J. Gu, C. Sun, and H. Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15303– 15312, 2021
work page 2021
-
[10]
N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp. Wayformer: Motion forecasting via simple & efficient attention networks. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 2980–2987. IEEE, 2023. 12
work page 2023
-
[11]
D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988
work page 1988
-
[12]
ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
M. Bansal, A. Krizhevsky, and A. Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018
work page Pith review arXiv 2018
- [13]
- [14]
- [15]
-
[16]
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023
work page 2023
-
[18]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
-
[20]
P. Zhang, X. Dong, B. Wang, Y . Cao, C. Xu, L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan, W. Zhang, H. Yan, X. Zhang, W. Li, J. Li, K. Chen, C. He, X. Zhang, Y . Qiao, D. Lin, and J. Wang. Internlm-xcomposer: A vision-language large model for advanced text-image com- prehension and composition, 2023
work page 2023
- [21]
-
[22]
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023
work page 2023
-
[23]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023. 13
work page internal anchor Pith review arXiv 2023
-
[26]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x mod- els. arXiv preprint arXiv:2310.08864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
R. Chekroun, M. Toromanoff, S. Hornauer, and F. Moutarde. Gri: General reinforced imitation and its application to vision-based autonomous driving. Robotics, 12(5):127, 2023
work page 2023
-
[28]
D. Chen, V . Koltun, and P. Kr¨ahenb¨uhl. Learning to drive from a world on rails. InProceedings of the IEEE/CVF International Conference on Computer Vision , pages 15590–15599, 2021
work page 2021
-
[29]
M. Toromanoff, E. Wirbel, and F. Moutarde. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 7153–7162, 2020
work page 2020
-
[30]
W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun. End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8660–8669, 2019
work page 2019
-
[31]
B. Wei, M. Ren, W. Zeng, M. Liang, B. Yang, and R. Urtasun. Perceive, attend, and drive: Learning spatial attention for safe self-driving. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 4875–4881. IEEE, 2021
work page 2021
-
[32]
P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan. Safe local motion planning with self- supervised freespace forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021
work page 2021
- [33]
-
[34]
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 17853–17862, 2023
work page 2023
- [35]
- [36]
-
[37]
A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [38]
-
[39]
D. Wu, W. Han, T. Wang, X. Dong, X. Zhang, and J. Shen. Referring multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14633–14642, 2023
work page 2023
- [40]
-
[41]
Talk2car: Taking control of your self-driving car.arXiv preprint arXiv:1909.10838,
T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M.-F. Moens. Talk2car: Taking control of your self-driving car. arXiv preprint arXiv:1909.10838, 2019
- [42]
- [43]
-
[44]
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020
work page 2020
-
[45]
J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. Textual explanations for self-driving vehicles. In Proceedings of the European conference on computer vision (ECCV) , pages 563– 578, 2018
work page 2018
-
[46]
Y . Xu, X. Yang, L. Gong, H.-C. Lin, T.-Y . Wu, Y . Li, and N. Vasconcelos. Explainable object- induced action decision for autonomous vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9523–9532, 2020
work page 2020
-
[47]
E. Sachdeva, N. Agarwal, S. Chundi, S. Roelofs, J. Li, B. Dariush, C. Choi, and M. Kochen- derfer. Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning. arXiv preprint arXiv:2309.06597, 2023
- [48]
-
[49]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
work page 2022
-
[50]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [51]
- [52]
-
[53]
T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. In European Conference on Computer Vision , pages 353–369. Springer, 2022
work page 2022
-
[54]
S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision , pages 533–549. Springer, 2022
work page 2022
-
[55]
L. Xu, H. Huang, and J. Liu. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9878–9888, 2021
work page 2021
- [56]
-
[57]
X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 15
work page 2024
-
[58]
B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring ex- pressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages 69–85. Springer, 2016
work page 2016
- [60]
-
[61]
X.ai. Grok-1.5 vision preview. https://x.ai/blog/grok-1.5v, 2024
work page 2024
- [62]
-
[63]
Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/ qwen1.5/
work page 2024
-
[64]
G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024
work page internal anchor Pith review arXiv 2024
-
[66]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree, A. Bakhtiari, H. Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
work page 2024
-
[68]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023
work page 2023
-
[69]
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022
work page 2022
- [70]
-
[71]
J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7132–7141, 2018
work page 2018
-
[72]
Y . Li, F. Wei, C. Zhang, and H. Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024
work page internal anchor Pith review arXiv 2024
-
[73]
T. Cai, Y . Li, Z. Geng, H. Peng, and T. Dao. Medusa: Simple framework for accelerating llm generation with multiple decoding heads, 2023. 16 A SUP-AD Dataset A.1 Meta-actions Meta-action statistics. We use the meta-action sequence to formally represent the driving strategy. Meta actions are classified into 17 categories. We show the distribution of each ...
work page 2023
-
[74]
Speed-control actions. Discerned from acceleration and braking signals within the ego state data, these actions include These actions can be discerned from acceleration and brak- ing signals within the ego state data. They includespeed up, slow down, slow down rapidly, go straight slowly, go straight at a constant speed , stop, wait, and reverse
-
[75]
Deduced from steering wheel signals, these actions consist of turn left, turn right, and turn around
Turning actions. Deduced from steering wheel signals, these actions consist of turn left, turn right, and turn around
-
[76]
Lane-control actions. Encompassing lane selection decisions, these actions are derived from a combination of steering wheel signals and either map or perception data. They involve change lane to the left , change lane to the right , shift slightly to the left , and shift slightly to the right . A.2 Scenario Categories The SUP-AD dataset is comprised of 1,...
-
[77]
Weather: Sunny (environmental conditions)
-
[78]
Time: Day (environmental conditions)
-
[79]
Road Environment: Urban (environmental conditions)
-
[80]
Lane Options: Left Lane, Own Lane, Right Lane (environmental conditions)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.