LingoQA: Visual Question Answering for Autonomous Driving
read the original abstract
We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs
CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.
-
GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving
GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.
-
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving
ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achie...
-
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.