LingoQA: Visual Question Answering for Autonomous Driving

Alex Kendall; Alice Karnsund; Ana-Maria Marcu; Benoit Hanotte; Elahe Arani; Jamie Shotton; Jan H\"unermann; Long Chen; Oleg Sinavski; Prajwal Chidananda

arxiv: 2312.14115 · v4 · pith:SNNNO3I5new · submitted 2023-12-21 · 💻 cs.RO · cs.AI· cs.CV

LingoQA: Visual Question Answering for Autonomous Driving

Ana-Maria Marcu , Long Chen , Jan H\"unermann , Alice Karnsund , Benoit Hanotte , Prajwal Chidananda , Saurabh Nair , Vijay Badrinarayanan

show 4 more authors

Alex Kendall Jamie Shotton Elahe Arani Oleg Sinavski

This is my paper

classification 💻 cs.RO cs.AIcs.CV

keywords autonomousbenchmarkdatasetdrivingvision-languageansweringevaluationhuman

0 comments

read the original abstract

We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs
cs.CV 2026-04 unverdicted novelty 8.0

CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.
GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving
cs.CV 2026-06 unverdicted novelty 7.0

GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving
cs.CV 2024-11 unverdicted novelty 7.0

ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achie...
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
cs.CV 2025-08 unverdicted novelty 6.0

CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...