pith. sign in

arxiv: 2312.14115 · v4 · pith:SNNNO3I5new · submitted 2023-12-21 · 💻 cs.RO · cs.AI· cs.CV

LingoQA: Visual Question Answering for Autonomous Driving

classification 💻 cs.RO cs.AIcs.CV
keywords autonomousbenchmarkdatasetdrivingvision-languageansweringevaluationhuman
0
0 comments X
read the original abstract

We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 8.0

    CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.

  2. GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

    cs.CV 2026-06 unverdicted novelty 7.0

    GeoDrive-Bench is a new multimodal benchmark and distillation method for testing and improving VLMs on region-specific traffic-rule reasoning in autonomous driving across six countries.

  3. Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

    cs.CV 2024-11 unverdicted novelty 7.0

    ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achie...

  4. CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving

    cs.CV 2025-08 unverdicted novelty 6.0

    CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.

  5. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...