Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Anthony Scanlan; Ciaran Eising; Fiachra Collins; Ganesh Sistu; Nikos Theodoridis; Reenu Mohandas; Tim Brophy

arxiv: 2511.13397 · v2 · submitted 2025-11-17 · 💻 cs.CV · cs.AI

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis , Tim Brophy , Reenu Mohandas , Ganesh Sistu , Fiachra Collins , Anthony Scanlan , Ciaran Eising This is my paper

Pith reviewed 2026-05-17 21:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords VQA benchmarkVision-Language ModelsTraffic perceptionAutomated drivingDistance annotationPerception evaluationSynthetic and real data

0 comments

The pith

DTPQA benchmark evaluates vision-language models on basic perception in traffic scenes using trivial questions and distance labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Distance-Annotated Traffic Perception Question Answering (DTPQA) as a VQA benchmark to test the perception capabilities of VLMs in traffic scenarios. It uses trivial questions about driving-relevant objects to separate perception from reasoning or world knowledge. The benchmark combines a synthetic part generated in a simulator with a real-world part drawn from existing traffic images. Each sample carries a distance annotation for the object in question, so performance can be tracked as distance grows from close range to 30 meters and beyond. The authors release the dataset and the Python scripts that created it so others can extend the collection.

Core claim

DTPQA is a Visual Question Answering benchmark designed to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of a synthetic benchmark created using a simulator and a real-world benchmark built on existing images of real traffic scenes. Each sample includes an image, a question, the ground truth answer, and the distance of the object in question from the camera, enabling analysis of how VLM performance degrades with increasing object distance.

What carries the argument

The DTPQA dataset, built from synthetic and real traffic images paired with simple perception questions and explicit distance annotations for each queried object.

If this is right

VLMs can be assessed specifically for their ability to perceive objects at long range in traffic scenes.
Accuracy can be compared directly across short, medium, and long distances within the same set of questions.
Synthetic scenes provide repeatable conditions while real scenes add ecological validity for driving tasks.
The released scripts allow researchers to generate additional samples with the same structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use could screen VLMs for basic perception failures before any vehicle integration.
The same distance-label approach might be applied to other perception-heavy domains such as aerial surveillance.
Results could guide targeted data collection for improving long-range recognition in future model training.

Load-bearing premise

The chosen questions are sufficiently trivial to isolate pure perception from any reasoning or world knowledge, and the distance annotations and scene selection accurately capture the distribution of real traffic objects at long range.

What would settle it

Applying current VLMs to the DTPQA samples and finding no measurable drop in accuracy for objects at 30+ meters compared with objects under 20 meters.

Figures

Figures reproduced from arXiv: 2511.13397 by Anthony Scanlan, Ciaran Eising, Fiachra Collins, Ganesh Sistu, Nikos Theodoridis, Reenu Mohandas, Tim Brophy.

**Figure 2.** Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: FIGURE 3 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: illustrates this performance gap, displaying the chance-corrected accuracy of nine SOTA small VLMs (and one big VLM) compared to human accuracy across DTPQA tasks. This substantial discrepancy indicates that DTPQA effectively challenges current models and serves as a valuable benchmark for evaluating the perception capabilities of VLMs in traffic scenarios. RECORDS AND STORAGE The structure of the main di… view at source ↗

**Figure 5.** Figure 5: FIGURE 5 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Distance-Annotated Traffic Perception Question Answering (DTPQA), a VQA benchmark for evaluating VLMs on traffic scene perception. It consists of DTP-Synthetic (generated via simulator) and DTP-Real (built on real traffic images), with each sample containing an image, question, ground-truth answer, and distance annotation to the queried object. The goal is to assess perception in isolation using trivial yet crucial driving-relevant questions and to enable analysis of performance degradation at increasing distances (including 30+ m). The authors release the dataset along with Python scripts for its creation and extension.

Significance. If the questions can be shown to isolate pure perception without invoking reasoning or world knowledge and if distance annotations prove reliable at long range, DTPQA would address a practical gap in safety-critical VLM evaluation for automated driving. The dual synthetic/real construction and open release of creation scripts constitute a reproducible artifact that the community can extend, which is a concrete strength of the work.

major comments (2)

[Abstract] Abstract: The central claim that questions are 'trivial yet crucial' and evaluate 'perception capabilities ... in isolation from other skills like reasoning or advanced world knowledge' lacks any supporting detail. No question examples, generation rules, design criteria for triviality, or validation steps (e.g., inter-annotator agreement or comparison against reasoning-heavy VQA items) are supplied. This directly affects whether the benchmark can fulfill its stated purpose.
[DTP-Real] DTP-Real: The source and methodology for obtaining distance annotations on real images are unspecified. Because the paper highlights performance analysis at long ranges (30+ m), the precision and provenance of these annotations are load-bearing for the intended degradation study.

minor comments (2)

Consider adding a table or figure that lists representative questions, images, and distance values from both DTP-Synthetic and DTP-Real to make the data characteristics concrete.
The scripts are mentioned but not described in the text; a short usage example or repository link with documentation would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the specific revisions planned for the next version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that questions are 'trivial yet crucial' and evaluate 'perception capabilities ... in isolation from other skills like reasoning or advanced world knowledge' lacks any supporting detail. No question examples, generation rules, design criteria for triviality, or validation steps (e.g., inter-annotator agreement or comparison against reasoning-heavy VQA items) are supplied. This directly affects whether the benchmark can fulfill its stated purpose.

Authors: We agree that the abstract would be strengthened by concrete supporting details. The submitted manuscript describes the overall benchmark purpose but does not include explicit question examples or validation steps in the abstract. In the revision we will add one representative question example to the abstract, briefly state the design criteria (direct queries on visible attributes such as object presence, color, or type with no multi-step inference required), and expand the main text with generation rules plus a short comparison to reasoning-heavy VQA items. These additions will be incorporated into the revised submission. revision: yes
Referee: [DTP-Real] DTP-Real: The source and methodology for obtaining distance annotations on real images are unspecified. Because the paper highlights performance analysis at long ranges (30+ m), the precision and provenance of these annotations are load-bearing for the intended degradation study.

Authors: We acknowledge the omission. The current manuscript states only that DTP-Real is built on existing real traffic images without specifying the source dataset or annotation procedure. In the revised version we will add a dedicated paragraph detailing the image source and the exact methodology used to obtain or estimate distance annotations, including any calibration steps or reliability considerations for ranges beyond 30 m. This will directly support the degradation analysis. revision: yes

Circularity Check

0 steps flagged

No circularity; paper introduces dataset artifact without derivations or self-referential reductions

full rationale

The manuscript presents DTPQA as a new benchmark dataset (synthetic and real-world components with distance annotations) and accompanying creation scripts. No mathematical derivations, equations, fitted parameters, or predictions are claimed. The central description—that the questions are 'trivial yet crucial' for isolating perception—is a design statement rather than a result derived from prior steps within the paper. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way that reduces the contribution to its own inputs by construction. The work is self-contained as an artifact contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that perception can be isolated via carefully worded questions and that simulator and real-image data together represent relevant traffic conditions. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Trivial questions can isolate perception capabilities from reasoning or world knowledge in traffic scenes.
Invoked when the abstract describes the questions as 'trivial yet crucial' for evaluating perception in isolation.

pith-pipeline@v0.9.0 · 5653 in / 1237 out tokens · 61808 ms · 2026-05-17T21:40:00.882297+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Lingoqa: Visual question answering for autonomous driving,

A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton, E. Arani, and O. Sinavski, “Lingoqa: Visual question answering for autonomous driving,” inEuropean Conference on Computer Vision, 12 2024, pp. 252–269. [Online]. Available: http://arxiv.org/abs/2312. 14115

work page 2024
[2]

Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives.arXiv preprint arXiv:2501.04003, 2025

S. Xie, L. Kong, Y . Dong, C. Sima, W. Zhang, Q. A. Chen, Z. Liu, and L. Pan, “Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives,” 1 2025. [Online]. Available: http://arxiv.org/abs/2501.04003

work page arXiv 2025
[3]

Reading between the lanes: Text videoqa on the road,

G. Tom, M. Mathew, S. Garcia, D. Karatzas, and C. V . Jawahar, “Reading between the lanes: Text videoqa on the road,” inInternational Conference on Document Analysis and Recognition, 7 2023, pp. 137–154. [Online]. Available: http: //arxiv.org/abs/2307.03948

work page arXiv 2023
[4]

Surds: Benchmarking spatial un- derstanding and reasoning in driving scenarios with vision language models,

X. Guo, R. Zhang, Y . Duan, Y . He, D. Nie, W. Huang, C. Zhang, S. Liu, H. Zhao, and L. Chen, “Surds: Benchmarking spatial un- derstanding and reasoning in driving scenarios with vision language models,”arXiv preprint arXiv:2411.13112, 2024

work page arXiv 2024
[5]

Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,

X. Ding, J. Han, H. Xu, X. Liang, W. Zhang, and X. Li, “Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 668–13 677. [Online]. Available: https://github

work page 2024
[6]

Tb-bench: Training and testing multi-modal ai for understanding spatio-temporal traffic behaviors from dashcam images/videos,

K. Charoenpitaks, V .-Q. Nguyen, M. Suganuma, K. Arai, S. Totsuka, H. Ino, and T. Okatani, “Tb-bench: Training and testing multi-modal ai for understanding spatio-temporal traffic behaviors from dashcam images/videos,” 1 2025. [Online]. Available: http://arxiv.org/abs/2501. 05733

work page 2025
[7]

Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

K. Ishihara, K. Sasaki, T. Takahashi, D. Shiono, and Y . Yamaguchi, “Stride-qa: Visual question answering dataset for spatiotemporal rea- soning in urban driving scenes,”arXiv preprint arXiv:2508.10427, 2025

work page arXiv 2025
[8]

Tumtraffic-videoqa: A benchmark for unified spatio-temporal video understanding in traffic scenes,

X. Zhou, K. Larintzakis, H. Guo, W. Zimmer, M. Liu, H. Cao, J. Zhang, V . Lakshminarasimhan, L. Strand, and A. C. Knoll, “Tumtraffic-videoqa: A benchmark for unified spatio-temporal video understanding in traffic scenes,”arXiv preprint arXiv:2502.02449, 2025

work page arXiv 2025
[9]

Carla: An open urban driving simulator,

A. Dosovitskiy, G. Ros, F. Codevilla, A. L ´opez, and V . Koltun, “Carla: An open urban driving simulator,” inConference on robot learning, 2017, pp. 1–16

work page 2017
[10]

nuScenes: A Multimodal Dataset for Autonomous Driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6 2020, pp. 11 618–11 628. [Online]. Available: https://ieeexplore.ieee.org/document/9156412/

work page arXiv 2020
[11]

Evaluating small vision-language models on distance-dependent traffic perception,

N. Theodoridis, T. Brophy, R. Mohandas, G. Sistu, F. Collins, A. Scan- lan, and C. Eising, “Evaluating small vision-language models on distance-dependent traffic perception,”IEEE Open Journal of Vehicu- lar Technology, pp. 1–22, 2025

work page 2025
[12]

A survey on simulators for testing self-driving cars,

P. Kaur, S. Taghavi, Z. Tian, and W. Shi, “A survey on simulators for testing self-driving cars,” in2021 Fourth International Conference on Connected and Autonomous Driving (MetroCAD). IEEE, 2021, pp. 62–70

work page 2021
[13]

A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook,

M. Liu, E. Yurtsever, J. Fossaert, X. Zhou, W. Zimmer, Y . Cui, B. L. Zagar, and A. C. Knoll, “A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook,”IEEE Transactions on Intelligent Vehicles, 2024

work page 2024
[14]

[Online]

D2ICE-Automotive-Research, “Dtpqa,” August 2025. [Online]. Available: https://github.com/D2ICE-Automotive-Research/DTPQA VOLUME 00, 2024 9

work page 2025

[1] [1]

Lingoqa: Visual question answering for autonomous driving,

A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton, E. Arani, and O. Sinavski, “Lingoqa: Visual question answering for autonomous driving,” inEuropean Conference on Computer Vision, 12 2024, pp. 252–269. [Online]. Available: http://arxiv.org/abs/2312. 14115

work page 2024

[2] [2]

Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives.arXiv preprint arXiv:2501.04003, 2025

S. Xie, L. Kong, Y . Dong, C. Sima, W. Zhang, Q. A. Chen, Z. Liu, and L. Pan, “Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives,” 1 2025. [Online]. Available: http://arxiv.org/abs/2501.04003

work page arXiv 2025

[3] [3]

Reading between the lanes: Text videoqa on the road,

G. Tom, M. Mathew, S. Garcia, D. Karatzas, and C. V . Jawahar, “Reading between the lanes: Text videoqa on the road,” inInternational Conference on Document Analysis and Recognition, 7 2023, pp. 137–154. [Online]. Available: http: //arxiv.org/abs/2307.03948

work page arXiv 2023

[4] [4]

Surds: Benchmarking spatial un- derstanding and reasoning in driving scenarios with vision language models,

X. Guo, R. Zhang, Y . Duan, Y . He, D. Nie, W. Huang, C. Zhang, S. Liu, H. Zhao, and L. Chen, “Surds: Benchmarking spatial un- derstanding and reasoning in driving scenarios with vision language models,”arXiv preprint arXiv:2411.13112, 2024

work page arXiv 2024

[5] [5]

Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,

X. Ding, J. Han, H. Xu, X. Liang, W. Zhang, and X. Li, “Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 668–13 677. [Online]. Available: https://github

work page 2024

[6] [6]

Tb-bench: Training and testing multi-modal ai for understanding spatio-temporal traffic behaviors from dashcam images/videos,

K. Charoenpitaks, V .-Q. Nguyen, M. Suganuma, K. Arai, S. Totsuka, H. Ino, and T. Okatani, “Tb-bench: Training and testing multi-modal ai for understanding spatio-temporal traffic behaviors from dashcam images/videos,” 1 2025. [Online]. Available: http://arxiv.org/abs/2501. 05733

work page 2025

[7] [7]

Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

K. Ishihara, K. Sasaki, T. Takahashi, D. Shiono, and Y . Yamaguchi, “Stride-qa: Visual question answering dataset for spatiotemporal rea- soning in urban driving scenes,”arXiv preprint arXiv:2508.10427, 2025

work page arXiv 2025

[8] [8]

Tumtraffic-videoqa: A benchmark for unified spatio-temporal video understanding in traffic scenes,

X. Zhou, K. Larintzakis, H. Guo, W. Zimmer, M. Liu, H. Cao, J. Zhang, V . Lakshminarasimhan, L. Strand, and A. C. Knoll, “Tumtraffic-videoqa: A benchmark for unified spatio-temporal video understanding in traffic scenes,”arXiv preprint arXiv:2502.02449, 2025

work page arXiv 2025

[9] [9]

Carla: An open urban driving simulator,

A. Dosovitskiy, G. Ros, F. Codevilla, A. L ´opez, and V . Koltun, “Carla: An open urban driving simulator,” inConference on robot learning, 2017, pp. 1–16

work page 2017

[10] [10]

nuScenes: A Multimodal Dataset for Autonomous Driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6 2020, pp. 11 618–11 628. [Online]. Available: https://ieeexplore.ieee.org/document/9156412/

work page arXiv 2020

[11] [11]

Evaluating small vision-language models on distance-dependent traffic perception,

N. Theodoridis, T. Brophy, R. Mohandas, G. Sistu, F. Collins, A. Scan- lan, and C. Eising, “Evaluating small vision-language models on distance-dependent traffic perception,”IEEE Open Journal of Vehicu- lar Technology, pp. 1–22, 2025

work page 2025

[12] [12]

A survey on simulators for testing self-driving cars,

P. Kaur, S. Taghavi, Z. Tian, and W. Shi, “A survey on simulators for testing self-driving cars,” in2021 Fourth International Conference on Connected and Autonomous Driving (MetroCAD). IEEE, 2021, pp. 62–70

work page 2021

[13] [13]

A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook,

M. Liu, E. Yurtsever, J. Fossaert, X. Zhou, W. Zimmer, Y . Cui, B. L. Zagar, and A. C. Knoll, “A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook,”IEEE Transactions on Intelligent Vehicles, 2024

work page 2024

[14] [14]

[Online]

D2ICE-Automotive-Research, “Dtpqa,” August 2025. [Online]. Available: https://github.com/D2ICE-Automotive-Research/DTPQA VOLUME 00, 2024 9

work page 2025