Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)
Pith reviewed 2026-05-17 21:40 UTC · model grok-4.3
The pith
DTPQA benchmark evaluates vision-language models on basic perception in traffic scenes using trivial questions and distance labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DTPQA is a Visual Question Answering benchmark designed to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of a synthetic benchmark created using a simulator and a real-world benchmark built on existing images of real traffic scenes. Each sample includes an image, a question, the ground truth answer, and the distance of the object in question from the camera, enabling analysis of how VLM performance degrades with increasing object distance.
What carries the argument
The DTPQA dataset, built from synthetic and real traffic images paired with simple perception questions and explicit distance annotations for each queried object.
If this is right
- VLMs can be assessed specifically for their ability to perceive objects at long range in traffic scenes.
- Accuracy can be compared directly across short, medium, and long distances within the same set of questions.
- Synthetic scenes provide repeatable conditions while real scenes add ecological validity for driving tasks.
- The released scripts allow researchers to generate additional samples with the same structure.
Where Pith is reading between the lines
- Widespread use could screen VLMs for basic perception failures before any vehicle integration.
- The same distance-label approach might be applied to other perception-heavy domains such as aerial surveillance.
- Results could guide targeted data collection for improving long-range recognition in future model training.
Load-bearing premise
The chosen questions are sufficiently trivial to isolate pure perception from any reasoning or world knowledge, and the distance annotations and scene selection accurately capture the distribution of real traffic objects at long range.
What would settle it
Applying current VLMs to the DTPQA samples and finding no measurable drop in accuracy for objects at 30+ meters compared with objects under 20 meters.
Figures
read the original abstract
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Distance-Annotated Traffic Perception Question Answering (DTPQA), a VQA benchmark for evaluating VLMs on traffic scene perception. It consists of DTP-Synthetic (generated via simulator) and DTP-Real (built on real traffic images), with each sample containing an image, question, ground-truth answer, and distance annotation to the queried object. The goal is to assess perception in isolation using trivial yet crucial driving-relevant questions and to enable analysis of performance degradation at increasing distances (including 30+ m). The authors release the dataset along with Python scripts for its creation and extension.
Significance. If the questions can be shown to isolate pure perception without invoking reasoning or world knowledge and if distance annotations prove reliable at long range, DTPQA would address a practical gap in safety-critical VLM evaluation for automated driving. The dual synthetic/real construction and open release of creation scripts constitute a reproducible artifact that the community can extend, which is a concrete strength of the work.
major comments (2)
- [Abstract] Abstract: The central claim that questions are 'trivial yet crucial' and evaluate 'perception capabilities ... in isolation from other skills like reasoning or advanced world knowledge' lacks any supporting detail. No question examples, generation rules, design criteria for triviality, or validation steps (e.g., inter-annotator agreement or comparison against reasoning-heavy VQA items) are supplied. This directly affects whether the benchmark can fulfill its stated purpose.
- [DTP-Real] DTP-Real: The source and methodology for obtaining distance annotations on real images are unspecified. Because the paper highlights performance analysis at long ranges (30+ m), the precision and provenance of these annotations are load-bearing for the intended degradation study.
minor comments (2)
- Consider adding a table or figure that lists representative questions, images, and distance values from both DTP-Synthetic and DTP-Real to make the data characteristics concrete.
- The scripts are mentioned but not described in the text; a short usage example or repository link with documentation would improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the specific revisions planned for the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that questions are 'trivial yet crucial' and evaluate 'perception capabilities ... in isolation from other skills like reasoning or advanced world knowledge' lacks any supporting detail. No question examples, generation rules, design criteria for triviality, or validation steps (e.g., inter-annotator agreement or comparison against reasoning-heavy VQA items) are supplied. This directly affects whether the benchmark can fulfill its stated purpose.
Authors: We agree that the abstract would be strengthened by concrete supporting details. The submitted manuscript describes the overall benchmark purpose but does not include explicit question examples or validation steps in the abstract. In the revision we will add one representative question example to the abstract, briefly state the design criteria (direct queries on visible attributes such as object presence, color, or type with no multi-step inference required), and expand the main text with generation rules plus a short comparison to reasoning-heavy VQA items. These additions will be incorporated into the revised submission. revision: yes
-
Referee: [DTP-Real] DTP-Real: The source and methodology for obtaining distance annotations on real images are unspecified. Because the paper highlights performance analysis at long ranges (30+ m), the precision and provenance of these annotations are load-bearing for the intended degradation study.
Authors: We acknowledge the omission. The current manuscript states only that DTP-Real is built on existing real traffic images without specifying the source dataset or annotation procedure. In the revised version we will add a dedicated paragraph detailing the image source and the exact methodology used to obtain or estimate distance annotations, including any calibration steps or reliability considerations for ranges beyond 30 m. This will directly support the degradation analysis. revision: yes
Circularity Check
No circularity; paper introduces dataset artifact without derivations or self-referential reductions
full rationale
The manuscript presents DTPQA as a new benchmark dataset (synthetic and real-world components with distance annotations) and accompanying creation scripts. No mathematical derivations, equations, fitted parameters, or predictions are claimed. The central description—that the questions are 'trivial yet crucial' for isolating perception—is a design statement rather than a result derived from prior steps within the paper. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way that reduces the contribution to its own inputs by construction. The work is self-contained as an artifact contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Trivial questions can isolate perception capabilities from reasoning or world knowledge in traffic scenes.
Reference graph
Works this paper leans on
-
[1]
Lingoqa: Visual question answering for autonomous driving,
A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton, E. Arani, and O. Sinavski, “Lingoqa: Visual question answering for autonomous driving,” inEuropean Conference on Computer Vision, 12 2024, pp. 252–269. [Online]. Available: http://arxiv.org/abs/2312. 14115
work page 2024
-
[2]
S. Xie, L. Kong, Y . Dong, C. Sima, W. Zhang, Q. A. Chen, Z. Liu, and L. Pan, “Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives,” 1 2025. [Online]. Available: http://arxiv.org/abs/2501.04003
-
[3]
Reading between the lanes: Text videoqa on the road,
G. Tom, M. Mathew, S. Garcia, D. Karatzas, and C. V . Jawahar, “Reading between the lanes: Text videoqa on the road,” inInternational Conference on Document Analysis and Recognition, 7 2023, pp. 137–154. [Online]. Available: http: //arxiv.org/abs/2307.03948
-
[4]
X. Guo, R. Zhang, Y . Duan, Y . He, D. Nie, W. Huang, C. Zhang, S. Liu, H. Zhao, and L. Chen, “Surds: Benchmarking spatial un- derstanding and reasoning in driving scenarios with vision language models,”arXiv preprint arXiv:2411.13112, 2024
-
[5]
Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,
X. Ding, J. Han, H. Xu, X. Liang, W. Zhang, and X. Li, “Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 668–13 677. [Online]. Available: https://github
work page 2024
-
[6]
K. Charoenpitaks, V .-Q. Nguyen, M. Suganuma, K. Arai, S. Totsuka, H. Ino, and T. Okatani, “Tb-bench: Training and testing multi-modal ai for understanding spatio-temporal traffic behaviors from dashcam images/videos,” 1 2025. [Online]. Available: http://arxiv.org/abs/2501. 05733
work page 2025
-
[7]
K. Ishihara, K. Sasaki, T. Takahashi, D. Shiono, and Y . Yamaguchi, “Stride-qa: Visual question answering dataset for spatiotemporal rea- soning in urban driving scenes,”arXiv preprint arXiv:2508.10427, 2025
-
[8]
Tumtraffic-videoqa: A benchmark for unified spatio-temporal video understanding in traffic scenes,
X. Zhou, K. Larintzakis, H. Guo, W. Zimmer, M. Liu, H. Cao, J. Zhang, V . Lakshminarasimhan, L. Strand, and A. C. Knoll, “Tumtraffic-videoqa: A benchmark for unified spatio-temporal video understanding in traffic scenes,”arXiv preprint arXiv:2502.02449, 2025
-
[9]
Carla: An open urban driving simulator,
A. Dosovitskiy, G. Ros, F. Codevilla, A. L ´opez, and V . Koltun, “Carla: An open urban driving simulator,” inConference on robot learning, 2017, pp. 1–16
work page 2017
-
[10]
nuScenes: A Multimodal Dataset for Autonomous Driving,
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6 2020, pp. 11 618–11 628. [Online]. Available: https://ieeexplore.ieee.org/document/9156412/
-
[11]
Evaluating small vision-language models on distance-dependent traffic perception,
N. Theodoridis, T. Brophy, R. Mohandas, G. Sistu, F. Collins, A. Scan- lan, and C. Eising, “Evaluating small vision-language models on distance-dependent traffic perception,”IEEE Open Journal of Vehicu- lar Technology, pp. 1–22, 2025
work page 2025
-
[12]
A survey on simulators for testing self-driving cars,
P. Kaur, S. Taghavi, Z. Tian, and W. Shi, “A survey on simulators for testing self-driving cars,” in2021 Fourth International Conference on Connected and Autonomous Driving (MetroCAD). IEEE, 2021, pp. 62–70
work page 2021
-
[13]
A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook,
M. Liu, E. Yurtsever, J. Fossaert, X. Zhou, W. Zimmer, Y . Cui, B. L. Zagar, and A. C. Knoll, “A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook,”IEEE Transactions on Intelligent Vehicles, 2024
work page 2024
- [14]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.