D2-V2X: Depth-Driven Cooperative V2X Reasoning for Autonomous Driving

Alphin Varghese; Colin Pham; David Oh; Kevin Richard; Srijan Das

arxiv: 2605.24098 · v1 · pith:AGRM5S2Cnew · submitted 2026-05-22 · 💻 cs.CV

D2-V2X: Depth-Driven Cooperative V2X Reasoning for Autonomous Driving

Kevin Richard , Alphin Varghese , Colin Pham , David Oh , Srijan Das This is my paper

Pith reviewed 2026-06-30 15:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords V2XVision-Language ModelsAutonomous DrivingCooperative PerceptionLiDAROcclusion HandlingBenchmarkChain-of-Thought

0 comments

The pith

Aligning cooperative 3D LiDAR with vision-language models raises occluded hazard recall to 24.4 percent and cuts spatial error by 77 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that single-vehicle vision-language models cannot resolve sensor occlusions on their own. It creates a new benchmark of 8500 question-rationale-answer triplets drawn from vehicle and roadside sensors to test cooperative reasoning. The baseline projects 3D LiDAR features into the model latent space and requires natural-language step-by-step reasoning before any JSON answer. This forces explicit description of spatial relations that zero-shot models ignore. The resulting system reaches the reported recall and error reductions while exposing a persistent 3D-to-2D projection limit in current architectures.

Core claim

Grounding VLMs in cooperative LiDAR through latent-space alignment, combined with enforced natural-language Chain-of-Thought rationales before structured outputs, produces 24.4 percent recall on occluded hazards where zero-shot models achieve near zero, reduces spatial estimation error on visible objects by 77 percent, and yields a 53.5 F1 decision-making score, while identifying 3D-to-2D projection as the remaining architectural bottleneck.

What carries the argument

The D2-V2X Question-Rationale-Answer benchmark together with the 3D LiDAR feature alignment method that inserts explicit spatial reasoning via natural-language Chain-of-Thought steps.

If this is right

V2X sensor fusion can supply the missing spatial context that single-vehicle VLMs lack.
Decision-making performance reaches a functional F1-score of 53.5 under the aligned setup.
3D-to-2D projection remains the primary limit on further gains in current VLM designs.
The 8500-triplet benchmark supplies a concrete testbed for measuring future cooperative-reasoning improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment technique could be tested on other sensor types such as radar if equivalent feature projections are defined.
Communication latency in real V2X networks may reduce the reported gains unless the model is made robust to delayed or partial LiDAR streams.
The benchmark could serve as an evaluation suite for non-VLM multimodal architectures that also process 3D point clouds.

Load-bearing premise

Requiring natural language Chain-of-Thought rationales before JSON outputs forces the model to articulate spatial relations explicitly rather than relying on dataset patterns.

What would settle it

Train an otherwise identical model that receives the same LiDAR alignment but skips the Chain-of-Thought step and measure whether occluded-hazard recall stays near 24.4 percent or drops to near zero.

Figures

Figures reproduced from arXiv: 2605.24098 by Alphin Varghese, Colin Pham, David Oh, Kevin Richard, Srijan Das.

**Figure 1.** Figure 1: V2X Cooperative Perception. While Ego-only multimodal models (left) fail to detect occluded hazards, our D2-V2X baseline (right) leverages infrastructure LiDAR to localize a hidden van (red box, 27m), enabling a safe decision in a complex scenario. tionale that explains the relations of occluded objects using multi-sensor data before committing to a final driving action. By grounding these rationales in… view at source ↗

**Figure 2.** Figure 2: Sample D2-V2X QRA triplet The explicitly generated rationale forces the model to resolve 3D spatial geometry before outputting the structured JSON response. DAR features with an ego-vehicle VLM. Through Supervised Fine-Tuning (SFT) on our 8,500 triplet dataset, we demonstrate that while articulating spatial relations significantly improves occlusion awareness and driving decisionmaking, the translation … view at source ↗

**Figure 3.** Figure 3: D2-V2X Architecture. A pre-trained CenterPoint backbone extracts 3D features from fused V2X LiDAR, which are projected into 1D tokens. These are concatenated with multi-view image and text embeddings. The QLoRA-tuned Qwen3-VL processes this sequence to generate grounded reasoning and a structured decision. loop evaluation was conducted on a random 1% subset of the data to maintain semantic and stylistic co… view at source ↗

read the original abstract

Single-vehicle Vision-Language Models (VLMs) are fundamentally constrained by sensor occlusions. While Vehicle-to-Everything (V2X) systems mitigate this, current benchmarks lack the cooperative reasoning required for resolving ambiguities in complex environments. We introduce D2-V2X, a spatially-aware Question-Rationale-Answer (QRA) benchmark featuring 8,500 triplets derived from multimodal vehicle and infrastructure sensors. We additionally establish a baseline that aligns 3D LiDAR features with the VLM's latent space. By enforcing natural language Chain-of-Thought rationales prior to structured JSON outputs, our model is forced to explicitly articulate spatial relations. Our experiments demonstrate that grounding VLMs in cooperative LiDAR achieves 24.4% recall in identifying occluded hazards compared to near-zero in zero-shot models and reduces spatial estimation error for visible objects by 77% compared to the zero-shot baseline. While the model achieves a functional decision-making F1-score of 53.5, we identify 3D-to-2D projection as a fundamental bottleneck in current VLM architectures, establishing a new baseline for future innovation. Data, code, and trained models available at https://github.com/KevinRichard1/D2-V2X

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New QRA benchmark with LiDAR-VLM alignment delivers concrete gains on occluded hazards, but the CoT rationale step lacks an ablation to show it drives the spatial gains rather than just supervised training on the new data.

read the letter

The paper's core offering is the D2-V2X benchmark of 8,500 multimodal QRA triplets drawn from vehicle and infrastructure sensors, plus a baseline that projects 3D LiDAR features into VLM space and requires natural-language Chain-of-Thought rationales before JSON outputs. This produces 24.4% recall on occluded hazards versus near-zero for zero-shot models and a 77% drop in spatial error for visible objects, along with a 53.5 F1 for decision making. They also release the data, code, and models.

The benchmark construction itself is new and fills a gap that prior V2X datasets did not target. The public release makes the numbers checkable, and the callout of 3D-to-2D projection as a persistent bottleneck is a useful flag for the field.

The soft spot is the missing control for the CoT requirement. The zero-shot baseline receives none of the new triplets, none of the alignment, and none of the output format, so the large deltas mix several changes. Without an ablation that keeps the alignment and supervised objective but drops the explicit rationale step, it is not clear whether the enforced natural-language reasoning is what forces better spatial articulation or whether any fine-tuning on this dataset would produce similar results.

This is for researchers working on V2X, cooperative perception, and VLMs for driving. The benchmark and released artifacts are concrete enough to merit referee time even if the method section needs more detail on training and controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces the D2-V2X QRA benchmark consisting of 8,500 triplets from multimodal V2X sensors and proposes a baseline that aligns 3D LiDAR features into a VLM latent space while enforcing natural-language Chain-of-Thought rationales before structured JSON outputs. It reports that this grounding yields 24.4% recall on occluded hazards (vs. near-zero for zero-shot VLMs) and a 77% reduction in spatial estimation error for visible objects, with an overall decision-making F1 of 53.5, while identifying 3D-to-2D projection as a remaining bottleneck. Data, code, and models are released.

Significance. If the performance deltas hold after proper controls, the work would provide a useful new benchmark and open-source baseline for cooperative V2X reasoning with VLMs, directly addressing sensor occlusion in autonomous driving. The public release of the 8,500 triplets, code, and trained models is a clear strength that enables reproducibility and follow-on research.

major comments (2)

[Abstract] Abstract: The central claim that 'enforcing natural language Chain-of-Thought rationales prior to structured JSON outputs' forces explicit articulation of spatial relations and drives the reported gains (24.4% occluded recall, 77% error reduction) is not supported by any ablation that holds the LiDAR alignment and supervised objective fixed while removing only the rationale step.
[Abstract] Abstract: The comparison to the zero-shot baseline confounds the effects of (i) LiDAR feature alignment, (ii) supervised fine-tuning on the 8,500 QRA examples, and (iii) the CoT requirement, because the zero-shot model receives neither the benchmark triplets nor the output format; without an ablation that applies the same supervised objective without CoT, the data cannot isolate which component produces the deltas.

minor comments (1)

[Abstract] Abstract: Training details (loss formulation, optimizer, number of epochs, learning-rate schedule) and statistical significance tests for the 24.4% and 77% figures are not reported, making it impossible to assess variance or reproducibility from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for more controlled experiments to isolate the individual contributions of LiDAR alignment, supervised fine-tuning, and the Chain-of-Thought (CoT) rationale step. We agree that the current zero-shot comparison does not fully disentangle these factors and that an ablation removing only the rationale requirement (while holding alignment and supervision fixed) is absent. We address each comment below and will incorporate the requested controls in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'enforcing natural language Chain-of-Thought rationales prior to structured JSON outputs' forces explicit articulation of spatial relations and drives the reported gains (24.4% occluded recall, 77% error reduction) is not supported by any ablation that holds the LiDAR alignment and supervised objective fixed while removing only the rationale step.

Authors: We acknowledge the validity of this observation. The manuscript does not currently contain an ablation that keeps the LiDAR feature alignment and supervised training objective fixed while removing only the natural-language rationale generation step. In the revised manuscript we will add exactly this control: a model variant trained on the same 8,500 QRA triplets but optimized to emit structured JSON directly, without an intermediate rationale. The resulting comparison will quantify the incremental benefit attributable to the CoT requirement. revision: yes
Referee: [Abstract] Abstract: The comparison to the zero-shot baseline confounds the effects of (i) LiDAR feature alignment, (ii) supervised fine-tuning on the 8,500 QRA examples, and (iii) the CoT requirement, because the zero-shot model receives neither the benchmark triplets nor the output format; without an ablation that applies the same supervised objective without CoT, the data cannot isolate which component produces the deltas.

Authors: We agree that the zero-shot baseline conflates all three factors. The revised version will therefore include a supervised fine-tuning baseline that uses the identical LiDAR-aligned features and training objective but omits the CoT rationale step (i.e., direct JSON prediction). This additional condition, together with the ablation described above, will allow readers to attribute performance changes more precisely. We note that because the QRA benchmark was constructed with rationales, any supervised model is trained on rationale-containing targets; the new ablation will nevertheless remove the explicit rationale output requirement at both training and inference time. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results stand as direct measurements

full rationale

The paper introduces the QRA benchmark (8,500 triplets) and a baseline aligning LiDAR features into VLM latent space, then reports measured recall (24.4%) and error reduction (77%) against an external zero-shot baseline. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The CoT enforcement is a prompting/training choice whose effect is measured empirically rather than derived by construction from the inputs. Results are self-contained against the external baseline with no reduction to definitions or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions in multimodal learning and VLM prompting without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption Vision-language models can be aligned with 3D LiDAR features in latent space to improve spatial reasoning
Invoked when establishing the baseline alignment method in the abstract.

pith-pipeline@v0.9.1-grok · 5760 in / 1335 out tokens · 64150 ms · 2026-06-30T15:37:23.510331+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Bev-llm: Leveraging multimodal bev maps for scene captioning in autonomous driving

Felix Brandst ¨atter, Erik Sch ¨utz, Katharina Winter, and Fabian B Flohr. Bev-llm: Leveraging multimodal bev maps for scene captioning in autonomous driving. In2025 IEEE Intelligent V ehicles Symposium (IV), pages 345–350. IEEE,
[3]

Rehg, and Chao Zheng

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M. Rehg, and Chao Zheng. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InCVPR, pages 21819–21830, 2024. 1, 2

2024
[4]

V2v- llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models.arXiv preprint arXiv:2502.09980, 2025

Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Stephen F Smith, Yu-Chiang Frank Wang, and Min-Hung Chen. V2v- llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models.arXiv preprint arXiv:2502.09980, 2025. 2

work page arXiv 2025
[5]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. NeurIPS, 36:10088–10115, 2023. 3

2023
[6]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Stride-qa: visual question an- swering dataset for spatiotemporal reasoning in urban driv- ing scenes

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Sh- iono, and Yu Yamaguchi. Stride-qa: visual question an- swering dataset for spatiotemporal reasoning in urban driv- ing scenes. InAAAI, pages 5257–5266, 2026. 1, 2

2026
[8]

V3lma: Visual 3d-enhanced language model for autonomous driving

Jannik L ¨ubberstedt, Esteban Rivera Guerrero, Nico Uhle- mann, and Markus Lienkamp. V3lma: Visual 3d-enhanced language model for autonomous driving. InCVPR, pages 4769–4778, 2025. 2

2025
[9]

Vlaad: Vision and language assistant for autonomous driv- ing

SungYeon Park, MinJae Lee, JiHyuk Kang, Hahyeon Choi, Yoonah Park, Juhwan Cho, Adam Lee, and DongKyu Kim. Vlaad: Vision and language assistant for autonomous driv- ing. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 980–987, 2024. 2

2024
[10]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In AAAI, pages 4542–4550, 2024. 1, 2

2024
[11]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InECCV, pages 256–274. Springer, 2024. 2

2024
[12]

V2x-llm: Enhancing v2x integration and un- derstanding in connected vehicle corridors.arXiv preprint arXiv:2503.02239, 2025

Keshu Wu, Pei Li, Yang Zhou, Rui Gan, Junwei You, Yang Cheng, Jingwen Zhu, Steven T Parker, Bin Ran, David A Noyce, et al. V2x-llm: Enhancing v2x integration and un- derstanding in connected vehicle corridors.arXiv preprint arXiv:2503.02239, 2025. 2

work page arXiv 2025
[13]

V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer

Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming- Hsuan Yang, and Jiaqi Ma. V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer. InECCV,
[14]

Center- based 3d object detection and tracking

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center- based 3d object detection and tracking. InCVPR, pages 11784–11793, 2021. 3

2021
[15]

Junwei You, Pei Li, Zhuoyu Jiang, Weizhe Tang, Zilin Huang, Rui Gan, Jiaxi Liu, Yan Zhao, Sikai Chen, and Bin Ran. V2x-qa: A comprehensive reasoning dataset and benchmark for multimodal large language models in au- tonomous driving across ego, infrastructure, and cooperative views.arXiv preprint arXiv:2604.02710, 2026. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text gen- eration with bert. InICLR, 2020. 4

2020
[17]

Tumtraf videoqa: Dataset and benchmark for uni- fied spatio-temporal video understanding in traffic scenes

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Wal- ter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkat- narayanan Lakshminarasimhan, Leah Strand, and Alois Knoll. Tumtraf videoqa: Dataset and benchmark for uni- fied spatio-temporal video understanding in traffic scenes. In F orty-second International Conference on Machine Learn- ing, 2025. 2

2025
[18]

Tumtraf v2x cooperative perception dataset

Walter Zimmer, Gerhard Arya Wardana, Suren Sritharan, Xingcheng Zhou, Rui Song, and Alois C Knoll. Tumtraf v2x cooperative perception dataset. InCVPR, pages 22668– 22677, 2024. 2, 3

2024

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Bev-llm: Leveraging multimodal bev maps for scene captioning in autonomous driving

Felix Brandst ¨atter, Erik Sch ¨utz, Katharina Winter, and Fabian B Flohr. Bev-llm: Leveraging multimodal bev maps for scene captioning in autonomous driving. In2025 IEEE Intelligent V ehicles Symposium (IV), pages 345–350. IEEE,

[3] [3]

Rehg, and Chao Zheng

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M. Rehg, and Chao Zheng. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InCVPR, pages 21819–21830, 2024. 1, 2

2024

[4] [4]

V2v- llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models.arXiv preprint arXiv:2502.09980, 2025

Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Stephen F Smith, Yu-Chiang Frank Wang, and Min-Hung Chen. V2v- llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models.arXiv preprint arXiv:2502.09980, 2025. 2

work page arXiv 2025

[5] [5]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. NeurIPS, 36:10088–10115, 2023. 3

2023

[6] [6]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Stride-qa: visual question an- swering dataset for spatiotemporal reasoning in urban driv- ing scenes

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Sh- iono, and Yu Yamaguchi. Stride-qa: visual question an- swering dataset for spatiotemporal reasoning in urban driv- ing scenes. InAAAI, pages 5257–5266, 2026. 1, 2

2026

[8] [8]

V3lma: Visual 3d-enhanced language model for autonomous driving

Jannik L ¨ubberstedt, Esteban Rivera Guerrero, Nico Uhle- mann, and Markus Lienkamp. V3lma: Visual 3d-enhanced language model for autonomous driving. InCVPR, pages 4769–4778, 2025. 2

2025

[9] [9]

Vlaad: Vision and language assistant for autonomous driv- ing

SungYeon Park, MinJae Lee, JiHyuk Kang, Hahyeon Choi, Yoonah Park, Juhwan Cho, Adam Lee, and DongKyu Kim. Vlaad: Vision and language assistant for autonomous driv- ing. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 980–987, 2024. 2

2024

[10] [10]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In AAAI, pages 4542–4550, 2024. 1, 2

2024

[11] [11]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InECCV, pages 256–274. Springer, 2024. 2

2024

[12] [12]

V2x-llm: Enhancing v2x integration and un- derstanding in connected vehicle corridors.arXiv preprint arXiv:2503.02239, 2025

Keshu Wu, Pei Li, Yang Zhou, Rui Gan, Junwei You, Yang Cheng, Jingwen Zhu, Steven T Parker, Bin Ran, David A Noyce, et al. V2x-llm: Enhancing v2x integration and un- derstanding in connected vehicle corridors.arXiv preprint arXiv:2503.02239, 2025. 2

work page arXiv 2025

[13] [13]

V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer

Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming- Hsuan Yang, and Jiaqi Ma. V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer. InECCV,

[14] [14]

Center- based 3d object detection and tracking

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center- based 3d object detection and tracking. InCVPR, pages 11784–11793, 2021. 3

2021

[15] [15]

Junwei You, Pei Li, Zhuoyu Jiang, Weizhe Tang, Zilin Huang, Rui Gan, Jiaxi Liu, Yan Zhao, Sikai Chen, and Bin Ran. V2x-qa: A comprehensive reasoning dataset and benchmark for multimodal large language models in au- tonomous driving across ego, infrastructure, and cooperative views.arXiv preprint arXiv:2604.02710, 2026. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text gen- eration with bert. InICLR, 2020. 4

2020

[17] [17]

Tumtraf videoqa: Dataset and benchmark for uni- fied spatio-temporal video understanding in traffic scenes

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Wal- ter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkat- narayanan Lakshminarasimhan, Leah Strand, and Alois Knoll. Tumtraf videoqa: Dataset and benchmark for uni- fied spatio-temporal video understanding in traffic scenes. In F orty-second International Conference on Machine Learn- ing, 2025. 2

2025

[18] [18]

Tumtraf v2x cooperative perception dataset

Walter Zimmer, Gerhard Arya Wardana, Suren Sritharan, Xingcheng Zhou, Rui Song, and Alois C Knoll. Tumtraf v2x cooperative perception dataset. InCVPR, pages 22668– 22677, 2024. 2, 3

2024