DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

Despoina Paschalidou; Jenny Schmalfuss; Jose M. Alvarez; Kashyap Chitta; Kevin Xie; Sanja Fidler; Xinglong Sun; Xiuming Zhang

arxiv: 2606.17362 · v1 · pith:K26ZSG5Enew · submitted 2026-06-15 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

Xinglong Sun , Kevin Xie , Jenny Schmalfuss , Despoina Paschalidou , Xiuming Zhang , Sanja Fidler , Kashyap Chitta , Jose M. Alvarez This is my paper

Pith reviewed 2026-06-27 02:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords autonomous drivingevaluation metricsvision-language modelsdriving quality classificationtrajectory preference selectioncontext-aware evaluationrule-grounded assessment

0 comments

The pith

DriveJudge uses a vision-language model to interpret driving context before selectively applying deterministic physical rules for evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create an evaluation method for autonomous driving policies that is both aware of the surrounding scenario and grounded in physical rules. It does this by training a system called DriveJudge on 33,577 human-annotated samples where evaluators judged whether a given trajectory was reasonable. DriveJudge first reasons over images and text with a VLM to understand the scene, then calls specific rule functions only when appropriate, avoiding the context-blindness of pure rule metrics like EPDMS and the weak grounding of pure VLM approaches. This matters for anyone building or testing end-to-end driving models because evaluation directly shapes what policies get deployed. A sympathetic reader would see the two new benchmarks as a concrete way to measure progress toward human-aligned driving assessment.

Core claim

DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. Trained and tested on a curated dataset of 33,577 challenging driving samples with human annotations, it addresses driving metric evaluation through two tasks: Driving Quality Classification and Trajectory Preference Selection, where it outperforms EPDMS by 21.23 AUC and the recent VLM-based DriveCritic by 6.5%.

What carries the argument

DriveJudge, the agent that uses VLM reasoning to interpret context and then selectively calls deterministic rule functions for physically grounded scores.

If this is right

Establishes two new human-aligned benchmark tasks for measuring driving evaluation quality.
Demonstrates that hybrid VLM-plus-rule systems can exceed both pure rule-based and pure VLM-based methods on the same data.
Provides an interpretable evaluation signal that could be used to train or fine-tune end-to-end driving policies.
Shows that context interpretation followed by rule invocation improves both classification accuracy and preference ranking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid pattern could be tested on other embodied tasks where behavior must be judged against both context and hard constraints, such as robotic manipulation.
If the rule set is expanded, DriveJudge might serve as an online monitor that flags unsafe actions in real vehicles before they occur.
Future work could measure how much the performance gain depends on the particular VLM backbone versus the rule-invocation logic.
The dataset itself could be used to study where human raters disagree, revealing ambiguous driving situations that current rules do not cover.

Load-bearing premise

Human annotations on the 33,577 samples give an accurate and unbiased ground truth for reasonable driving behavior, and the VLM can reliably read the context to pick the correct rules without adding new mistakes.

What would settle it

Collecting a fresh test set of driving scenarios and showing that a panel of new human raters agrees more with EPDMS or DriveCritic classifications than with DriveJudge.

Figures

Figures reproduced from arXiv: 2606.17362 by Despoina Paschalidou, Jenny Schmalfuss, Jose M. Alvarez, Kashyap Chitta, Kevin Xie, Sanja Fidler, Xinglong Sun, Xiuming Zhang.

**Figure 1.** Figure 1: Comparison of evaluation paradigms. We address limitations of existing classical and VLM-based metrics with DriveJudge which is a VLM-guided, rule-grounded evaluation paradigm. It selectively invokes rules based on scene context, enabling explicit, expert-free, context-aware, and interpretable driving assessment. In the example, DriveJudge ignores lane keeping (LK) to nudge. Abstract Autonomous driving has… view at source ↗

**Figure 2.** Figure 2: DriveJudge Framework. DriveJudge leverages a VLM to determine metric relevance under scene context, selectively invokes the relevant rules, and aggregates their outputs into a contextaware driving quality score [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: DriveJudge Dataset. The DriveJudge dataset contains diverse challenging driving scenarios, including reasonable behaviors that may deviate from conventional driving rules, as well as finegrained driving failure cases. ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative model comparisons on two representative clips from the test set [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Validation accuracy of Zero-shot+RL and SFT+RL using DriveJudge-2B. 0 5000 10000 15000 20000 25000 30000 SFT Training Data Size (# samples) 55 60 65 70 75 Validation AUC 1/20 1/15 1/10 1/5 1/3 1/2 1 Driving Quality Classification SFT AUC vs. Data Size 0 500 1000 1500 2000 2500 3000 3500 RL Training Data Size (# samples) 72 74 76 78 80 Validation Accuracy (%) 1/20 1/15 1/10 1/5 1/3 1/2 1 Trajectory Preferen… view at source ↗

**Figure 8.** Figure 8: B.3 Filtering the DriveCritic [45] Trajectory Selection Dataset The original DriveCritic [45] trajectory preference dataset contains a substantial number of samples where the preference signal is inherently ambiguous, as shown in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 8.** Figure 8: Representative examples from our collected annotation labels. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Ambiguous preference samples removed from DriveCritic [45]. A common pattern in the original DriveCritic dataset involves trajectory pairs without a clear human preference. In this example, both trajectories are safe and valid, differing only in minor progress behavior (e.g., slowly advancing versus remaining stationary), making the preference inherently ambiguous. We remove such samples to construct a cle… view at source ↗

**Figure 10.** Figure 10: RL training reward comparison from different initialization. Starting from an SFTinitialized model leads to steady reward improvement, while zero-shot initialization quickly plateaus with minimal gain. This highlights the importance of SFT as a critical warm start. Concretely, the full driving score slightly differs from the simplified form shown in Eqn. 8. We present the complete formulation in Eqn. 14.… view at source ↗

**Figure 11.** Figure 11: SFT training dynamics. Left: training loss over optimization steps. Right: validation perfect-match rate measuring exact alignment between predicted rule invocation traces and groundtruth traces. shows the validation perfect-match rate, which measures the percentage of validation samples where the predicted rule invocation trace exactly matches the ground-truth rule invocation trace across all evaluation… view at source ↗

read the original abstract

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveJudge pairs VLM context reading with selective deterministic rules on a new 33k human-labeled dataset, but the abstract gives almost no architecture, rule, or annotation details to judge whether the 21 AUC and 6.5% gains are robust.

read the letter

The paper's main move is to build an agent that first lets a VLM interpret the scene and then hands off to fixed physical rules only when the context matches. They back this with a fresh dataset of 33,577 samples labeled by humans for reasonableness and two new tasks: quality classification and trajectory preference. That combination is not directly in the cited prior work.

The hybrid idea itself is sensible. Pure rule metrics like EPDMS miss context; pure VLM outputs can be loose on physics. Letting the VLM decide when to apply rules could tighten both sides, and the reported lifts over EPDMS and DriveCritic are large enough to notice.

The soft spots are mostly in what is missing. The abstract supplies no model name or size, no description of the rule functions, no dataset collection protocol, no inter-annotator agreement numbers, and no significance tests. Human labels on “reasonable” driving are the load-bearing ground truth, yet reasonableness is subjective on edge cases; without agreement stats the gains could partly track annotation noise rather than better evaluation. The stress-test concern lands because the abstract offers nothing to rebut it.

This is for people working on AV metrics and benchmarks. A reader who needs a new dataset or task definition could extract value once the implementation details appear. The work shows clear engagement with the literature and the core tradeoff, so it is coherent on its own terms.

I would send it to peer review so the missing pieces can be checked and the experiments can be stress-tested properly.

Referee Report

3 major / 1 minor

Summary. The paper introduces DriveJudge, a VLM-based driving evaluation agent that interprets environmental context via vision-language reasoning and selectively invokes physically-grounded deterministic rule functions. It curates a dataset of 33,577 human-annotated samples for reasonableness of driving behavior and defines two benchmark tasks (driving quality classification and trajectory preference selection), reporting a 21.23 AUC improvement over EPDMS and a 6.5% gain over DriveCritic.

Significance. If the evaluation setup is robust, the work could advance interpretable, context-aware metrics for end-to-end autonomous driving policies by addressing the context-dependence limitation of pure rule-based metrics and the ambiguity of standalone VLM outputs. The human-aligned benchmarks represent a constructive contribution to the field.

major comments (3)

[Dataset / Evaluation setup] Dataset construction (implied in abstract and methods): The central claims rest on human annotations of 33,577 samples serving as reliable ground truth for whether driving behavior is 'reasonable' in context. No inter-annotator agreement statistics, annotation guidelines, or validation against objective criteria are described; given the inherent subjectivity of reasonableness (e.g., aggressive yet safe maneuvers), this undermines assessment of whether the reported AUC and percentage gains reflect genuine improvement rather than annotation noise or bias.
[Methods / DriveJudge description] Methods (abstract and § on DriveJudge architecture): The performance numbers are presented without any description of the VLM model, training procedure, prompting strategy for context interpretation, or the precise deterministic rule functions and invocation logic. These omissions make it impossible to determine whether the 21.23 AUC and 6.5% gains arise from the proposed hybrid approach or from unspecified implementation choices.
[Results / Experiments] Results (abstract): The reported improvements lack any mention of statistical significance testing, confidence intervals, ablation studies, or error analysis. Without these, the robustness of the gains over EPDMS and DriveCritic cannot be evaluated, especially on a benchmark whose ground truth reliability is itself unquantified.

minor comments (1)

[Abstract] Abstract contains minor typographical issues (e.g., 'VLMbased' should be 'VLM-based').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive criticism. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Dataset / Evaluation setup] Dataset construction (implied in abstract and methods): The central claims rest on human annotations of 33,577 samples serving as reliable ground truth for whether driving behavior is 'reasonable' in context. No inter-annotator agreement statistics, annotation guidelines, or validation against objective criteria are described; given the inherent subjectivity of reasonableness (e.g., aggressive yet safe maneuvers), this undermines assessment of whether the reported AUC and percentage gains reflect genuine improvement rather than annotation noise or bias.

Authors: We agree that details on the annotation process are essential for validating the ground truth. The current manuscript provides limited information on this. In the revision, we will include a dedicated subsection on dataset construction that details the annotation guidelines, the number of annotators per sample if multiple, inter-annotator agreement metrics (such as Cohen's kappa if computed), and any steps taken to mitigate subjectivity. If certain statistics were not collected during the original annotation, we will explicitly state this as a limitation and discuss its implications. revision: yes
Referee: [Methods / DriveJudge description] Methods (abstract and § on DriveJudge architecture): The performance numbers are presented without any description of the VLM model, training procedure, prompting strategy for context interpretation, or the precise deterministic rule functions and invocation logic. These omissions make it impossible to determine whether the 21.23 AUC and 6.5% gains arise from the proposed hybrid approach or from unspecified implementation choices.

Authors: We acknowledge the lack of implementation details in the submitted manuscript. We will revise the methods section to provide a comprehensive description of the VLM model employed (including version and parameters), the prompting strategies used for context interpretation, the training procedure if any fine-tuning was performed, and the exact deterministic rule functions along with the logic for their selective invocation. This will allow readers to better understand and potentially reproduce the hybrid approach. revision: yes
Referee: [Results / Experiments] Results (abstract): The reported improvements lack any mention of statistical significance testing, confidence intervals, ablation studies, or error analysis. Without these, the robustness of the gains over EPDMS and DriveCritic cannot be evaluated, especially on a benchmark whose ground truth reliability is itself unquantified.

Authors: We will enhance the results section by adding statistical significance testing (e.g., paired t-tests or McNemar's test where appropriate), confidence intervals for the reported metrics, ablation studies on the components of DriveJudge (such as the contribution of VLM reasoning versus rule functions), and an error analysis to identify cases where the model underperforms. These additions will help demonstrate the robustness of the improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical ML system (DriveJudge) that combines VLM reasoning with deterministic rules, evaluated on a curated human-annotated dataset of 33,577 samples for two benchmark tasks. No equations, derivations, fitted parameters presented as predictions, or first-principles claims appear in the provided text. Performance numbers (21.23 AUC, 6.5%) are direct empirical comparisons against external baselines and human labels on held-out data. No self-citation chains, ansatzes, or renamings reduce any central result to its own inputs by construction. This is a standard empirical evaluation paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach implicitly treats human annotations as ground truth and assumes VLM context interpretation is sufficiently reliable to gate rule invocation.

pith-pipeline@v0.9.1-grok · 5780 in / 1059 out tokens · 46374 ms · 2026-06-27T02:59:55.020094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 33 canonical work pages · 15 internal anchors

[1]

Marius Zoellner

Ahmed Abouelazm, Jonas Michel, and J. Marius Zoellner. A review of reward functions for reinforcement learning in the context of autonomous driving, 2026

2026
[2]

Evaluating vision-language models as evaluators in path planning

Mohamed Aghzal, Xiang Yue, Erion Plaku, and Ziyu Yao. Evaluating vision-language models as evaluators in path planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6886–6897, 2025

2025
[3]

Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

work page arXiv 2025
[4]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InCVPR, pages 11621–11631, 2020. 10

2020
[5]

Pseudo-simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo-simulation for autonomous driving. InConference on Robot Learning (CoRL), 2025

2025
[6]

Criteria: a new benchmarking paradigm for evaluating trajectory prediction models for autonomous driving

Changhe Chen, Mozhgan Pourkeshavarz, and Amir Rasouli. Criteria: a new benchmarking paradigm for evaluating trajectory prediction models for autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8265–8271. IEEE, 2024

2024
[7]

Eccv 2024 w-coda: 1st workshop on multimodal perception and comprehension of corner cases in autonomous driving.arXiv preprint arXiv:2507.01735, 2025

Kai Chen, Ruiyuan Gao, Lanqing Hong, Hang Xu, Xu Jia, Holger Caesar, Dengxin Dai, Bingbing Liu, Dzmitry Tsishkou, Songcen Xu, et al. Eccv 2024 w-coda: 1st workshop on multimodal perception and comprehension of corner cases in autonomous driving.arXiv preprint arXiv:2507.01735, 2025

work page arXiv 2024
[8]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Scaling instruction-finetuned language models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024

2024
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS, volume 37, pages 28706–28719, 2024

2024
[12]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. pages 1–16, 2017

2017
[14]

Task success is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors.arXiv preprint arXiv:2402.04210, 2024

Lin Guan, Yifan Zhou, Denis Liu, Yantian Zha, Heni Ben Amor, and Subbarao Kambhampati. Task success is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors.arXiv preprint arXiv:2402.04210, 2024

work page arXiv 2024
[15]

Styledrive: Towards driving-style aware benchmarking of end-to-end autonomous driving

Ruiyang Hao, Bowen Jing, Haibao Yu, and Zaiqing Nie. Styledrive: Towards driving-style aware benchmarking of end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4627–4635, 2026. 11

2026
[16]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, pages 17853–17862, 2023

2023
[17]

3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

Ting Huang, Zeyu Zhang, and Hao Tang. 3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

work page arXiv 2025
[18]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.NeurIPS, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.NeurIPS, 37:819–844, 2024

2024
[19]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[20]

Rewarding dino: Predicting dense rewards with vision foundation models.arXiv preprint arXiv:2603.16978, 2026

Pierre Krack, Tobias Jülg, Wolfram Burgard, and Florian Walter. Rewarding dino: Predicting dense rewards with vision foundation models.arXiv preprint arXiv:2603.16978, 2026

work page arXiv 2026
[21]

Explaining human preferences via metrics for structured 3d reconstruction

Jack Langerman, Denys Rozumnyi, Yuzhong Huang, and Dmytro Mishkin. Explaining human preferences via metrics for structured 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26944–26953, 2025

2025
[22]

Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

work page arXiv 2026
[23]

Coda: A real-world road corner case dataset for object detection in autonomous driving

Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing Xu, Dit-Yan Yeung, et al. Coda: A real-world road corner case dataset for object detection in autonomous driving. InEuropean Conference on Computer Vision, pages 406–423. Springer, 2022

2022
[24]

Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking. 2026

2026
[25]

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

2022
[27]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M. Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. InICCV, pages 27305–27314, October 2025

2025
[29]

Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring.arXiv preprint arXiv:2510.24108, 2025

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Jingde Chen, Nadine Chang, Maying Shen, Jingyu Song, Zuxuan Wu, Shiyi Lan, et al. Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring.arXiv preprint arXiv:2510.24108, 2025

work page arXiv 2025
[30]

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M. Alvarez. Generalized trajectory scoring for end-to-end multimodal planning. arXiv preprint arXiv:2506.06664, 2025

work page arXiv 2025
[31]

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S Huang, Luke Zettlemoyer, Dieter Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, and Xinggang Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, June 2025

2025
[33]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[34]

Dapo: Improving multi-step reasoning abilities of large language models with direct advantage-based policy optimization

Jiacai Liu, Chaojie Wang, Chris Yuhao Liu, Liang Zeng, Rui Yan, Yiwen Sun, and Yang Liu. Dapo: Improving multi-step reasoning abilities of large language models with direct advantage-based policy optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12

2025
[35]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

work page arXiv 2026
[37]

Alpasim: A modular, lightweight, and data-driven research simulator for autonomous driving, October 2025

NVIDIA, Yulong Cao, Riccardo de Lutio, Sanja Fidler, Guillermo Garcia Cobo, Zan Gojcic, Maximilian Igl, Boris Ivanovic, Peter Karkus, Janick Martinez Esturo, Marco Pavone, Aaron Smith, Ellie Tanimura, Michal Tyszkiewicz, Michael Watson, Qi Wu, and Le Zhang. Alpasim: A modular, lightweight, and data-driven research simulator for autonomous driving, October 2025

2025
[38]

PhysicalAI-Autonomous-Vehicles, October 2025

NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles, October 2025. Dataset hosted on Hugging Face. Accessed 2026-05-05

2025
[39]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[41]

Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025. 13

work page arXiv 2025
[42]

PARC: A quantitative framework uncovering the symmetries within vision language models

Jenny Schmalfuss, Nadine Chang, Vibashan VS, Maying Shen, Andres Bruhn, and Jose M Alvarez. PARC: A quantitative framework uncovering the symmetries within vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 25081–25091, 2025

2025
[43]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Drivecritic: Towards context-aware, human-aligned evaluation for autonomous driving with vision-language models.ICRA, 2026

Jingyu Song, Zhenxin Li, Shiyi Lan, Xinglong Sun, Nadine Chang, Maying Shen, Joshua Chen, Kather- ine A Skinner, and Jose M Alvarez. Drivecritic: Towards context-aware, human-aligned evaluation for autonomous driving with vision-language models.ICRA, 2026

2026
[46]

Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

work page arXiv 2026
[47]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

Large reward models: Gen- eralizable online robot reward generation with vision-language models.arXiv preprint arXiv:2603.16065, 2026

Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, and Yue Wang. Large reward models: Gen- eralizable online robot reward generation with vision-language models.arXiv preprint arXiv:2603.16065, 2026

work page arXiv 2026
[49]

Text2reward: Reward shaping with language models for reinforcement learning.arXiv preprint arXiv:2309.11489, 2023

Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, and Tao Yu. Text2reward: Reward shaping with language models for reinforcement learning.arXiv preprint arXiv:2309.11489, 2023

work page arXiv 2023
[50]

Phycritic: Multimodal critic models for physical ai.arXiv preprint arXiv:2602.11124, 2026

Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, and Zhiding Yu. Phycritic: Multimodal critic models for physical ai.arXiv preprint arXiv:2602.11124, 2026

work page arXiv 2026
[51]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

work page arXiv 2025
[52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Alvarez, and Zuxuan Wu

Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M. Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659, 2025

work page arXiv 2025
[54]

HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving

Wenhao Yao, Xinglong Sun, Zhenxin Li, Shiyi Lan, Zi Wang, Jose M Alvarez, and Zuxuan Wu. Had: Combining hierarchical diffusion with metric-decoupled rl for end-to-end driving.arXiv preprint arXiv:2604.03581, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Critic in the loop: A tri-system vla framework for robust long-horizon manipulation.arXiv preprint arXiv:2603.05185, 2026

Pengfei Yi, Yingjie Ma, Wenjiang Xu, Yanan Hao, Shuai Gan, Wanting Li, and Shanlin Zhong. Critic in the loop: A tri-system vla framework for robust long-horizon manipulation.arXiv preprint arXiv:2603.05185, 2026

work page arXiv 2026
[56]

Critic-v: Vlm critics help catch vlm errors in multimodal reasoning

Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, et al. Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9050–9061, 2025

2025
[57]

Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforcement learning

Xuanyu Zhang, Weiqi Li, Shijie Zhao, Junlin Li, Li Zhang, and Jian Zhang. Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12870–12878, 2026. 14

2026
[58]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[59]

too strict

Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, and Xinggang Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745, 2025. 15 A Model Details A.1 Prompts System : You are an expert AI driving i n s t r u c ...

work page arXiv 2025

[1] [1]

Marius Zoellner

Ahmed Abouelazm, Jonas Michel, and J. Marius Zoellner. A review of reward functions for reinforcement learning in the context of autonomous driving, 2026

2026

[2] [2]

Evaluating vision-language models as evaluators in path planning

Mohamed Aghzal, Xiang Yue, Erion Plaku, and Ziyu Yao. Evaluating vision-language models as evaluators in path planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6886–6897, 2025

2025

[3] [3]

Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

work page arXiv 2025

[4] [4]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InCVPR, pages 11621–11631, 2020. 10

2020

[5] [5]

Pseudo-simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo-simulation for autonomous driving. InConference on Robot Learning (CoRL), 2025

2025

[6] [6]

Criteria: a new benchmarking paradigm for evaluating trajectory prediction models for autonomous driving

Changhe Chen, Mozhgan Pourkeshavarz, and Amir Rasouli. Criteria: a new benchmarking paradigm for evaluating trajectory prediction models for autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8265–8271. IEEE, 2024

2024

[7] [7]

Eccv 2024 w-coda: 1st workshop on multimodal perception and comprehension of corner cases in autonomous driving.arXiv preprint arXiv:2507.01735, 2025

Kai Chen, Ruiyuan Gao, Lanqing Hong, Hang Xu, Xu Jia, Holger Caesar, Dengxin Dai, Bingbing Liu, Dzmitry Tsishkou, Songcen Xu, et al. Eccv 2024 w-coda: 1st workshop on multimodal perception and comprehension of corner cases in autonomous driving.arXiv preprint arXiv:2507.01735, 2025

work page arXiv 2024

[8] [8]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Scaling instruction-finetuned language models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024

2024

[10] [10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS, volume 37, pages 28706–28719, 2024

2024

[12] [12]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. pages 1–16, 2017

2017

[14] [14]

Task success is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors.arXiv preprint arXiv:2402.04210, 2024

Lin Guan, Yifan Zhou, Denis Liu, Yantian Zha, Heni Ben Amor, and Subbarao Kambhampati. Task success is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors.arXiv preprint arXiv:2402.04210, 2024

work page arXiv 2024

[15] [15]

Styledrive: Towards driving-style aware benchmarking of end-to-end autonomous driving

Ruiyang Hao, Bowen Jing, Haibao Yu, and Zaiqing Nie. Styledrive: Towards driving-style aware benchmarking of end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4627–4635, 2026. 11

2026

[16] [16]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, pages 17853–17862, 2023

2023

[17] [17]

3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

Ting Huang, Zeyu Zhang, and Hao Tang. 3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

work page arXiv 2025

[18] [18]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.NeurIPS, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.NeurIPS, 37:819–844, 2024

2024

[19] [19]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[20] [20]

Rewarding dino: Predicting dense rewards with vision foundation models.arXiv preprint arXiv:2603.16978, 2026

Pierre Krack, Tobias Jülg, Wolfram Burgard, and Florian Walter. Rewarding dino: Predicting dense rewards with vision foundation models.arXiv preprint arXiv:2603.16978, 2026

work page arXiv 2026

[21] [21]

Explaining human preferences via metrics for structured 3d reconstruction

Jack Langerman, Denys Rozumnyi, Yuzhong Huang, and Dmytro Mishkin. Explaining human preferences via metrics for structured 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26944–26953, 2025

2025

[22] [22]

Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

work page arXiv 2026

[23] [23]

Coda: A real-world road corner case dataset for object detection in autonomous driving

Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing Xu, Dit-Yan Yeung, et al. Coda: A real-world road corner case dataset for object detection in autonomous driving. InEuropean Conference on Computer Vision, pages 406–423. Springer, 2022

2022

[24] [24]

Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking. 2026

2026

[25] [25]

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE transactions on pattern analysis and machine intelligence, 45(3):3461–3475, 2022

2022

[27] [27]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M. Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. InICCV, pages 27305–27314, October 2025

2025

[29] [29]

Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring.arXiv preprint arXiv:2510.24108, 2025

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Jingde Chen, Nadine Chang, Maying Shen, Jingyu Song, Zuxuan Wu, Shiyi Lan, et al. Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring.arXiv preprint arXiv:2510.24108, 2025

work page arXiv 2025

[30] [30]

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M. Alvarez. Generalized trajectory scoring for end-to-end multimodal planning. arXiv preprint arXiv:2506.06664, 2025

work page arXiv 2025

[31] [31]

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S Huang, Luke Zettlemoyer, Dieter Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, and Xinggang Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, June 2025

2025

[33] [33]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[34] [34]

Dapo: Improving multi-step reasoning abilities of large language models with direct advantage-based policy optimization

Jiacai Liu, Chaojie Wang, Chris Yuhao Liu, Liang Zeng, Rui Yan, Yiwen Sun, and Yang Liu. Dapo: Improving multi-step reasoning abilities of large language models with direct advantage-based policy optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12

2025

[35] [35]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

work page arXiv 2026

[37] [37]

Alpasim: A modular, lightweight, and data-driven research simulator for autonomous driving, October 2025

NVIDIA, Yulong Cao, Riccardo de Lutio, Sanja Fidler, Guillermo Garcia Cobo, Zan Gojcic, Maximilian Igl, Boris Ivanovic, Peter Karkus, Janick Martinez Esturo, Marco Pavone, Aaron Smith, Ellie Tanimura, Michal Tyszkiewicz, Michael Watson, Qi Wu, and Le Zhang. Alpasim: A modular, lightweight, and data-driven research simulator for autonomous driving, October 2025

2025

[38] [38]

PhysicalAI-Autonomous-Vehicles, October 2025

NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles, October 2025. Dataset hosted on Hugging Face. Accessed 2026-05-05

2025

[39] [39]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[41] [41]

Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025. 13

work page arXiv 2025

[42] [42]

PARC: A quantitative framework uncovering the symmetries within vision language models

Jenny Schmalfuss, Nadine Chang, Vibashan VS, Maying Shen, Andres Bruhn, and Jose M Alvarez. PARC: A quantitative framework uncovering the symmetries within vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 25081–25091, 2025

2025

[43] [43]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Drivecritic: Towards context-aware, human-aligned evaluation for autonomous driving with vision-language models.ICRA, 2026

Jingyu Song, Zhenxin Li, Shiyi Lan, Xinglong Sun, Nadine Chang, Maying Shen, Joshua Chen, Kather- ine A Skinner, and Jose M Alvarez. Drivecritic: Towards context-aware, human-aligned evaluation for autonomous driving with vision-language models.ICRA, 2026

2026

[46] [46]

Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

work page arXiv 2026

[47] [47]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[48] [48]

Large reward models: Gen- eralizable online robot reward generation with vision-language models.arXiv preprint arXiv:2603.16065, 2026

Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, and Yue Wang. Large reward models: Gen- eralizable online robot reward generation with vision-language models.arXiv preprint arXiv:2603.16065, 2026

work page arXiv 2026

[49] [49]

Text2reward: Reward shaping with language models for reinforcement learning.arXiv preprint arXiv:2309.11489, 2023

Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, and Tao Yu. Text2reward: Reward shaping with language models for reinforcement learning.arXiv preprint arXiv:2309.11489, 2023

work page arXiv 2023

[50] [50]

Phycritic: Multimodal critic models for physical ai.arXiv preprint arXiv:2602.11124, 2026

Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, and Zhiding Yu. Phycritic: Multimodal critic models for physical ai.arXiv preprint arXiv:2602.11124, 2026

work page arXiv 2026

[51] [51]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

work page arXiv 2025

[52] [52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Alvarez, and Zuxuan Wu

Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M. Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659, 2025

work page arXiv 2025

[54] [54]

HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving

Wenhao Yao, Xinglong Sun, Zhenxin Li, Shiyi Lan, Zi Wang, Jose M Alvarez, and Zuxuan Wu. Had: Combining hierarchical diffusion with metric-decoupled rl for end-to-end driving.arXiv preprint arXiv:2604.03581, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [55]

Critic in the loop: A tri-system vla framework for robust long-horizon manipulation.arXiv preprint arXiv:2603.05185, 2026

Pengfei Yi, Yingjie Ma, Wenjiang Xu, Yanan Hao, Shuai Gan, Wanting Li, and Shanlin Zhong. Critic in the loop: A tri-system vla framework for robust long-horizon manipulation.arXiv preprint arXiv:2603.05185, 2026

work page arXiv 2026

[56] [56]

Critic-v: Vlm critics help catch vlm errors in multimodal reasoning

Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, et al. Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9050–9061, 2025

2025

[57] [57]

Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforcement learning

Xuanyu Zhang, Weiqi Li, Shijie Zhao, Junlin Li, Li Zhang, and Jian Zhang. Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12870–12878, 2026. 14

2026

[58] [58]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[59] [59]

too strict

Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, and Xinggang Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745, 2025. 15 A Model Details A.1 Prompts System : You are an expert AI driving i n s t r u c ...

work page arXiv 2025