VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

arxiv: 2605.20082 · v1 · pith:LH3UPC5Mnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

Zhefan Xu , Ghassen Jerfel , Marina Haliem , Qi Zhao , Jeonhyung Kang , Khaled S. Refaat This is my paper

Pith reviewed 2026-05-20 05:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsautonomous drivingpreference optimizationmotion forecastingdirect preference optimizationwaymo dataset

0 comments p. Extension

pith:LH3UPC5M Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{LH3UPC5M}

Prints a linked pith:LH3UPC5M badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Vision-language models generate preference data to finetune autonomous driving forecasts for human alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using a vision-language model to automatically label preferred driving trajectories from a pretrained model's outputs. These labels form preference pairs for finetuning via Direct Preference Optimization. This matters for autonomous driving because it captures nuanced human preferences that standard imitation learning might overlook, leading to forecasts that better match what people would choose. The approach is tested on the Waymo dataset and shows gains in human-aligned metrics.

Core claim

By treating the vision-language model as a zero-shot reasoner, VL-DPO creates preference pairs from rollouts and applies DPO to produce a model with 11.94% higher rater feedback score and 10.01% lower average displacement error than the pretrained version, while confirming the VLM selections match human annotations.

What carries the argument

The VL-DPO pipeline, in which a vision-language model selects between trajectory options to build preference pairs for Direct Preference Optimization of motion forecasting models.

If this is right

The finetuned model produces trajectories that humans rate more highly.
Average displacement from preferred paths decreases.
The method avoids the need for extensive new human labeling by leveraging VLM reasoning.
It demonstrates that preference optimization can enhance pretrained models in complex real-world tasks like driving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework might extend to other domains where VLMs can proxy preferences, such as robotics or game AI.
Stronger VLMs could lead to even better alignment results in future iterations.
It highlights a path to make AI systems more intuitively aligned without massive supervised preference datasets.

Load-bearing premise

The vision-language model's trajectory selection accurately reflects human preferences.

What would settle it

Collecting new human preference labels on the same rollouts and finding that they disagree with the VLM's choices on a majority of cases would disprove the proxy quality.

Figures

Figures reproduced from arXiv: 2605.20082 by Ghassen Jerfel, Jeonhyung Kang, Khaled S. Refaat, Marina Haliem, Qi Zhao, Zhefan Xu.

**Figure 2.** Figure 2: The architecture of MotionLM [2]. It adopts an encoder–decoder [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the VLM’s Chain-of-Thought (CoT) reasoning process. The VLM takes as input the sequence of image history, top-down view [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Example of VLM Chain-of-Thought reasoning. For clarity, the full [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of RFS, avgRFS, mlRFS across the finetuning methods. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of central-mode trajectory prediction plots in top-down images from MotionLM [2], the imitation learning–finetuned model, and our [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VL-DPO uses a VLM to auto-generate DPO pairs for driving model finetuning and reports 12% RFS gains plus 10% ADE drop on held-out human annotations, but the proxy validation carries the claim.

read the letter

VL-DPO shows how to use a VLM to auto-generate DPO pairs for finetuning driving models, delivering measurable gains on human preference metrics from the Waymo dataset. The new part is treating the VLM as a zero-shot reasoner to pick preferred and rejected trajectories from the pretrained model's rollouts. This avoids manual labeling for the preference data and applies the DPO objective directly to the motion forecasting task. They get an 11.94% lift in RFS and 10.01% drop in ADE after finetuning, evaluated on held-out human annotations. The paper does a decent job grounding the evaluation in external human ratings rather than just model metrics. The claim that the VLM acts as a high-quality proxy is backed by some alignment check in the experiments, which is necessary for the story to hold. The soft spot is still that proxy step. If the VLM-human agreement is not particularly strong or if it correlates with other factors like trajectory smoothness, the observed improvements might not specifically reflect better preference alignment. DPO can regularize models in various ways, so isolating the effect matters. I'd look for ablations that compare against random pairs or other selection methods. This paper is for people working on end-to-end autonomous driving and preference-based alignment techniques. A reader in that area could pick up the method and test it on their own setups. I would recommend sending it for peer review. The application is timely and the results are reported clearly enough to get useful comments.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VL-DPO, a framework that uses a vision-language model (VLM) as a zero-shot reasoner to generate preference pairs (preferred/rejected trajectories) from rollouts of a pretrained motion forecasting model on the Waymo Open End-to-End Driving Dataset (WOD-E2E). These pairs are then used to fine-tune the model via Direct Preference Optimization (DPO). The authors evaluate on held-out human preference annotations and report that VL-DPO achieves an 11.94% increase in rater feedback score (RFS) and a 10.01% reduction in average displacement error (ADE) relative to the pretrained baseline, while claiming that experiments confirm the VLM selections serve as a high-quality proxy for human preferences.

Significance. If the VLM proxy validation holds with strong quantitative support, the work could provide a scalable method for aligning autonomous driving motion models with nuanced human preferences using existing VLMs, reducing reliance on large-scale manual annotations. The approach builds on DPO and VLM reasoning capabilities in a concrete application domain, and the reported metric gains on held-out annotations would be a useful empirical result if reproducible.

major comments (2)

Experiments section: The central claim that 'the VLM's trajectory selection is a high-quality proxy for human preference' is load-bearing for interpreting the 11.94% RFS and 10.01% ADE gains as genuine preference alignment rather than regularization or dataset effects. The manuscript must provide concrete quantitative evidence here, such as exact selection accuracy against held-out human annotations, inter-rater agreement (e.g., Cohen's kappa), or a statistical test comparing VLM choices to human raters; without these numbers and controls for VLM biases on driving scenes, the proxy quality cannot be assessed as 'high-quality.'
§3 (Method) and §4 (Experiments): The DPO objective is applied to VLM-generated pairs, but it is unclear how trajectory rollouts are sampled, how the VLM prompt elicits preferences, and whether any post-hoc filtering or multiple VLM queries are used. If these choices are not fixed before seeing the held-out human annotations, the reported gains risk being influenced by evaluation setup rather than the method itself.

minor comments (2)

Abstract and §4: Clarify the exact definition and computation of RFS (rater feedback score) and whether it is normalized; also report standard deviations or confidence intervals on the 11.94% and 10.01% figures.
Related work: Add citations to recent VLM-based preference alignment works outside driving (e.g., in robotics or general RLHF) to better contextualize the novelty of applying DPO with VLMs to motion forecasting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the presentation of our results. We address the major comments point by point below. Where the manuscript lacks sufficient detail or direct evidence, we will revise accordingly while preserving the original experimental outcomes.

read point-by-point responses

Referee: Experiments section: The central claim that 'the VLM's trajectory selection is a high-quality proxy for human preference' is load-bearing for interpreting the 11.94% RFS and 10.01% ADE gains as genuine preference alignment rather than regularization or dataset effects. The manuscript must provide concrete quantitative evidence here, such as exact selection accuracy against held-out human annotations, inter-rater agreement (e.g., Cohen's kappa), or a statistical test comparing VLM choices to human raters; without these numbers and controls for VLM biases on driving scenes, the proxy quality cannot be assessed as 'high-quality.'

Authors: We agree that direct quantitative metrics would make the proxy claim more robust. The current manuscript relies on downstream gains on held-out human annotations as supporting evidence. In the revision we will add a dedicated subsection and table reporting (1) the exact match rate between VLM-selected preferred trajectories and human rater choices on the held-out set, (2) Cohen's kappa for inter-rater agreement between VLM and humans, and (3) a chi-squared test against a random baseline. We will also include a short discussion of observed VLM biases in driving scenes (e.g., over-caution in certain merge scenarios) with qualitative examples. revision: yes
Referee: §3 (Method) and §4 (Experiments): The DPO objective is applied to VLM-generated pairs, but it is unclear how trajectory rollouts are sampled, how the VLM prompt elicits preferences, and whether any post-hoc filtering or multiple VLM queries are used. If these choices are not fixed before seeing the held-out human annotations, the reported gains risk being influenced by evaluation setup rather than the method itself.

Authors: We appreciate the call for greater methodological transparency. In the revised §3 we will specify: rollouts are obtained by sampling 8 trajectories per scene from the pretrained model using fixed top-k sampling (k=5, temperature=0.7); the VLM prompt is a single, fixed zero-shot template that instructs the model to reason about safety, comfort, and rule compliance before outputting the preferred index; no post-hoc filtering or multiple queries per pair are performed. All sampling parameters, prompt wording, and pair construction decisions were locked prior to any inspection of the held-out human annotations, as documented in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external held-out human annotations

full rationale

The paper generates preference pairs via zero-shot VLM reasoning on pretrained rollouts, applies DPO finetuning, and reports gains on RFS and ADE measured against held-out human preference annotations. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the reported 11.94% RFS lift or 10.01% ADE reduction to quantities defined by the method's own inputs. The VLM-proxy validation is presented as an experimental confirmation against external human data rather than a definitional or self-referential step, leaving the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions plus one domain-specific premise that the VLM can serve as a reliable proxy for human preference without additional training.

axioms (2)

domain assumption Direct Preference Optimization can be applied to motion forecasting models to improve alignment with external preference signals.
Invoked when the authors apply DPO to the pretrained driving model using VLM-generated pairs.
domain assumption Held-out human preference annotations provide an unbiased measure of model quality.
Used to compute RFS and to validate that VLM selections match human judgments.

pith-pipeline@v0.9.0 · 5759 in / 1363 out tokens · 31291 ms · 2026-05-20T05:34:34.876349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

DriveGPT: Scaling autoregressive behavior models for driving,

X. Huang, E. M. Wolff, P. Vernaza, T. Phan-Minh, H. Chen, D. S. Hay- den, M. Edmonds, B. Pierce, X. Chen, P. E. Jacob, X. Chen, C. Tair- bekov, P. Agarwal, T. Gao, Y . Chai, and S. Srinivasa, “DriveGPT: Scaling autoregressive behavior models for driving,” inForty-second International Conference on Machine Learning, 2025

work page 2025
[2]

Motionlm: Multi-agent motion forecasting as language modeling,

A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp, “Motionlm: Multi-agent motion forecasting as language modeling,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8579–8590

work page 2023
[3]

Wayformer: Motion forecasting via simple & efficient attention networks,

N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient attention networks,” inICRA, 2023

work page 2023
[4]

Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,

B. Varadarajan, A. Hefny, A. Srivastava, K. S. Refaat, N. Nayakanti, A. Cornman, K. Chen, B. Douillard, C. Lam, D. Anguelov, and B. Sapp, “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” inICRA, 2022

work page 2022
[5]

Scaling laws of motion forecasting and planning–a technical report,

M. Baniodeh, K. Goel, S. Ettinger, C. Fuertes, A. Seff, T. Shen, C. Gulino, C. Yang, G. Jerfel, D. Choeet al., “Scaling laws of motion forecasting and planning–a technical report,”arXiv preprint arXiv:2506.08228, 2025

work page arXiv 2025
[6]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean conference on computer vision. Springer, 2024

work page 2024
[7]

DriveVLM: The convergence of autonomous driving and large vision-language models,

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “DriveVLM: The convergence of autonomous driving and large vision-language models,” in8th Annual Conference on Robot Learning, 2024

work page 2024
[8]

Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model,

S. Zhang, W. Huang, Z. Gao, H. Chen, and C. Lv, “Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model,”arXiv preprint arXiv:2412.09951, 2024

work page arXiv 2024
[9]

EMMA: End-to-end multimodal model for autonomous driving,

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, Y . Zhou, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-end multimodal model for autonomous driving,”Transactions on Machine Learning Research, 2025

work page 2025
[10]

Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,

X. Zhou, X. Han, F. Yang, Y . Ma, and A. C. Knoll, “Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,”arXiv preprint arXiv:2503.23463, 2025

work page arXiv 2025
[11]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, 2023

work page 2023
[12]

Instructvla: Vision-language-action instruction tuning from understanding to manipulation,

S. Yang, H. Li, Y . Chen, B. Wang, Y . Tian, T. Wang, H. Wang, F. Zhao, Y . Liao, and J. Pang, “Instructvla: Vision-language-action instruction tuning from understanding to manipulation,”arXiv preprint arXiv:2507.17520, 2025

work page arXiv 2025
[13]

Chatvla: Unified multimodal understanding and robot control with vision-language-action model,

Z. Zhou, Y . Zhu, M. Zhu, J. Wen, N. Liu, Z. Xu, W. Meng, R. Cheng, Y . Peng, C. Shenet al., “Chatvla: Unified multimodal understanding and robot control with vision-language-action model,”arXiv preprint arXiv:2502.14420, 2025

work page arXiv 2025
[14]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[15]

Multimodal trajectory predictions for autonomous driving using deep convolutional networks,

H. Cui, V . Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” inIEEE International Conference on Robotics and Automation (ICRA), 2019

work page 2019
[16]

Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,

J. Hong, B. Sapp, and J. Philbin, “Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,” in CVPR, 2019

work page 2019
[17]

Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,

Y . Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” in Conference on Robot Learning, 2020, pp. 86–99

work page 2020
[18]

Spagnn: Spatially- aware graph neural networks for relational behavior forecasting from sensor data,

S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatially- aware graph neural networks for relational behavior forecasting from sensor data,” inICRA, 2020

work page 2020
[19]

Learning lane graph representations for motion forecasting,

M. Liang, B. Yang, R. Hu, Y . Chen, R. Liao, S. Feng, and R. Urta- sun, “Learning lane graph representations for motion forecasting,” in European Conference on Computer Vision, 2020

work page 2020
[20]

Motion transformer with global intention localization and local movement refinement,

S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,”Ad- vances in Neural Information Processing Systems, 2022

work page 2022
[21]

Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,

X. Jia, P. Wu, L. Chen, H. Li, Y . Liu, and J. Yan, “Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,”CoRL, 2022

work page 2022
[22]

Multi-modal fusion transformer for end-to-end autonomous driving,

A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[23]

Vad: Vectorized scene representation for efficient autonomous driving,

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inICCV, 2023, pp. 8306–8316

work page 2023
[24]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” in CVPR, 2023

work page 2023
[25]

Vdt-auto: End-to-end autonomous driving with vlm-guided diffusion transformers,

Z. Guo, K. Gubernatorov, S. Asfaw, Z. Yagudin, and D. Tsetserukou, “Vdt-auto: End-to-end autonomous driving with vlm-guided diffusion transformers,”arXiv preprint arXiv:2502.20108, 2025

work page arXiv 2025
[26]

Distilling multi- modal large language models for autonomous driving,

D. Hegde, R. Yasarla, H. Cai, S. Han, A. Bhattacharyya, S. Mahajan, L. Liu, R. Garrepalli, V . M. Patel, and F. Porikli, “Distilling multi- modal large language models for autonomous driving,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025
[27]

Empowering autonomous driving with large language models: A safety perspective,

Y . Wang, R. Jiao, S. S. Zhan, C. Lang, C. Huang, Z. Wang, Z. Yang, and Q. Zhu, “Empowering autonomous driving with large language models: A safety perspective,” inICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

work page 2024
[28]

Hard cases detection in motion prediction by vision-language foundation models,

Y . Yang, Q. Zhang, K. Ikemura, N. Batool, and J. Folkesson, “Hard cases detection in motion prediction by vision-language foundation models,” in2024 IEEE Intelligent Vehicles Symposium (IV), 2024

work page 2024
[29]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

Y . Xu, Y . Hu, Z. Zhang, G. P. Meyer, S. K. Mustikovela, S. Srini- vasa, E. M. Wolff, and X. Huang, “Vlm-ad: End-to-end autonomous driving through vision-language model supervision,”arXiv preprint arXiv:2412.14446, 2024

work page arXiv 2024
[30]

Vlp: Vision language planning for autonomous driving,

C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inCVPR, 2024

work page 2024
[31]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,

S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025
[32]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, 2022

work page 2022
[33]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[34]

Direct post-training preference alignment for multi-agent motion generation model using implicit feedback from pre-training demonstrations,

T. Tian and K. Goel, “Direct post-training preference alignment for multi-agent motion generation model using implicit feedback from pre-training demonstrations,” inThe Thirteenth International Confer- ence on Learning Representations, 2025

work page 2025
[35]

Vtla: Vision- tactile-language-action model with preference learning for insertion manipulation,

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang, “Vtla: Vision- tactile-language-action model with preference learning for insertion manipulation,”arXiv preprint arXiv:2505.09577, 2025

work page arXiv 2025
[36]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

DriveGPT: Scaling autoregressive behavior models for driving,

X. Huang, E. M. Wolff, P. Vernaza, T. Phan-Minh, H. Chen, D. S. Hay- den, M. Edmonds, B. Pierce, X. Chen, P. E. Jacob, X. Chen, C. Tair- bekov, P. Agarwal, T. Gao, Y . Chai, and S. Srinivasa, “DriveGPT: Scaling autoregressive behavior models for driving,” inForty-second International Conference on Machine Learning, 2025

work page 2025

[2] [2]

Motionlm: Multi-agent motion forecasting as language modeling,

A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp, “Motionlm: Multi-agent motion forecasting as language modeling,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8579–8590

work page 2023

[3] [3]

Wayformer: Motion forecasting via simple & efficient attention networks,

N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient attention networks,” inICRA, 2023

work page 2023

[4] [4]

Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,

B. Varadarajan, A. Hefny, A. Srivastava, K. S. Refaat, N. Nayakanti, A. Cornman, K. Chen, B. Douillard, C. Lam, D. Anguelov, and B. Sapp, “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” inICRA, 2022

work page 2022

[5] [5]

Scaling laws of motion forecasting and planning–a technical report,

M. Baniodeh, K. Goel, S. Ettinger, C. Fuertes, A. Seff, T. Shen, C. Gulino, C. Yang, G. Jerfel, D. Choeet al., “Scaling laws of motion forecasting and planning–a technical report,”arXiv preprint arXiv:2506.08228, 2025

work page arXiv 2025

[6] [6]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean conference on computer vision. Springer, 2024

work page 2024

[7] [7]

DriveVLM: The convergence of autonomous driving and large vision-language models,

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “DriveVLM: The convergence of autonomous driving and large vision-language models,” in8th Annual Conference on Robot Learning, 2024

work page 2024

[8] [8]

Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model,

S. Zhang, W. Huang, Z. Gao, H. Chen, and C. Lv, “Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model,”arXiv preprint arXiv:2412.09951, 2024

work page arXiv 2024

[9] [9]

EMMA: End-to-end multimodal model for autonomous driving,

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, Y . Zhou, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-end multimodal model for autonomous driving,”Transactions on Machine Learning Research, 2025

work page 2025

[10] [10]

Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,

X. Zhou, X. Han, F. Yang, Y . Ma, and A. C. Knoll, “Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,”arXiv preprint arXiv:2503.23463, 2025

work page arXiv 2025

[11] [11]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, 2023

work page 2023

[12] [12]

Instructvla: Vision-language-action instruction tuning from understanding to manipulation,

S. Yang, H. Li, Y . Chen, B. Wang, Y . Tian, T. Wang, H. Wang, F. Zhao, Y . Liao, and J. Pang, “Instructvla: Vision-language-action instruction tuning from understanding to manipulation,”arXiv preprint arXiv:2507.17520, 2025

work page arXiv 2025

[13] [13]

Chatvla: Unified multimodal understanding and robot control with vision-language-action model,

Z. Zhou, Y . Zhu, M. Zhu, J. Wen, N. Liu, Z. Xu, W. Meng, R. Cheng, Y . Peng, C. Shenet al., “Chatvla: Unified multimodal understanding and robot control with vision-language-action model,”arXiv preprint arXiv:2502.14420, 2025

work page arXiv 2025

[14] [14]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[15] [15]

Multimodal trajectory predictions for autonomous driving using deep convolutional networks,

H. Cui, V . Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” inIEEE International Conference on Robotics and Automation (ICRA), 2019

work page 2019

[16] [16]

Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,

J. Hong, B. Sapp, and J. Philbin, “Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,” in CVPR, 2019

work page 2019

[17] [17]

Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,

Y . Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” in Conference on Robot Learning, 2020, pp. 86–99

work page 2020

[18] [18]

Spagnn: Spatially- aware graph neural networks for relational behavior forecasting from sensor data,

S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatially- aware graph neural networks for relational behavior forecasting from sensor data,” inICRA, 2020

work page 2020

[19] [19]

Learning lane graph representations for motion forecasting,

M. Liang, B. Yang, R. Hu, Y . Chen, R. Liao, S. Feng, and R. Urta- sun, “Learning lane graph representations for motion forecasting,” in European Conference on Computer Vision, 2020

work page 2020

[20] [20]

Motion transformer with global intention localization and local movement refinement,

S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,”Ad- vances in Neural Information Processing Systems, 2022

work page 2022

[21] [21]

Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,

X. Jia, P. Wu, L. Chen, H. Li, Y . Liu, and J. Yan, “Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,”CoRL, 2022

work page 2022

[22] [22]

Multi-modal fusion transformer for end-to-end autonomous driving,

A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[23] [23]

Vad: Vectorized scene representation for efficient autonomous driving,

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inICCV, 2023, pp. 8306–8316

work page 2023

[24] [24]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” in CVPR, 2023

work page 2023

[25] [25]

Vdt-auto: End-to-end autonomous driving with vlm-guided diffusion transformers,

Z. Guo, K. Gubernatorov, S. Asfaw, Z. Yagudin, and D. Tsetserukou, “Vdt-auto: End-to-end autonomous driving with vlm-guided diffusion transformers,”arXiv preprint arXiv:2502.20108, 2025

work page arXiv 2025

[26] [26]

Distilling multi- modal large language models for autonomous driving,

D. Hegde, R. Yasarla, H. Cai, S. Han, A. Bhattacharyya, S. Mahajan, L. Liu, R. Garrepalli, V . M. Patel, and F. Porikli, “Distilling multi- modal large language models for autonomous driving,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025

[27] [27]

Empowering autonomous driving with large language models: A safety perspective,

Y . Wang, R. Jiao, S. S. Zhan, C. Lang, C. Huang, Z. Wang, Z. Yang, and Q. Zhu, “Empowering autonomous driving with large language models: A safety perspective,” inICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

work page 2024

[28] [28]

Hard cases detection in motion prediction by vision-language foundation models,

Y . Yang, Q. Zhang, K. Ikemura, N. Batool, and J. Folkesson, “Hard cases detection in motion prediction by vision-language foundation models,” in2024 IEEE Intelligent Vehicles Symposium (IV), 2024

work page 2024

[29] [29]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

Y . Xu, Y . Hu, Z. Zhang, G. P. Meyer, S. K. Mustikovela, S. Srini- vasa, E. M. Wolff, and X. Huang, “Vlm-ad: End-to-end autonomous driving through vision-language model supervision,”arXiv preprint arXiv:2412.14446, 2024

work page arXiv 2024

[30] [30]

Vlp: Vision language planning for autonomous driving,

C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inCVPR, 2024

work page 2024

[31] [31]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,

S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025

[32] [32]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, 2022

work page 2022

[33] [33]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

work page 2017

[34] [34]

Direct post-training preference alignment for multi-agent motion generation model using implicit feedback from pre-training demonstrations,

T. Tian and K. Goel, “Direct post-training preference alignment for multi-agent motion generation model using implicit feedback from pre-training demonstrations,” inThe Thirteenth International Confer- ence on Learning Representations, 2025

work page 2025

[35] [35]

Vtla: Vision- tactile-language-action model with preference learning for insertion manipulation,

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang, “Vtla: Vision- tactile-language-action model with preference learning for insertion manipulation,”arXiv preprint arXiv:2505.09577, 2025

work page arXiv 2025

[36] [36]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025