pith. sign in

arxiv: 2605.20082 · v1 · pith:LH3UPC5Mnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

Pith reviewed 2026-05-20 05:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsautonomous drivingpreference optimizationmotion forecastingdirect preference optimizationwaymo dataset
0
0 comments X

The pith

Vision-language models generate preference data to finetune autonomous driving forecasts for human alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using a vision-language model to automatically label preferred driving trajectories from a pretrained model's outputs. These labels form preference pairs for finetuning via Direct Preference Optimization. This matters for autonomous driving because it captures nuanced human preferences that standard imitation learning might overlook, leading to forecasts that better match what people would choose. The approach is tested on the Waymo dataset and shows gains in human-aligned metrics.

Core claim

By treating the vision-language model as a zero-shot reasoner, VL-DPO creates preference pairs from rollouts and applies DPO to produce a model with 11.94% higher rater feedback score and 10.01% lower average displacement error than the pretrained version, while confirming the VLM selections match human annotations.

What carries the argument

The VL-DPO pipeline, in which a vision-language model selects between trajectory options to build preference pairs for Direct Preference Optimization of motion forecasting models.

If this is right

  • The finetuned model produces trajectories that humans rate more highly.
  • Average displacement from preferred paths decreases.
  • The method avoids the need for extensive new human labeling by leveraging VLM reasoning.
  • It demonstrates that preference optimization can enhance pretrained models in complex real-world tasks like driving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework might extend to other domains where VLMs can proxy preferences, such as robotics or game AI.
  • Stronger VLMs could lead to even better alignment results in future iterations.
  • It highlights a path to make AI systems more intuitively aligned without massive supervised preference datasets.

Load-bearing premise

The vision-language model's trajectory selection accurately reflects human preferences.

What would settle it

Collecting new human preference labels on the same rollouts and finding that they disagree with the VLM's choices on a majority of cases would disprove the proxy quality.

Figures

Figures reproduced from arXiv: 2605.20082 by Ghassen Jerfel, Jeonhyung Kang, Khaled S. Refaat, Marina Haliem, Qi Zhao, Zhefan Xu.

Figure 1
Figure 1. Figure 1: Illustration of the proposed training stages of the motion forecasting [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of MotionLM [2]. It adopts an encoder–decoder [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the VLM’s Chain-of-Thought (CoT) reasoning process. The VLM takes as input the sequence of image history, top-down view [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of VLM Chain-of-Thought reasoning. For clarity, the full [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of RFS, avgRFS, mlRFS across the finetuning methods. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of central-mode trajectory prediction plots in top-down images from MotionLM [2], the imitation learning–finetuned model, and our [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VL-DPO, a framework that uses a vision-language model (VLM) as a zero-shot reasoner to generate preference pairs (preferred/rejected trajectories) from rollouts of a pretrained motion forecasting model on the Waymo Open End-to-End Driving Dataset (WOD-E2E). These pairs are then used to fine-tune the model via Direct Preference Optimization (DPO). The authors evaluate on held-out human preference annotations and report that VL-DPO achieves an 11.94% increase in rater feedback score (RFS) and a 10.01% reduction in average displacement error (ADE) relative to the pretrained baseline, while claiming that experiments confirm the VLM selections serve as a high-quality proxy for human preferences.

Significance. If the VLM proxy validation holds with strong quantitative support, the work could provide a scalable method for aligning autonomous driving motion models with nuanced human preferences using existing VLMs, reducing reliance on large-scale manual annotations. The approach builds on DPO and VLM reasoning capabilities in a concrete application domain, and the reported metric gains on held-out annotations would be a useful empirical result if reproducible.

major comments (2)
  1. Experiments section: The central claim that 'the VLM's trajectory selection is a high-quality proxy for human preference' is load-bearing for interpreting the 11.94% RFS and 10.01% ADE gains as genuine preference alignment rather than regularization or dataset effects. The manuscript must provide concrete quantitative evidence here, such as exact selection accuracy against held-out human annotations, inter-rater agreement (e.g., Cohen's kappa), or a statistical test comparing VLM choices to human raters; without these numbers and controls for VLM biases on driving scenes, the proxy quality cannot be assessed as 'high-quality.'
  2. §3 (Method) and §4 (Experiments): The DPO objective is applied to VLM-generated pairs, but it is unclear how trajectory rollouts are sampled, how the VLM prompt elicits preferences, and whether any post-hoc filtering or multiple VLM queries are used. If these choices are not fixed before seeing the held-out human annotations, the reported gains risk being influenced by evaluation setup rather than the method itself.
minor comments (2)
  1. Abstract and §4: Clarify the exact definition and computation of RFS (rater feedback score) and whether it is normalized; also report standard deviations or confidence intervals on the 11.94% and 10.01% figures.
  2. Related work: Add citations to recent VLM-based preference alignment works outside driving (e.g., in robotics or general RLHF) to better contextualize the novelty of applying DPO with VLMs to motion forecasting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the presentation of our results. We address the major comments point by point below. Where the manuscript lacks sufficient detail or direct evidence, we will revise accordingly while preserving the original experimental outcomes.

read point-by-point responses
  1. Referee: Experiments section: The central claim that 'the VLM's trajectory selection is a high-quality proxy for human preference' is load-bearing for interpreting the 11.94% RFS and 10.01% ADE gains as genuine preference alignment rather than regularization or dataset effects. The manuscript must provide concrete quantitative evidence here, such as exact selection accuracy against held-out human annotations, inter-rater agreement (e.g., Cohen's kappa), or a statistical test comparing VLM choices to human raters; without these numbers and controls for VLM biases on driving scenes, the proxy quality cannot be assessed as 'high-quality.'

    Authors: We agree that direct quantitative metrics would make the proxy claim more robust. The current manuscript relies on downstream gains on held-out human annotations as supporting evidence. In the revision we will add a dedicated subsection and table reporting (1) the exact match rate between VLM-selected preferred trajectories and human rater choices on the held-out set, (2) Cohen's kappa for inter-rater agreement between VLM and humans, and (3) a chi-squared test against a random baseline. We will also include a short discussion of observed VLM biases in driving scenes (e.g., over-caution in certain merge scenarios) with qualitative examples. revision: yes

  2. Referee: §3 (Method) and §4 (Experiments): The DPO objective is applied to VLM-generated pairs, but it is unclear how trajectory rollouts are sampled, how the VLM prompt elicits preferences, and whether any post-hoc filtering or multiple VLM queries are used. If these choices are not fixed before seeing the held-out human annotations, the reported gains risk being influenced by evaluation setup rather than the method itself.

    Authors: We appreciate the call for greater methodological transparency. In the revised §3 we will specify: rollouts are obtained by sampling 8 trajectories per scene from the pretrained model using fixed top-k sampling (k=5, temperature=0.7); the VLM prompt is a single, fixed zero-shot template that instructs the model to reason about safety, comfort, and rule compliance before outputting the preferred index; no post-hoc filtering or multiple queries per pair are performed. All sampling parameters, prompt wording, and pair construction decisions were locked prior to any inspection of the held-out human annotations, as documented in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external held-out human annotations

full rationale

The paper generates preference pairs via zero-shot VLM reasoning on pretrained rollouts, applies DPO finetuning, and reports gains on RFS and ADE measured against held-out human preference annotations. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the reported 11.94% RFS lift or 10.01% ADE reduction to quantities defined by the method's own inputs. The VLM-proxy validation is presented as an experimental confirmation against external human data rather than a definitional or self-referential step, leaving the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions plus one domain-specific premise that the VLM can serve as a reliable proxy for human preference without additional training.

axioms (2)
  • domain assumption Direct Preference Optimization can be applied to motion forecasting models to improve alignment with external preference signals.
    Invoked when the authors apply DPO to the pretrained driving model using VLM-generated pairs.
  • domain assumption Held-out human preference annotations provide an unbiased measure of model quality.
    Used to compute RFS and to validate that VLM selections match human judgments.

pith-pipeline@v0.9.0 · 5759 in / 1363 out tokens · 31291 ms · 2026-05-20T05:34:34.876349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    DriveGPT: Scaling autoregressive behavior models for driving,

    X. Huang, E. M. Wolff, P. Vernaza, T. Phan-Minh, H. Chen, D. S. Hay- den, M. Edmonds, B. Pierce, X. Chen, P. E. Jacob, X. Chen, C. Tair- bekov, P. Agarwal, T. Gao, Y . Chai, and S. Srinivasa, “DriveGPT: Scaling autoregressive behavior models for driving,” inForty-second International Conference on Machine Learning, 2025

  2. [2]

    Motionlm: Multi-agent motion forecasting as language modeling,

    A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp, “Motionlm: Multi-agent motion forecasting as language modeling,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8579–8590

  3. [3]

    Wayformer: Motion forecasting via simple & efficient attention networks,

    N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient attention networks,” inICRA, 2023

  4. [4]

    Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,

    B. Varadarajan, A. Hefny, A. Srivastava, K. S. Refaat, N. Nayakanti, A. Cornman, K. Chen, B. Douillard, C. Lam, D. Anguelov, and B. Sapp, “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” inICRA, 2022

  5. [5]

    Scaling laws of motion forecasting and planning–a technical report,

    M. Baniodeh, K. Goel, S. Ettinger, C. Fuertes, A. Seff, T. Shen, C. Gulino, C. Yang, G. Jerfel, D. Choeet al., “Scaling laws of motion forecasting and planning–a technical report,”arXiv preprint arXiv:2506.08228, 2025

  6. [6]

    Drivelm: Driving with graph visual question answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean conference on computer vision. Springer, 2024

  7. [7]

    DriveVLM: The convergence of autonomous driving and large vision-language models,

    X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “DriveVLM: The convergence of autonomous driving and large vision-language models,” in8th Annual Conference on Robot Learning, 2024

  8. [8]

    Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model,

    S. Zhang, W. Huang, Z. Gao, H. Chen, and C. Lv, “Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model,”arXiv preprint arXiv:2412.09951, 2024

  9. [9]

    EMMA: End-to-end multimodal model for autonomous driving,

    J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, Y . Zhou, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-end multimodal model for autonomous driving,”Transactions on Machine Learning Research, 2025

  10. [10]

    Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,

    X. Zhou, X. Han, F. Yang, Y . Ma, and A. C. Knoll, “Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,”arXiv preprint arXiv:2503.23463, 2025

  11. [11]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, 2023

  12. [12]

    Instructvla: Vision-language-action instruction tuning from understanding to manipulation,

    S. Yang, H. Li, Y . Chen, B. Wang, Y . Tian, T. Wang, H. Wang, F. Zhao, Y . Liao, and J. Pang, “Instructvla: Vision-language-action instruction tuning from understanding to manipulation,”arXiv preprint arXiv:2507.17520, 2025

  13. [13]

    Chatvla: Unified multimodal understanding and robot control with vision-language-action model,

    Z. Zhou, Y . Zhu, M. Zhu, J. Wen, N. Liu, Z. Xu, W. Meng, R. Cheng, Y . Peng, C. Shenet al., “Chatvla: Unified multimodal understanding and robot control with vision-language-action model,”arXiv preprint arXiv:2502.14420, 2025

  14. [14]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

  15. [15]

    Multimodal trajectory predictions for autonomous driving using deep convolutional networks,

    H. Cui, V . Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” inIEEE International Conference on Robotics and Automation (ICRA), 2019

  16. [16]

    Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,

    J. Hong, B. Sapp, and J. Philbin, “Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,” in CVPR, 2019

  17. [17]

    Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,

    Y . Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” in Conference on Robot Learning, 2020, pp. 86–99

  18. [18]

    Spagnn: Spatially- aware graph neural networks for relational behavior forecasting from sensor data,

    S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatially- aware graph neural networks for relational behavior forecasting from sensor data,” inICRA, 2020

  19. [19]

    Learning lane graph representations for motion forecasting,

    M. Liang, B. Yang, R. Hu, Y . Chen, R. Liao, S. Feng, and R. Urta- sun, “Learning lane graph representations for motion forecasting,” in European Conference on Computer Vision, 2020

  20. [20]

    Motion transformer with global intention localization and local movement refinement,

    S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,”Ad- vances in Neural Information Processing Systems, 2022

  21. [21]

    Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,

    X. Jia, P. Wu, L. Chen, H. Li, Y . Liu, and J. Yan, “Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,”CoRL, 2022

  22. [22]

    Multi-modal fusion transformer for end-to-end autonomous driving,

    A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  23. [23]

    Vad: Vectorized scene representation for efficient autonomous driving,

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inICCV, 2023, pp. 8306–8316

  24. [24]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” in CVPR, 2023

  25. [25]

    Vdt-auto: End-to-end autonomous driving with vlm-guided diffusion transformers,

    Z. Guo, K. Gubernatorov, S. Asfaw, Z. Yagudin, and D. Tsetserukou, “Vdt-auto: End-to-end autonomous driving with vlm-guided diffusion transformers,”arXiv preprint arXiv:2502.20108, 2025

  26. [26]

    Distilling multi- modal large language models for autonomous driving,

    D. Hegde, R. Yasarla, H. Cai, S. Han, A. Bhattacharyya, S. Mahajan, L. Liu, R. Garrepalli, V . M. Patel, and F. Porikli, “Distilling multi- modal large language models for autonomous driving,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  27. [27]

    Empowering autonomous driving with large language models: A safety perspective,

    Y . Wang, R. Jiao, S. S. Zhan, C. Lang, C. Huang, Z. Wang, Z. Yang, and Q. Zhu, “Empowering autonomous driving with large language models: A safety perspective,” inICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

  28. [28]

    Hard cases detection in motion prediction by vision-language foundation models,

    Y . Yang, Q. Zhang, K. Ikemura, N. Batool, and J. Folkesson, “Hard cases detection in motion prediction by vision-language foundation models,” in2024 IEEE Intelligent Vehicles Symposium (IV), 2024

  29. [29]

    Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

    Y . Xu, Y . Hu, Z. Zhang, G. P. Meyer, S. K. Mustikovela, S. Srini- vasa, E. M. Wolff, and X. Huang, “Vlm-ad: End-to-end autonomous driving through vision-language model supervision,”arXiv preprint arXiv:2412.14446, 2024

  30. [30]

    Vlp: Vision language planning for autonomous driving,

    C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inCVPR, 2024

  31. [31]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  32. [32]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, 2022

  33. [33]

    Deep reinforcement learning from human preferences,

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

  34. [34]

    Direct post-training preference alignment for multi-agent motion generation model using implicit feedback from pre-training demonstrations,

    T. Tian and K. Goel, “Direct post-training preference alignment for multi-agent motion generation model using implicit feedback from pre-training demonstrations,” inThe Thirteenth International Confer- ence on Learning Representations, 2025

  35. [35]

    Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation,

    C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang, “Vtla: Vision- tactile-language-action model with preference learning for insertion manipulation,”arXiv preprint arXiv:2505.09577, 2025

  36. [36]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025