pith. sign in

arxiv: 2606.20641 · v1 · pith:6ARW5XOYnew · submitted 2026-06-02 · 💻 cs.RO · cs.AI· cs.LG

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

Pith reviewed 2026-06-28 09:30 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords multimodal large language modelsreinforcement learning fine-tuningmotion planningautonomous drivingWaymo Open Motion Datasettrajectory predictionplanning rewards
0
0 comments X

The pith

Reinforcement learning fine-tuning of multimodal language models produces driving plans with lower overlap and off-road rates than supervised fine-tuning alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that next-token prediction in pre-training and supervised fine-tuning of multimodal LLMs encourages only per-token text imitation and misses multi-step planning needs such as spacing for other vehicles. To fix this, the authors add a reinforcement learning stage that converts the model's output tokens into vehicle trajectories and scores them directly with planning rewards. This shifts the training signal from text accuracy to actual driving metrics. On the Waymo Open Motion Dataset the RL-tuned models show measurable drops in overlap and off-road errors relative to the supervised baseline. A reader would care because the result suggests a practical way to make language-model planners respect safety and comfort objectives instead of copying demonstration text.

Core claim

The authors establish that an initial supervised fine-tuning phase produces a baseline capable of emitting plan trajectories as text sequences of X-Y coordinates, after which reinforcement learning fine-tuning that maps those sequences to trajectories and optimizes token-level planning rewards yields over 10.5 percent lower overlap rate and 38.9 percent lower off-road rate on the Waymo dataset.

What carries the argument

MAGNIFIED, the reinforcement learning fine-tuning procedure that converts sequences of predicted tokens into vehicle trajectories and applies planning rewards to optimize the model beyond token-level imitation.

Load-bearing premise

That scoring predicted trajectories with planning metrics supplies a training signal that genuinely improves multi-step driving behavior rather than overfitting to the reward rules or dataset patterns.

What would settle it

Evaluating the RL-fine-tuned model on a held-out collection of driving scenarios and observing no reduction, or an increase, in overlap or off-road rates compared with the supervised fine-tuning baseline.

Figures

Figures reproduced from arXiv: 2606.20641 by Ben Sapp, Drago Anguelov, Justin Fu, Jyh-Jing Hwang, Letian Chen, Runsheng Xu, Yichen Xie, Yiren Lu.

Figure 1
Figure 1. Figure 1: This diagram illustrates our framework for AV planning. Textual and visual inputs are provided to a MLLM, which [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of rasterized image. The roadgraph is shown [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: This figure illustrates the comparison between token-level reward and sequence-level reward. In this example, the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Percentage change in ADE at 3, 5, and 8 seconds [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of the off-road weight wo on planning metrics. As wo increases, overlap and off-road trade off, confirming MAGNIFIED’s ability to balance multiple objectives [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: This figure shows the impact of KL regularization [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token prediction objective merely encourages per-token imitation in text, often irrespective of multi-step consequences and the alignment with crucial planning considerations such as giving space to other road actors. To overcome these limitations, we propose a reinforcement learning fine-tuning (RLFT) approach, MAGNIFIED, that aligns the MLLM-based driving agent with planning objectives by learning from token-level rewards. By mapping a sequence of predicted tokens to corresponding vehicle trajectories and learning from planning rewards, MAGNIFIED optimizes for the true planning objectives rather than focusing solely on token prediction accuracy, enabling the model to refine its understanding of the planning task beyond simple imitation. We validate our approach on the Waymo Open Motion Dataset with a novel setup incorporating rasterized birds-eye views and tokenized trajectories as inputs and planning-oriented outputs. An initial SFT phase establishes a strong baseline in outputting plan trajectories as sequences of X-Y coordinates in text, while subsequent RL fine-tuning substantially enhances planning performance relative to the SFT baseline (demonstrating over a 10.5% reduction in overlap rate and a 38.9% reduction in off-road rate), underscoring the potential of RLFT on MLLMs to achieve vehicle planning that is better aligned with compliant, comfortable, and efficient driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes MAGNIFIED, an RL fine-tuning approach for multimodal LLMs in autonomous driving motion planning. It argues that next-token prediction in SFT fails to capture multi-step planning objectives such as collision avoidance and road compliance. The method maps token sequences to trajectories, applies planning rewards (overlap, off-road), and performs RLFT after an initial SFT phase on rasterized BEV inputs and tokenized outputs. On the Waymo Open Motion Dataset, it reports >10.5% reduction in overlap rate and 38.9% reduction in off-road rate relative to the SFT baseline.

Significance. If the empirical gains prove robust, the work would indicate that token-level RL rewards derived from trajectory mapping can improve alignment of MLLM planners with safety and efficiency criteria beyond pure imitation, with potential implications for integrating language models into robotics planning pipelines.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline performance deltas (10.5% overlap, 38.9% off-road) are stated without error bars, number of random seeds, statistical significance tests, or training curves, preventing assessment of whether the improvements are stable or sensitive to hyperparameter choices.
  2. [§3] §3 (Method): The description of how predicted token sequences are mapped to vehicle trajectories and how planning rewards are computed at the token level provides no explicit formulation or pseudocode; without this, it is impossible to verify whether the RL signal targets genuine multi-step consequences or merely fits the specific reward formulation on the Waymo dataset.
  3. [§4] §4 (Experiments): No ablation studies on individual reward components (overlap vs. off-road), alternative metrics, or out-of-distribution evaluation are reported, leaving open the possibility that gains arise from overfitting to dataset artifacts rather than improved planning behavior.
minor comments (1)
  1. [§3] The novel setup incorporating rasterized birds-eye views and tokenized trajectories is mentioned but lacks a figure or table illustrating the input-output tokenization scheme.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects for improving the clarity and robustness of our work. We address each major comment point-by-point below and will revise the manuscript to incorporate the suggested additions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline performance deltas (10.5% overlap, 38.9% off-road) are stated without error bars, number of random seeds, statistical significance tests, or training curves, preventing assessment of whether the improvements are stable or sensitive to hyperparameter choices.

    Authors: We agree that reporting variability is essential for assessing stability. The current manuscript reports point estimates from a single run. In the revision, we will rerun experiments with multiple random seeds (at least 3), include error bars (standard deviation), perform statistical significance tests (e.g., paired t-tests), and add training curves showing reward and metric progression during RLFT. revision: yes

  2. Referee: [§3] §3 (Method): The description of how predicted token sequences are mapped to vehicle trajectories and how planning rewards are computed at the token level provides no explicit formulation or pseudocode; without this, it is impossible to verify whether the RL signal targets genuine multi-step consequences or merely fits the specific reward formulation on the Waymo dataset.

    Authors: We acknowledge the need for greater precision in the method description. The manuscript currently describes the mapping and rewards at a high level. In the revision, we will add explicit mathematical formulations for the token-to-trajectory mapping function and the per-token reward computation (including how overlap and off-road penalties are assigned across the sequence), along with pseudocode for the full RLFT pipeline to demonstrate that the signal accounts for multi-step trajectory consequences. revision: yes

  3. Referee: [§4] §4 (Experiments): No ablation studies on individual reward components (overlap vs. off-road), alternative metrics, or out-of-distribution evaluation are reported, leaving open the possibility that gains arise from overfitting to dataset artifacts rather than improved planning behavior.

    Authors: We recognize that ablations would strengthen claims of genuine planning improvement. The current experiments focus on the combined reward. In the revision, we will add ablations isolating overlap versus off-road rewards, report additional metrics (e.g., comfort-related), and include out-of-distribution evaluation on held-out Waymo scenarios with different traffic densities or unseen map layouts to address potential overfitting concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL gains measured on held-out data

full rationale

The paper reports an empirical result: after SFT on tokenized trajectories from Waymo Open Motion, RL fine-tuning with planning rewards (overlap, off-road) yields measured reductions (10.5% overlap, 38.9% off-road) on held-out evaluation. No derivation chain, equations, or 'predictions' reduce to inputs by construction. No self-citations justify theorems or ansatzes. The performance delta is a standard held-out metric comparison, not a fitted quantity renamed as prediction. This matches the default case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach implicitly assumes that planning rewards can be defined from trajectory geometry without additional unstated modeling choices.

pith-pipeline@v0.9.1-grok · 5854 in / 1176 out tokens · 19293 ms · 2026-06-28T09:30:06.063175+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Vision-language transformer for interpretable pathology visual question answering,

    U. Naseem, M. Khushi, and J. Kim, “Vision-language transformer for interpretable pathology visual question answering,”IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 4, pp. 1681–1690, 2022

  2. [2]

    Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,

    A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid, “Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 714–10 726

  3. [3]

    Vision– language model for visual question answering in medical imagery,

    Y . Bazi, M. M. A. Rahhal, L. Bashmal, and M. Zuair, “Vision– language model for visual question answering in medical imagery,” Bioengineering, vol. 10, no. 3, p. 380, 2023

  4. [4]

    Physically grounded vision-language models for robotic manipulation,

    J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 12 462–12 469

  5. [5]

    Vision and language navigation in the real world via online visual language mapping,

    C. Xu, H. T. Nguyen, C. Amato, and L. L. Wong, “Vision and language navigation in the real world via online visual language mapping,”arXiv preprint arXiv:2310.10822, 2023

  6. [6]

    Vision language models in autonomous driving: A survey and outlook,

    X. Zhou, M. Liu, E. Yurtsever, B. L. Zagar, W. Zimmer, H. Cao, and A. C. Knoll, “Vision language models in autonomous driving: A survey and outlook,”IEEE Transactions on Intelligent V ehicles, 2024

  7. [7]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp,et al., “Emma: End-to-end multimodal model for autonomous driving,”arXiv preprint arXiv:2410.23262, 2024

  8. [8]

    Stepcoder: Improve code gen- eration with reinforcement learning from compiler feedback,

    S. Dou, Y . Liu, H. Jia, L. Xiong, E. Zhou, W. Shen, J. Shan, C. Huang, X. Wang, X. Fan,et al., “Stepcoder: Improve code gen- eration with reinforcement learning from compiler feedback,”arXiv preprint arXiv:2402.01391, 2024

  9. [9]

    Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios,

    Y . Lu, J. Fu, G. Tucker, X. Pan, E. Bronstein, R. Roelofs, B. Sapp, B. White, A. Faust, S. Whiteson,et al., “Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 7553–7560

  10. [10]

    Im- proving agent behaviors with rl fine-tuning for autonomous driving,

    Z. Peng, W. Luo, Y . Lu, T. Shen, C. Gulino, A. Seff, and J. Fu, “Im- proving agent behaviors with rl fine-tuning for autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2025, pp. 165–181

  11. [11]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”RA-L, 2024

  12. [12]

    Lmdrive: Closed-loop end-to-end driving with large language models,

    H. Shao, Y . Hu, L. Wang, G. Song, S. L. Waslander, Y . Liu, and H. Li, “Lmdrive: Closed-loop end-to-end driving with large language models,” inCVPR, 2024

  13. [13]

    Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models,

    T.-H. Wang, A. Maalouf, W. Xiao, Y . Ban, A. Amini, G. Rosman, S. Karaman, and D. Rus, “Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models,” inICRA, 2024

  14. [14]

    S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation,

    Y . Xie, R. Xu, T. He, J.-J. Hwang, K. Luo, J. Ji, H. Lin, L. Chen, Y . Lu, Z. Leng,et al., “S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 1622–1632

  15. [15]

    Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.arXiv preprint arXiv:2405.01533, 2024

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,” arXiv preprint arXiv:2405.01533, 2024

  16. [16]

    Vlp: Vision language planning for autonomous driving,

    C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 760–14 769

  17. [17]

    Drivevlm: The convergence of autonomous driving and large vision-language models,

    X. Tian, J. Gu, B. Li, Y . Liu, C. Hu, Y . Wang, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,” inCoRL, 2024

  18. [18]

    Drivecot: Integrating chain-of-thought reasoning with end-to-end driving,

    T. Wang, E. Xie, R. Chu, Z. Li, and P. Luo, “Drivecot: Integrating chain-of-thought reasoning with end-to-end driving,”arXiv preprint arXiv:2403.16996, 2024

  19. [19]

    Look, remember and reason: Grounded reasoning in videos with language models,

    A. Bhattacharyya, S. Panchal, M. Lee, R. Pourreza, P. Madan, and R. Memisevic, “Look, remember and reason: Grounded reasoning in videos with language models,” inICRA, 2023

  20. [20]

    Toward fully autonomous driving: Ai, challenges, opportunities, and needs,

    L. Ullrich, M. Buchholz, K. Dietmayer, and K. Graichen, “Toward fully autonomous driving: Ai, challenges, opportunities, and needs,” arXiv preprint arXiv:2601.22927, 2026

  21. [21]

    Deep reinforcement learning for autonomous driving: A survey,

    B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. Al Sallab, S. Yo- gamani, and P. P ´erez, “Deep reinforcement learning for autonomous driving: A survey,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4909–4926, 2021

  22. [22]

    Efficient reinforcement learning for autonomous driving with parameterized skills and priors,

    L. Wang, J. Liu, H. Shao, W. Wang, R. Chen, Y . Liu, and S. L. Waslan- der, “Efficient reinforcement learning for autonomous driving with parameterized skills and priors,”arXiv preprint arXiv:2305.04412, 2023

  23. [23]

    Efficient deep reinforcement learning with imitative expert priors for autonomous driving,

    Z. Huang, J. Wu, and C. Lv, “Efficient deep reinforcement learning with imitative expert priors for autonomous driving,”IEEE Transac- tions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 7391–7403, 2022

  24. [24]

    Fear-neuro-inspired reinforcement learning for safe au- tonomous driving,

    X. He, J. Wu, Z. Huang, Z. Hu, J. Wang, A. Sangiovanni-Vincentelli, and C. Lv, “Fear-neuro-inspired reinforcement learning for safe au- tonomous driving,”IEEE transactions on pattern analysis and machine intelligence, 2023

  25. [25]

    Learning to summarize with human feedback,

    N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,”Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021, 2020

  26. [26]

    Aligning language models with human preferences via a bayesian approach,

    J. Wang, H. Wang, S. Sun, and W. Li, “Aligning language models with human preferences via a bayesian approach,”Advances in Neural Information Processing Systems, vol. 36, 2024

  27. [27]

    Fine-Tuning Language Models from Human Preferences

    D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,”arXiv preprint arXiv:1909.08593, 2019

  28. [28]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning,

    S. Zhai, H. Bai, Z. Lin, J. Pan, P. Tong, Y . Zhou, A. Suhr, S. Xie, Y . LeCun, Y . Ma,et al., “Fine-tuning large vision-language models as decision-making agents via reinforcement learning,”Advances in neural information processing systems, vol. 37, pp. 110 935–110 971, 2024

  29. [29]

    Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

    Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,”arXiv preprint arXiv:2402.03681, 2024

  30. [30]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning,

    R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine learning, vol. 8, pp. 229–256, 1992

  31. [31]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    A. Ahmadian, C. Cremer, M. Gall ´e, M. Fadaee, J. Kreutzer, O. Pietquin, A. ¨Ust¨un, and S. Hooker, “Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms,”arXiv preprint arXiv:2402.14740, 2024

  32. [32]

    PaLI-X: On Scaling up a Multilingual Vision and Language Model

    X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz, S. Goodman, X. Wang, Y . Tay,et al., “Pali-x: On scaling up a multilingual vision and language model,”arXiv preprint arXiv:2305.18565, 2023

  33. [33]

    Pali- 3 vision language models: Smaller, faster, stronger,

    X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J. Wu, P. V oigtlaender, B. Mustafa, S. Goodman, I. Alabdulmohsin, P. Padlewski,et al., “Pali- 3 vision language models: Smaller, faster, stronger,”arXiv preprint arXiv:2310.09199, 2023

  34. [34]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 975– 11 986

  35. [35]

    Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset,

    S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. R. Qi, Y . Zhou,et al., “Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9710–9719

  36. [36]

    Waymax: An accelerated, data- driven simulator for large-scale autonomous driving research,

    C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y . Lu, J. Harb, X. Pan, Y . Wang, X. Chen,et al., “Waymax: An accelerated, data- driven simulator for large-scale autonomous driving research,”Ad- vances in Neural Information Processing Systems, vol. 36, 2024

  37. [37]

    Motionlm: Multi-agent motion forecasting as language modeling,

    A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp, “Motionlm: Multi-agent motion forecasting as language modeling,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8579–8590

  38. [38]

    Wayformer: Motion forecasting via simple & efficient atten- tion networks,

    N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient atten- tion networks,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2980–2987