RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Behzad Khamidehi; Dongfeng Bai; Ehsan Ahmadi; Fazel Arasteh; Hunter Schofield; Jinjun Shan; Kasra Rezaee; Lili Mou

REVIEW 2 major objections 2 minor 1 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Reinforcement learning fine-tuning aligns traffic simulator rollouts with real-world data distributions for greater realism and goal-based controllability.

2026-05-20 09:15 UTC pith:SXF7B6M6

load-bearing objection RL fine-tuning improves traffic sim realism and controllability over open-loop methods but the reward needs scrutiny for multi-agent stability. the 2 major comments →

arxiv 2605.19033 v1 pith:SXF7B6M6 submitted 2026-05-18 cs.RO cs.AIcs.CVcs.LGcs.MA

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Ehsan Ahmadi , Hunter Schofield , Behzad Khamidehi , Fazel Arasteh , Jinjun Shan , Lili Mou , Dongfeng Bai , Kasra Rezaee This is my paper

classification cs.RO cs.AIcs.CVcs.LGcs.MA

keywords multi-agent traffic simulationreinforcement learning fine-tuningscenario realismgoal-conditioned controllabilityWaymo Open Motion Datasetautonomous driving

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Supervised training of traffic simulators often misses the back-and-forth interactions among multiple vehicles in real driving scenes. This paper shows that starting from a pre-trained model and then fine-tuning it with reinforcement learning can shift the generated scenarios to better match the statistical patterns found in actual road data. The same process also embeds the ability to steer entire simulations toward specific goals. Readers care because more faithful simulators let engineers test self-driving systems on rare or risky situations without putting real cars on the road.

Core claim

RLFTSim shows that reinforcement-learning fine-tuning of a pre-trained traffic simulation model, guided by a reward that trades off fidelity to real data against controllability, produces rollouts whose statistics align more closely with real-world distributions. The method reaches state-of-the-art realism scores on the Waymo Open Motion Dataset, requires far fewer samples than heuristic search baselines thanks to its dense low-variance reward, and simultaneously distills goal-conditioned controllability into the simulator.

What carries the argument

The RL fine-tuning loop applied to a pre-trained simulator, driven by a reward that penalizes deviation from real data distributions while rewarding goal achievement.

Load-bearing premise

The pre-trained base model is stable enough to fine-tune with reinforcement learning and the chosen reward actually balances data fidelity against controllability without creating new artifacts.

What would settle it

Evaluating the fine-tuned simulator on held-out real trajectories and finding that distributions of inter-vehicle distances, speeds, or collision rates remain no closer to the real data than those produced by the original pre-trained model.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Simulator outputs achieve higher realism metrics on standard benchmarks such as the Waymo Open Motion Dataset.
Goal conditioning allows direct control over generated traffic scenarios without post-hoc search.
The approach uses substantially fewer environment samples than heuristic fine-tuning methods.
Realism alignment is enforced directly by the training objective rather than by post-processing.
The framework can be layered on top of existing pre-trained simulation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

More realistic simulators could supply higher-quality synthetic data for training perception and planning models in autonomous vehicles.
The same fine-tuning idea might transfer to other multi-agent domains such as pedestrian crowd simulation or warehouse robotics.
Controllable scenario generation could help systematically create safety-critical edge cases for regulatory testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 2 minor

Summary. The paper proposes RLFTSim, a reinforcement-learning fine-tuning framework for multi-agent traffic simulation. Starting from a pre-trained simulator, it uses RL to align rollouts with real-world distributions on the Waymo Open Motion Dataset to improve realism while distilling goal-conditioned controllability. The method claims SOTA performance, significantly better sample efficiency than heuristic search baselines, and a low-variance dense reward that directly addresses realism alignment.

Significance. If the central claims hold, the work offers a practical route to improve multi-agent traffic simulators for autonomous-driving validation by combining pre-trained models with RL fine-tuning and controllability distillation. The emphasis on sample efficiency and explicit handling of realism via reward design could influence future simulation pipelines, though the absence of detailed metrics and reward specifications in the provided abstract limits immediate assessment of impact.

major comments (2)

[Abstract] Abstract: the claim of achieving state-of-the-art realism and sample efficiency is asserted without any quantitative metrics, baseline comparisons, error bars, or statistical tests, making it impossible to evaluate the strength of the reported improvements on the Waymo dataset.
[§3] §3 (Method, reward design): the reward that is said to balance fidelity to data against controllability is described only at a high level; without the explicit formulation, component weights, or ablations, it cannot be verified whether the reward is truly low-variance and free of biases that could amplify distribution shifts across multi-agent interactions over long horizons.

minor comments (2)

[§4] §4 (Experiments): add error bars, confidence intervals, and details on the number of runs for all reported realism and controllability metrics to allow proper interpretation of the SOTA claims.
[Appendix / Code] Ensure the project page link and any released code include the exact reward implementation and training hyperparameters so that the low-variance claim can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each major point below and have prepared revisions to improve the manuscript where the feedback identifies gaps in detail or verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of achieving state-of-the-art realism and sample efficiency is asserted without any quantitative metrics, baseline comparisons, error bars, or statistical tests, making it impossible to evaluate the strength of the reported improvements on the Waymo dataset.

Authors: We agree that the abstract would benefit from quantitative support to substantiate the SOTA and sample-efficiency claims. In the revised manuscript we will update the abstract to include key numerical results from the Waymo Open Motion Dataset experiments (realism metrics, sample counts versus heuristic baselines, and references to error bars and statistical significance). revision: yes
Referee: [§3] §3 (Method, reward design): the reward that is said to balance fidelity to data against controllability is described only at a high level; without the explicit formulation, component weights, or ablations, it cannot be verified whether the reward is truly low-variance and free of biases that could amplify distribution shifts across multi-agent interactions over long horizons.

Authors: We acknowledge that the current §3 presents the reward at a conceptual level. We will add the explicit reward equation, the precise component weights, and dedicated ablation results in the revised §3. These additions will allow direct verification of the claimed low variance and will include analysis of potential bias accumulation in multi-agent rollouts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in external data and pre-trained model.

full rationale

The paper instantiates RLFTSim on a pre-trained simulation model and aligns rollouts to the external Waymo Open Motion Dataset via a designed reward balancing fidelity and controllability. No step in the provided abstract or description reduces a claimed prediction or result to its own inputs by construction, self-definition, or load-bearing self-citation. The method is presented as an independent fine-tuning procedure with empirical validation on real-world data, making the central claims self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim depends on the existence of a suitable pre-trained base model and on a reward function that can be designed to trade off fidelity and controllability without side effects; neither is detailed in the abstract.

free parameters (1)

reward balance weights
The reward is described as balancing fidelity and controllability, implying tunable coefficients whose values are not reported.

pith-pipeline@v0.9.0 · 5752 in / 1146 out tokens · 35741 ms · 2026-05-20T09:15:57.996655+00:00 · methodology

0 comments

read the original abstract

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.

Figures

Figures reproduced from arXiv: 2605.19033 by Behzad Khamidehi, Dongfeng Bai, Ehsan Ahmadi, Fazel Arasteh, Hunter Schofield, Jinjun Shan, Kasra Rezaee, Lili Mou.

**Figure 2.** Figure 2: Empirical reward variance of MLOO and RLOO on the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Evidence of RLFTSim Effectiveness. (a) Realism Enhancement: Comparison of baseline SMART-tiny (a-2) vs. RLFTSim (a-3) on a challenging intersection scenario. The baseline model generates unrealistic off-road behavior (red trajectory) and a collision with cross-traffic, while RLFTSim produces realistic lane-following behavior that respects traffic rules. (b) Controllability Distillation: Two set… view at source ↗

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions... design a reward that balances fidelity and controllability
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RMMMLOO_i = 1/N sum (RMM_{-j} - RMM_{-i}) ... unbiased estimator of grad E[RMM(tau_1:N-1)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow-ERD: Agent-type Aware Flow Matching with Entropy-Regularized Distillation for Diverse Traffic Simulation
cs.RO 2026-07 conditional novelty 6.0

Flow-ERD achieves state-of-the-art realism and diversity on the WOSAC benchmark by coupling agent-type-aware flow matching with entropy-regularized distillation that prevents mode collapse during closed-loop fine-tuning.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Curb your attention: Causal attention gating for robust trajectory prediction in autonomous driving

Ehsan Ahmadi, Soheil Alizadeh, Ray Mercurius, and Amir Rasouli. Curb your attention: Causal attention gating for robust trajectory prediction in autonomous driving. InICRA,

work page
[2]

Getting SMARTER for motion planning in autonomous driving systems

Montgomery Alban, Ehsan Ahmadi, Randy Goebel, and Amir Rasouli. Getting SMARTER for motion planning in autonomous driving systems. InIEEE IV, 2025. 1

work page 2025
[3]

Albrecht, Cillian Brewitt, John Wilhelm, Balint Gyevnar, Francisco Eiras, Mihai Dobre, and Subramanian Ramamoorthy

Stefano V . Albrecht, Cillian Brewitt, John Wilhelm, Balint Gyevnar, Francisco Eiras, Mihai Dobre, and Subramanian Ramamoorthy. Interpretable goal-based prediction and plan- ning for autonomous driving. InICRA, 2021. 3

work page 2021
[4]

Hind- sight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hind- sight experience replay. InNeurIPS, 2017. 2

work page 2017
[5]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Bei- jbom, and Sammy Omari. NuPlan: A closed-loop ML- based planning benchmark for autonomous vehicles, 2022. arXiv:2106.11810. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Reinforcement learning with human feedback for realistic traffic simulation

Yulong Cao, Boris Ivanovic, Chaowei Xiao, and Marco Pavone. Reinforcement learning with human feedback for realistic traffic simulation. InICRA, 2024. 2

work page 2024
[7]

Human-compatible driving agents through data-regularized self-play reinforce- ment learning.Reinforcement Learning Journal, 1, 2024

Daphne Cornelisse and Eugene Vinitsky. Human-compatible driving agents through data-regularized self-play reinforce- ment learning.Reinforcement Learning Journal, 1, 2024. 3

work page 2024
[8]

Causal confusion in imitation learning

Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. InNeurIPS, 2019. 2

work page 2019
[9]

Large scale in- teractive motion forecasting for autonomous driving : The Waymo open motion dataset

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou, Zoey Yang, Aurelien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale in- teractive motion forecasting for autonomous driving : The Waymo open motion datas...

work page 2021
[10]

Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research

Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xi- angyu Chen, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. In NeurIPS, 2023. 1, 7

work page 2023
[11]

Solv- ing motion planning tasks with a scalable generative model

Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, and Qiang Liu. Solv- ing motion planning tasks with a scalable generative model. InECCV, 2024. 2, 6

work page 2024
[12]

Solv- ing motion planning tasks with a scalable generative model

Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, and Qiang Liu. Solv- ing motion planning tasks with a scalable generative model. InECCV, 2024. 2

work page 2024
[13]

Versatile Behavior Diffusion for Generalized Traffic Agent Simulation,

Zhiyu Huang, Zixu Zhang, Ameya Vaidya, Yuxiao Chen, Chen Lv, and Jaime Fern ´andez Fisac. Versatile behav- ior diffusion for generalized traffic agent simulation, 2024. arXiv:2404.02524. 2, 6

work page arXiv 2024
[14]

Sym- phony: Learning realistic and diverse agents for autonomous driving simulation

Maximilian Igl, Daewoo Kim, Alex Kuefler, Paul Mougin, Punit Shah, Kyriacos Shiarlis, Dragomir Anguelov, Mark Palatucci, Brandyn White, and Shimon Whiteson. Sym- phony: Learning realistic and diverse agents for autonomous driving simulation. InICRA, 2022. 2

work page 2022
[15]

Nidhi Kalra and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate au- tonomous vehicle reliability?Transportation Research Part A: Policy and Practice, 94:182–193, 2016. 1

work page 2016
[16]

Buy 4 REINFORCE samples, get a baseline for free! InDeepRL- StructPred@ICLR, 2019

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! InDeepRL- StructPred@ICLR, 2019. 3

work page 2019
[17]

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, and Yue Wang. Revisit mixture mod- els for multi-agent simulation: Experimental study within a unified framework, 2025. arXiv:2501.17015. 2, 6

work page internal anchor Pith review arXiv 2025
[18]

Goal- conditioned reinforcement learning: Problems and solutions

Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal- conditioned reinforcement learning: Problems and solutions. InIJCAI, 2022. 3

work page 2022
[19]

Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driv- ing scenarios

Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bron- stein, Rebecca Roelofs, Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, Dragomir Anguelov, and Sergey Levine. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driv- ing scenarios. InIROS, 2023. 2

work page 2023
[20]

The Waymo Open Sim Agents Chal- lenge

Nico Montali, John Lambert, Paul Mougin, Alex Kuefler, Nick Rhinehart, Michelle Li, Cole Gulino, Tristan Em- rich, Zoey Yang, Shimon Whiteson, Brandyn White, and Dragomir Anguelov. The Waymo Open Sim Agents Chal- lenge. InNeurIPS, 2023. 3, 4, 5, 11

work page 2023
[21]

Improving agent behaviors with RL fine-tuning for autonomous driving

Zhenghao Peng, Wenjie Luo, Yiren Lu, Tianyi Shen, Cole Gulino, Ari Seff, and Justin Fu. Improving agent behaviors with RL fine-tuning for autonomous driving. InECCV, 2024. 4

work page 2024
[22]

Trajeglish: Traffic modeling as next-token prediction

Jonah Philion, Xue Bin Peng, and Sanja Fidler. Trajeglish: Traffic modeling as next-token prediction. InICLR, 2024. 2, 6

work page 2024
[23]

Hindsight policy gradients

Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and J¨urgen Schmidhuber. Hindsight policy gradients. InICLR,

work page
[24]

A re- duction of imitation learning and structured prediction to no- regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InAISTATS, 2011. 2

work page 2011
[25]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. arXiv:2402.03300. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

TrafficSim: Learning to simulate realistic multi- agent behaviors

Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun. TrafficSim: Learning to simulate realistic multi- agent behaviors. InCVPR, 2021. 2

work page 2021
[27]

Promptable closed-loop traffic simulation

Shuhan Tan, Boris Ivanovic, Yuxiao Chen, Boyi Li, Xin- shuo Weng, Yulong Cao, Philipp Kraehenbuehl, and Marco Pavone. Promptable closed-loop traffic simulation. InCoRL,

work page
[28]

arXiv preprint arXiv:2405.08448 , year=

Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, R ´emi Munos, Bernardo ´Avila Pires, Michal Valko, Yong Cheng, and Will Dabney. Understanding the performance gap between online and offline alignment algorithms, 2024. arXiv:2405.08448. 3 9

work page arXiv 2024
[29]

Direct post-training prefer- ence alignment for multi-agent motion generation model us- ing implicit feedback from pre-training demonstrations

Thomas Tian and Kratarth Goel. Direct post-training prefer- ence alignment for multi-agent motion generation model us- ing implicit feedback from pre-training demonstrations. In ICLR, 2025. 2, 3

work page 2025
[30]

Con- gested traffic states in empirical observations and micro- scopic simulations.Phys

Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Con- gested traffic states in empirical observations and micro- scopic simulations.Phys. Rev. E, 2000. 1, 2

work page 2000
[31]

arXiv preprint arXiv:2306.11868 (2023)

Yu Wang, Tiebiao Zhao, and Fan Yi. Multiverse Trans- former: 1st place solution for Waymo open sim agents chal- lenge 2023, 2023. arXiv:2306.11868. 6

work page arXiv 2023
[32]

Reinforcement learning from human feedback for lane changing of autonomous vehicles in mixed traffic, 2024

Yuting Wang, Lu Liu, Maonan Wang, and Xi Xiong. Reinforcement learning from human feedback for lane changing of autonomous vehicles in mixed traffic, 2024. arXiv:2408.04447. 2

work page arXiv 2024
[33]

Argoverse 2: Next generation datasets for self-driving perception and fore- casting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and fore- casting. InNeurIPS Datasets and Benchmarks, 2021. 3

work page 2021
[34]

SMART: Scalable multi-agent real-time motion generation via next-token prediction

Wei Wu, Xiaoxin Feng, Ziyan Gao, and Yuheng KAN. SMART: Scalable multi-agent real-time motion generation via next-token prediction. InNeurIPS, 2024. 1, 2, 6, 7, 15, 16, 17

work page 2024
[35]

BITS: Bi-level imitation for traffic simulation

Danfei Xu, Yuxiao Chen, Boris Ivanovic, and Marco Pavone. BITS: Bi-level imitation for traffic simulation. InICRA,

work page
[36]

TrafficBots: Towards world models for autonomous driving simulation and motion prediction

Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, and Luc Van Gool. TrafficBots: Towards world models for autonomous driving simulation and motion prediction. In ICRA, 2023. 3

work page 2023
[37]

TrafficBots V1.5: Traffic simulation via conditional V AEs and transformers with relative pose encoding, 2024

Zhejun Zhang, Christos Sakaridis, and Luc Van Gool. TrafficBots V1.5: Traffic simulation via conditional V AEs and transformers with relative pose encoding, 2024. arXiv:2406.10898. 6, 18

work page arXiv 2024
[38]

TrajTok: Technical report for 2025 Waymo open sim agents challenge, 2025

Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, and Junchi Yan. TrajTok: Technical report for 2025 Waymo open sim agents challenge, 2025. arXiv:2506.21618. 2, 6

work page arXiv 2025
[39]

Closed- loop supervised fine-tuning of tokenized traffic models

Zhejun Zhang, Peter Karkus, Maximilian Igl, Wenhao Ding, Yuxiao Chen, Boris Ivanovic, and Marco Pavone. Closed- loop supervised fine-tuning of tokenized traffic models. In CVPR, 2025. 2, 6, 7, 15

work page 2025
[40]

TNT: Target-driven trajectory prediction

Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, and Dragomir Anguelov. TNT: Target-driven trajectory prediction. InCoRL, 2020. 3

work page 2020
[41]

DRoPE: Directional rotary position embedding for efficient agent interaction modeling, 2025

Jianbo Zhao, Taiyu Ban, Zhihao Liu, Hangning Zhou, Xiyang Wang, Qibin Zhou, Hailong Qin, Mu Yang, Lei Liu, and Bin Li. DRoPE: Directional rotary position embedding for efficient agent interaction modeling, 2025. arXiv:2503.15029. 6

work page arXiv 2025
[42]

KiGRAS: Kinematic-driven generative model for realistic agent simulation.IEEE Robotics and Automa- tion Letters, 2025

Jianbo Zhao, Jiaheng Zhuang, Qibin Zhou, Taiyu Ban, Ziyao Xu, Hangning Zhou, Junhe Wang, Guoan Wang, Zhiheng Li, and Bin Li. KiGRAS: Kinematic-driven generative model for realistic agent simulation.IEEE Robotics and Automa- tion Letters, 2025. 2, 6

work page 2025
[43]

Guided conditional diffusion for controllable traffic simula- tion

Ziyuan Zhong, Davis Rempe, Danfei Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray, and Marco Pavone. Guided conditional diffusion for controllable traffic simula- tion. InICRA, 2023. 3

work page 2023
[44]

BehaviorGPT: Smart agent simulation for autonomous driving with next-patch prediction

Zikang Zhou, Haibo Hu, Xinhong Chen, Jianping Wang, Nan Guan, Kui Wu, Yung-Hui Li, Yu-Kai Huang, and Chun Jason Xue. BehaviorGPT: Smart agent simulation for autonomous driving with next-patch prediction. InNeurIPS,

work page
[45]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019. arXiv:1909.08593. 15 10 RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning Supplementary Material A. Methodological D...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[1] [1]

Curb your attention: Causal attention gating for robust trajectory prediction in autonomous driving

Ehsan Ahmadi, Soheil Alizadeh, Ray Mercurius, and Amir Rasouli. Curb your attention: Causal attention gating for robust trajectory prediction in autonomous driving. InICRA,

work page

[2] [2]

Getting SMARTER for motion planning in autonomous driving systems

Montgomery Alban, Ehsan Ahmadi, Randy Goebel, and Amir Rasouli. Getting SMARTER for motion planning in autonomous driving systems. InIEEE IV, 2025. 1

work page 2025

[3] [3]

Albrecht, Cillian Brewitt, John Wilhelm, Balint Gyevnar, Francisco Eiras, Mihai Dobre, and Subramanian Ramamoorthy

Stefano V . Albrecht, Cillian Brewitt, John Wilhelm, Balint Gyevnar, Francisco Eiras, Mihai Dobre, and Subramanian Ramamoorthy. Interpretable goal-based prediction and plan- ning for autonomous driving. InICRA, 2021. 3

work page 2021

[4] [4]

Hind- sight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hind- sight experience replay. InNeurIPS, 2017. 2

work page 2017

[5] [5]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Bei- jbom, and Sammy Omari. NuPlan: A closed-loop ML- based planning benchmark for autonomous vehicles, 2022. arXiv:2106.11810. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Reinforcement learning with human feedback for realistic traffic simulation

Yulong Cao, Boris Ivanovic, Chaowei Xiao, and Marco Pavone. Reinforcement learning with human feedback for realistic traffic simulation. InICRA, 2024. 2

work page 2024

[7] [7]

Human-compatible driving agents through data-regularized self-play reinforce- ment learning.Reinforcement Learning Journal, 1, 2024

Daphne Cornelisse and Eugene Vinitsky. Human-compatible driving agents through data-regularized self-play reinforce- ment learning.Reinforcement Learning Journal, 1, 2024. 3

work page 2024

[8] [8]

Causal confusion in imitation learning

Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. InNeurIPS, 2019. 2

work page 2019

[9] [9]

Large scale in- teractive motion forecasting for autonomous driving : The Waymo open motion dataset

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou, Zoey Yang, Aurelien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale in- teractive motion forecasting for autonomous driving : The Waymo open motion datas...

work page 2021

[10] [10]

Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research

Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xi- angyu Chen, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. In NeurIPS, 2023. 1, 7

work page 2023

[11] [11]

Solv- ing motion planning tasks with a scalable generative model

Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, and Qiang Liu. Solv- ing motion planning tasks with a scalable generative model. InECCV, 2024. 2, 6

work page 2024

[12] [12]

Solv- ing motion planning tasks with a scalable generative model

Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, and Qiang Liu. Solv- ing motion planning tasks with a scalable generative model. InECCV, 2024. 2

work page 2024

[13] [13]

Versatile Behavior Diffusion for Generalized Traffic Agent Simulation,

Zhiyu Huang, Zixu Zhang, Ameya Vaidya, Yuxiao Chen, Chen Lv, and Jaime Fern ´andez Fisac. Versatile behav- ior diffusion for generalized traffic agent simulation, 2024. arXiv:2404.02524. 2, 6

work page arXiv 2024

[14] [14]

Sym- phony: Learning realistic and diverse agents for autonomous driving simulation

Maximilian Igl, Daewoo Kim, Alex Kuefler, Paul Mougin, Punit Shah, Kyriacos Shiarlis, Dragomir Anguelov, Mark Palatucci, Brandyn White, and Shimon Whiteson. Sym- phony: Learning realistic and diverse agents for autonomous driving simulation. InICRA, 2022. 2

work page 2022

[15] [15]

Nidhi Kalra and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate au- tonomous vehicle reliability?Transportation Research Part A: Policy and Practice, 94:182–193, 2016. 1

work page 2016

[16] [16]

Buy 4 REINFORCE samples, get a baseline for free! InDeepRL- StructPred@ICLR, 2019

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! InDeepRL- StructPred@ICLR, 2019. 3

work page 2019

[17] [17]

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, and Yue Wang. Revisit mixture mod- els for multi-agent simulation: Experimental study within a unified framework, 2025. arXiv:2501.17015. 2, 6

work page internal anchor Pith review arXiv 2025

[18] [18]

Goal- conditioned reinforcement learning: Problems and solutions

Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal- conditioned reinforcement learning: Problems and solutions. InIJCAI, 2022. 3

work page 2022

[19] [19]

Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driv- ing scenarios

Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bron- stein, Rebecca Roelofs, Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, Dragomir Anguelov, and Sergey Levine. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driv- ing scenarios. InIROS, 2023. 2

work page 2023

[20] [20]

The Waymo Open Sim Agents Chal- lenge

Nico Montali, John Lambert, Paul Mougin, Alex Kuefler, Nick Rhinehart, Michelle Li, Cole Gulino, Tristan Em- rich, Zoey Yang, Shimon Whiteson, Brandyn White, and Dragomir Anguelov. The Waymo Open Sim Agents Chal- lenge. InNeurIPS, 2023. 3, 4, 5, 11

work page 2023

[21] [21]

Improving agent behaviors with RL fine-tuning for autonomous driving

Zhenghao Peng, Wenjie Luo, Yiren Lu, Tianyi Shen, Cole Gulino, Ari Seff, and Justin Fu. Improving agent behaviors with RL fine-tuning for autonomous driving. InECCV, 2024. 4

work page 2024

[22] [22]

Trajeglish: Traffic modeling as next-token prediction

Jonah Philion, Xue Bin Peng, and Sanja Fidler. Trajeglish: Traffic modeling as next-token prediction. InICLR, 2024. 2, 6

work page 2024

[23] [23]

Hindsight policy gradients

Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and J¨urgen Schmidhuber. Hindsight policy gradients. InICLR,

work page

[24] [24]

A re- duction of imitation learning and structured prediction to no- regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InAISTATS, 2011. 2

work page 2011

[25] [25]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. arXiv:2402.03300. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

TrafficSim: Learning to simulate realistic multi- agent behaviors

Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun. TrafficSim: Learning to simulate realistic multi- agent behaviors. InCVPR, 2021. 2

work page 2021

[27] [27]

Promptable closed-loop traffic simulation

Shuhan Tan, Boris Ivanovic, Yuxiao Chen, Boyi Li, Xin- shuo Weng, Yulong Cao, Philipp Kraehenbuehl, and Marco Pavone. Promptable closed-loop traffic simulation. InCoRL,

work page

[28] [28]

arXiv preprint arXiv:2405.08448 , year=

Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, R ´emi Munos, Bernardo ´Avila Pires, Michal Valko, Yong Cheng, and Will Dabney. Understanding the performance gap between online and offline alignment algorithms, 2024. arXiv:2405.08448. 3 9

work page arXiv 2024

[29] [29]

Direct post-training prefer- ence alignment for multi-agent motion generation model us- ing implicit feedback from pre-training demonstrations

Thomas Tian and Kratarth Goel. Direct post-training prefer- ence alignment for multi-agent motion generation model us- ing implicit feedback from pre-training demonstrations. In ICLR, 2025. 2, 3

work page 2025

[30] [30]

Con- gested traffic states in empirical observations and micro- scopic simulations.Phys

Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Con- gested traffic states in empirical observations and micro- scopic simulations.Phys. Rev. E, 2000. 1, 2

work page 2000

[31] [31]

arXiv preprint arXiv:2306.11868 (2023)

Yu Wang, Tiebiao Zhao, and Fan Yi. Multiverse Trans- former: 1st place solution for Waymo open sim agents chal- lenge 2023, 2023. arXiv:2306.11868. 6

work page arXiv 2023

[32] [32]

Reinforcement learning from human feedback for lane changing of autonomous vehicles in mixed traffic, 2024

Yuting Wang, Lu Liu, Maonan Wang, and Xi Xiong. Reinforcement learning from human feedback for lane changing of autonomous vehicles in mixed traffic, 2024. arXiv:2408.04447. 2

work page arXiv 2024

[33] [33]

Argoverse 2: Next generation datasets for self-driving perception and fore- casting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and fore- casting. InNeurIPS Datasets and Benchmarks, 2021. 3

work page 2021

[34] [34]

SMART: Scalable multi-agent real-time motion generation via next-token prediction

Wei Wu, Xiaoxin Feng, Ziyan Gao, and Yuheng KAN. SMART: Scalable multi-agent real-time motion generation via next-token prediction. InNeurIPS, 2024. 1, 2, 6, 7, 15, 16, 17

work page 2024

[35] [35]

BITS: Bi-level imitation for traffic simulation

Danfei Xu, Yuxiao Chen, Boris Ivanovic, and Marco Pavone. BITS: Bi-level imitation for traffic simulation. InICRA,

work page

[36] [36]

TrafficBots: Towards world models for autonomous driving simulation and motion prediction

Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, and Luc Van Gool. TrafficBots: Towards world models for autonomous driving simulation and motion prediction. In ICRA, 2023. 3

work page 2023

[37] [37]

TrafficBots V1.5: Traffic simulation via conditional V AEs and transformers with relative pose encoding, 2024

Zhejun Zhang, Christos Sakaridis, and Luc Van Gool. TrafficBots V1.5: Traffic simulation via conditional V AEs and transformers with relative pose encoding, 2024. arXiv:2406.10898. 6, 18

work page arXiv 2024

[38] [38]

TrajTok: Technical report for 2025 Waymo open sim agents challenge, 2025

Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, and Junchi Yan. TrajTok: Technical report for 2025 Waymo open sim agents challenge, 2025. arXiv:2506.21618. 2, 6

work page arXiv 2025

[39] [39]

Closed- loop supervised fine-tuning of tokenized traffic models

Zhejun Zhang, Peter Karkus, Maximilian Igl, Wenhao Ding, Yuxiao Chen, Boris Ivanovic, and Marco Pavone. Closed- loop supervised fine-tuning of tokenized traffic models. In CVPR, 2025. 2, 6, 7, 15

work page 2025

[40] [40]

TNT: Target-driven trajectory prediction

Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, and Dragomir Anguelov. TNT: Target-driven trajectory prediction. InCoRL, 2020. 3

work page 2020

[41] [41]

DRoPE: Directional rotary position embedding for efficient agent interaction modeling, 2025

Jianbo Zhao, Taiyu Ban, Zhihao Liu, Hangning Zhou, Xiyang Wang, Qibin Zhou, Hailong Qin, Mu Yang, Lei Liu, and Bin Li. DRoPE: Directional rotary position embedding for efficient agent interaction modeling, 2025. arXiv:2503.15029. 6

work page arXiv 2025

[42] [42]

KiGRAS: Kinematic-driven generative model for realistic agent simulation.IEEE Robotics and Automa- tion Letters, 2025

Jianbo Zhao, Jiaheng Zhuang, Qibin Zhou, Taiyu Ban, Ziyao Xu, Hangning Zhou, Junhe Wang, Guoan Wang, Zhiheng Li, and Bin Li. KiGRAS: Kinematic-driven generative model for realistic agent simulation.IEEE Robotics and Automa- tion Letters, 2025. 2, 6

work page 2025

[43] [43]

Guided conditional diffusion for controllable traffic simula- tion

Ziyuan Zhong, Davis Rempe, Danfei Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray, and Marco Pavone. Guided conditional diffusion for controllable traffic simula- tion. InICRA, 2023. 3

work page 2023

[44] [44]

BehaviorGPT: Smart agent simulation for autonomous driving with next-patch prediction

Zikang Zhou, Haibo Hu, Xinhong Chen, Jianping Wang, Nan Guan, Kui Wu, Yung-Hui Li, Yu-Kai Huang, and Chun Jason Xue. BehaviorGPT: Smart agent simulation for autonomous driving with next-patch prediction. InNeurIPS,

work page

[45] [45]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019. arXiv:1909.08593. 15 10 RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning Supplementary Material A. Methodological D...

work page internal anchor Pith review Pith/arXiv arXiv 2019