pith. sign in

arxiv: 2605.19033 · v1 · pith:SXF7B6M6new · submitted 2026-05-18 · 💻 cs.RO · cs.AI· cs.CV· cs.LG· cs.MA

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Pith reviewed 2026-05-20 09:15 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LGcs.MA
keywords multi-agent traffic simulationreinforcement learning fine-tuningscenario realismgoal-conditioned controllabilityWaymo Open Motion Datasetautonomous driving
0
0 comments X

The pith

Reinforcement learning fine-tuning aligns traffic simulator rollouts with real-world data distributions for greater realism and goal-based controllability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Supervised training of traffic simulators often misses the back-and-forth interactions among multiple vehicles in real driving scenes. This paper shows that starting from a pre-trained model and then fine-tuning it with reinforcement learning can shift the generated scenarios to better match the statistical patterns found in actual road data. The same process also embeds the ability to steer entire simulations toward specific goals. Readers care because more faithful simulators let engineers test self-driving systems on rare or risky situations without putting real cars on the road.

Core claim

RLFTSim shows that reinforcement-learning fine-tuning of a pre-trained traffic simulation model, guided by a reward that trades off fidelity to real data against controllability, produces rollouts whose statistics align more closely with real-world distributions. The method reaches state-of-the-art realism scores on the Waymo Open Motion Dataset, requires far fewer samples than heuristic search baselines thanks to its dense low-variance reward, and simultaneously distills goal-conditioned controllability into the simulator.

What carries the argument

The RL fine-tuning loop applied to a pre-trained simulator, driven by a reward that penalizes deviation from real data distributions while rewarding goal achievement.

If this is right

  • Simulator outputs achieve higher realism metrics on standard benchmarks such as the Waymo Open Motion Dataset.
  • Goal conditioning allows direct control over generated traffic scenarios without post-hoc search.
  • The approach uses substantially fewer environment samples than heuristic fine-tuning methods.
  • Realism alignment is enforced directly by the training objective rather than by post-processing.
  • The framework can be layered on top of existing pre-trained simulation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • More realistic simulators could supply higher-quality synthetic data for training perception and planning models in autonomous vehicles.
  • The same fine-tuning idea might transfer to other multi-agent domains such as pedestrian crowd simulation or warehouse robotics.
  • Controllable scenario generation could help systematically create safety-critical edge cases for regulatory testing.

Load-bearing premise

The pre-trained base model is stable enough to fine-tune with reinforcement learning and the chosen reward actually balances data fidelity against controllability without creating new artifacts.

What would settle it

Evaluating the fine-tuned simulator on held-out real trajectories and finding that distributions of inter-vehicle distances, speeds, or collision rates remain no closer to the real data than those produced by the original pre-trained model.

Figures

Figures reproduced from arXiv: 2605.19033 by Behzad Khamidehi, Dongfeng Bai, Ehsan Ahmadi, Fazel Arasteh, Hunter Schofield, Jinjun Shan, Kasra Rezaee, Lili Mou.

Figure 1
Figure 1. Figure 1: Post-training with RLFTSim. For a seed scenario from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical reward variance of MLOO and RLOO on the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Evidence of RLFTSim Effectiveness. (a) Realism Enhancement: Comparison of baseline SMART-tiny (a-2) vs. RLFTSim (a-3) on a challenging intersection scenario. The baseline model generates unrealistic off-road behavior (red trajectory) and a collision with cross-traffic, while RLFTSim produces realistic lane-following behavior that respects traffic rules. (b) Controllability Distillation: Two set… view at source ↗
read the original abstract

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RLFTSim, a reinforcement-learning fine-tuning framework for multi-agent traffic simulation. Starting from a pre-trained simulator, it uses RL to align rollouts with real-world distributions on the Waymo Open Motion Dataset to improve realism while distilling goal-conditioned controllability. The method claims SOTA performance, significantly better sample efficiency than heuristic search baselines, and a low-variance dense reward that directly addresses realism alignment.

Significance. If the central claims hold, the work offers a practical route to improve multi-agent traffic simulators for autonomous-driving validation by combining pre-trained models with RL fine-tuning and controllability distillation. The emphasis on sample efficiency and explicit handling of realism via reward design could influence future simulation pipelines, though the absence of detailed metrics and reward specifications in the provided abstract limits immediate assessment of impact.

major comments (2)
  1. [Abstract] Abstract: the claim of achieving state-of-the-art realism and sample efficiency is asserted without any quantitative metrics, baseline comparisons, error bars, or statistical tests, making it impossible to evaluate the strength of the reported improvements on the Waymo dataset.
  2. [§3] §3 (Method, reward design): the reward that is said to balance fidelity to data against controllability is described only at a high level; without the explicit formulation, component weights, or ablations, it cannot be verified whether the reward is truly low-variance and free of biases that could amplify distribution shifts across multi-agent interactions over long horizons.
minor comments (2)
  1. [§4] §4 (Experiments): add error bars, confidence intervals, and details on the number of runs for all reported realism and controllability metrics to allow proper interpretation of the SOTA claims.
  2. [Appendix / Code] Ensure the project page link and any released code include the exact reward implementation and training hyperparameters so that the low-variance claim can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each major point below and have prepared revisions to improve the manuscript where the feedback identifies gaps in detail or verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of achieving state-of-the-art realism and sample efficiency is asserted without any quantitative metrics, baseline comparisons, error bars, or statistical tests, making it impossible to evaluate the strength of the reported improvements on the Waymo dataset.

    Authors: We agree that the abstract would benefit from quantitative support to substantiate the SOTA and sample-efficiency claims. In the revised manuscript we will update the abstract to include key numerical results from the Waymo Open Motion Dataset experiments (realism metrics, sample counts versus heuristic baselines, and references to error bars and statistical significance). revision: yes

  2. Referee: [§3] §3 (Method, reward design): the reward that is said to balance fidelity to data against controllability is described only at a high level; without the explicit formulation, component weights, or ablations, it cannot be verified whether the reward is truly low-variance and free of biases that could amplify distribution shifts across multi-agent interactions over long horizons.

    Authors: We acknowledge that the current §3 presents the reward at a conceptual level. We will add the explicit reward equation, the precise component weights, and dedicated ablation results in the revised §3. These additions will allow direct verification of the claimed low variance and will include analysis of potential bias accumulation in multi-agent rollouts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in external data and pre-trained model.

full rationale

The paper instantiates RLFTSim on a pre-trained simulation model and aligns rollouts to the external Waymo Open Motion Dataset via a designed reward balancing fidelity and controllability. No step in the provided abstract or description reduces a claimed prediction or result to its own inputs by construction, self-definition, or load-bearing self-citation. The method is presented as an independent fine-tuning procedure with empirical validation on real-world data, making the central claims self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim depends on the existence of a suitable pre-trained base model and on a reward function that can be designed to trade off fidelity and controllability without side effects; neither is detailed in the abstract.

free parameters (1)
  • reward balance weights
    The reward is described as balancing fidelity and controllability, implying tunable coefficients whose values are not reported.

pith-pipeline@v0.9.0 · 5752 in / 1146 out tokens · 35741 ms · 2026-05-20T09:15:57.996655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

  1. [1]

    Curb your attention: Causal attention gating for robust trajectory prediction in autonomous driving

    Ehsan Ahmadi, Soheil Alizadeh, Ray Mercurius, and Amir Rasouli. Curb your attention: Causal attention gating for robust trajectory prediction in autonomous driving. InICRA,

  2. [2]

    Getting SMARTER for motion planning in autonomous driving systems

    Montgomery Alban, Ehsan Ahmadi, Randy Goebel, and Amir Rasouli. Getting SMARTER for motion planning in autonomous driving systems. InIEEE IV, 2025. 1

  3. [3]

    Albrecht, Cillian Brewitt, John Wilhelm, Balint Gyevnar, Francisco Eiras, Mihai Dobre, and Subramanian Ramamoorthy

    Stefano V . Albrecht, Cillian Brewitt, John Wilhelm, Balint Gyevnar, Francisco Eiras, Mihai Dobre, and Subramanian Ramamoorthy. Interpretable goal-based prediction and plan- ning for autonomous driving. InICRA, 2021. 3

  4. [4]

    Hind- sight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hind- sight experience replay. InNeurIPS, 2017. 2

  5. [5]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Bei- jbom, and Sammy Omari. NuPlan: A closed-loop ML- based planning benchmark for autonomous vehicles, 2022. arXiv:2106.11810. 2

  6. [6]

    Reinforcement learning with human feedback for realistic traffic simulation

    Yulong Cao, Boris Ivanovic, Chaowei Xiao, and Marco Pavone. Reinforcement learning with human feedback for realistic traffic simulation. InICRA, 2024. 2

  7. [7]

    Human-compatible driving agents through data-regularized self-play reinforce- ment learning.Reinforcement Learning Journal, 1, 2024

    Daphne Cornelisse and Eugene Vinitsky. Human-compatible driving agents through data-regularized self-play reinforce- ment learning.Reinforcement Learning Journal, 1, 2024. 3

  8. [8]

    Causal confusion in imitation learning

    Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. InNeurIPS, 2019. 2

  9. [9]

    Large scale in- teractive motion forecasting for autonomous driving : The Waymo open motion dataset

    Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou, Zoey Yang, Aurelien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale in- teractive motion forecasting for autonomous driving : The Waymo open motion datas...

  10. [10]

    Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research

    Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xi- angyu Chen, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. In NeurIPS, 2023. 1, 7

  11. [11]

    Solv- ing motion planning tasks with a scalable generative model

    Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, and Qiang Liu. Solv- ing motion planning tasks with a scalable generative model. InECCV, 2024. 2, 6

  12. [12]

    Solv- ing motion planning tasks with a scalable generative model

    Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, and Qiang Liu. Solv- ing motion planning tasks with a scalable generative model. InECCV, 2024. 2

  13. [13]

    Versatile behavior diffusion for generalized traffic agent simulation,

    Zhiyu Huang, Zixu Zhang, Ameya Vaidya, Yuxiao Chen, Chen Lv, and Jaime Fern ´andez Fisac. Versatile behav- ior diffusion for generalized traffic agent simulation, 2024. arXiv:2404.02524. 2, 6

  14. [14]

    Sym- phony: Learning realistic and diverse agents for autonomous driving simulation

    Maximilian Igl, Daewoo Kim, Alex Kuefler, Paul Mougin, Punit Shah, Kyriacos Shiarlis, Dragomir Anguelov, Mark Palatucci, Brandyn White, and Shimon Whiteson. Sym- phony: Learning realistic and diverse agents for autonomous driving simulation. InICRA, 2022. 2

  15. [15]

    Nidhi Kalra and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate au- tonomous vehicle reliability?Transportation Research Part A: Policy and Practice, 94:182–193, 2016. 1

  16. [16]

    Buy 4 REINFORCE samples, get a baseline for free! InDeepRL- StructPred@ICLR, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! InDeepRL- StructPred@ICLR, 2019. 3

  17. [17]

    Revisit mixture mod- els for multi-agent simulation: Experimental study within a unified framework, 2025

    Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, and Yue Wang. Revisit mixture mod- els for multi-agent simulation: Experimental study within a unified framework, 2025. arXiv:2501.17015. 2, 6

  18. [18]

    Goal- conditioned reinforcement learning: Problems and solutions

    Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal- conditioned reinforcement learning: Problems and solutions. InIJCAI, 2022. 3

  19. [19]

    Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driv- ing scenarios

    Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bron- stein, Rebecca Roelofs, Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, Dragomir Anguelov, and Sergey Levine. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driv- ing scenarios. InIROS, 2023. 2

  20. [20]

    The Waymo Open Sim Agents Chal- lenge

    Nico Montali, John Lambert, Paul Mougin, Alex Kuefler, Nick Rhinehart, Michelle Li, Cole Gulino, Tristan Em- rich, Zoey Yang, Shimon Whiteson, Brandyn White, and Dragomir Anguelov. The Waymo Open Sim Agents Chal- lenge. InNeurIPS, 2023. 3, 4, 5, 11

  21. [21]

    Improving agent behaviors with RL fine-tuning for autonomous driving

    Zhenghao Peng, Wenjie Luo, Yiren Lu, Tianyi Shen, Cole Gulino, Ari Seff, and Justin Fu. Improving agent behaviors with RL fine-tuning for autonomous driving. InECCV, 2024. 4

  22. [22]

    Trajeglish: Traffic modeling as next-token prediction

    Jonah Philion, Xue Bin Peng, and Sanja Fidler. Trajeglish: Traffic modeling as next-token prediction. InICLR, 2024. 2, 6

  23. [23]

    Hindsight policy gradients

    Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and J¨urgen Schmidhuber. Hindsight policy gradients. InICLR,

  24. [24]

    A re- duction of imitation learning and structured prediction to no- regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InAISTATS, 2011. 2

  25. [25]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. arXiv:2402.03300. 2

  26. [26]

    TrafficSim: Learning to simulate realistic multi- agent behaviors

    Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun. TrafficSim: Learning to simulate realistic multi- agent behaviors. InCVPR, 2021. 2

  27. [27]

    Promptable closed-loop traffic simulation

    Shuhan Tan, Boris Ivanovic, Yuxiao Chen, Boyi Li, Xin- shuo Weng, Yulong Cao, Philipp Kraehenbuehl, and Marco Pavone. Promptable closed-loop traffic simulation. InCoRL,

  28. [28]

    Understanding the performance gap between online and offline alignment algorithms, 2024

    Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, R ´emi Munos, Bernardo ´Avila Pires, Michal Valko, Yong Cheng, and Will Dabney. Understanding the performance gap between online and offline alignment algorithms, 2024. arXiv:2405.08448. 3 9

  29. [29]

    Direct post-training prefer- ence alignment for multi-agent motion generation model us- ing implicit feedback from pre-training demonstrations

    Thomas Tian and Kratarth Goel. Direct post-training prefer- ence alignment for multi-agent motion generation model us- ing implicit feedback from pre-training demonstrations. In ICLR, 2025. 2, 3

  30. [30]

    Con- gested traffic states in empirical observations and micro- scopic simulations.Phys

    Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Con- gested traffic states in empirical observations and micro- scopic simulations.Phys. Rev. E, 2000. 1, 2

  31. [31]

    Multiverse Trans- former: 1st place solution for Waymo open sim agents chal- lenge 2023, 2023

    Yu Wang, Tiebiao Zhao, and Fan Yi. Multiverse Trans- former: 1st place solution for Waymo open sim agents chal- lenge 2023, 2023. arXiv:2306.11868. 6

  32. [32]

    Reinforcement learning from human feedback for lane changing of autonomous vehicles in mixed traffic, 2024

    Yuting Wang, Lu Liu, Maonan Wang, and Xi Xiong. Reinforcement learning from human feedback for lane changing of autonomous vehicles in mixed traffic, 2024. arXiv:2408.04447. 2

  33. [33]

    Argoverse 2: Next generation datasets for self-driving perception and fore- casting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and fore- casting. InNeurIPS Datasets and Benchmarks, 2021. 3

  34. [34]

    SMART: Scalable multi-agent real-time motion generation via next-token prediction

    Wei Wu, Xiaoxin Feng, Ziyan Gao, and Yuheng KAN. SMART: Scalable multi-agent real-time motion generation via next-token prediction. InNeurIPS, 2024. 1, 2, 6, 7, 15, 16, 17

  35. [35]

    BITS: Bi-level imitation for traffic simulation

    Danfei Xu, Yuxiao Chen, Boris Ivanovic, and Marco Pavone. BITS: Bi-level imitation for traffic simulation. InICRA,

  36. [36]

    TrafficBots: Towards world models for autonomous driving simulation and motion prediction

    Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, and Luc Van Gool. TrafficBots: Towards world models for autonomous driving simulation and motion prediction. In ICRA, 2023. 3

  37. [37]

    TrafficBots V1.5: Traffic simulation via conditional V AEs and transformers with relative pose encoding, 2024

    Zhejun Zhang, Christos Sakaridis, and Luc Van Gool. TrafficBots V1.5: Traffic simulation via conditional V AEs and transformers with relative pose encoding, 2024. arXiv:2406.10898. 6, 18

  38. [38]

    TrajTok: Technical report for 2025 Waymo open sim agents challenge, 2025

    Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, and Junchi Yan. TrajTok: Technical report for 2025 Waymo open sim agents challenge, 2025. arXiv:2506.21618. 2, 6

  39. [39]

    Closed- loop supervised fine-tuning of tokenized traffic models

    Zhejun Zhang, Peter Karkus, Maximilian Igl, Wenhao Ding, Yuxiao Chen, Boris Ivanovic, and Marco Pavone. Closed- loop supervised fine-tuning of tokenized traffic models. In CVPR, 2025. 2, 6, 7, 15

  40. [40]

    TNT: Target-driven trajectory prediction

    Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, and Dragomir Anguelov. TNT: Target-driven trajectory prediction. InCoRL, 2020. 3

  41. [41]

    DRoPE: Directional rotary position embedding for efficient agent interaction modeling, 2025

    Jianbo Zhao, Taiyu Ban, Zhihao Liu, Hangning Zhou, Xiyang Wang, Qibin Zhou, Hailong Qin, Mu Yang, Lei Liu, and Bin Li. DRoPE: Directional rotary position embedding for efficient agent interaction modeling, 2025. arXiv:2503.15029. 6

  42. [42]

    KiGRAS: Kinematic-driven generative model for realistic agent simulation.IEEE Robotics and Automa- tion Letters, 2025

    Jianbo Zhao, Jiaheng Zhuang, Qibin Zhou, Taiyu Ban, Ziyao Xu, Hangning Zhou, Junhe Wang, Guoan Wang, Zhiheng Li, and Bin Li. KiGRAS: Kinematic-driven generative model for realistic agent simulation.IEEE Robotics and Automa- tion Letters, 2025. 2, 6

  43. [43]

    Guided conditional diffusion for controllable traffic simula- tion

    Ziyuan Zhong, Davis Rempe, Danfei Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray, and Marco Pavone. Guided conditional diffusion for controllable traffic simula- tion. InICRA, 2023. 3

  44. [44]

    BehaviorGPT: Smart agent simulation for autonomous driving with next-patch prediction

    Zikang Zhou, Haibo Hu, Xinhong Chen, Jianping Wang, Nan Guan, Kui Wu, Yung-Hui Li, Yu-Kai Huang, and Chun Jason Xue. BehaviorGPT: Smart agent simulation for autonomous driving with next-patch prediction. InNeurIPS,

  45. [45]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019. arXiv:1909.08593. 15 10 RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning Supplementary Material A. Methodological D...