pith. sign in

arxiv: 2606.31209 · v1 · pith:NG2OBJ2Tnew · submitted 2026-06-30 · 💻 cs.AI · cs.RO

Long-term Traffic Simulation via Structured Autoregressive Modeling

Pith reviewed 2026-07-01 05:53 UTC · model grok-4.3

classification 💻 cs.AI cs.RO
keywords traffic simulationautoregressive modelinglarge language modelsmulti-agent interactionsWaymo Open Sim Agent Challengelong-horizon simulationmotion tokensretrieval-based evaluation
0
0 comments X

The pith

Small frozen LLMs adapt to traffic simulation through motion-language token consistency, powering RosettaSim for stable long-horizon multi-agent modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the inductive biases and statistical priors of LLMs, specifically the transferability of their attention mechanisms and the distributional match between motion tokens and natural language tokens, let small heavily frozen models rapidly adapt to interactive traffic simulation without major architectural changes or fine-tuning. This addresses the core difficulty of sustaining multi-agent interactions over long horizons where agents enter and exit dynamically. A sympathetic reader would care because it suggests traffic world models for autonomous driving can reuse existing sequence-model infrastructure rather than building specialized simulators from scratch. The work also introduces a new evaluation method called Retrieval-based Traffic Evaluation to handle the fading one-to-one agent correspondence in extended rollouts.

Core claim

RosettaSim projects scene topology, agent states, and spawning intents into a structured autoregressive stream of variable length; small frozen LLMs then generate sustained multi-agent traffic behavior, reaching state-of-the-art results on both short- and long-term metrics of the Waymo Open Sim Agent Challenge. Retrieval-based Traffic Evaluation retrieves semantically similar real-world scenarios as context-aware anchors and achieves a higher correlation (r=0.83) with standard metrics than prior evaluation approaches (r=0.74).

What carries the argument

RosettaSim, the unified framework that converts dynamic traffic scenes into a single variable-length structured autoregressive token stream, leveraging LLM attention transfer and motion-natural-language distributional consistency.

If this is right

  • RosettaSim reaches state-of-the-art accuracy on both short-term and long-term simulation tasks in the Waymo Open Sim Agent Challenge.
  • Retrieval-based Traffic Evaluation supplies reference anchors that raise correlation with long-horizon fidelity from r=0.74 to r=0.83.
  • Variable-length autoregressive streams naturally accommodate agents entering and exiting the scene.
  • Heavily frozen small LLMs suffice for the adaptation once the scene is projected into the structured token stream.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the token-consistency mechanism generalizes, the same projection technique could be tested on other continuous physical domains such as pedestrian crowds or robotic manipulation without domain-specific retraining.
  • RTE-style retrieval anchors might be applied to other long-horizon simulation benchmarks to reduce reliance on fading one-to-one agent matching.
  • Stable long-horizon traffic models produced this way could serve as drop-in world models inside closed-loop autonomous-driving planners.

Load-bearing premise

Distributional consistency between motion tokens and natural language tokens is sufficient for small frozen LLMs to adapt rapidly to traffic modeling without substantial fine-tuning or architectural changes.

What would settle it

A controlled experiment in which small frozen LLMs given the same structured stream but without the claimed motion-language consistency show no advantage over non-LLM autoregressive baselines on long-horizon WOSAC rollouts.

Figures

Figures reproduced from arXiv: 2606.31209 by Lingyu Xiao, Xintao Yan, Zexin Feng.

Figure 1
Figure 1. Figure 1: Performance. (a) Visualization of long-term simulation compared with [72]. Mere demo videos can be found on the webpage. (b) Quantitative performance com￾parison on WOMD under close-loop short/long-term simulation. recent works [57,72] have reframed long-term traffic simulation as a joint model￾ing problem that integrates scene generation [8, 15, 48, 58] (agent injection) and motion generation within a uni… view at source ↗
Figure 2
Figure 2. Figure 2: The token frequency distribution of traffic motion tokens on WOMD [39]. The distribution follows Zipf’s law [80], similar to natural language. (a) The frequency proportion of the language tokens [19].(b) The frequency proportion of motion tokens. (c) The comparison between motion tokens and language tokens in log-log space. Recent advancements in traffic simulators [43, 62, 65, 70, 75] have increas￾ingly e… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of RosettaSim. We formulate the long-term traffic simulation as a structured sequence generation problem. Parallel Motion Generation The primary goal is to effectively leverage the capabilities of pretrained large sequence model fθ, e.g., LLMs. The overall archi￾tecture is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our proposed retrieval-based evaluation (RTE) framework for long￾term traffic simulation. segments from the validation set based on the Wasserstein distance [61] between their latent representations: \text {similarity}(\mathbf {z}, \hat {\mathbf {z}}) &= W_2(\mathbf {z}_\text {object}, \hat {\mathbf {z}}_\text {object}) + \lambda \cdot W_2(\mathbf {z}_\text {map}, \hat {\mathbf {z}}_\text {map}… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on LLM’s prior. All parame￾ters are tunable. Can all LLMs be adapted to traffic simulation? To investigate the im￾pact of foundational architectures, we adapt four LLMs under a fixed single￾epoch budget to evaluate their rapid adaptation capability. As shown in Tab. 3, modern LLMs significantly outperform older architecture like GPT-2, proving that modern LLMs are adaptable. Notably, Qwen2.5… view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Correlations. Left: Pearson correlations comparison between RTE and log￾based metrics against standard metrics. Right: Ablation of RTE’s parameters. How well does RTE reflect the quality of long-term traffic simula￾tion? Ideally, metrics designed for long-term simulation should exhibit high correlation with standard short-term WOSAC metrics when evaluated on a sin￾gle short-term window. To validate this, w… view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation of the retrieval model’s faithfulness. Top: Evaluated with an equal number of scenarios. Bottom: Evaluated with an unequal number of scenarios. This section validates whether the retrieval model Eϕ is faithful enough to reliably retrieve the most semantically similar real-world scenarios, which is the foundational premise of our long-term metrics. If the retrieval model fails to find accurate ma… view at source ↗
Figure 10
Figure 10. Figure 10: Traffic density vs. Collision indication. We plot the relationship between traffic density (flow) and critical behavior metrics (e.g., collision likelihood) in [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison between RTE placement metrics and log-based placement metrics across rollout windows for short-term specific baselines [65, 75]. Best viewed zoomed in [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Due to actively increasing traffic density via autoregressive spawning, [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Relative performance changes regarding Behavior Realism across progressive evaluation windows. Best viewed zoomed in. RosettaSim demonstrates superior tem￾poral stability compared to baselines. F.7 Ablation on heuristic As stated on Sec. D.3 we have three heuristic adaptations: (1) centerline snap (2) overlap rejection, and (3) boundary removal. Disabling overlap rejection makes RMM-F1 drop from 0.7346 to… view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of retrieved scenarios. The left column shows the query rollout segment, while the right column displays the top semantically retrieved reference an￾chors. G.2 Visualization of Long-term Simulation We provide additional visualizations of long-term simulation rollouts in Follow￾ing figures. Each two row corresponds to a distinct scenario, with the top and bottom halves comparing InfGen and Ro… view at source ↗
Figure 14
Figure 14. Figure 14: Visualization #1 [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualization #2 [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visualization #3 [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visualization #4 [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
read the original abstract

Interactive traffic simulation is a vital world model for autonomous driving. A central challenge in long-horizon simulation is modeling sustained multi-agent interactions, which is further exacerbated by dynamic token cardinality as agents continuously enter and exit the scene. In this work, we propose that the solution lies in the synergy between the architectural inductive biases and statistical priors of large-scale sequence models, e.g., Large Language Models (LLMs). Our probing experiments reveal that the transferability of attention mechanisms and the distributional consistency between motion tokens and natural language enable small-scale, heavily frozen LLMs to rapidly adapt to traffic modeling. Building on this insight, we introduce RosettaSim, a unified framework that projects scene topology, agent states, and spawning intents into a structured autoregressive stream with variable length, achieving both strong short-term accuracy and stable long-horizon simulation fidelity. Furthermore, evaluating extended rollouts presents yet another hurdle, as one-to-one agent correspondence inevitably fades over time. To address this, we introduce Retrieval-based Traffic Evaluation (RTE), which retrieves semantically similar real-world scenarios as context-aware reference anchors. Experiments on the Waymo Open Sim Agent Challenge (WOSAC) demonstrate that RosettaSim achieves state-of-the-art performance in both short- and long-term simulation. Furthermore, RTE exhibits a stronger correlation with standard metrics ($r=0.83$) than existing approaches ($r=0.74$), indicating improved alignment with long-horizon simulation fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that attention transferability and distributional consistency between motion tokens and natural-language tokens allow small, heavily frozen LLMs to adapt rapidly to variable-cardinality traffic sequences. It introduces RosettaSim, which encodes scene topology, agent states, and spawning intents as a structured autoregressive stream, and reports SOTA results on both short- and long-horizon metrics of the Waymo Open Sim Agent Challenge. It further proposes Retrieval-based Traffic Evaluation (RTE) that retrieves semantically similar real-world scenarios as anchors and shows that RTE correlates more strongly (r=0.83) with standard metrics than prior approaches (r=0.74).

Significance. If the adaptation mechanism and the reported WOSAC numbers are substantiated, the work would provide evidence that pre-trained sequence models can be repurposed for sustained multi-agent simulation with limited architectural change, addressing a recognized bottleneck in long-horizon traffic world models. The RTE metric would also supply a concrete, reference-anchored alternative for evaluating rollouts where agent identity is lost.

major comments (2)
  1. [probing-experiments paragraph] Probing-experiments paragraph (abstract): the central claim that distributional consistency plus attention transferability suffices for rapid adaptation of heavily frozen small LLMs is load-bearing for the SOTA assertion, yet the supplied text contains no controls (freezing-ratio curves, from-scratch autoregressive baseline of identical size, or explicit frozen-parameter counts) that would confirm the adaptation occurs under the stated constraints.
  2. [RTE paragraph] Abstract, RTE paragraph: the reported correlation improvement (r=0.83 vs. r=0.74) is presented as evidence of better long-horizon fidelity, but without a description of how the retrieval anchors were selected or whether the correlation was computed on held-out data, it is impossible to rule out that the anchors were tuned on the same test distribution used for the WOSAC numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: Probing-experiments paragraph (abstract): the central claim that distributional consistency plus attention transferability suffices for rapid adaptation of heavily frozen small LLMs is load-bearing for the SOTA assertion, yet the supplied text contains no controls (freezing-ratio curves, from-scratch autoregressive baseline of identical size, or explicit frozen-parameter counts) that would confirm the adaptation occurs under the stated constraints.

    Authors: We agree that the current probing experiments lack the requested controls. In the revised manuscript we will add freezing-ratio curves, a from-scratch autoregressive baseline of identical size, and explicit frozen-parameter counts to directly substantiate that adaptation occurs under heavy freezing. revision: yes

  2. Referee: Abstract, RTE paragraph: the reported correlation improvement (r=0.83 vs. r=0.74) is presented as evidence of better long-horizon fidelity, but without a description of how the retrieval anchors were selected or whether the correlation was computed on held-out data, it is impossible to rule out that the anchors were tuned on the same test distribution used for the WOSAC numbers.

    Authors: We agree that the description of anchor selection and data partitioning is insufficient. We will revise the abstract and add explicit text stating that anchors are drawn from a held-out training subset and that the reported correlation is computed on the official test split, thereby eliminating any possibility of overlap with the WOSAC evaluation set. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain remains self-contained

full rationale

The paper introduces RosettaSim as a projection of scene elements into a structured autoregressive stream and RTE as a retrieval-based metric, then reports empirical WOSAC results and an r=0.83 correlation. No equation, definition, or self-citation reduces a claimed prediction or uniqueness result to its own inputs by construction. The probing-experiment insight on attention transferability and token distributional consistency is presented as an empirical observation supporting the framework, not as a fitted parameter renamed as output. RTE correlation is reported as a measured property rather than a tautological consequence of its own anchors. The central claims therefore rest on external benchmark performance rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on unstated modeling assumptions about tokenization and LLM transfer that cannot be audited from the given text.

pith-pipeline@v0.9.1-grok · 5786 in / 1226 out tokens · 20974 ms · 2026-07-01T05:53:01.843293+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    In: CVPR Workshop on Autonomous Driving (WAD) (2025)

    Ahmadi, E., Schofield, H.: Rlftsim: Multi-agent traffic simulation via reinforcement learning fine-tuning. In: CVPR Workshop on Autonomous Driving (WAD) (2025)

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  3. [3]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  4. [4]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Caesar, H., Kabzan, J., Tan, K.S., Fong, W.K., Wolff, E.M., Lang, A.H., Fletcher, L., Beijbom, O., Omari, S.: nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810 (2021),https://arxiv. org/abs/2106.11810

  5. [5]

    Advances in neural information processing systems35, 18878– 18891 (2022)

    Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClel- land, J., Hill, F.: Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems35, 18878– 18891 (2022)

  6. [6]

    arXiv preprint arXiv:2510.18060 (2025)

    Chang, W.J., Rangesh, A., Joseph, K., Strong, M., Tomizuka, M., Hu, Y., Zhan, W.: Spacer: Self-play anchoring with centralized reference models. arXiv preprint arXiv:2510.18060 (2025)

  7. [7]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chang, W.J., Zhan, W., Tomizuka, M., Chandraker, M., Pittaluga, F.: Langtraj: Diffusion model and dataset for language-conditioned trajectory simulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26622–26631 (2025)

  8. [8]

    In: European Conference on Computer Vision

    Chitta, K., Dauner, D., Geiger, A.: Sledge: Synthesizing driving environments with generative models and rule-based traffic. In: European Conference on Computer Vision. pp. 57–74. Springer (2024)

  9. [9]

    In: International Conference on Machine Learning

    Cusumano-Towner, M., Hafner, D., Hertzberg, A., Huval, B., Petrenko, A., Vinit- sky, E., Wijmans, E., Killian, T.W., Bowers, S., Sener, O., et al.: Robust auton- omy emerges from self-play. In: International Conference on Machine Learning. pp. 11710–11737. PMLR (2025)

  10. [10]

    In: European Conference on Computer Vision

    Ding, W., Cao, Y., Zhao, D., Xiao, C., Pavone, M.: Realgen: Retrieval augmented generation for controllable traffic scenarios. In: European Conference on Computer Vision. pp. 93–110. Springer (2024)

  11. [11]

    arXiv preprint arXiv:2505.24808 (2025)

    Ding, W., Veer, S., Chen, Y., Cao, Y., Xiao, C., Pavone, M.: Realdrive: Retrieval- augmented driving with diffusion models. arXiv preprint arXiv:2505.24808 (2025)

  12. [12]

    Advances in Neural Information Processing Systems35, 11763–11784 (2022)

    Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., Sohn, J.y., Pa- pailiopoulos, D., Lee, K.: Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems35, 11763–11784 (2022)

  13. [13]

    In: Conference on robot learning

    Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017)

  14. [14]

    Xiao et al

    Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp, B., Qi, C.R., Zhou, Y., et al.: Large scale interactive motion forecasting 16 L. Xiao et al. for autonomous driving: The waymo open motion dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9710–9719 (2021)

  15. [15]

    In: 2023 IEEE International Conference on Robotics and Automation (ICRA)

    Feng, L., Li, Q., Peng, Z., Tan, S., Zhou, B.: Trafficgen: Learning to generate diverse and realistic traffic scenarios. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 3567–3575 (2023)

  16. [16]

    Nature615(7953), 620–627 (2023)

    Feng, S., Sun, H., Yan, X., Zhu, H., Zou, Z., Shen, S., Liu, H.X.: Dense reinforce- ment learning for safety validation of autonomous vehicles. Nature615(7953), 620–627 (2023)

  17. [17]

    Nature Communications (2026)

    Feng, S., Zhu, H., Sun, H., Yan, X., He, L., Yang, J., Su, G., Li, B., Li, S., Wang, L., et al.: Breaking through safety performance stagnation in autonomous vehicles with dense learning. Nature Communications (2026)

  18. [18]

    In: CVPR Workshop on Simulation for Au- tonomous Driving (2026)

    Feng, Z., Xiao, L., Yan, X.: Beyond binary metrics: Unveiling the safety illusion in autonomous driving simulation. In: CVPR Workshop on Simulation for Au- tonomous Driving (2026)

  19. [19]

    Dept.ofLinguistics,BrownUniversity,Providence,R.I,revisedandamplified1979

    Francis, W.N.: Brown corpus maunal: manual of information to accompany a stan- dard corpus of present-day edited American English for use with digital computers. Dept.ofLinguistics,BrownUniversity,Providence,R.I,revisedandamplified1979. edn. (1979)

  20. [20]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  21. [21]

    Advances in Neural Information Process- ing Systems36(2024)

    Gulino, C., Fu, J., Luo, W., Tucker, G., Bronstein, E., Lu, Y., Harb, J., Pan, X., Wang, Y., Chen, X., et al.: Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. Advances in Neural Information Process- ing Systems36(2024)

  22. [22]

    arXiv preprint arXiv:2510.06913 (2025)

    Guo, K., Liu, H., Wu, X., Lv, C.: Decompgail: Learning realistic traffic behav- iors with decomposed multi-agent generative adversarial imitation learning. arXiv preprint arXiv:2510.06913 (2025)

  23. [23]

    In: International Conference on Learning Representations

    Hendrycks,D.,Burns,C.,Basart,S.,Zou,A.,Mazeika,M.,Song,D.,Steinhardt,J.: Measuring massive multitask language understanding. In: International Conference on Learning Representations

  24. [24]

    In: European Conference on Computer Vision

    Hu, Y., Chai, S., Yang, Z., Qian, J., Li, K., Shao, W., Zhang, H., Xu, W., Liu, Q.: Solving motion planning tasks with a scalable generative model. In: European Conference on Computer Vision. pp. 386–404. Springer (2024)

  25. [25]

    Transactions on Machine Learning Research

    Hwang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Cov- ington, P., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. Transactions on Machine Learning Research

  26. [26]

    In: International Conference on Machine Learning

    Janner, M., Du, Y., Tenenbaum, J., Levine, S.: Planning with diffusion for flexible behavior synthesis. In: International Conference on Machine Learning. pp. 9902–

  27. [27]

    Advances in Neural Information Processing Systems37, 55729–55760 (2024)

    Jiang, C.M., Bai, Y., Cornman, A., Davis, C., Huang, X., Jeon, H., Kulshrestha, S., Lambert, J., Li, S., Zhou, X., et al.: Scenediffuser: Efficient and controllable driving simulation initialization and rollout. Advances in Neural Information Processing Systems37, 55729–55760 (2024)

  28. [28]

    In: International Con- ference on Learning Representations

    Kazemkhani, S., Pandya, A., Cornelisse, D., Shacklett, B., Vinitsky, E.: Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps. In: International Con- ference on Learning Representations. vol. 2025, pp. 19320–19336 (2025)

  29. [29]

    In: Conference on Robot Learning

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., et al.: Openvla: An open-source vision-language-action model. In: Conference on Robot Learning. pp. 2679–2713. PMLR (2025) Long-term Traffic Simulation via Structured Autoregressive Modeling 17

  30. [30]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  31. [31]

    LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

    Krojer, B., Nayak, S., Mañas, O., Adlakha, V., Elliott, D., Reddy, S., Mosbach, M.: Latentlens: Revealing highly interpretable visual tokens in llms. arXiv preprint arXiv:2602.00462 (2026)

  32. [32]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

    Lin, L., Lin, X., Xu, K., Lu, H., Huang, L., Xiong, R., Wang, Y.: Unimm: A unified mixture model framework for multi-agent simulation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

  33. [33]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  34. [34]

    Authorea Preprints (2025)

    Liu, H., Cao, Z., Yan, X., Feng, S., Lu, Q.: Autonomous vehicles: A critical review (2004-2024) and a vision for the future. Authorea Preprints (2025)

  35. [35]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Lu, J., Wong, K., Zhang, C., Suo, S., Urtasun, R.: Scenecontrol: Diffusion for controllable traffic scene generation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 16908–16914. IEEE (2024)

  36. [36]

    In: Proceedings of the AAAI conference on artificial intelligence

    Lu, K., Grover, A., Abbeel, P., Mordatch, I.: Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 7628–7636 (2022)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Mi, L., Zhao, H., Nash, C., Jin, X., Gao, J., Sun, C., Schmid, C., Shavit, N., Chai, Y., Anguelov, D.: Hdmapgen: A hierarchical graph generative model of high definition maps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4227–4236 (2021)

  38. [38]

    In: Conference on Robot Learning

    Mirchandani, S., Xia, F., Florence, P., Ichter, B., Driess, D., Arenas, M.G., Rao, K., Sadigh, D., Zeng, A.: Large language models as general pattern machines. In: Conference on Robot Learning. pp. 2498–2518. PMLR (2023)

  39. [39]

    Advances in Neural Information Processing Systems36, 59151–59171 (2023)

    Montali,N.,Lambert,J.,Mougin,P.,Kuefler,A.,Rhinehart,N.,Li,M.,Gulino,C., Emrich, T., Yang, Z., Whiteson, S., et al.: The waymo open sim agents challenge. Advances in Neural Information Processing Systems36, 59151–59171 (2023)

  40. [40]

    In: The Twelfth International Conference on Learning Representations

    Pang, Z., Xie, Z., Man, Y., Wang, Y.X.: Frozen transformers in language models are effective visual encoder layers. In: The Twelfth International Conference on Learning Representations

  41. [41]

    arXiv preprint arXiv:2509.23993 (2025)

    Pei, M., Shi, S., Shen, S.: Advancing multi-agent traffic simulation via r1-style reinforcement fine-tuning. arXiv preprint arXiv:2509.23993 (2025)

  42. [42]

    In: The Fourteenth International Conference on Learning Representations (2026)

    Peng, Z., Liu, Y., Zhou, B.: Scenestreamer: Continuous scenario generation as next token group prediction. In: The Fourteenth International Conference on Learning Representations (2026)

  43. [43]

    In: The Twelfth International Conference on Learning Representations

    Philion, J., Peng, X.B., Fidler, S.: Trajeglish: Traffic modeling as next-token pre- diction. In: The Twelfth International Conference on Learning Representations

  44. [44]

    arXiv preprint arXiv:2306.15914 (2023)

    Qian, C., Xiu, D., Tian, M.: The 2nd place solution for 2023 waymo open sim agents challenge. arXiv preprint arXiv:2306.15914 (2023)

  45. [45]

    Renz, K., Chen, L., Arani, E., Sinavski, O.: Simlingo: Vision-only closed-loop au- tonomousdrivingwithlanguage-actionalignment.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 11993–12003 (2025)

  46. [46]

    Rossert, C., Drever, J., Brostek, L.: combot: an ensemble combination model com- bining results from smart-tiny-clsft with a cognitive behavior mode. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Workshop on Autonomous Driving (WAD) (2025),https:// storage.googleapis.com/waymo-uploads/files/resea...

  47. [47]

    In: Conference on Robot Learning

    Rowe, L., Girgis, R., Gosselin, A., Carrez, B., Golemo, F., Heide, F., Paull, L., Pal, C.: Ctrl-sim: Reactive and controllable driving agents with offline reinforcement learning. In: Conference on Robot Learning. pp. 3600–3621. PMLR (2025)

  48. [48]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Rowe, L., Girgis, R., Gosselin, A., Paull, L., Pal, C., Heide, F.: Scenario dreamer: Vectorized latent diffusion for generating driving simulation environments. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 17207–17218 (2025)

  49. [49]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Seff,A.,Cera,B.,Chen,D.,Ng,M.,Zhou,A.,Nayakanti,N.,Refaat,K.S.,Al-Rfou, R., Sapp, B.: Motionlm: Multi-agent motion forecasting as language modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8579–8590 (2023)

  50. [50]

    In: International Conference on Ma- chine Learning

    Shen, J., Li, L., Dery, L.M., Staten, C., Khodak, M., Neubig, G., Talwalkar, A.: Cross-modal fine-tuning: Align then refine. In: International Conference on Ma- chine Learning. pp. 31030–31056. PMLR (2023)

  51. [51]

    Advances in Neural Information Pro- cessing Systems (2022)

    Shi, S., Jiang, L., Dai, D., Schiele, B.: Motion transformer with global intention localization and local movement refinement. Advances in Neural Information Pro- cessing Systems (2022)

  52. [52]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46(5), 3955–3971 (2024)

    Shi, S., Jiang, L., Dai, D., Schiele, B.: Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. IEEE Transactions on Pattern Analysis and Machine Intelligence46(5), 3955–3971 (2024)

  53. [53]

    IEEE Robotics and Automation Letters9(8), 7007–7014 (2024)

    Sun, S., Gu, Z., Sun, T., Sun, J., Yuan, C., Han, Y., Li, D., Ang, M.H.: Drivescene- gen: Generating diverse and realistic driving scenarios from scratch. IEEE Robotics and Automation Letters9(8), 7007–7014 (2024)

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Suo, S., Regalado, S., Casas, S., Urtasun, R.: Trafficsim: Learning to simulate realistic multi-agent behaviors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10400–10409 (2021)

  55. [55]

    In: 8th Annual Conference on Robot Learning (CoRL) (2024)

    Tan, S., Ivanovic, B., Chen, Y., Li, B., Weng, X., Cao, Y., Krähenbühl, P., Pavone, M.: Promptable closed-loop traffic simulation. In: 8th Annual Conference on Robot Learning (CoRL) (2024)

  56. [56]

    In: 7th Annual Conference on Robot Learning (CoRL) (2023),https://openreview.net/forum?id=PK2debCKaG

    Tan, S., Ivanovic, B., Weng, X., Pavone, M., Kraehenbuehl, P.: Language condi- tioned traffic generation. In: 7th Annual Conference on Robot Learning (CoRL) (2023),https://openreview.net/forum?id=PK2debCKaG

  57. [57]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)

    Tan, S., Lambert, J., Jeon, H., Kulshrestha, S., Bai, Y., Luo, J., Anguelov, D., Tan, M., Jiang, C.M.: Scenediffuser++: City-scale traffic simulation via a generative world model. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 1570–1580 (June 2025)

  58. [58]

    In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) (June 2021)

    Tan, S., Wong, K., Wang, S., Manivasagam, S., Ren, M., Urtasun, R.: Scenegen: Learning to generate realistic traffic scenes. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) (June 2021)

  59. [59]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  60. [60]

    Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

    Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multi- modal few-shot learning with frozen language models. Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

  61. [61]

    In: Optimal transport: old and new, pp

    Villani, C.: The wasserstein distances. In: Optimal transport: old and new, pp. 93–111. Springer (2009)

  62. [62]

    In: Conference on Robot Learning

    Wang, M., Wang, J., Ye, T., Yu, K.: Do llm modules generalize? a study on motion generation for autonomous driving. In: Conference on Robot Learning. pp. 4657–

  63. [63]

    PMLR (2025) Long-term Traffic Simulation via Structured Autoregressive Modeling 19

  64. [64]

    Wang, S., Xu, J., Zhang, X., Hu, F., Huang, Z., Luo, J., Zhu, K., Zhu, J., Zhou, Y., Chen, Z.: Improving tokenization of agents and maps with transformers for multi-agentsimulation.In:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) Workshops, Workshop on Autonomous Driving (WAD) (2025)

  65. [65]

    arXiv preprint arXiv:2306.11868 (2023)

    Wang, Y., Zhao, T., Yi, F.: Multiverse transformer: 1st place solution for waymo open sim agents challenge 2023. arXiv preprint arXiv:2306.11868 (2023)

  66. [66]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Wu, W., Feng, X., Gao, Z., KAN, Y.: Smart: Scalable multi-agent real-time mo- tion generation via next-token prediction. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural In- formation Processing Systems. vol. 37, pp. 114048–114071. Curran Associates, Inc. (2024),https://proceedings.neurips....

  67. [67]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Xiao, L., Liu, J.J., Yang, S., Li, X., Ye, X., Yang, W., Wang, J.: Learning multiple probabilistic decisions from latent world model in autonomous driving. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 1279–

  68. [68]

    arXiv preprint arXiv:2408.16375 (2024)

    Xiao,L.,Liu,J.J.,Ye,X.,Yang,W.,Wang,J.:Easychauffeur:Abaselineadvancing simplicity and efficiency on waymax. arXiv preprint arXiv:2408.16375 (2024)

  69. [69]

    IEEE Robotics and Automation Letters9(10), 8186–8193 (2024)

    Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.Y.K., Li, Z., Zhao, H.: Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters9(10), 8186–8193 (2024)

  70. [70]

    IEEE Trans- actions on Intelligent Transportation Systems26(7), 9187–9200 (2025).https: //doi.org/10.1109/TITS.2025.3571966

    Yan, X., Feng, S., Sun, H., Liu, H.X.: Distributionally consistent simulation of naturalistic driving environment for autonomous vehicle testing. IEEE Trans- actions on Intelligent Transportation Systems26(7), 9187–9200 (2025).https: //doi.org/10.1109/TITS.2025.3571966

  71. [71]

    Nature communications14(1), 2037 (2023)

    Yan, X., Zou, Z., Feng, S., Zhu, H., Sun, H., Liu, H.X.: Learning naturalistic driving environment with statistical realism. Nature communications14(1), 2037 (2023)

  72. [72]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  73. [73]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yang, X., Tan, S., Krähenbühl, P.: Long-term traffic simulation with interleaved autoregressive motion and scenario generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 25305–25314 (2025)

  74. [74]

    Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: An open-source small language model (2024)

  75. [75]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  76. [76]

    In: Proceedings of theIEEEConferenceonComputerVisionandPatternRecognition(CVPR)(2025)

    Zhang, Z., Karkus, P., Igl, M., Ding, W., Chen, Y., Ivanovic, B., Pavone, M.: Closed-loop supervised fine-tuning of tokenized traffic models. In: Proceedings of theIEEEConferenceonComputerVisionandPatternRecognition(CVPR)(2025)

  77. [77]

    Zhang, Z., Jia, X., Chen, G., Li, Q., Wu, Z., Jiang, Y.G., Yan, J.: Trajtok: What makes for a good trajectory tokenizer in behavior generation? In: The Fourteenth International Conference on Learning Representations

  78. [78]

    IEEE Robotics and Automation Letters10(2), 1082–1089 (2024)

    Zhao, J., Zhuang, J., Zhou, Q., Ban, T., Xu, Z., Zhou, H., Wang, J., Wang, G., Li, Z., Li, B.: Kigras: Kinematic-driven generative model for realistic agent simulation. IEEE Robotics and Automation Letters10(2), 1082–1089 (2024)

  79. [79]

    In: 2023 IEEE International Conference on Robotics and Automation (ICRA)

    Zhong, Z., Rempe, D., Xu, D., Chen, Y., Veer, S., Che, T., Ray, B., Pavone, M.: Guided conditional diffusion for controllable traffic simulation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 3560–3566. IEEE (2023) 20 L. Xiao et al

  80. [80]

    Advances in Neural Information Processing Systems37, 79597– 79617 (2024)

    Zhou, Z., Hu, H., Chen, X., Wang, J., Guan, N., Wu, K., Li, Y.H., Huang, Y.K., Xue, C.J.: Behaviorgpt: Smart agent simulation for autonomous driving with next- patch prediction. Advances in Neural Information Processing Systems37, 79597– 79617 (2024)

Showing first 80 references.