Bridging Local Observation and Global Simulation in Closed-Loop Traffic Modeling

Peng Chen; Tan Xiang; Xintao Yan; Ziyan Wang

arxiv: 2606.31844 · v1 · pith:RLCF5FJTnew · submitted 2026-06-30 · 💻 cs.RO · cs.AI

Bridging Local Observation and Global Simulation in Closed-Loop Traffic Modeling

Ziyan Wang , Tan Xiang , Peng Chen , Xintao Yan This is my paper

Pith reviewed 2026-07-01 05:06 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords traffic simulationclosed-loop modelingcontextual preference alignmentautoregressive simulatorsautonomous drivingpreference learningwhat-if rolloutstest-time alignment

0 comments

The pith

A plug-in evaluator scores simulator actions under full scene context to reduce collisions by 31 percent without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive traffic simulators trained on ego-centric logs learn incomplete context-action mappings because surrounding agents are only partially observed. When these models run in closed-loop settings with complete global observations, the mismatch produces unrealistic stops, unsafe interactions, and rule violations. CRAFT generates diverse what-if rollouts from logged starting states inside the base simulator to surface such failures, grounds the failures in human-aligned driving priors, and converts them into preference data that trains a Contextual Preference Evaluator. At test time the evaluator reweights candidate actions toward globally coherent choices. The result is measured improvement in closed-loop realism achieved solely by test-time alignment.

Core claim

CRAFT treats the base simulator as a globally observable sandbox that produces self-supervised what-if rollouts from logged initial states; these rollouts expose context-induced failures that are then grounded with human-aligned driving priors and turned into preference supervision for a Contextual Preference Evaluator. The evaluator scores candidate actions under complete scene context and reweights autoregressive decoding at inference time, yielding a 31.2 percent reduction in collisions and a 33.2 percent reduction in traffic violations while leaving the original simulator weights unchanged.

What carries the argument

The Contextual Preference Evaluator, a plug-in module trained on preference pairs derived from self-supervised failure rollouts, that scores and reweights actions according to complete global scene context.

If this is right

Existing autoregressive simulators can be deployed in fully observable environments with fewer unsafe behaviors.
Alignment occurs at inference time, so no additional training of the base model is required.
Preference supervision derived from internal rollouts substitutes for large-scale human labeling of global scenes.
The same failure-discovery loop can be repeated on new initial states to adapt the evaluator without changing the simulator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on other autoregressive sequence models that face partial-observation training versus full-observation deployment mismatches.
If the evaluator generalizes across different base simulators, it would reduce the need to collect globally observable training data for each new model.
One could measure whether the preference pairs also improve long-horizon stability metrics that the paper does not report.

Load-bearing premise

Self-supervised what-if rollouts from logged initial states will reliably expose the precise context-induced failures that human-aligned priors can convert into effective preference supervision.

What would settle it

Run the base simulator and the CRAFT-augmented version on the same closed-loop test set; if the collision and violation rates show no statistically significant drop or an increase, the alignment mechanism has not mitigated the mismatch.

Figures

Figures reproduced from arXiv: 2606.31844 by Peng Chen, Tan Xiang, Xintao Yan, Ziyan Wang.

**Figure 1.** Figure 1: Motivation of CRAFT. Incomplete ego-centric logs can induce flawed context–action mappings, leading to abnormal behaviors in global autoregressive rollouts. CRAFT mitigates this mismatch by learning complete-context preferences and guiding simulation toward rational behaviors. with incomplete context, reproducing the action even when the true cause is absent [12]. This leads to ambiguous context–action map… view at source ↗

**Figure 2.** Figure 2: The training pipeline of CRAFT. Starting from logged initial scenes, a frozen autoregressive simulator generates complete-context what-if rollouts, which are annotated by rule-based evaluators and organized into grouped preference datasets. The CPE is trained on grouped rollouts to predict dense preference scores with token-level, intra-trajectory, and inter-rollout supervision. 4 Method 4.1 Grouped compl… view at source ↗

**Figure 3.** Figure 3: The inference pipeline of CRAFT. During inference, CPE evaluates behaviors via a lookahead search algorithm, computing the posterior probabilities before sampling. CPE is trained with absolute token-level supervision and relative preference constraints. We formulate the weighted Binary Cross Entropy loss [25] as Lbce = − 1 Na(Tf ) X Na i=1 T X +Tf t=1 [yi,t log(Ri,t) + (1 − yi,t) log(1 − Ri,t)] , (9) Beyon… view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison between CAT-K and CRAFT. (a) CAT-K violates the traffic signal, while CRAFT reliably complies with the traffic rule. (b) CAT-K exhibits abnormal stopping and disrupts traffic flow, whereas CRAFT drives smoothly and maintains flow stability [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Abnormal agent warning rate of CPE over different warning lead times [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Results are reported as relative improvement over the CAT-K baseline under different [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Contextual Preference Evaluator Architecture [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Case study illustrating the limitation of WOSAC-style trajectory matching. In the driving log, the leading vehicle decelerates and disappears after approximately 4 seconds, leaving only a partial observation of its original intent. CAT-K closely imitates this pattern and eventually stops in the middle of the road, even though no leading vehicle exists in the simulated scene. In contrast, CRAFT uses the com… view at source ↗

**Figure 10.** Figure 10: Visualization of partial negative-sample scenarios in CRAFT15K. The scenes are generated by our grouped complete-context what-if simulation pipeline. Vehicles with anomalous behaviors are color-coded as green (collision), dark blue (offroad), purple (abnormal stopping), red (traffic violation), black (front–rear jitter), and yellow (wrong-way driving). Blue indicates normal vehicles. The bar on the right … view at source ↗

**Figure 11.** Figure 11: Distribution of behavioral anomalies in the CRAFT15K dataset. The chart visualizes the composition of anomalous tokens across a training set (17.1M tokens, 13.42% negative) and a testing set (1.1M tokens, 13.19% negative). The highly consistent distribution across both splits ensures reliable and unbiased evaluation for our predefined violation criteria. at time step t is formulated as: yi,t = 1, (i, t)… view at source ↗

**Figure 12.** Figure 12: Qualitative evaluation of CAT-K generated trajectories using CRAFT. (a) CRAFT accurately penalizes red-light infractions and unnatural lateral drifts. (b) The model consistently rewards stable car-following behaviors while assigning low scores to hazardous spatial conflicts. Scenario 1: Rule violation. This scenario highlights the model’s sensitivity to rule grounding. At T=5s, the orange agent commits a … view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparisons between CRAFT and CAT-K. (a) Given an initial anomalous state, CRAFT gradually corrects the trajectory back to the lane center, whereas the baseline destabilizes into the oncoming lane. (b) CRAFT maintains strict lane adherence during a large-angle maneuver, successfully avoiding the offroad and wrong-way violations frequently triggered by the baseline’s token selection … view at source ↗

**Figure 14.** Figure 14: Failure case due to candidate set limitations. (Left) Qualitative visualization at a T-junction, where both CAT-K and CRAFT exhibit anomalous behaviors. (Right) The velocity-time profile of the red circle vehicle showing an unnatural deceleration. (Bottom) The token probability distribution from the base simulator at T=4s, indicating that the optimal turning tokens are absent from the Top-32 candidate poo… view at source ↗

read the original abstract

A local-to-global context mismatch arises when autoregressive traffic simulators trained on ego-centric driving logs are deployed in globally observable closed-loop environments. In such logs, the ego vehicle has rich local observations, while surrounding agents are only partially observed due to perception limits and occlusions. As a result, simulators may learn incomplete context--action mappings that remain hidden in log-based training but emerge during closed-loop rollouts, leading to unrealistic behaviors such as abnormal stops, unsafe interactions, and rule violations. We propose CRAFT, a Contextual pReference Alignment Framework for Traffic Simulation, to mitigate this mismatch via self-supervised failure discovery and preference-guided test-time alignment. CRAFT treats the base simulator as a globally observable sandbox, generating diverse what-if rollouts from logged initial states to expose context-induced failures. These failures are grounded with human-aligned driving priors and converted into preference supervision for training a Contextual Preference Evaluator (CPE). At inference time, CPE acts as a plug-in alignment module that scores candidate actions under complete scene context and reweights autoregressive decoding toward globally coherent behaviors. CRAFT mitigates this local-to-global contextual bias, reducing collisions by 31.2\% and traffic violations by 33.2\% without retraining the base simulator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRAFT targets a real mismatch in traffic simulators with test-time preference reweighting, but the abstract leaves the experimental claims hard to assess.

read the letter

The paper's core idea is that autoregressive traffic models trained on ego-centric logs develop incomplete context-action mappings that only show up in closed-loop global simulation, and CRAFT tries to correct this at inference time. It generates diverse what-if rollouts from logged starts inside the global sandbox, turns the resulting failures into preference pairs using human driving priors, trains a Contextual Preference Evaluator on those pairs, and uses the evaluator to reweight candidate actions during decoding. The reported outcome is a 31.2% drop in collisions and 33.2% drop in violations without touching the base simulator weights.

What stands out is the practical framing: the method treats the simulator as a sandbox for self-supervised discovery rather than requiring new training data or full retraining. That matches a deployment pain point in AV testing pipelines where log-trained models produce unrealistic stops or violations once they have full scene context.

The main weakness is that the abstract supplies almost no information on how the rollouts are made diverse, how the preference pairs are actually constructed or validated, what baselines are compared against, or how the percentage reductions were measured. Without those details it is difficult to tell whether the gains come from the proposed alignment or from other factors in the evaluation setup. The stress-test worry about rollouts failing to isolate the exact local-to-global failures also lands cleanly on the description given; nothing in the abstract shows that the discovered failures are systematically tied to context mismatch rather than simulator noise.

This is aimed at people building or using closed-loop traffic simulators for autonomous driving research. A reader already working on preference alignment or test-time adaptation in simulation would see the most direct value. The work deserves a serious referee because the problem is concrete and the proposed fix is straightforward to implement and test, even though the current write-up needs the methods and results sections expanded before the claims can be properly evaluated.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes CRAFT (Contextual pReference Alignment Framework for Traffic Simulation) to address the local-to-global context mismatch in autoregressive traffic simulators. These simulators are trained on ego-centric logs with rich local observations but deployed in globally observable closed-loop settings, leading to incomplete context-action mappings that cause unrealistic behaviors. CRAFT generates self-supervised what-if rollouts from logged initial states inside a global sandbox to expose failures, grounds them using human-aligned driving priors to create preference pairs, trains a Contextual Preference Evaluator (CPE), and deploys CPE as a plug-in module to reweight autoregressive decoding toward globally coherent actions at test time. The central empirical claim is a 31.2% reduction in collisions and 33.2% reduction in traffic violations without retraining the base simulator.

Significance. If the quantitative claims are supported by detailed, reproducible experiments with appropriate controls, this approach could meaningfully advance closed-loop traffic simulation by providing a test-time alignment method that avoids costly retraining. The self-supervised failure discovery combined with preference-based reweighting is a potentially generalizable idea for mitigating distribution shifts in simulation. However, the absence of any experimental protocol, dataset description, baseline comparisons, or statistical analysis in the abstract makes it impossible to evaluate whether the reported gains are robust or attributable to the proposed components.

major comments (2)

[Abstract] Abstract: the central claim of 31.2% collision and 33.2% violation reductions is presented with no information on experimental setup, datasets used, baselines, number of rollouts, statistical tests, or how the percentages were computed. This information is load-bearing for verifying that the gains arise from CPE reweighting rather than simulator artifacts or evaluation choices.
[Method (implied by abstract description of rollout generation and CPE training)] The manuscript does not provide evidence that the self-supervised what-if rollouts from logged initial states systematically surface context-induced failures (incomplete context-action mappings) rather than random noise or simulator-specific artifacts. Without an ablation or diagnostic that isolates this distinction, the preference supervision signal for CPE may be misaligned, undermining the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 31.2% collision and 33.2% violation reductions is presented with no information on experimental setup, datasets used, baselines, number of rollouts, statistical tests, or how the percentages were computed. This information is load-bearing for verifying that the gains arise from CPE reweighting rather than simulator artifacts or evaluation choices.

Authors: We agree that the abstract's brevity leaves the central claims difficult to evaluate in isolation. The full manuscript details the experimental protocol in Section 4 (including the traffic dataset, base simulator, number of closed-loop rollouts, baseline comparisons, and relative reduction formula). To address the concern directly, we will revise the abstract to include a concise statement of the setup and evaluation (e.g., "evaluated over 500 closed-loop rollouts on logged urban scenes against standard autoregressive baselines, with percentages computed as relative reductions versus the unaligned simulator"). revision: yes
Referee: [Method (implied by abstract description of rollout generation and CPE training)] The manuscript does not provide evidence that the self-supervised what-if rollouts from logged initial states systematically surface context-induced failures (incomplete context-action mappings) rather than random noise or simulator-specific artifacts. Without an ablation or diagnostic that isolates this distinction, the preference supervision signal for CPE may be misaligned, undermining the reported improvements.

Authors: The reported 31.2% and 33.2% gains are measured against the base simulator without CPE, providing indirect support that the preference signal from the what-if rollouts is effective. However, we acknowledge the value of a direct diagnostic. We will add an ablation that (i) compares CPE trained on context-induced failure pairs versus randomly sampled or noise-augmented pairs and (ii) reports the resulting difference in downstream collision/violation rates. This will isolate whether the rollouts systematically expose the targeted local-to-global mismatches. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no equations or self-referential derivations

full rationale

The paper presents CRAFT as a procedural framework involving self-supervised what-if rollouts, grounding with priors, CPE training, and test-time reweighting. No equations, derivations, or parameter-fitting steps are described that would reduce the reported collision/violation reductions to quantities defined by construction from the inputs. The central claims rest on empirical outcomes rather than any self-definitional, fitted-prediction, or self-citation load-bearing chain. The provided text contains no citations at all, let alone load-bearing self-citations. This is a standard non-finding for a methods paper whose gains are not mathematically forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; the method description mentions human-aligned driving priors and a Contextual Preference Evaluator but gives no further specification.

pith-pipeline@v0.9.1-grok · 5754 in / 1140 out tokens · 36331 ms · 2026-07-01T05:06:46.682846+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Bordes, N

A. Bordes, N. Usunier, A. Garcia-Duran, and et al. Translating embeddings for modeling multi- relational data. InProceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 26, 2013

2013
[2]

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[3]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Codevilla, E

F. Codevilla, E. Santana, A. M. López, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 9329–9338, 2019

2019
[5]

J. Cui, F. Liang, G. Yang, C. Tang, and J. Cui. Safer: Safety-critical scenario generation for autonomous driving test via feasibility-constrained token resampling.arXiv preprint arXiv:2603.04071, 2026

work page arXiv 2026
[6]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning- based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023

2023
[7]

De Haan, D

P. De Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning. InProceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 32, 2019

2019
[8]

Ettinger, S

S. Ettinger, S. Cheng, B. Caine, and et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision (CVPR), pages 9710–9719, 2021

2021
[9]

A. Fan, M. Lewis, and Y . Dauphin. Hierarchical neural story generation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018

2018
[10]

S. Feng, H. Sun, X. Yan, and et al. Dense reinforcement learning for safety validation of autonomous vehicles.Nature, 615(7953):620–627, 2023

2023
[11]

H. Gao, S. Chen, Y . Zhu, Y . Song, W. Liu, Q. Zhang, and X. Wang. Rad-2: Scaling reinforcement learning in a generator-discriminator framework.arXiv preprint arXiv:2604.15308, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Geirhos, J.-H

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

2020
[13]

Y . Hu, S. Chai, Z. Yang, and et al. Solving motion planning tasks with a scalable generative model. InProceedings of the European Conference on Computer Vision (ECCV), pages 386–404, 2024

2024
[14]

Huang, X

Z. Huang, X. Weng, M. Igl, and et al. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. InProceedings of 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3445–3451, 2025

2025
[15]

Huang, Z

Z. Huang, Z. Zhang, A. Vaidya, and et al. Versatile behavior diffusion for generalized traffic agent simulation.arXiv preprint arXiv:2404.02524, 2024

work page arXiv 2024
[16]

L. Lin, X. Lin, K. Xu, et al. Revisit mixture models for multi-agent simulation: Experimental study within a unified framework.arXiv preprint arXiv:2501.17015, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

J. Lu, K. Wong, C. Zhang, and et al. Scenecontrol: Diffusion for controllable traffic scene generation. InProceedings of 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16908–16914, 2024

2024
[18]

Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

S. Meng and C. Ai. Infrastructure-centric world models: Bridging temporal depth and spatial breadth for roadside perception.arXiv preprint arXiv:2604.17651, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Montali, J

N. Montali, J. Lambert, P. Mougin, and et al. The waymo open sim agents challenge. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 36, pages 59151–59171, 2023

2023
[20]

LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

L. Nguyen, M. Fauth, and B. a. a. Jaeger. Lead: Minimizing learner-expert asymmetry in end-to-end driving.arXiv preprint arXiv:2512.20563, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

M. Pei, S. Shi, and S. Shen. Advancing multi-agent traffic simulation via r1-style reinforcement fine-tuning.arXiv preprint arXiv:2509.23993, 2025

work page arXiv 2025
[22]

Philion, X

J. Philion, X. B. Peng, and S. Fidler. Trajeglish: Traffic modeling as next-token prediction. arXiv preprint arXiv:2312.04535, 2023

work page arXiv 2023
[23]

L. Rowe, R. Girgis, A. Gosselin, and et al. Ctrl-sim: Reactive and controllable driving agents with offline reinforcement learning. InProceedings of the Conference on Robot Learning (CoRL), pages 3600–3621, 2025. 10

2025
[24]

A. Seff, B. Cera, D. Chen, and et al. Motionlm: Multi-agent motion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8579–8590, 2023

2023
[25]

C. E. Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

1948
[26]

C. Si, D. Friedman, N. Joshi, S. Feng, D. Chen, and H. He. Measuring inductive biases of in-context learning with underspecified demonstrations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11289–11310, 2023

2023
[27]

S. Suo, S. Regalado, S. Casas, and R. Urtasun. Trafficsim: Learning to simulate realistic multi-agent behaviors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10400–10409, 2021

2021
[28]

Vinitsky, N

E. Vinitsky, N. Lichtlé, X. Yang, and et al. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. InProceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 35, pages 3962–3974, 2022

2022
[29]

M. Wang, J. Wang, T. Ye, and K. Yu. Do llm modules generalize? a study on motion generation for autonomous driving. InProceedings of the Conference on Robot Learning (CoRL), pages 4657–4683, 2025

2025
[30]

Y . Wang, T. Zhao, and F. Yi. Multiverse transformer: 1st place solution for waymo open sim agents challenge 2023.arXiv preprint arXiv:2306.11868, 2023

work page arXiv 2023
[31]

Z. Wang, P. Chen, D. Li, C. Li, Q. Zhang, Z. Xia, and G. Yu. Learning rollout from sampling: An r1-style tokenized traffic simulation model.IEEE Robotics and Automation Letters, 11(5):6336– 6343, 2026

2026
[32]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hart- nett, J. K. Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

W. Wu, X. Feng, Z. Gao, and Y . Kan. Smart: Scalable multi-agent real-time motion generation via next-token prediction. InProceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 37, pages 114048–114071, 2024

2024
[34]

X. Yan, Z. Zou, S. Feng, and et al. Learning naturalistic driving environment with statistical realism.Nature communications, 14(1):2037, 2023

2037
[35]

Zhang, P

Z. Zhang, P. Karkus, M. Igl, and et al. Closed-loop supervised fine-tuning of tokenized traffic models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5422–5432, 2025

2025
[36]

Zhang, A

Z. Zhang, A. Liniger, D. Dai, and et al. Trafficbots: Towards world models for autonomous driving simulation and motion prediction. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1522–1529, 2023

2023
[37]

S. Z. Zhao, L. Wang, H. Ruan, Y . Bao, Y . Chen, Z. Leng, A. Ravichandran, H. He, Z. Zhou, X. Han, et al. Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Zhong, D

Z. Zhong, D. Rempe, D. Xu, and et al. Guided conditional diffusion for controllable traffic simulation. InProceedings of 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3560–3566. IEEE, 2023

2023
[39]

Y . Zhou, N. Ye, W. Ljungbergh, and et al. Decoupled diffusion sparks adaptive scene generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 27760–27770, 2025

2025
[40]

Z. Zhou, H. Haibo, X. Chen, J. Wang, N. Guan, K. Wu, Y .-H. Li, Y .-K. Huang, and C. J. Xue. Behaviorgpt: Smart agent simulation for autonomous driving with next-patch prediction. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 37, pages 79597–79617, 2024

2024
[41]

Z. Zhou, J. Wang, Y .-H. Li, and Y .-K. Huang. Query-centric trajectory prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 17863– 17873, 2023. 11 In this supplementary material, we provide additional details in a question-driven manner to support the methodology and experiments discussed in the ma...

2023
[42]

A time step is included inV col if no such violation is detected

Safety • Collision (Vcol): We check whether an agent in the scene exhibits any explicit collision or hazardous spatial conflict with other agents over the entire time horizon. A time step is included inV col if no such violation is detected
[43]

A time step is included in Vstop if no such violation is detected

Compliance • Abnormal stopping (Vstop): We check whether an agent undergoes substantial deceleration and comes to an evident stop from a normal driving state when there is no leading blocking vehicle, no red light, and no large-angle turning maneuver ahead. A time step is included in Vstop if no such violation is detected. • Traffic rule violation(Vtraffi...
[44]

A time step is included in Vjitter if no such violation is detected

Kinematics • Trajectory jitter (Vjitter): We check whether an agent exhibits abnormal back-and-forth jittering in the lateral or longitudinal direction. A time step is included in Vjitter if no such violation is detected. Consequently, the final valid set V is strictly defined as the intersection of all individual criterion sets: V=V col ∩ Vstop ∩ Vtraffi...

work page arXiv 1940

[1] [1]

Bordes, N

A. Bordes, N. Usunier, A. Garcia-Duran, and et al. Translating embeddings for modeling multi- relational data. InProceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 26, 2013

2013

[2] [2]

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952

[3] [3]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Codevilla, E

F. Codevilla, E. Santana, A. M. López, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 9329–9338, 2019

2019

[5] [5]

J. Cui, F. Liang, G. Yang, C. Tang, and J. Cui. Safer: Safety-critical scenario generation for autonomous driving test via feasibility-constrained token resampling.arXiv preprint arXiv:2603.04071, 2026

work page arXiv 2026

[6] [6]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning- based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023

2023

[7] [7]

De Haan, D

P. De Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning. InProceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 32, 2019

2019

[8] [8]

Ettinger, S

S. Ettinger, S. Cheng, B. Caine, and et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision (CVPR), pages 9710–9719, 2021

2021

[9] [9]

A. Fan, M. Lewis, and Y . Dauphin. Hierarchical neural story generation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018

2018

[10] [10]

S. Feng, H. Sun, X. Yan, and et al. Dense reinforcement learning for safety validation of autonomous vehicles.Nature, 615(7953):620–627, 2023

2023

[11] [11]

H. Gao, S. Chen, Y . Zhu, Y . Song, W. Liu, Q. Zhang, and X. Wang. Rad-2: Scaling reinforcement learning in a generator-discriminator framework.arXiv preprint arXiv:2604.15308, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Geirhos, J.-H

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

2020

[13] [13]

Y . Hu, S. Chai, Z. Yang, and et al. Solving motion planning tasks with a scalable generative model. InProceedings of the European Conference on Computer Vision (ECCV), pages 386–404, 2024

2024

[14] [14]

Huang, X

Z. Huang, X. Weng, M. Igl, and et al. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. InProceedings of 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3445–3451, 2025

2025

[15] [15]

Huang, Z

Z. Huang, Z. Zhang, A. Vaidya, and et al. Versatile behavior diffusion for generalized traffic agent simulation.arXiv preprint arXiv:2404.02524, 2024

work page arXiv 2024

[16] [16]

L. Lin, X. Lin, K. Xu, et al. Revisit mixture models for multi-agent simulation: Experimental study within a unified framework.arXiv preprint arXiv:2501.17015, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

J. Lu, K. Wong, C. Zhang, and et al. Scenecontrol: Diffusion for controllable traffic scene generation. InProceedings of 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16908–16914, 2024

2024

[18] [18]

Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

S. Meng and C. Ai. Infrastructure-centric world models: Bridging temporal depth and spatial breadth for roadside perception.arXiv preprint arXiv:2604.17651, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Montali, J

N. Montali, J. Lambert, P. Mougin, and et al. The waymo open sim agents challenge. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 36, pages 59151–59171, 2023

2023

[20] [20]

LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

L. Nguyen, M. Fauth, and B. a. a. Jaeger. Lead: Minimizing learner-expert asymmetry in end-to-end driving.arXiv preprint arXiv:2512.20563, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

M. Pei, S. Shi, and S. Shen. Advancing multi-agent traffic simulation via r1-style reinforcement fine-tuning.arXiv preprint arXiv:2509.23993, 2025

work page arXiv 2025

[22] [22]

Philion, X

J. Philion, X. B. Peng, and S. Fidler. Trajeglish: Traffic modeling as next-token prediction. arXiv preprint arXiv:2312.04535, 2023

work page arXiv 2023

[23] [23]

L. Rowe, R. Girgis, A. Gosselin, and et al. Ctrl-sim: Reactive and controllable driving agents with offline reinforcement learning. InProceedings of the Conference on Robot Learning (CoRL), pages 3600–3621, 2025. 10

2025

[24] [24]

A. Seff, B. Cera, D. Chen, and et al. Motionlm: Multi-agent motion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8579–8590, 2023

2023

[25] [25]

C. E. Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

1948

[26] [26]

C. Si, D. Friedman, N. Joshi, S. Feng, D. Chen, and H. He. Measuring inductive biases of in-context learning with underspecified demonstrations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11289–11310, 2023

2023

[27] [27]

S. Suo, S. Regalado, S. Casas, and R. Urtasun. Trafficsim: Learning to simulate realistic multi-agent behaviors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10400–10409, 2021

2021

[28] [28]

Vinitsky, N

E. Vinitsky, N. Lichtlé, X. Yang, and et al. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. InProceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 35, pages 3962–3974, 2022

2022

[29] [29]

M. Wang, J. Wang, T. Ye, and K. Yu. Do llm modules generalize? a study on motion generation for autonomous driving. InProceedings of the Conference on Robot Learning (CoRL), pages 4657–4683, 2025

2025

[30] [30]

Y . Wang, T. Zhao, and F. Yi. Multiverse transformer: 1st place solution for waymo open sim agents challenge 2023.arXiv preprint arXiv:2306.11868, 2023

work page arXiv 2023

[31] [31]

Z. Wang, P. Chen, D. Li, C. Li, Q. Zhang, Z. Xia, and G. Yu. Learning rollout from sampling: An r1-style tokenized traffic simulation model.IEEE Robotics and Automation Letters, 11(5):6336– 6343, 2026

2026

[32] [32]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hart- nett, J. K. Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

W. Wu, X. Feng, Z. Gao, and Y . Kan. Smart: Scalable multi-agent real-time motion generation via next-token prediction. InProceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 37, pages 114048–114071, 2024

2024

[34] [34]

X. Yan, Z. Zou, S. Feng, and et al. Learning naturalistic driving environment with statistical realism.Nature communications, 14(1):2037, 2023

2037

[35] [35]

Zhang, P

Z. Zhang, P. Karkus, M. Igl, and et al. Closed-loop supervised fine-tuning of tokenized traffic models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5422–5432, 2025

2025

[36] [36]

Zhang, A

Z. Zhang, A. Liniger, D. Dai, and et al. Trafficbots: Towards world models for autonomous driving simulation and motion prediction. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1522–1529, 2023

2023

[37] [37]

S. Z. Zhao, L. Wang, H. Ruan, Y . Bao, Y . Chen, Z. Leng, A. Ravichandran, H. He, Z. Zhou, X. Han, et al. Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Zhong, D

Z. Zhong, D. Rempe, D. Xu, and et al. Guided conditional diffusion for controllable traffic simulation. InProceedings of 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3560–3566. IEEE, 2023

2023

[39] [39]

Y . Zhou, N. Ye, W. Ljungbergh, and et al. Decoupled diffusion sparks adaptive scene generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 27760–27770, 2025

2025

[40] [40]

Z. Zhou, H. Haibo, X. Chen, J. Wang, N. Guan, K. Wu, Y .-H. Li, Y .-K. Huang, and C. J. Xue. Behaviorgpt: Smart agent simulation for autonomous driving with next-patch prediction. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), volume 37, pages 79597–79617, 2024

2024

[41] [41]

Z. Zhou, J. Wang, Y .-H. Li, and Y .-K. Huang. Query-centric trajectory prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 17863– 17873, 2023. 11 In this supplementary material, we provide additional details in a question-driven manner to support the methodology and experiments discussed in the ma...

2023

[42] [42]

A time step is included inV col if no such violation is detected

Safety • Collision (Vcol): We check whether an agent in the scene exhibits any explicit collision or hazardous spatial conflict with other agents over the entire time horizon. A time step is included inV col if no such violation is detected

[43] [43]

A time step is included in Vstop if no such violation is detected

Compliance • Abnormal stopping (Vstop): We check whether an agent undergoes substantial deceleration and comes to an evident stop from a normal driving state when there is no leading blocking vehicle, no red light, and no large-angle turning maneuver ahead. A time step is included in Vstop if no such violation is detected. • Traffic rule violation(Vtraffi...

[44] [44]

A time step is included in Vjitter if no such violation is detected

Kinematics • Trajectory jitter (Vjitter): We check whether an agent exhibits abnormal back-and-forth jittering in the lateral or longitudinal direction. A time step is included in Vjitter if no such violation is detected. Consequently, the final valid set V is strictly defined as the intersection of all individual criterion sets: V=V col ∩ Vstop ∩ Vtraffi...

work page arXiv 1940