Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation

Adel Bazzi; Dengfeng Sun; Mingrui Li; Weifan Zhang; Xiaofeng Zhao; Yifan Wei

arxiv: 2605.09153 · v1 · submitted 2026-05-09 · 💻 cs.RO · cs.AI

Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation

Weifan Zhang , Xiaofeng Zhao , Adel Bazzi , Mingrui Li , Yifan Wei , Dengfeng Sun This is my paper

Pith reviewed 2026-05-12 01:50 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords closed-loop traffic simulationhierarchical reinforcement learningStackelberg MARLcontinuous motion controlmulti-agent interactionhybrid co-trainingSUMO simulatorsafety and smoothness

0 comments

The pith

A hierarchical Stackelberg MARL planner with continuous low-level control generates smoother and safer agent behaviors in closed-loop traffic simulation than self-play baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-play reinforcement learning scales for traffic simulation but produces equilibria lacking the social awareness of real drivers. It introduces a two-layer architecture where a high-level Stackelberg-style multi-agent module reasons about interactions to output intention commands, which a low-level continuous module converts into physically consistent control sequences. A hybrid co-training procedure that mixes MARL with recovery supervision counters distribution shift during closed-loop rollout. Experiments on a SUMO urban network show the resulting agents achieve better smoothness and safety scores while keeping traffic efficiency competitive with baselines.

Core claim

The central claim is that a hierarchical system pairing Stackelberg-style multi-agent reinforcement learning for interaction-aware intention commands with a low-level continuous motion module for scene-responsive trajectory execution, trained through hybrid MARL and auxiliary recovery supervision, produces closed-loop agent behaviors with superior smoothness and safety relative to pure self-play and passive imitation methods while preserving competitive traffic flow efficiency.

What carries the argument

The hierarchical architecture in which a Stackelberg MARL module generates interaction intention commands that condition a continuous low-level motion module, stabilized by hybrid co-training that combines strategic learning with recovery supervision.

If this is right

Agents display more socially responsive interaction patterns during multi-vehicle maneuvers.
Closed-loop deployment maintains lower variance in control smoothness across varying traffic densities.
Safety metrics improve without measurable loss in overall network throughput.
The hybrid training reduces the gap between training and deployment distributions enough to support longer simulation horizons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of strategic intention from continuous execution could be applied to other continuous multi-agent domains where high-level coordination must respect low-level dynamics.
If the low-level module were replaced by a learned policy conditioned on the same intentions, the same co-training logic might transfer to non-traffic settings.
The approach suggests a route for embedding human-like social reasoning into simulators used for testing autonomous driving stacks.

Load-bearing premise

High-level intention commands can be translated by the low-level continuous module into stable, scene-responsive controls without residual instability that hybrid co-training fails to correct.

What would settle it

A direct comparison on the same SUMO network showing that the full hierarchical model produces higher collision rates or jerk than a non-hierarchical self-play baseline with identical training budget would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2605.09153 by Adel Bazzi, Dengfeng Sun, Mingrui Li, Weifan Zhang, Xiaofeng Zhao, Yifan Wei.

**Figure 2.** Figure 2: SUMO-based testing network covering a 1.5 × 2 mile urban region in California. We evaluate our approach in a simulation spanning a 1.5 × 2 mile urban network in California, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustrative comparisons of multi-vehicle control using strategically uninformed realization module (Wayformer), the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Closed-loop traffic simulation requires agents that are both scalable and behaviorally realistic. Recent self-play reinforcement learning approaches demonstrate strong scalability, but their equilibrium strategies fail to capture the socially aware behaviors of real human drivers. We propose a hierarchical architecture that goes beyond self-play by combining high-level multi-agent interaction reasoning with low-level continuous trajectory realization. Specifically, a Stackelberg-style Multi-Agent Reinforcement Learning (MARL) module generates interaction-aware intention commands. These commands condition a low-level continuous motion module, translating the strategic intent into physically consistent, scene-responsive control sequences. To mitigate distribution shift in closed-loop deployment, we introduce a hybrid co-training scheme combining MARL with auxiliary recovery supervision. Experiments on a SUMO-based urban network demonstrate that the proposed framework achieves superior control smoothness and safety compared to self-play and passive imitation baselines, while maintaining competitive traffic efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hierarchical split with Stackelberg MARL on top and a continuous low-level module is a reasonable architectural idea for traffic simulation, but the abstract supplies zero numbers or ablations so the claimed gains in smoothness and safety cannot be checked.

read the letter

The paper's core move is to layer a Stackelberg-style MARL module that outputs interaction-aware intention commands on top of a separate continuous-motion module that turns those commands into scene-responsive trajectories. A hybrid co-training scheme mixes the MARL objective with auxiliary recovery supervision to reduce distribution shift when the system runs closed-loop. This is framed as an advance over pure self-play, which the authors say produces equilibria that miss the socially aware behavior of real drivers. The setup targets exactly the kind of scalable yet realistic simulation needed for autonomous-vehicle validation on networks like SUMO. That direction is sensible and the split between strategic intentions and physical realization is a clean way to address the usual self-play limitations. The experiments are described as showing better control smoothness and safety than self-play and passive imitation baselines while keeping traffic efficiency competitive. The problem is that none of those claims are backed by numbers, error bars, statistical tests, or even basic ablation details in the abstract. Without seeing the actual metrics or how the low-level module behaves under distribution shift, it is impossible to tell whether the hierarchy delivers real gains or whether the co-training is simply masking mismatches between the discrete SUMO dynamics and the continuous controller. The stress-test concern about instability in the intention-to-control translation is therefore still open. This work is aimed at people already building MARL-based traffic simulators for robotics and transportation applications. If the full paper contains solid quantitative results and controls for the stability issue, it would be worth a serious referee's time; right now the evidence is too thin to judge.

Referee Report

2 major / 1 minor

Summary. The paper proposes a hierarchical architecture for closed-loop traffic simulation that combines a Stackelberg-style Multi-Agent Reinforcement Learning (MARL) module to generate interaction-aware intention commands with a low-level continuous motion module that translates these into physically consistent, scene-responsive controls. A hybrid co-training scheme is introduced to mitigate distribution shift between training and closed-loop deployment. Experiments on a SUMO-based urban network are claimed to show superior control smoothness and safety relative to self-play and passive imitation baselines while preserving competitive traffic efficiency.

Significance. If the empirical claims hold under rigorous evaluation, the work offers a meaningful step toward more behaviorally realistic multi-agent traffic simulators by explicitly separating strategic interaction reasoning from continuous trajectory execution. The hybrid co-training approach targets a practical deployment challenge in closed-loop settings and could influence downstream applications in autonomous driving validation and robotics simulation.

major comments (2)

[Experiments] Experiments section: the central claim of superior smoothness and safety is asserted without any numerical metrics, error bars, statistical tests, ablation tables, or protocol details, rendering the data-to-claim link unevaluable and directly undermining the paper's primary empirical contribution.
[§3.2] §3.2 (hybrid co-training description): no analysis or closed-loop verification is provided showing that the auxiliary recovery supervision actually resolves mismatches between discrete-time SUMO dynamics and the continuous low-level module, leaving open the possibility that co-training merely masks rather than eliminates instability in the intention-to-control mapping.

minor comments (1)

[§3] The notation distinguishing high-level intention commands from low-level control sequences could be made more explicit with additional equations or a diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the empirical presentation and analysis.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim of superior smoothness and safety is asserted without any numerical metrics, error bars, statistical tests, ablation tables, or protocol details, rendering the data-to-claim link unevaluable and directly undermining the paper's primary empirical contribution.

Authors: We agree that the experiments section as currently written presents the claims without sufficient quantitative support. In the revised manuscript we will add explicit numerical metrics for smoothness (e.g., mean jerk) and safety (e.g., collision rate and near-miss frequency), error bars from multiple random seeds, statistical significance tests against baselines, ablation tables isolating the hierarchical and co-training components, and a complete experimental protocol section detailing hyperparameters, SUMO settings, and evaluation procedures. revision: yes
Referee: [§3.2] §3.2 (hybrid co-training description): no analysis or closed-loop verification is provided showing that the auxiliary recovery supervision actually resolves mismatches between discrete-time SUMO dynamics and the continuous low-level module, leaving open the possibility that co-training merely masks rather than eliminates instability in the intention-to-control mapping.

Authors: The observation is correct. We will expand §3.2 with a dedicated analysis of how the auxiliary recovery supervision mitigates the discrete-to-continuous dynamics mismatch. The revision will also include new closed-loop verification results, such as stability metrics (e.g., trajectory deviation and control saturation rates) measured during deployment, together with direct comparisons of the full model versus an ablated version without recovery supervision, to demonstrate that the co-training resolves rather than conceals the underlying instability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architectural proposal with independent experimental validation

full rationale

The paper proposes a hierarchical framework (Stackelberg MARL for high-level intentions + low-level continuous motion module + hybrid co-training) and supports its claims of improved smoothness/safety via direct SUMO experiments against self-play and imitation baselines. No algebraic derivation chain exists; there are no equations presented as first-principles predictions that reduce to fitted parameters or self-citations by construction. Central claims remain falsifiable through external simulation benchmarks rather than tautological. Minor self-citations, if present, are not load-bearing for the empirical results.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard reinforcement-learning assumptions plus the novel architectural split; no machine-checked proofs or parameter-free derivations are present.

free parameters (1)

MARL and co-training hyperparameters
Typical RL training parameters and auxiliary loss weights are required but not enumerated in the abstract.

axioms (2)

domain assumption Stackelberg leader-follower dynamics adequately model multi-agent traffic interactions
Invoked to justify the high-level module.
domain assumption Low-level continuous controller can faithfully realize high-level intentions without violating physics or scene constraints
Core premise of the hierarchical decomposition.

invented entities (1)

Hybrid co-training scheme no independent evidence
purpose: Mitigate distribution shift between training and closed-loop deployment
New training procedure introduced to stabilize the hierarchy.

pith-pipeline@v0.9.0 · 5458 in / 1523 out tokens · 53980 ms · 2026-05-12T01:50:22.432102+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical architecture... Stackelberg-style Multi-Agent Reinforcement Learning (MARL) module generates interaction-aware intention commands... command-conditioned Wayformer... hybrid co-training scheme
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on a SUMO-based urban network demonstrate... superior control smoothness and safety

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Curse of rarity for autonomous vehicles,

H. X. Liu and S. Feng, “Curse of rarity for autonomous vehicles,”nature communications, vol. 15, no. 1, p. 4808, 2024

work page 2024
[2]

Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349,

M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Seneret al., “Robust autonomy emerges from self-play,”arXiv preprint arXiv:2502.03349, 2025

work page arXiv 2025
[3]

Building reliable sim driving agents by scaling self-play,

D. Cornelisse, A. Pandya, K. Joseph, J. Su ´arez, and E. Vinitsky, “Build- ing reliable sim driving agents by scaling self-play,”arXiv preprint arXiv:2502.14706, 2025

work page arXiv 2025
[4]

Learning to drive in new cities without human demonstrations,

Z. Wang, S. Rahmani, D. Cornelisse, B. Sarkar, A. D. Goldie, J. N. Foerster, and S. Whiteson, “Learning to drive in new cities without human demonstrations,”arXiv preprint arXiv:2602.15891, 2026

work page arXiv 2026
[5]

Imitation is not enough: Ro- bustifying imitation with reinforcement learning for challenging driving scenarios,

Y . Lu, J. Fu, G. Tucker, X. Pan, E. Bronstein, R. Roelofs, B. Sapp, B. White, A. Faust, S. Whitesonet al., “Imitation is not enough: Ro- bustifying imitation with reinforcement learning for challenging driving scenarios,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 7553–7560

work page 2023
[6]

Learning realistic traffic agents in closed-loop,

C. Zhang, J. Tu, L. Zhang, K. Wong, S. Suo, and R. Urtasun, “Learning realistic traffic agents in closed-loop,”arXiv preprint arXiv:2311.01394, 2023

work page arXiv 2023
[7]

Advancing multi-agent traffic simulation via r1-style reinforcement fine-tuning,

M. Pei, S. Shi, and S. Shen, “Advancing multi-agent traffic simulation via r1-style reinforcement fine-tuning,”arXiv preprint arXiv:2509.23993, 2025

work page arXiv 2025
[8]

Cimrl: Combining imitation and reinforcement learning for safe autonomous driving,

J. Booher, K. Rohanimanesh, J. Xu, V . Isenbaev, A. Balakrishna, I. Gupta, W. Liu, and A. Petiushko, “Cimrl: Combining imitation and reinforcement learning for safe autonomous driving,”arXiv preprint arXiv:2406.08878, 2024

work page arXiv 2024
[9]

Wayformer: Motion forecasting via simple & efficient atten- tion networks,

N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient atten- tion networks,”2023 IEEE International Conference on Robotics and Automation (ICRA), 2022

work page 2023
[10]

Motionlm: Multi-agent motion forecasting as language modeling,

A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp, “Motionlm: Multi-agent motion forecasting as language modeling,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8579–8590

work page 2023
[11]

Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving,

Z. Huang, H. Liu, and C. Lv, “Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3903–3913

work page 2023
[12]

Sequential asynchronous action coordination in multi-agent systems: A stackelberg decision transformer approach,

B. Zhang, H. Mao, L. Li, Z. Xu, D. Li, R. Zhao, and G. Fan, “Sequential asynchronous action coordination in multi-agent systems: A stackelberg decision transformer approach,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[13]

Behavior planning at urban intersections through hierarchical reinforcement learning,

Z. Qiao, J. Schneider, and J. M. Dolan, “Behavior planning at urban intersections through hierarchical reinforcement learning,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 2667–2673

work page 2021
[14]

A survey of reinforcement learning-based motion planning for autonomous driving: Lessons learned from a driving task perspective,

Z. Li, G. Jin, R. Yu, Z. Chen, N. Li, W. Han, L. Xiong, B. Leng, J. Hu, I. Kolmanovskyet al., “A survey of reinforcement learning-based motion planning for autonomous driving: Lessons learned from a driving task perspective,”arXiv preprint arXiv:2503.23650, 2025

work page arXiv 2025
[15]

Cooperation with humans of unknown intentions in confined spaces using the stackelberg friend-or-foe game,

X. Zhao, W. Zhang, and D. Sun, “Cooperation with humans of unknown intentions in confined spaces using the stackelberg friend-or-foe game,” IEEE Transactions on Intelligent Transportation Systems, 2026

work page 2026
[16]

Cooperation with humans of unknown intentions in confined spaces using the stackelberg friend-or-foe game,

X. Zhao, H. Hu, and D. Sun, “Cooperation with humans of unknown intentions in confined spaces using the stackelberg friend-or-foe game,” IEEE Transactions on Aerospace and Electronic Systems, vol. 61, no. 3, pp. 5814–5825, 2025

work page 2025
[17]

Feedback in imitation learning: The three regimes of covariate shift,

J. Spencer, S. Choudhury, A. Venkatraman, B. Ziebart, and J. A. Bagnell, “Feedback in imitation learning: The three regimes of covariate shift,” arXiv preprint arXiv:2102.02872, 2021

work page arXiv 2021
[18]

Causal confusion in imitation learning,

P. De Haan, D. Jayaraman, and S. Levine, “Causal confusion in imitation learning,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[19]

Trafficsim: Learning to simulate realistic multi-agent behaviors,

S. Suo, S. Regalado, S. Casas, and R. Urtasun, “Trafficsim: Learning to simulate realistic multi-agent behaviors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 400–10 409

work page 2021
[20]

Smart: Scalable multi-agent real- time motion generation via next-token prediction,

W. Wu, X. Feng, Z. Gao, and Y . Kan, “Smart: Scalable multi-agent real- time motion generation via next-token prediction,”Advances in Neural Information Processing Systems, vol. 37, pp. 114 048–114 071, 2024

work page 2024

[1] [1]

Curse of rarity for autonomous vehicles,

H. X. Liu and S. Feng, “Curse of rarity for autonomous vehicles,”nature communications, vol. 15, no. 1, p. 4808, 2024

work page 2024

[2] [2]

Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349,

M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Seneret al., “Robust autonomy emerges from self-play,”arXiv preprint arXiv:2502.03349, 2025

work page arXiv 2025

[3] [3]

Building reliable sim driving agents by scaling self-play,

D. Cornelisse, A. Pandya, K. Joseph, J. Su ´arez, and E. Vinitsky, “Build- ing reliable sim driving agents by scaling self-play,”arXiv preprint arXiv:2502.14706, 2025

work page arXiv 2025

[4] [4]

Learning to drive in new cities without human demonstrations,

Z. Wang, S. Rahmani, D. Cornelisse, B. Sarkar, A. D. Goldie, J. N. Foerster, and S. Whiteson, “Learning to drive in new cities without human demonstrations,”arXiv preprint arXiv:2602.15891, 2026

work page arXiv 2026

[5] [5]

Imitation is not enough: Ro- bustifying imitation with reinforcement learning for challenging driving scenarios,

Y . Lu, J. Fu, G. Tucker, X. Pan, E. Bronstein, R. Roelofs, B. Sapp, B. White, A. Faust, S. Whitesonet al., “Imitation is not enough: Ro- bustifying imitation with reinforcement learning for challenging driving scenarios,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 7553–7560

work page 2023

[6] [6]

Learning realistic traffic agents in closed-loop,

C. Zhang, J. Tu, L. Zhang, K. Wong, S. Suo, and R. Urtasun, “Learning realistic traffic agents in closed-loop,”arXiv preprint arXiv:2311.01394, 2023

work page arXiv 2023

[7] [7]

Advancing multi-agent traffic simulation via r1-style reinforcement fine-tuning,

M. Pei, S. Shi, and S. Shen, “Advancing multi-agent traffic simulation via r1-style reinforcement fine-tuning,”arXiv preprint arXiv:2509.23993, 2025

work page arXiv 2025

[8] [8]

Cimrl: Combining imitation and reinforcement learning for safe autonomous driving,

J. Booher, K. Rohanimanesh, J. Xu, V . Isenbaev, A. Balakrishna, I. Gupta, W. Liu, and A. Petiushko, “Cimrl: Combining imitation and reinforcement learning for safe autonomous driving,”arXiv preprint arXiv:2406.08878, 2024

work page arXiv 2024

[9] [9]

Wayformer: Motion forecasting via simple & efficient atten- tion networks,

N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient atten- tion networks,”2023 IEEE International Conference on Robotics and Automation (ICRA), 2022

work page 2023

[10] [10]

Motionlm: Multi-agent motion forecasting as language modeling,

A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp, “Motionlm: Multi-agent motion forecasting as language modeling,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8579–8590

work page 2023

[11] [11]

Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving,

Z. Huang, H. Liu, and C. Lv, “Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3903–3913

work page 2023

[12] [12]

Sequential asynchronous action coordination in multi-agent systems: A stackelberg decision transformer approach,

B. Zhang, H. Mao, L. Li, Z. Xu, D. Li, R. Zhao, and G. Fan, “Sequential asynchronous action coordination in multi-agent systems: A stackelberg decision transformer approach,” inForty-first International Conference on Machine Learning, 2024

work page 2024

[13] [13]

Behavior planning at urban intersections through hierarchical reinforcement learning,

Z. Qiao, J. Schneider, and J. M. Dolan, “Behavior planning at urban intersections through hierarchical reinforcement learning,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 2667–2673

work page 2021

[14] [14]

A survey of reinforcement learning-based motion planning for autonomous driving: Lessons learned from a driving task perspective,

Z. Li, G. Jin, R. Yu, Z. Chen, N. Li, W. Han, L. Xiong, B. Leng, J. Hu, I. Kolmanovskyet al., “A survey of reinforcement learning-based motion planning for autonomous driving: Lessons learned from a driving task perspective,”arXiv preprint arXiv:2503.23650, 2025

work page arXiv 2025

[15] [15]

Cooperation with humans of unknown intentions in confined spaces using the stackelberg friend-or-foe game,

X. Zhao, W. Zhang, and D. Sun, “Cooperation with humans of unknown intentions in confined spaces using the stackelberg friend-or-foe game,” IEEE Transactions on Intelligent Transportation Systems, 2026

work page 2026

[16] [16]

Cooperation with humans of unknown intentions in confined spaces using the stackelberg friend-or-foe game,

X. Zhao, H. Hu, and D. Sun, “Cooperation with humans of unknown intentions in confined spaces using the stackelberg friend-or-foe game,” IEEE Transactions on Aerospace and Electronic Systems, vol. 61, no. 3, pp. 5814–5825, 2025

work page 2025

[17] [17]

Feedback in imitation learning: The three regimes of covariate shift,

J. Spencer, S. Choudhury, A. Venkatraman, B. Ziebart, and J. A. Bagnell, “Feedback in imitation learning: The three regimes of covariate shift,” arXiv preprint arXiv:2102.02872, 2021

work page arXiv 2021

[18] [18]

Causal confusion in imitation learning,

P. De Haan, D. Jayaraman, and S. Levine, “Causal confusion in imitation learning,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[19] [19]

Trafficsim: Learning to simulate realistic multi-agent behaviors,

S. Suo, S. Regalado, S. Casas, and R. Urtasun, “Trafficsim: Learning to simulate realistic multi-agent behaviors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 400–10 409

work page 2021

[20] [20]

Smart: Scalable multi-agent real- time motion generation via next-token prediction,

W. Wu, X. Feng, Z. Gao, and Y . Kan, “Smart: Scalable multi-agent real- time motion generation via next-token prediction,”Advances in Neural Information Processing Systems, vol. 37, pp. 114 048–114 071, 2024

work page 2024