pith. sign in

arxiv: 2512.05812 · v5 · submitted 2025-12-05 · 💻 cs.RO · cs.CV

Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

Pith reviewed 2026-05-17 00:53 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords multi-agent driving simulationinstance-centric representationbehavior modelingadversarial inverse reinforcement learningrelative positional encodingstraffic simulationrobust trajectory prediction
0
0 comments X

The pith

Instance-centric local frames with relative encodings let behavior models for multi-agent driving simulation scale efficiently while improving accuracy and robustness over agent-centric baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a behavior model to control individual vehicles in multi-agent driving simulations that must remain both realistic and fast as the number of agents grows. Each traffic participant and map element is placed in its own local coordinate frame so that static map information can be encoded once and reused at every time step. A query-centric symmetric encoder uses relative positional encodings between these local frames to capture interactions without needing a single global viewpoint. Training relies on adversarial inverse reinforcement learning together with an adaptive reward transformation that automatically trades off realism against robustness. The resulting model reduces training and inference time as token count rises and produces more accurate and stable trajectory predictions than several agent-centric alternatives.

Core claim

By placing every traffic participant and map element inside its own local coordinate frame, the method obtains a viewpoint-invariant scene encoding that reuses static map tokens across simulation steps. Interactions are modeled through a query-centric symmetric context encoder that applies relative positional encodings between the local frames. Adversarial inverse reinforcement learning combined with an adaptive reward transformation learns the policy, yielding a behavior model whose training and inference cost grows more slowly with the number of tokens and whose positional accuracy and robustness exceed those of agent-centric baselines in multi-agent driving simulation.

What carries the argument

Instance-centric scene representation that encodes each agent and map element in its own local coordinate frame, paired with relative positional encodings inside a query-centric symmetric context encoder.

Load-bearing premise

The local frames and relative encodings between them are assumed to preserve every interaction detail that matters without losing context that would only be visible from a shared global viewpoint.

What would settle it

Run the learned policy on a set of held-out scenarios containing agents whose relative positions create strong viewpoint asymmetry or partial occlusions; if the instance-centric model then shows higher average displacement error than a matched agent-centric baseline, the sufficiency claim is falsified.

Figures

Figures reproduced from arXiv: 2512.05812 by Christoph Stiller, Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann.

Figure 1
Figure 1. Figure 1: Illustration of different scene representations. Our instance-centric [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Single simulation step: The behavior model maps observations to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of an example situation using instance-centric observa [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the proposed instance-centric behavior model mapping observations to actions. Instance encoders convert observations into latent [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Regressed inference latency of a single policy-network forward [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Peak throughput of the behavior model. a) Inference Steps per Second [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an instance-centric scene representation for multi-agent driving simulation, in which each traffic participant and map element is encoded in its own local coordinate frame. This enables viewpoint-invariant encoding and reuse of static map tokens across timesteps. Interactions are modeled with a query-centric symmetric context encoder that employs relative positional encodings between local frames. Behavior is learned via Adversarial Inverse Reinforcement Learning (AIRL) together with a proposed adaptive reward transformation that balances robustness and realism. The central experimental claim is that the approach scales efficiently with token count, reduces training and inference time, and outperforms several agent-centric baselines on positional accuracy and robustness.

Significance. If the representation and learning claims are substantiated with quantitative evidence, the work could offer a practical route to scalable, realistic multi-agent simulation for autonomous-driving validation. The combination of local-frame efficiency, token reuse, and AIRL-based policy learning addresses both computational and behavioral realism bottlenecks that currently limit large-scale simulation.

major comments (2)
  1. [§3.2] §3.2 (Instance-centric representation and relative encodings): The claim that relative positional encodings between local frames are sufficient to capture all relevant multi-agent interactions rests on an untested assumption. No ablation or analysis is provided showing that long-range relations (e.g., distant vehicles at an intersection or merging lane) are recovered without introducing viewpoint artifacts or loss of absolute context. This directly affects the robustness results reported in §4.
  2. [§4] §4 (Experiments): The abstract and results section assert clear outperformance in positional accuracy and robustness together with large efficiency gains, yet no quantitative metrics (ADE/FDE, collision rates, timing numbers), error bars, baseline implementation details, or data-exclusion criteria are supplied. Without these, the central empirical claim cannot be evaluated or reproduced.
minor comments (2)
  1. [Figure 1] Figure 1 or §3.1: A diagram explicitly showing the local-frame transformations and how relative encodings are computed between agents would improve clarity of the instance-centric design.
  2. [§3] Notation in §3: The symbols used for local frames, query tokens, and the adaptive reward transformation should be defined in a single table or paragraph to avoid scattered definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and describe the revisions we will make to improve clarity, substantiation, and reproducibility.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Instance-centric representation and relative encodings): The claim that relative positional encodings between local frames are sufficient to capture all relevant multi-agent interactions rests on an untested assumption. No ablation or analysis is provided showing that long-range relations (e.g., distant vehicles at an intersection or merging lane) are recovered without introducing viewpoint artifacts or loss of absolute context. This directly affects the robustness results reported in §4.

    Authors: We appreciate the referee's point that an explicit ablation would provide stronger support. The query-centric symmetric context encoder with relative positional encodings is specifically designed to model interactions through relative geometry, which is viewpoint-invariant and avoids the need for absolute coordinates. Our robustness experiments in §4 already include complex multi-agent scenarios such as intersections and merges where long-range relations are present, and the performance gains over agent-centric baselines indicate that these relations are captured effectively. Nevertheless, we agree that a dedicated analysis would address the concern directly. In the revision we will add an ablation study in §4 that compares variants with and without relative positional encodings on subsets of scenes containing distant agents, reporting effects on both accuracy and robustness metrics. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and results section assert clear outperformance in positional accuracy and robustness together with large efficiency gains, yet no quantitative metrics (ADE/FDE, collision rates, timing numbers), error bars, baseline implementation details, or data-exclusion criteria are supplied. Without these, the central empirical claim cannot be evaluated or reproduced.

    Authors: We regret that the quantitative details were not presented with sufficient prominence. Section 4 contains ADE/FDE values for positional accuracy, collision rates for robustness evaluation, and wall-clock timing measurements for training and inference efficiency, all compared against the listed agent-centric baselines. Error bars reflect standard deviation across three random seeds, baseline implementations follow the original papers with hyperparameters listed in the appendix, and data-exclusion criteria follow the standard train/validation splits of the nuScenes dataset with no additional filtering beyond scene length. To resolve the referee's concern we will (i) insert the key numerical results into the abstract, (ii) expand §4 with a consolidated results table that includes all metrics and error bars, and (iii) add a short paragraph detailing baseline re-implementation choices and data criteria. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents an instance-centric scene representation with relative positional encodings and an AIRL-based learning procedure with adaptive reward transformation. Claims of efficiency scaling and outperformance rest on experimental comparisons to external agent-centric baselines rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or steps in the provided text equate outputs to inputs by construction; results are presented as empirically validated on benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5437 in / 1033 out tokens · 65780 ms · 2026-05-17T00:53:40.776913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we employ instance-centric observations, representing each instance... in its respective local coordinate frame... relative positional encoding between a target agent i and any of the combined instance tokens... ri→j = [Δαi→j, ψi→j, ∥pi→j∥]⊺

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

  1. [1]

    In [6], AIRL is used, where a discriminator is trained to distinguish real from simulated behavior, assigning higher scores to more realistic samples

    address this issue by reconstructing a reward signal from real-world data. In [6], AIRL is used, where a discriminator is trained to distinguish real from simulated behavior, assigning higher scores to more realistic samples. As the goal is to drive as realistically as possible, the output of the discriminator is then used as a reward signal for RL traini...

  2. [2]

    In [18], a global map is rasterized and encoded using a CNN

    is the only work on learning a scene-centric multi-agent behavior model for closed-loop simulation. In [18], a global map is rasterized and encoded using a CNN. Then, local map features are extracted via Rotated Region of Interest Align and fused with the agent features. Lastly, a joint decoder model, realized as a message passing network, processes all a...

  3. [3]

    to reconstruct a surrogate reward signal from real data, given as D={(o 1, a1),(o 2, a2), . . .}. In AIRL, an additional discriminator model Dϕ is trained to distinguish generated from real samples, outputting the probability Dϕ(o, a)∈[0,1] for the observation-action pair being real, i. e., stemming from D. The policy is trained via RL using the surrogate...

  4. [4]

    Constant Velocity (CV): A learning-free baseline where agents are assumed to continue moving forward at a constant velocity

  5. [5]

    LateFusionMLP[8]: Following [7], [8], [16], this compact agent-centric model consists solely of MLPs and max-pooling operations. We adopt the public implementation [8], replacing its discrete action decoder with ours to support continuous actions and training it within our framework for realistic behavior modeling

  6. [6]

    GraphAIRL[6]: A more sophisticated agent-centric model that leverages a vectorized scene representation

  7. [7]

    We evaluate two variants: 1) trained with c= 5 , as proposed in [6], and 2) trained with our proposed adaptive reward offset, defined in (3)

    and attention-based interaction modeling. We evaluate two variants: 1) trained with c= 5 , as proposed in [6], and 2) trained with our proposed adaptive reward offset, defined in (3)

  8. [8]

    Our agent-centric observations include both nearby agents and map elements within the observation radius

    Behavior Cloning (BC): A supervised learning variant of our instance-centric approach, trained for 600 epochs by minimizing the negative log-likelihood of expert actions under the predicted action distribution. Our agent-centric observations include both nearby agents and map elements within the observation radius. The start and end points of a vector v a...

  9. [9]

    Mixsim: A hierarchical framework for mixed reality traffic simulation,

    S. Suo, K. Wong, J. Xu, J. Tu, A. Cui, S. Casas, and R. Urtasun, “Mixsim: A hierarchical framework for mixed reality traffic simulation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9622–9631

  10. [10]

    Sledge: Synthesizing driving environments with generative models and rule-based traffic,

    K. Chitta, D. Dauner, and A. Geiger, “Sledge: Synthesizing driving environments with generative models and rule-based traffic,” in European Conference on Computer Vision. Springer, 2024, pp. 57–74

  11. [11]

    Learning robust control policies for end-to- end autonomous driving from data-driven simulation,

    A. Amini, I. Gilitschenski, J. Phillips, J. Moseyko, R. Banerjee, S. Karaman, and D. Rus, “Learning robust control policies for end-to- end autonomous driving from data-driven simulation,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1143–1150, 2020

  12. [12]

    Trafficbots: Towards world models for autonomous driving simulation and motion prediction,

    Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “Trafficbots: Towards world models for autonomous driving simulation and motion prediction,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 1522–1529

  13. [13]

    Modeling human driving behavior through generative adversarial imitation learning,

    R. Bhattacharyya, B. Wulfe, D. J. Phillips, A. Kuefler, J. Morton, R. Senanayake, and M. J. Kochenderfer, “Modeling human driving behavior through generative adversarial imitation learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 3, pp. 2874–2887, 2022

  14. [14]

    Graph- based adversarial imitation learning for predicting human driving behavior,

    F. Konstantinidis, M. Sackmann, U. Hofmann, and C. Stiller, “Graph- based adversarial imitation learning for predicting human driving behavior,” in2024 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2024, pp. 857–864

  15. [15]

    Robust autonomy emerges from self-play

    M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Seneret al., “Robust autonomy emerges from self-play,”arXiv preprint arXiv:2502.03349, 2025

  16. [16]

    Building reliable sim driving agents by scaling self-play,

    D. Cornelisse, A. Pandya, K. Joseph, J. Su ´arez, and E. Vinitsky, “Building reliable sim driving agents by scaling self-play,”arXiv preprint arXiv:2502.14706, 2025

  17. [17]

    ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

    M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst,”arXiv preprint arXiv:1812.03079, 2018

  18. [18]

    Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios,

    Y . Lu, J. Fu, G. Tucker, X. Pan, E. Bronstein, R. Roelofs, B. Sapp, B. White, A. Faust, S. Whitesonet al., “Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 7553–7560

  19. [19]

    Model-free deep reinforcement learning for urban autonomous driving,

    J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement learning for urban autonomous driving,” in2019 IEEE intelligent transportation systems conference (ITSC). IEEE, 2019, pp. 2765– 2771

  20. [20]

    Modeling interaction-aware driving behavior using graph-based representations and multi-agent reinforcement learning,

    F. Konstantinidis, M. Sackmann, U. Hofmann, and C. Stiller, “Modeling interaction-aware driving behavior using graph-based representations and multi-agent reinforcement learning,” in2023 IEEE 26th Interna- tional Conference on Intelligent Transportation Systems (ITSC). IEEE, 2023, pp. 1643–1650

  21. [21]

    Importance sampling-guided meta-training for intelligent agents in highly interactive environments,

    M. Arief, M. Timmerman, J. Li, D. Isele, and M. J. Kochenderfer, “Importance sampling-guided meta-training for intelligent agents in highly interactive environments,”IEEE Robotics and Automation Letters, 2024

  22. [22]

    Learning robust rewards with adverse- rial inverse reinforcement learning,

    J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adverse- rial inverse reinforcement learning,” inInternational Conference on Learning Representations, 2018

  23. [23]

    Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research,

    C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y . Lu, J. Harb, X. Pan, Y . Wang, X. Chenet al., “Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research,”Advances in Neural Information Processing Systems, vol. 36, pp. 7730–7742, 2023

  24. [24]

    Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps,

    S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky, “Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps,” arXiv preprint arXiv:2408.01584, 2024

  25. [25]

    Imagining the road ahead: Multi-agent trajectory prediction via differentiable simulation,

    A. ´Scibior, V . Lioutas, D. Reda, P. Bateni, and F. Wood, “Imagining the road ahead: Multi-agent trajectory prediction via differentiable simulation,” in2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, pp. 720–725

  26. [26]

    Trafficsim: Learning to simulate realistic multi-agent behaviors,

    S. Suo, S. Regalado, S. Casas, and R. Urtasun, “Trafficsim: Learning to simulate realistic multi-agent behaviors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 400–10 409

  27. [27]

    Scene transformer: A unified architecture for predicting multiple agent trajectories.arXiv preprint arXiv:2106.08417, 2021

    J. Ngiam, B. Caine, V . Vasudevan, Z. Zhang, H.-T. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopalet al., “Scene transformer: A unified architecture for predicting multiple agent trajectories,”arXiv preprint arXiv:2106.08417, 2021

  28. [28]

    Simnet: Learning reactive self-driving simulations from real-world observations,

    L. Bergamini, Y . Ye, O. Scheel, L. Chen, C. Hu, L. Del Pero, B. Osi´nski, H. Grimmett, and P. Ondruska, “Simnet: Learning reactive self-driving simulations from real-world observations,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 5119–5125

  29. [29]

    General lane-changing model mobil for car-following models,

    A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model mobil for car-following models,”Transportation Research Record, vol. 1999, no. 1, pp. 86–94, 2007

  30. [30]

    Enhanced intelligent driver model to access the impact of driving strategies on traffic capacity,

    ——, “Enhanced intelligent driver model to access the impact of driving strategies on traffic capacity,”Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 368, no. 1928, pp. 4585–4605, 2010

  31. [31]

    Feedback in imitation learning: The three regimes of covariate shift,

    J. Spencer, S. Choudhury, A. Venkatraman, B. Ziebart, and J. A. Bagnell, “Feedback in imitation learning: The three regimes of covariate shift,” arXiv preprint arXiv:2102.02872, 2021

  32. [32]

    Modelling two-dimensional driving behaviours at unsignalised intersection using multi-agent imitation learning,

    J. Sun and J. Kim, “Modelling two-dimensional driving behaviours at unsignalised intersection using multi-agent imitation learning,” Transportation Research Part C: Emerging Technologies, vol. 165, p. 104702, 2024

  33. [33]

    Betail: Behavior transformer adversarial imitation learning from human racing gameplay,

    C. Weaver, C. Tang, C. Hao, K. Kawamoto, M. Tomizuka, and W. Zhan, “Betail: Behavior transformer adversarial imitation learning from human racing gameplay,”IEEE Robotics and Automation Letters, 2024

  34. [34]

    Narrowing the coordinate-frame gap in behavior prediction models: Distillation for efficient and accurate scene-centric motion forecasting,

    D. A. Su, B. Douillard, R. Al-Rfou, C. Park, and B. Sapp, “Narrowing the coordinate-frame gap in behavior prediction models: Distillation for efficient and accurate scene-centric motion forecasting,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 653–659

  35. [35]

    Simpl: A simple and efficient multi-agent motion prediction baseline for autonomous driving,

    L. Zhang, P. Li, S. Liu, and S. Shen, “Simpl: A simple and efficient multi-agent motion prediction baseline for autonomous driving,”IEEE Robotics and Automation Letters (RA-L), 2024

  36. [36]

    Query-centric trajectory prediction,

    Z. Zhou, J. Wang, Y .-H. Li, and Y .-K. Huang, “Query-centric trajectory prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 863–17 873

  37. [37]

    Real-time motion prediction via heterogeneous polyline transformer with relative pose encoding,

    Z. Zhang, A. Liniger, C. Sakaridis, F. Yu, and L. V . Gool, “Real-time motion prediction via heterogeneous polyline transformer with relative pose encoding,”Advances in Neural Information Processing Systems, vol. 36, pp. 57 481–57 499, 2023

  38. [38]

    Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying,

    S. Shi, L. Jiang, D. Dai, and B. Schiele, “Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3955–3971, 2024

  39. [39]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015

  40. [40]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  41. [41]

    Vectornet: Encoding hd maps and agent dynamics from vectorized representation,

    J. Gao, C. Sun, H. Zhao, Y . Shen, D. Anguelov, C. Li, and C. Schmid, “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 522–11 530

  42. [42]

    Film: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  43. [43]

    Driving with llms: Fusing object- level vector modality for explainable autonomous driving,

    L. Chen, O. Sinavski, J. H ¨unermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024

  44. [44]

    Perceiver: General perception with iterative attention,

    A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” inInternational conference on machine learning. PMLR, 2021, pp. 4651–4664

  45. [45]

    Interaction dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps,

    W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Kummerle, H. Konigshof, C. Stiller, A. de La Fortelleet al., “Interaction dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps,” arXiv preprint arXiv:1910.03088, 2019

  46. [46]

    Highly accurate and diverse traffic data: The deepscenario open 3d dataset,

    O. Dhaouadi, J. Meier, L. Wahl, J. Kaiser, L. Scalerandi, N. Wandelburg, Z. Zhou, N. Berinpanathan, H. Banzhaf, and D. Cremers, “Highly accurate and diverse traffic data: The deepscenario open 3d dataset,” arXiv preprint arXiv:2504.17371, 2025

  47. [47]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017