pith. sign in

arxiv: 2410.07191 · v2 · pith:DZI7HRW3new · submitted 2024-09-23 · 💻 cs.RO · cs.LG· stat.ME

Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving

Pith reviewed 2026-05-23 20:20 UTC · model grok-4.3

classification 💻 cs.RO cs.LGstat.ME
keywords trajectory predictionautonomous drivingcausal discoveryattention gatingtransformerrobustnessdomain generalization
0
0 comments X

The pith

Causal attention gating in trajectory models filters non-causal agent signals to raise robustness by up to 54 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a trajectory prediction model that first runs a causal discovery network over past observations to find which surrounding agents actually influence the ego vehicle. It then inserts a causal attention gating step inside a transformer so that attention weights ignore agents whose actions lack causal links. This design targets the problem of predictions being thrown off by irrelevant agents whose behavior should not matter. If the gating works as intended, predictions stay stable under added noise from non-causal agents while accuracy on normal cases holds steady. Experiments on standard driving datasets also show the same architecture transfers better to new domains.

Core claim

The model CRiTIC identifies inter-agent causal relations over a window of past time steps with a Causal Discovery Network, then applies a Causal Attention Gating mechanism inside its transformer encoder to pass only causally relevant information forward; this yields up to 54 percent higher robustness against non-causal perturbations with little loss in prediction accuracy and up to 29 percent better performance when tested across different driving datasets.

What carries the argument

Causal Attention Gating, which multiplies standard attention scores by a binary or soft mask derived from the output of the Causal Discovery Network so that only agents with identified causal influence contribute to the prediction.

If this is right

  • Trajectory forecasts become less sensitive to the movements of agents whose actions have no causal bearing on the ego vehicle.
  • Prediction accuracy on standard benchmarks stays comparable while robustness metrics rise.
  • The same architecture produces higher accuracy when the test distribution shifts to a different driving dataset or city.
  • Downstream planning modules receive more stable inputs because fewer spurious correlations reach the output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same discovery-plus-gating pattern could be inserted into other multi-agent forecasting settings such as pedestrian crowd modeling.
  • Running causal discovery on longer histories or with uncertainty estimates might further tighten the mask and reduce residual errors.
  • Directly feeding the discovered causal graph into a planner could let the vehicle plan around only the agents that truly matter.
  • Replacing the discovery network with a learned module trained end-to-end might relax the requirement for an accurate separate causal estimator.

Load-bearing premise

The causal discovery network must correctly label which agents exert causal influence on the ego-agent over the observed time window.

What would settle it

A controlled test set in which non-causal agents are deliberately injected into scenes but the discovery network still assigns them high causal scores would remove the reported robustness gains if the gating step is the source of improvement.

Figures

Figures reproduced from arXiv: 2410.07191 by Amir Rasouli, Ehsan Ahmadi, Kasra Rezaee, Ray Mercurius, Soheil Alizadeh.

Figure 1
Figure 1. Figure 1: Robustness qualitative samples. The AV is shown in green. In CRiTIC’s [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of CRiTIC. In this architecture, Causal Discovery Network receives the agent representations and generates a causality adjacency matrix. The [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Precision, recall, and the robustness against RemoveNonCausal [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent's behavior. Such perturbations can lead to incorrect predictions of other agents' trajectories, potentially compromising the safety and efficiency of the ego-vehicle's decision-making process. Motivated by this challenge, we propose $\textit{Causal tRajecTory predICtion}$ $\textbf{(CRiTIC)}$, a novel model that utilizes a $\textit{Causal Discovery Network}$ to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel $\textit{Causal Attention Gating}$ mechanism to selectively filter information in the proposed Transformer-based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to $\textbf{54%}$ without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to $\textbf{29%}$ improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domains. Further details can be found on our project page: https://ehsan-ami.github.io/critic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CRiTIC, a Transformer-based trajectory prediction model for autonomous driving. It introduces a Causal Discovery Network to identify inter-agent causal relations from past trajectories and a Causal Attention Gating mechanism to selectively filter non-causal information. The central empirical claims are up to 54% improvement in robustness against non-causal perturbations and up to 29% better cross-domain performance on two public benchmarks, with no significant loss in nominal prediction accuracy.

Significance. If the reported robustness and generalization gains are shown to stem specifically from accurate causal discovery rather than architectural side-effects or dataset artifacts, the approach could meaningfully improve safety margins in autonomous driving by reducing sensitivity to irrelevant agents. The work evaluates on standard public datasets and provides a project page, which supports reproducibility of the empirical protocol.

major comments (2)
  1. [Method description (Causal Discovery Network)] The 54% robustness and 29% cross-domain claims rest on the Causal Discovery Network correctly recovering inter-agent causal edges. No section reports an independent validation metric (e.g., edge F1, intervention test, or synthetic-graph recovery) that quantifies discovery precision on either benchmark; the method description only states that the network “identifies” relations without reporting its own error rate or ablation against a non-causal baseline using the same architecture but random or correlation-based masks.
  2. [Abstract and Experiments] Abstract and experimental claims provide no information on perturbation generation procedure, choice of baseline models, statistical significance testing, or ablation controls that isolate the contribution of the gating mechanism. These omissions make it impossible to evaluate whether the quantitative gains are load-bearing evidence for the causal-attention hypothesis.
minor comments (1)
  1. [Abstract] The acronym construction “Causal tRajecTory predICtion (CRiTIC)” is unconventional and may confuse readers; a standard descriptive name would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify areas where additional clarity and controls would strengthen the presentation of the causal discovery and gating components. We respond to each major comment below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [Method description (Causal Discovery Network)] The 54% robustness and 29% cross-domain claims rest on the Causal Discovery Network correctly recovering inter-agent causal edges. No section reports an independent validation metric (e.g., edge F1, intervention test, or synthetic-graph recovery) that quantifies discovery precision on either benchmark; the method description only states that the network “identifies” relations without reporting its own error rate or ablation against a non-causal baseline using the same architecture but random or correlation-based masks.

    Authors: We agree that the manuscript does not provide direct validation metrics (such as edge F1 or synthetic-graph recovery) for the Causal Discovery Network, as the real-world benchmarks lack ground-truth causal edges. The reported robustness gains are shown via end-to-end performance under controlled perturbations rather than explicit causal accuracy metrics. To address the concern, we will add an ablation study comparing the full model against variants that replace the discovered relations with random masks and with correlation-based masks, using the identical Transformer architecture. We will also add a limitations paragraph discussing the absence of ground-truth causal labels in public driving datasets. revision: partial

  2. Referee: [Abstract and Experiments] Abstract and experimental claims provide no information on perturbation generation procedure, choice of baseline models, statistical significance testing, or ablation controls that isolate the contribution of the gating mechanism. These omissions make it impossible to evaluate whether the quantitative gains are load-bearing evidence for the causal-attention hypothesis.

    Authors: We acknowledge that the current abstract and experimental sections omit these procedural and control details. In the revised version we will (i) expand the abstract to briefly note the perturbation protocol and evaluation protocol, (ii) add a dedicated subsection describing how non-causal perturbations are generated, (iii) list all baseline models with citations, (iv) report statistical significance (e.g., mean and standard deviation over multiple seeds together with paired statistical tests), and (v) include an explicit ablation that isolates the Causal Attention Gating by comparing the full model to an ablated version without the gating mechanism. revision: yes

standing simulated objections not resolved
  • Direct edge-level validation metrics (e.g., edge F1) for the Causal Discovery Network cannot be reported on the public benchmarks because those datasets do not contain ground-truth causal relations between agents.

Circularity Check

0 steps flagged

No circularity: empirical architecture evaluated on external benchmarks

full rationale

The paper introduces CRiTIC as a Transformer-based model augmented by a Causal Discovery Network and Causal Attention Gating. All performance claims (54% robustness, 29% cross-domain) are obtained by direct measurement against baselines on two public autonomous-driving datasets. No equations, fitted parameters, or self-citations are shown to reduce the reported metrics to the model's own inputs by construction; the derivation chain consists of standard architectural choices followed by empirical validation, rendering the results externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract does not enumerate free parameters or background axioms; the model introduces a new gating mechanism whose correctness rests on the empirical performance of the causal discovery component.

invented entities (1)
  • Causal Attention Gating mechanism no independent evidence
    purpose: Selectively filter transformer attention based on discovered causal relations
    Newly proposed component whose independent evidence is the reported robustness gains.

pith-pipeline@v0.9.0 · 5796 in / 1104 out tokens · 24277 ms · 2026-05-23T20:20:21.008880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Forecasting the Past: Gradient-Based Distribution Shift Detection in Trajectory Prediction

    cs.LG 2026-04 unverdicted novelty 7.0

    A gradient norm from a post-hoc self-supervised trajectory forecasting decoder detects distribution shifts in prediction models, with reported improvements on Shifts and Argoverse datasets.

  2. Super Agents and Confounders: Influence of surrounding agents on vehicle trajectory prediction

    cs.LG 2026-04 unverdicted novelty 6.0

    Surrounding agents frequently degrade trajectory prediction accuracy in interactive driving scenes, and integrating a Conditional Information Bottleneck improves results by ignoring non-beneficial contextual signals.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 2 Pith papers

  1. [1]

    Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,

    B. Varadarajan, A. Hefny, A. Srivastava, K. S. Refaat, N. Nayakanti, A. Cornman, K. Chen, B. Douillard, C. P. Lam, D. Anguelov, et al. , “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” in ICRA, 2022

  2. [2]

    Motion transformer with global intention localization and local movement refinement,

    S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,” in NeurIPS, 2022

  3. [3]

    Scene transformer: A unified architecture for predicting future trajectories of multiple agents,

    J. Ngiam, V. Vasudevan, B. Caine, Z. Zhang, H. T. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal, D. J. Weiss, B. Sapp, Z. Chen, and J. Shlens, “Scene transformer: A unified architecture for predicting future trajectories of multiple agents,” in ICLR, 2022

  4. [4]

    Wayformer: Motion forecasting via simple & efficient attention networks,

    N. Nayakanti, R. Al Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient attention networks,” in ICRA, 2023

  5. [5]

    CausalAgents: A robustness benchmark for motion forecasting,

    L. Sun, R. Roelofs, B. Caine, K. S. Refaat, B. Sapp, S. Ettinger, and W. Chai, “CausalAgents: A robustness benchmark for motion forecasting,” in ICRA, 2024

  6. [6]

    Toward causal representation learning,

    B. Sch¨ olkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio, “Toward causal representation learning,” Proceedings of the IEEE , 2021

  7. [7]

    Causal imitative model for autonomous driving,

    M. R. Samsami, M. Bahari, S. Salehkaleybar, and A. Alahi, “Causal imitative model for autonomous driving,” arXiv:2112.03908, 2021

  8. [8]

    Towards robust and adaptive motion forecasting: A causal representation perspective,

    Y. Liu, R. Cadei, J. Schweizer, S. Bahmani, and A. Alahi, “Towards robust and adaptive motion forecasting: A causal representation perspective,” in CVPR, 2022

  9. [9]

    A survey on graph structure learning: Progress and opportunities,

    Y. Zhu, W. Xu, J. Zhang, Y. Du, J. Zhang, Q. Liu, C. Yang, and S. Wu, “A survey on graph structure learning: Progress and opportunities,” arXiv:2103.03036, 2021

  10. [10]

    Neural relational inference for interacting systems,

    T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neural relational inference for interacting systems,” in ICML, 2018

  11. [11]

    Iterative deep graph learning for graph neural networks: Better and robust node embeddings,

    Y. Chen, L. Wu, and M. Zaki, “Iterative deep graph learning for graph neural networks: Better and robust node embeddings,” in NeurIPS, 2020

  12. [12]

    Learning discrete structures for graph neural networks,

    L. Franceschi, M. Niepert, M. Pontil, and X. He, “Learning discrete structures for graph neural networks,” in ICML, 2019

  13. [13]

    SLAPS: Self-supervision improves structure learning for graph neural networks,

    B. Fatemi, L. E. Asri, and S. M. Kazemi, “SLAPS: Self-supervision improves structure learning for graph neural networks,” in NeurIPS, 2021

  14. [14]

    On causal discovery from time series data using FCI,

    D. Entner and P. O. Hoyer, “On causal discovery from time series data using FCI,” Probabilistic Graphical Models, pp. 121–128, 2010

  15. [15]

    Optimal structure identification with greedy search,

    D. M. Chickering, “Optimal structure identification with greedy search,” JMLR, vol. 3, pp. 507–554, 2002

  16. [16]

    Neural Granger causality,

    A. Tank, I. Covert, N. Foti, A. Shojaie, and E. B. Fox, “Neural Granger causality,” PAMI, vol. 44, no. 8, pp. 4267–4279, 2021

  17. [17]

    Amortized causal discovery: Learning to infer causal graphs from time-series data,

    S. L¨ owe, D. Madras, R. Zemel, and M. Welling, “Amortized causal discovery: Learning to infer causal graphs from time-series data,” in CLeaR, 2022

  18. [18]

    Pedformer: Pedestrian behavior prediction via cross-modal attention modulation and gated multitask learning,

    A. Rasouli and I. Kotseruba, “Pedformer: Pedestrian behavior prediction via cross-modal attention modulation and gated multitask learning,” in ICRA, 2023

  19. [19]

    Graph-based spatial transformer with memory replay for multi-future pedestrian trajectory prediction,

    L. Li, M. Pagnucco, and Y. Song, “Graph-based spatial transformer with memory replay for multi-future pedestrian trajectory prediction,” in CVPR, 2022

  20. [20]

    Learning pedestrian group representations for multi-modal trajectory prediction,

    I. Bae, J.-H. Park, and H.-G. Jeon, “Learning pedestrian group representations for multi-modal trajectory prediction,” in ECCV, 2022

  21. [21]

    Dice: Diverse diffusion model with scoring for trajectory prediction,

    Y. Choi, R. C. Mercurius, S. Mohamad Alizadeh Shabestary, and A. Rasouli, “Dice: Diverse diffusion model with scoring for trajectory prediction,” in IV, 2024

  22. [22]

    Bifold and semantic reasoning for pedestrian behavior prediction,

    A. Rasouli, M. Rohani, and J. Luo, “Bifold and semantic reasoning for pedestrian behavior prediction,” in ICCV, 2021

  23. [23]

    SGCN: Sparse graph convolution network for pedestrian trajectory prediction,

    L. Shi, L. Wang, C. Long, S. Zhou, M. Zhou, Z. Niu, and G. Hua, “SGCN: Sparse graph convolution network for pedestrian trajectory prediction,” in CVPR, 2021

  24. [24]

    Cadet: a causal disentanglement approach for robust trajectory prediction in autonomous driving,

    M. Pourkeshavarz, J. Zhang, and A. Rasouli, “Cadet: a causal disentanglement approach for robust trajectory prediction in autonomous driving,” in CVPR, 2024

  25. [25]

    Destine: Dynamic goal queries with temporal transductive alignment for trajectory prediction,

    R. Karim, S. M. A. Shabestary, and A. Rasouli, “Destine: Dynamic goal queries with temporal transductive alignment for trajectory prediction,” in ICRA, 2024

  26. [26]

    LaPred: Lane-aware prediction of multi-modal future trajectories of dynamic agents,

    B. Kim, S. H. Park, S. Lee, E. Khoshimjonov, D. Kum, J. Kim, J. S. Kim, and J. W. Choi, “LaPred: Lane-aware prediction of multi-modal future trajectories of dynamic agents,” in CVPR, 2021

  27. [27]

    HiVT: Hierarchical vector transformer for multi-agent motion prediction,

    Z. Zhou, L. Ye, J. Wang, K. Wu, and K. Lu, “HiVT: Hierarchical vector transformer for multi-agent motion prediction,” in CVPR, 2022

  28. [28]

    LTP: Lane-based trajectory prediction for autonomous driving,

    J. Wang, T. Ye, Z. Gu, and J. Chen, “LTP: Lane-based trajectory prediction for autonomous driving,” in CVPR, 2022

  29. [29]

    Learning lane graph representations for motion forecasting,

    M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and R. Urtasun, “Learning lane graph representations for motion forecasting,” in ECCV, 2020

  30. [30]

    LatentFormer: Multi-agent transformer-based interaction modeling and trajectory prediction,

    E. Amirloo, A. Rasouli, P. Lakner, M. Rohani, and J. Luo, “LatentFormer: Multi-agent transformer-based interaction modeling and trajectory prediction,” arXiv:2203.01880, 2022

  31. [31]

    LookOut: Diverse multi-future prediction and planning for self-driving,

    A. Cui, S. Casas, A. Sadat, R. Liao, and R. Urtasun, “LookOut: Diverse multi-future prediction and planning for self-driving,” in ICCV, 2021

  32. [32]

    Implicit latent variable model for scene-consistent motion forecasting,

    S. Casas, C. Gulino, S. Suo, K. Luo, R. Liao, and R. Urtasun, “Implicit latent variable model for scene-consistent motion forecasting,” in ECCV, 2020

  33. [33]

    Latent variable sequential set transformers for joint multi-agent motion prediction,

    R. Girgis, F. Golemo, F. Codevilla, M. Weiss, J. A. D’Souza, S. E. Kahou, F. Heide, and C. Pal, “Latent variable sequential set transformers for joint multi-agent motion prediction,” in ICLR, 2022

  34. [34]

    VectorNet: Encoding hd maps and agent dynamics from vectorized representation,

    J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “VectorNet: Encoding hd maps and agent dynamics from vectorized representation,” in CVPR, 2020

  35. [35]

    Learn tarot with mentor: A meta-learned self-supervised approach for trajectory prediction,

    M. Pourkeshavarz, C. Chen, and A. Rasouli, “Learn tarot with mentor: A meta-learned self-supervised approach for trajectory prediction,” in ICCV, 2023

  36. [36]

    Tract: A training dynamics aware contrastive learning framework for long-tail trajectory prediction,

    J. Zhang, M. Pourkeshavarz, and A. Rasouli, “Tract: A training dynamics aware contrastive learning framework for long-tail trajectory prediction,” in IV, 2024

  37. [37]

    Multiple futures 7 prediction,

    C. Tang and R. R. Salakhutdinov, “Multiple futures 7 prediction,” in NeurIPS, 2019

  38. [38]

    Trajectron++: Multi-agent generative trajectory forecasting with heterogeneous data for control,

    T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Multi-agent generative trajectory forecasting with heterogeneous data for control,” in ECCV, 2020

  39. [39]

    GOHOME: Graph-oriented heatmap output for future motion estimation,

    T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde, “GOHOME: Graph-oriented heatmap output for future motion estimation,” in ICRA, 2022

  40. [40]

    MUSE-VAE: Multi-scale VAE for environment-aware long term trajectory prediction,

    M. Lee, S. S. Sohn, S. Moon, S. Yoon, M. Kapadia, and V. Pavlovic, “MUSE-VAE: Multi-scale VAE for environment-aware long term trajectory prediction,” in CVPR, 2022

  41. [41]

    TNT: Target-driven trajectory prediction,

    H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y. Shen, Y. Shen, Y. Chai, C. Schmid, et al. , “TNT: Target-driven trajectory prediction,” in CoRL, 2020

  42. [42]

    A novel benchmarking paradigm and a scale- and motion-aware model for egocentric pedestrian trajectory prediction,

    A. Rasouli, “A novel benchmarking paradigm and a scale- and motion-aware model for egocentric pedestrian trajectory prediction,” in ICRA, 2024

  43. [43]

    AgentFormer: Agent-aware transformers for socio-temporal multi-agent forecasting,

    Y. Yuan, X. Weng, Y. Ou, and K. M. Kitani, “AgentFormer: Agent-aware transformers for socio-temporal multi-agent forecasting,” in ICCV, 2021

  44. [44]

    Convolutional social pooling for vehicle trajectory prediction,

    N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicle trajectory prediction,” in CVPR W, 2018

  45. [45]

    Human trajectory prediction via counterfactual analysis,

    G. Chen, J. Li, J. Lu, and J. Zhou, “Human trajectory prediction via counterfactual analysis,” in ICCV, 2021

  46. [46]

    Investigating causal relations by econometric models and cross-spectral methods,

    C. W. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica: Journal of the Econometric Society , pp. 424–438, 1969

  47. [47]

    PointNet: Deep learning on point sets for 3d classification and segmentation,

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017

  48. [48]

    Learning continuous phrase representations and syntactic parsing with recursive neural networks,

    R. Socher, C. D. Manning, and A. Y. Ng, “Learning continuous phrase representations and syntactic parsing with recursive neural networks,” in NeurIPS, 2010

  49. [49]

    Neural message passing for quantum chemistry,

    J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in ICML, 2017

  50. [50]

    The concrete distribution: A continuous relaxation of discrete random variables,

    C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” in ICLR, 2017

  51. [51]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, and Others, “Attention is all you need,” in NeurIPS, 2017

  52. [52]

    Ignorance is bliss: Robust control via information gating,

    M. Tomar, R. Islam, M. E. Taylor, S. Levine, and P. Bachman, “Ignorance is bliss: Robust control via information gating,” in NeurIPS, 2023

  53. [53]

    Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset,

    S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V. Vasudevan, A. McCauley, J. Shlens, and D. Anguelov, “Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset,” in ICCV, 2021

  54. [54]

    INTERACTION dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps,

    W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Kummerle, H. Konigshof, C. Stiller, A. de La Fortelle, et al. , “INTERACTION dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps,” arXiv:1910.03088, 2019

  55. [55]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in ICLR, 2019

  56. [56]

    MotionLM: Multi-agent motion forecasting as language modeling,

    A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp, “MotionLM: Multi-agent motion forecasting as language modeling,” in ICCV, 2023

  57. [57]

    HDGT: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,

    X. Jia, P. Wu, L. Chen, Y. Liu, H. Li, and J. Yan, “HDGT: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,” PAMI, 2023

  58. [58]

    CILF: Causality inspired learning framework for out-of-distribution vehicle trajectory prediction,

    S. Li, Q. Xue, Y. Zhang, and X. Li, “CILF: Causality inspired learning framework for out-of-distribution vehicle trajectory prediction,” in Asian Conference on Pattern Recognition, 2023

  59. [59]

    Social LSTM: Human trajectory prediction in crowded spaces,

    A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social LSTM: Human trajectory prediction in crowded spaces,” in CVPR, 2016. 8