Recall to Predict: Grounding Motion Forecasting in Interpretable Motion Bank

Abhishek Vivekanandan; Ahmed Abouelazm; J. Marius Z\"ollner

arxiv: 2605.01393 · v1 · submitted 2026-05-02 · 💻 cs.CV

Recall to Predict: Grounding Motion Forecasting in Interpretable Motion Bank

Abhishek Vivekanandan , Ahmed Abouelazm , J. Marius Z\"ollner This is my paper

Pith reviewed 2026-05-09 15:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords motion forecastingmotion bankanchor retrievalinterpretable predictionmulti-modal forecastingcontrastive learningGumbel-Softmaxautonomous driving

0 comments

The pith

Conditioning motion forecasts on retrieved primitives from a contrastively learned motion bank removes opaque latent queries while matching competitive multi-modal accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an end-to-end differentiable architecture for motion forecasting that builds a motion bank of physically realizable trajectories via contrastive learning. Rather than starting from blank latent queries, a novel Anchor Retrieval Layer uses orthogonally initialized queries, dual-level gated cross-attention, and straight-through Gumbel-Softmax to select explicit anchors, which a DETR-style decoder then refines. Joint training with a winner-takes-all kinematic GMM, diversity penalty, and soft-min endpoint loss keeps multi-modal performance high. A sympathetic reader would care because the method promises traceable predictions for safety-critical uses such as vehicle trajectory planning on Argoverse 2 and Waymo without an accuracy penalty.

Core claim

By strictly conditioning the decoding phase on diverse, interpretable motion primitives from a contrastively learned motion bank, the architecture eliminates the black box of standard latent queries and achieves competitive multi-modal accuracy on the Argoverse 2 and Waymo Open Motion datasets through the use of an Anchor Retrieval Layer with Dual-Level Gated Cross-Attention and Straight-Through Gumbel-Softmax.

What carries the argument

The motion bank, a structured embedding space of physically realizable trajectories constructed via contrastive learning, together with the Anchor Retrieval Layer that adapts queries via dual-level gated cross-attention and executes discrete selection via straight-through Gumbel-Softmax to preserve gradient flow.

If this is right

Decoding becomes directly conditioned on explicit, semantically grounded anchors instead of latent variables prone to collapse.
Multi-modal diversity is preserved through the latent diversity penalty and soft-min weighted endpoint loss.
Joint optimization with a kinematic Gaussian Mixture Model enforces both accuracy and physical feasibility.
Performance stays competitive with prior methods on Argoverse 2 and Waymo Open Motion datasets.
Predictions become traceable to specific motion primitives, supporting inspection and debugging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The discrete selection step may allow easier hybridization with rule-based planners that can inspect or override chosen primitives.
If the bank can be updated incrementally, the method might adapt to new environments without full retraining.
Testing coverage on rare edge-case motions would directly probe whether the finite bank assumption holds in practice.
Similar retrieval mechanisms could transfer to other sequence tasks where explicit priors aid interpretability.

Load-bearing premise

A contrastively learned motion bank of finite size will contain sufficiently diverse and physically realizable trajectories for all relevant scenarios, and the Anchor Retrieval Layer will reliably select appropriate anchors.

What would settle it

A held-out test set of motion patterns outside the span of the learned bank where accuracy drops below latent-query baselines or where the retrieval layer repeatedly selects implausible anchors.

Figures

Figures reproduced from arXiv: 2605.01393 by Abhishek Vivekanandan, Ahmed Abouelazm, J. Marius Z\"ollner.

**Figure 1.** Figure 1: Overview of the Proposed Architecture. The system initially processes heterogeneous driving context using a PointNet-based projection layer. Because agent-centric pooling inherently strips absolute geometric grounding, global pose features are explicitly restored to the embeddings via residual additions. Guided by the projected context, an auxiliary retrieval module fetches Nq discrete Anchor Tokens from a… view at source ↗

**Figure 2.** Figure 2: Anchor Retrieval Layer. Orthogonally initialized latent view at source ↗

**Figure 3.** Figure 3: Each row depicts a distinct driving scenario, illustrating three randomly selected latent queries alongside the top five scene view at source ↗

**Figure 4.** Figure 4: Qualitative forecasting results on the Argoverse 2 (AV2) dataset. The view at source ↗

**Figure 5.** Figure 5: Optional Physical Grouped Query Aggregation (PGQA) view at source ↗

read the original abstract

Motion forecasting often requires trading interpretability for predictive accuracy. Standard anchor-based architectures rely on opaque latent queries that are highly prone to latent collapse, or naive trajectory sampling that limits multi-modal diversity. We propose an end-to-end differentiable framework that grounds predictions in a comprehensive "motion bank", a structured embedding space of physically realizable trajectories constructed via contrastive learning. Rather than regressing paths from a blank slate, our architecture dynamically retrieves explicit motion priors using a novel Anchor Retrieval Layer. This module adapts orthogonally initialized queries via a Dual-Level Gated Cross-Attention mechanism and executes discrete trajectory selection using a Straight-Through Gumbel-Softmax estimator to preserve continuous gradient flow. The retrieved semantically grounded anchors are then geometrically refined by a DETR-style decoder, optimized jointly with a Winner-Takes-All (WTA) kinematic Gaussian Mixture Model (GMM), a latent diversity penalty, and a soft-min weighted endpoint loss. By strictly conditioning the decoding phase on diverse, interpretable motion primitives, our approach eliminates the "black box" of standard latent queries while achieving competitive multi-modal accuracy on the Argoverse 2 and Waymo Open Motion datasets. Code is available at: https://github.com/abviv/recall2predict

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a retrieval-based architecture that grounds forecasts in an explicit contrastive motion bank, which is a concrete step toward interpretability but leaves coverage questions open.

read the letter

The main takeaway is that this replaces standard latent queries with dynamic retrieval from a learned bank of trajectories, using a new Anchor Retrieval Layer built on dual-level gated cross-attention and straight-through Gumbel-Softmax selection, followed by DETR-style refinement and joint WTA-GMM training. That combination is new enough to stand out from prior anchor or query methods, and the public code link is a real plus for anyone who wants to test it directly on Argoverse 2 or Waymo data. The contrastive construction of the bank and the diversity penalty are sensible ways to push for both coverage and variety without collapsing to a single mode. If the full experiments back the competitive accuracy numbers and show that retrieved anchors are actually used in practice, the interpretability gain is tangible for safety-critical applications. The soft spot is the finite bank itself. A contrastively learned set drawn from the training distribution can still miss low-frequency maneuvers, and nothing in the setup guarantees that the retrieval step will not default to poor anchors in those cases; the subsequent refinement cannot fully fix selection errors. The abstract is light on tables and error bars, so the elimination of latent collapse and the claimed robustness need the ablations to be convincing. The mild circularity from training and evaluating on overlapping data is minor but would benefit from explicit held-out checks. This is worth a serious referee for groups working on motion prediction in autonomous driving who already use anchor or query decoders and want a more traceable alternative. The technical proposal is clear, the code is available, and the problem it targets is real, so it deserves review time even if the coverage and validation details need tightening.

Referee Report

3 major / 2 minor

Summary. The paper proposes an end-to-end differentiable motion forecasting framework that constructs an interpretable 'motion bank' of physically realizable trajectories via contrastive learning. Predictions are grounded by retrieving anchors through a novel Anchor Retrieval Layer (Dual-Level Gated Cross-Attention + Straight-Through Gumbel-Softmax), followed by DETR-style geometric refinement. Training uses a Winner-Takes-All kinematic GMM loss, latent diversity penalty, and soft-min endpoint loss. The central claim is that this eliminates the black-box latent queries of prior methods while delivering competitive multi-modal accuracy on Argoverse 2 and Waymo Open Motion datasets.

Significance. If validated, the approach would meaningfully advance interpretable motion prediction for autonomous driving by replacing opaque latent queries with explicit, retrievable motion primitives. This addresses a key tension between accuracy and explainability. The public code release is a clear strength for reproducibility. Significance depends on demonstrating that the finite motion bank provides adequate coverage without introducing new failure modes in retrieval or refinement.

major comments (3)

[Abstract] Abstract: The claim of 'competitive multi-modal accuracy' and elimination of latent collapse is asserted without any quantitative results, tables, ablation studies, or error bars. This is load-bearing for the empirical contribution and must be supported by explicit benchmark numbers, comparisons to baselines, and statistical significance tests in the experiments section.
[Anchor Retrieval Layer] Anchor Retrieval Layer (and motion bank construction): The central interpretability claim rests on the assumption that a finite contrastively-learned motion bank supplies sufficiently diverse, physically realizable primitives for all relevant scenarios, including rare maneuvers. No coverage analysis, diversity metrics, or failure-case evaluation on low-frequency trajectories is described; without this, retrieval failures would reintroduce opacity and undermine the 'eliminates the black box' assertion.
[Optimization and Losses] Optimization section: The Straight-Through Gumbel-Softmax discretization for anchor selection can introduce selection bias or gradient artifacts absent from continuous latent queries. No ablation isolating this choice versus alternatives (e.g., soft attention or REINFORCE) is mentioned, leaving open whether the claimed gains are due to the motion bank itself or the discretization mechanism.

minor comments (2)

The description of the Dual-Level Gated Cross-Attention would benefit from an explicit equation or diagram distinguishing the two gating levels from standard cross-attention to improve clarity for readers.
Figure captions should explicitly label retrieved anchors versus refined trajectories and indicate whether examples are from training or held-out validation sets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and constructive referee report. We appreciate the recognition of our approach's potential to advance interpretable motion forecasting. We address each major comment point by point below, with clarifications from the manuscript and commitments to revisions where the feedback identifies areas for strengthening.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of 'competitive multi-modal accuracy' and elimination of latent collapse is asserted without any quantitative results, tables, ablation studies, or error bars. This is load-bearing for the empirical contribution and must be supported by explicit benchmark numbers, comparisons to baselines, and statistical significance tests in the experiments section.

Authors: The manuscript's Experiments section (Section 4) presents quantitative benchmark results on Argoverse 2 and Waymo Open Motion, including comparisons to baselines, ablations on key components, and performance metrics with error bars. The abstract summarizes these findings at a high level. To address the concern directly, we will revise the abstract to explicitly include key competitive accuracy numbers, baseline comparisons, and a brief reference to the supporting evidence from the experiments, ensuring the claims are quantitatively grounded while remaining concise. revision: yes
Referee: [Anchor Retrieval Layer] Anchor Retrieval Layer (and motion bank construction): The central interpretability claim rests on the assumption that a finite contrastively-learned motion bank supplies sufficiently diverse, physically realizable primitives for all relevant scenarios, including rare maneuvers. No coverage analysis, diversity metrics, or failure-case evaluation on low-frequency trajectories is described; without this, retrieval failures would reintroduce opacity and undermine the 'eliminates the black box' assertion.

Authors: Section 3.1 details the contrastive learning process used to construct the motion bank from physically realizable trajectories, with the Anchor Retrieval Layer designed to dynamically select from this bank. We agree that explicit validation of coverage would reinforce the interpretability claims. In the revision, we will add a dedicated analysis in the experiments, including diversity metrics (e.g., intra-bank trajectory variance), coverage statistics across the dataset distribution, and retrieval performance on low-frequency/rare maneuvers, to demonstrate adequate coverage and address potential failure modes. revision: yes
Referee: [Optimization and Losses] Optimization section: The Straight-Through Gumbel-Softmax discretization for anchor selection can introduce selection bias or gradient artifacts absent from continuous latent queries. No ablation isolating this choice versus alternatives (e.g., soft attention or REINFORCE) is mentioned, leaving open whether the claimed gains are due to the motion bank itself or the discretization mechanism.

Authors: Section 3.2 explains the choice of Straight-Through Gumbel-Softmax to enable discrete, interpretable anchor selection while maintaining end-to-end differentiability. We agree that an isolating ablation would strengthen the analysis. We will add this to the experiments section, comparing Straight-Through Gumbel-Softmax against soft attention and REINFORCE variants in terms of accuracy, training dynamics, and stability, to clarify the contributions of the motion bank versus the discretization approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained.

full rationale

The architecture learns a finite motion bank via contrastive loss on training trajectories, retrieves anchors via gated cross-attention and Gumbel-Softmax, then refines them with a DETR decoder under WTA-GMM and auxiliary losses. All components are trained end-to-end and evaluated on held-out splits of Argoverse 2 and Waymo; the interpretability claim follows directly from conditioning on explicit retrieved primitives rather than opaque latents. No equation or claim reduces a downstream prediction to a fitted parameter or self-citation by construction, and no uniqueness theorem or ansatz is imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the existence of a finite, contrastively separable set of physically realizable trajectories that can be retrieved and refined without loss of coverage; no explicit free parameters are named in the abstract, but the Gumbel-Softmax temperature and contrastive loss margins are implicit hyperparameters.

axioms (1)

domain assumption A finite motion bank constructed via contrastive learning contains all relevant physically realizable trajectories for the target domain.
Invoked when stating that retrieved anchors are 'physically realizable' and sufficient to eliminate latent collapse.

invented entities (1)

Motion bank no independent evidence
purpose: Structured embedding space of explicit trajectory priors used for retrieval instead of latent queries.
New data structure introduced to ground predictions; no independent falsifiable evidence provided beyond the claim of contrastive construction.

pith-pipeline@v0.9.0 · 5529 in / 1445 out tokens · 25840 ms · 2026-05-09T15:08:52.682269+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

arXiv preprint arXiv:2506.08228 , year=

Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Dokook Choe, Rui Wang, Benjamin Char- row, Vinutha Kallem, Sergio Casas, Rami Al-Rfou, Ben- jamin Sapp, and Dragomir Anguelov. Scaling Laws of Mo- tion Forecasting and Planning – Technical Report, 2025. arXiv:2506.08228 [cs]. 6

work page arXiv 2025
[2]

PRANK: motion Prediction based on RANKing

Yuriy Biktairov, Maxim Stebelev, Irina Rudenko, Oleh Shli- azhko, and Boris Yangel. PRANK: motion Prediction based on RANKing. InAdvances in Neural Information Processing Systems, pages 2553–2563. Curran Associates, Inc., 2020. 1

work page 2020
[3]

End- to-End Object Detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-End Object Detection with Transformers. InEuropean Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020. 1, 5

work page 2020
[4]

Forecast-MAE: Self-supervised pre-training for motion forecasting with masked autoencoders.Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2023

Jie Cheng, Xiaodong Mei, and Ming Liu. Forecast-MAE: Self-supervised pre-training for motion forecasting with masked autoencoders.Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2023. 7

work page 2023
[5]

Gorela: Go relative for viewpoint-invariant motion forecasting, 2022

Alexander Cui, Sergio Casas, Kelvin Wong, Simon Suo, and Raquel Urtasun. Gorela: Go relative for viewpoint-invariant motion forecasting, 2022. 7

work page 2022
[6]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 5

work page 2021
[7]

Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander Mc- Cauley, Jonathon Shlens, and Dragomir Anguelov

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander Mc- Cauley, Jonathon Shlens, and Dragomir Anguelov. Large scale interactive motion forecasting for autonomous driv- ing: The waymo open motion ...

work page 2021
[8]

https://arxiv.org/abs/2005.04259

Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, and Cordelia Schmid. VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation, 2020. arXiv:2005.04259. 3

work page arXiv 2020
[9]

Densetnt: End-to-end trajectory prediction from dense goal sets

Junru Gu, Chen Sun, and Hang Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. InProceedings of the IEEE/CVF international conference on computer vision, pages 15303–15312, 2021. 2

work page 2021
[10]

Categorical repa- rameterization with gumbel-softmax, 2017

Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax, 2017. 2

work page 2017
[11]

JUWELS Cluster and Booster: Exascale Pathfinder with Modular Supercomputing Architecture at Juelich Supercomputing Centre.Journal of large-scale research facilities, 7(A183), 2021

J ¨ulich Supercomputing Centre. JUWELS Cluster and Booster: Exascale Pathfinder with Modular Supercomputing Architecture at Juelich Supercomputing Centre.Journal of large-scale research facilities, 7(A183), 2021. 9

work page 2021
[12]

EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction,

Longzhong Lin, Xuewu Lin, Tianwei Lin, Lichao Huang, Rong Xiong, and Yue Wang. EDA: Evolving and Dis- tinct Anchors for Multimodal Motion Prediction, 2023. arXiv:2312.09501. 2, 7

work page arXiv 2023
[13]

Wayformer: Motion forecasting via simple & efﬁcient attention networks,

Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S. Refaat, and Benjamin Sapp. Wayformer: Motion Forecasting via Simple & Efficient Attention Net- works, 2022. arXiv:2207.05844. 1

work page arXiv 2022
[14]

and Beijbom, Oscar and Wolff, Eric M

Tung Phan-Minh, Elena Corina Grigore, Freddy A. Boul- ton, Oscar Beijbom, and Eric M. Wolff. CoverNet: Mul- timodal Behavior Prediction using Trajectory Sets, 2020. arXiv:1911.10298. 1

work page arXiv 2020
[15]

Ef- ficient Motion Prediction: A Lightweight & Accurate Tra- jectory Prediction Model With Fast Training and Inference Speed, 2024

Alexander Prutsch, Horst Bischof, and Horst Possegger. Ef- ficient Motion Prediction: A Lightweight & Accurate Tra- jectory Prediction Model With Fast Training and Inference Speed, 2024. arXiv:2409.16154 [cs]. 7

work page arXiv 2024
[16]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3, 7

work page 2017
[17]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 4

work page 2021
[18]

Motion transformer with global intention localization and local movement refinement.arXiv preprint arXiv:2209.13508, 2022

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion Transformer with Global Intention Localization and Local Movement Refinement, 2023. arXiv:2209.13508. 1, 7

work page arXiv 2023
[19]

MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying,

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. MTR++: Multi-Agent Motion Prediction with Symmet- ric Scene Modeling and Guided Intention Querying, 2024. arXiv:2306.17770. 2, 6, 7

work page arXiv 2024
[20]

Smith and Nicholay Topin

Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates, 2018. 6

work page 2018
[21]

Learning to predict vehicle trajectories with model-based planning

Haoran Song, Di Luan, Wenchao Ding, Michael Y Wang, and Qifeng Chen. Learning to predict vehicle trajectories with model-based planning. InConference on Robot Learn- ing, pages 1035–1045. PMLR, 2022. 2 9

work page 2022
[22]

Jiawei Sun, Chengran Yuan, Shuo Sun, Shanze Wang, Yuhang Han, Shuailei Ma, Zefan Huang, Anthony Wong, Keng Peng Tee, and Marcelo H. Ang Jr. ControlMTR: Control-Guided Motion Transformer with Scene-Compliant Intention Points for Feasible Motion Prediction, 2024. arXiv:2404.10295. 2, 7

work page arXiv 2024
[23]

arXiv preprint arXiv:2111.14973 , year=

Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Srivas- tava, Khaled S. Refaat, Nigamaa Nayakanti, Andre Corn- man, Kan Chen, Bertrand Douillard, Chi Pang Lam, Dragomir Anguelov, and Benjamin Sapp. MultiPath++: Effi- cient Information Fusion and Trajectory Aggregation for Be- havior Prediction, 2021. arXiv:2111.14973. 1

work page arXiv 2021
[24]

Marius Z ¨ollner

Abhishek Vivekanandan and J. Marius Z ¨ollner. Efficient Data Representation for Motion Forecasting: A Scene- Specific Trajectory Set Approach, 2024. arXiv:2407.20732. 2

work page arXiv 2024
[25]

Marius Z ¨ollner

Abhishek Vivekanandan, Ahmed Abouelazm, Philip Sch¨orner, and J. Marius Z ¨ollner. KI-PMF: Knowledge Integrated Plausible Motion Forecasting. In2024 IEEE Intelligent Vehicles Symposium (IV), pages 176–183, 2024. 1, 2

work page 2024
[26]

Marius Z ¨ollner

Abhishek Vivekanandan, Christian Hubschneider, and J. Marius Z ¨ollner. Contrast & Compress: Learning Lightweight Embeddings for Short Trajectories, 2025. arXiv:2506.02571. 1, 2, 7

work page arXiv 2025
[27]

ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-Informed Proposals

Xishun Wang, Tong Su, Fang Da, and Xiaodong Yang. ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-Informed Proposals. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21995–22003. IEEE, 2023. 1, 2

work page 2023
[28]

Argoverse 2: Next generation datasets for self-driving perception and fore- casting, 2023

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and fore- casting, 2023. 6

work page 2023
[29]

Decoupling motion forecasting into directional intentions and dynamic states

Bozhou Zhang, Nan Song, and Li Zhang. Decoupling motion forecasting into directional intentions and dynamic states. Advances in Neural Information Processing Systems, 37: 106582–106606, 2024. 2

work page 2024
[30]

Demo: Decoupling motion forecasting into directional intentions and dynamic states

Bozhou Zhang, Nan Song, and Li Zhang. Demo: Decoupling motion forecasting into directional intentions and dynamic states. InNeurIPS, 2024. 7

work page 2024
[31]

Simpl: A simple and efficient multi-agent motion prediction base- line for autonomous driving, 2024

Lu Zhang, Peiliang Li, Sikang Liu, and Shaojie Shen. Simpl: A simple and efficient multi-agent motion prediction base- line for autonomous driving, 2024. 7

work page 2024
[32]

Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose En- coding, 2023

Zhejun Zhang, Alexander Liniger, Christos Sakaridis, Fisher Yu, and Luc Van Gool. Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose En- coding, 2023. arXiv:2310.12970 [cs]. 3, 6, 7

work page arXiv 2023
[33]

TNT: Target-driven Trajectory Prediction

Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, and Dragomir Anguelov. TNT: Target-driven Trajectory Prediction. InProceedings of the 2020 Conference on Robot Learning, pages 895–904. PMLR, 2021. 1, 2

work page 2020
[34]

Waslander, Hongsheng Li, and Yu Liu

Yang Zhou, Hao Shao, Letian Wang, Steven L. Waslander, Hongsheng Li, and Yu Liu. Smartrefine: A scenario-adaptive refinement framework for efficient motion prediction, 2024. 7 10

work page 2024
[35]

Supplementatal 8.1. Physical Grouped-Query Aggregation (PGQA) To process situations whenN q > Kand to maintain a constant computational overhead for the retrieved anchors which are to be used as queries in the decoder, we use Grouped Query Aggregation. This module mainly func- tions to reduce the unnecessary computational overhead when the number of gener...

work page arXiv 2043

[1] [1]

arXiv preprint arXiv:2506.08228 , year=

Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Dokook Choe, Rui Wang, Benjamin Char- row, Vinutha Kallem, Sergio Casas, Rami Al-Rfou, Ben- jamin Sapp, and Dragomir Anguelov. Scaling Laws of Mo- tion Forecasting and Planning – Technical Report, 2025. arXiv:2506.08228 [cs]. 6

work page arXiv 2025

[2] [2]

PRANK: motion Prediction based on RANKing

Yuriy Biktairov, Maxim Stebelev, Irina Rudenko, Oleh Shli- azhko, and Boris Yangel. PRANK: motion Prediction based on RANKing. InAdvances in Neural Information Processing Systems, pages 2553–2563. Curran Associates, Inc., 2020. 1

work page 2020

[3] [3]

End- to-End Object Detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-End Object Detection with Transformers. InEuropean Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020. 1, 5

work page 2020

[4] [4]

Forecast-MAE: Self-supervised pre-training for motion forecasting with masked autoencoders.Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2023

Jie Cheng, Xiaodong Mei, and Ming Liu. Forecast-MAE: Self-supervised pre-training for motion forecasting with masked autoencoders.Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2023. 7

work page 2023

[5] [5]

Gorela: Go relative for viewpoint-invariant motion forecasting, 2022

Alexander Cui, Sergio Casas, Kelvin Wong, Simon Suo, and Raquel Urtasun. Gorela: Go relative for viewpoint-invariant motion forecasting, 2022. 7

work page 2022

[6] [6]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 5

work page 2021

[7] [7]

Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander Mc- Cauley, Jonathon Shlens, and Dragomir Anguelov

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander Mc- Cauley, Jonathon Shlens, and Dragomir Anguelov. Large scale interactive motion forecasting for autonomous driv- ing: The waymo open motion ...

work page 2021

[8] [8]

https://arxiv.org/abs/2005.04259

Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, and Cordelia Schmid. VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation, 2020. arXiv:2005.04259. 3

work page arXiv 2020

[9] [9]

Densetnt: End-to-end trajectory prediction from dense goal sets

Junru Gu, Chen Sun, and Hang Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. InProceedings of the IEEE/CVF international conference on computer vision, pages 15303–15312, 2021. 2

work page 2021

[10] [10]

Categorical repa- rameterization with gumbel-softmax, 2017

Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax, 2017. 2

work page 2017

[11] [11]

JUWELS Cluster and Booster: Exascale Pathfinder with Modular Supercomputing Architecture at Juelich Supercomputing Centre.Journal of large-scale research facilities, 7(A183), 2021

J ¨ulich Supercomputing Centre. JUWELS Cluster and Booster: Exascale Pathfinder with Modular Supercomputing Architecture at Juelich Supercomputing Centre.Journal of large-scale research facilities, 7(A183), 2021. 9

work page 2021

[12] [12]

EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction,

Longzhong Lin, Xuewu Lin, Tianwei Lin, Lichao Huang, Rong Xiong, and Yue Wang. EDA: Evolving and Dis- tinct Anchors for Multimodal Motion Prediction, 2023. arXiv:2312.09501. 2, 7

work page arXiv 2023

[13] [13]

Wayformer: Motion forecasting via simple & efﬁcient attention networks,

Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S. Refaat, and Benjamin Sapp. Wayformer: Motion Forecasting via Simple & Efficient Attention Net- works, 2022. arXiv:2207.05844. 1

work page arXiv 2022

[14] [14]

and Beijbom, Oscar and Wolff, Eric M

Tung Phan-Minh, Elena Corina Grigore, Freddy A. Boul- ton, Oscar Beijbom, and Eric M. Wolff. CoverNet: Mul- timodal Behavior Prediction using Trajectory Sets, 2020. arXiv:1911.10298. 1

work page arXiv 2020

[15] [15]

Ef- ficient Motion Prediction: A Lightweight & Accurate Tra- jectory Prediction Model With Fast Training and Inference Speed, 2024

Alexander Prutsch, Horst Bischof, and Horst Possegger. Ef- ficient Motion Prediction: A Lightweight & Accurate Tra- jectory Prediction Model With Fast Training and Inference Speed, 2024. arXiv:2409.16154 [cs]. 7

work page arXiv 2024

[16] [16]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3, 7

work page 2017

[17] [17]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 4

work page 2021

[18] [18]

Motion transformer with global intention localization and local movement refinement.arXiv preprint arXiv:2209.13508, 2022

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion Transformer with Global Intention Localization and Local Movement Refinement, 2023. arXiv:2209.13508. 1, 7

work page arXiv 2023

[19] [19]

MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying,

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. MTR++: Multi-Agent Motion Prediction with Symmet- ric Scene Modeling and Guided Intention Querying, 2024. arXiv:2306.17770. 2, 6, 7

work page arXiv 2024

[20] [20]

Smith and Nicholay Topin

Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates, 2018. 6

work page 2018

[21] [21]

Learning to predict vehicle trajectories with model-based planning

Haoran Song, Di Luan, Wenchao Ding, Michael Y Wang, and Qifeng Chen. Learning to predict vehicle trajectories with model-based planning. InConference on Robot Learn- ing, pages 1035–1045. PMLR, 2022. 2 9

work page 2022

[22] [22]

Jiawei Sun, Chengran Yuan, Shuo Sun, Shanze Wang, Yuhang Han, Shuailei Ma, Zefan Huang, Anthony Wong, Keng Peng Tee, and Marcelo H. Ang Jr. ControlMTR: Control-Guided Motion Transformer with Scene-Compliant Intention Points for Feasible Motion Prediction, 2024. arXiv:2404.10295. 2, 7

work page arXiv 2024

[23] [23]

arXiv preprint arXiv:2111.14973 , year=

Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Srivas- tava, Khaled S. Refaat, Nigamaa Nayakanti, Andre Corn- man, Kan Chen, Bertrand Douillard, Chi Pang Lam, Dragomir Anguelov, and Benjamin Sapp. MultiPath++: Effi- cient Information Fusion and Trajectory Aggregation for Be- havior Prediction, 2021. arXiv:2111.14973. 1

work page arXiv 2021

[24] [24]

Marius Z ¨ollner

Abhishek Vivekanandan and J. Marius Z ¨ollner. Efficient Data Representation for Motion Forecasting: A Scene- Specific Trajectory Set Approach, 2024. arXiv:2407.20732. 2

work page arXiv 2024

[25] [25]

Marius Z ¨ollner

Abhishek Vivekanandan, Ahmed Abouelazm, Philip Sch¨orner, and J. Marius Z ¨ollner. KI-PMF: Knowledge Integrated Plausible Motion Forecasting. In2024 IEEE Intelligent Vehicles Symposium (IV), pages 176–183, 2024. 1, 2

work page 2024

[26] [26]

Marius Z ¨ollner

Abhishek Vivekanandan, Christian Hubschneider, and J. Marius Z ¨ollner. Contrast & Compress: Learning Lightweight Embeddings for Short Trajectories, 2025. arXiv:2506.02571. 1, 2, 7

work page arXiv 2025

[27] [27]

ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-Informed Proposals

Xishun Wang, Tong Su, Fang Da, and Xiaodong Yang. ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-Informed Proposals. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21995–22003. IEEE, 2023. 1, 2

work page 2023

[28] [28]

Argoverse 2: Next generation datasets for self-driving perception and fore- casting, 2023

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and fore- casting, 2023. 6

work page 2023

[29] [29]

Decoupling motion forecasting into directional intentions and dynamic states

Bozhou Zhang, Nan Song, and Li Zhang. Decoupling motion forecasting into directional intentions and dynamic states. Advances in Neural Information Processing Systems, 37: 106582–106606, 2024. 2

work page 2024

[30] [30]

Demo: Decoupling motion forecasting into directional intentions and dynamic states

Bozhou Zhang, Nan Song, and Li Zhang. Demo: Decoupling motion forecasting into directional intentions and dynamic states. InNeurIPS, 2024. 7

work page 2024

[31] [31]

Simpl: A simple and efficient multi-agent motion prediction base- line for autonomous driving, 2024

Lu Zhang, Peiliang Li, Sikang Liu, and Shaojie Shen. Simpl: A simple and efficient multi-agent motion prediction base- line for autonomous driving, 2024. 7

work page 2024

[32] [32]

Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose En- coding, 2023

Zhejun Zhang, Alexander Liniger, Christos Sakaridis, Fisher Yu, and Luc Van Gool. Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose En- coding, 2023. arXiv:2310.12970 [cs]. 3, 6, 7

work page arXiv 2023

[33] [33]

TNT: Target-driven Trajectory Prediction

Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, and Dragomir Anguelov. TNT: Target-driven Trajectory Prediction. InProceedings of the 2020 Conference on Robot Learning, pages 895–904. PMLR, 2021. 1, 2

work page 2020

[34] [34]

Waslander, Hongsheng Li, and Yu Liu

Yang Zhou, Hao Shao, Letian Wang, Steven L. Waslander, Hongsheng Li, and Yu Liu. Smartrefine: A scenario-adaptive refinement framework for efficient motion prediction, 2024. 7 10

work page 2024

[35] [35]

Supplementatal 8.1. Physical Grouped-Query Aggregation (PGQA) To process situations whenN q > Kand to maintain a constant computational overhead for the retrieved anchors which are to be used as queries in the decoder, we use Grouped Query Aggregation. This module mainly func- tions to reduce the unnecessary computational overhead when the number of gener...

work page arXiv 2043