arxiv: 2604.21479 · v3 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction

Yanjiao Liu , Jiawei Liu , Xun Gong , Zifei Nie

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords trajectory predictionfrozen LLMsHD mapsautonomous drivingspatio-temporal reasoningreprogramming adaptervehicle trajectoriesmulti-modal features

0 comments

The pith

Frozen LLMs can act as map-aware reasoners to predict vehicle trajectories after a simple feature adapter converts scene and road data into tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that keeps large language models completely frozen and assigns them the task of forecasting future paths of vehicles in traffic. A traffic encoder pulls features from observed agent movements while a small CNN processes nearby HD map data; both are then turned into token sequences by a reprogramming adapter so the LLM can reason over them. A basic linear layer at the end produces the trajectory outputs. This design isolates how much map information improves accuracy and lets the same pipeline run on many different LLMs with almost no extra training.

Core claim

By encoding past trajectories with a traffic encoder and local HD maps with a CNN, then routing the combined features through a reprogramming adapter into a frozen LLM, the model generates future vehicle trajectories via a linear decoder. The authors state that this arrangement lets the LLM perform spatio-temporal reasoning over dynamic agents and road topology, supports quantitative measurement of each input modality's contribution especially map semantics, and works across varied LLM backbones with minimal adaptation.

What carries the argument

The reprogramming adapter that converts multi-modal scene features from trajectories and HD maps into token sequences the frozen LLM can process for trajectory generation.

If this is right

Map semantics can be isolated and measured for their direct effect on prediction accuracy.
The same pipeline can evaluate many different LLM architectures with only the adapter and decoder adjusted.
Prediction performance depends mainly on the LLM's internal reasoning once features are tokenized.
A single platform now exists for comparing how various input modalities affect trajectory forecasts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the adapter truly elicits reasoning, the approach could be extended to include other static inputs such as traffic light states without redesigning the LLM.
The frozen setup suggests that scaling LLM size alone might raise accuracy on spatial prediction tasks without additional fine-tuning.
This token-conversion pattern might transfer to other layout-based forecasting problems like pedestrian paths in city environments.

Load-bearing premise

That the tokens created by the adapter cause the frozen LLM to reason about traffic agents and road layout rather than simply relaying transformed features to the final decoder.

What would settle it

Running the same scenes with the map encoder removed and finding no increase in prediction error, or seeing the LLM outputs stay unchanged when map inputs are altered, would show the model is not using map semantics for reasoning.

Figures

Figures reproduced from arXiv: 2604.21479 by Jiawei Liu, Xun Gong, Yanjiao Liu, Zifei Nie.

**Figure 1.** Figure 1: Overview of the proposed multi-modal evaluation framework. The model integrates ego vehicle trajectories, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Turning Scenarios. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation Comparison of ADE and FDE at different time horizons. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of LLaMA2 and LLaMA3 with/without utilizing Map for ADE and FDE across Different Time Horizons. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of map-aware trajectory predictions. The first row represents straight scenarios, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Generalizability evaluation across six LLM backbones. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a modular adapter-based way to plug frozen LLMs into map-aware trajectory prediction, but the abstract supplies no numbers or controls, so the claim that the LLM itself is reasoning remains untested.

read the letter

The core contribution is a clean pipeline: a traffic encoder pulls spatial features from agent trajectories, a small CNN handles local HD maps, a learned reprogramming adapter turns both into LLM tokens, and a linear head produces future paths. Keeping the LLM frozen and swapping only the adapter and encoders is a practical design choice that lets the same setup run on different base models with little extra work. That part is useful for anyone who wants to measure how much map semantics actually move the needle on prediction error in driving scenes.

Referee Report

2 major / 1 minor

Summary. The paper introduces a framework for vehicle trajectory prediction in autonomous driving that positions frozen LLMs as map-aware spatio-temporal reasoners. A traffic encoder extracts spatial features from observed agent trajectories, a lightweight CNN encodes local HD maps, a reprogramming adapter converts the combined features into LLM-compatible tokens, and a linear decoder produces future trajectories. The approach is presented as enabling quantitative analysis of multi-modal information (especially map semantics) on prediction accuracy, seamless integration of diverse frozen LLMs with minimal adaptation, and a unified evaluation platform.

Significance. If the empirical results and controls hold, the framework could offer a practical route to leverage pre-trained LLMs' reasoning capabilities for trajectory prediction without full fine-tuning, while providing a standardized way to measure the contribution of map semantics across models. This would be valuable for understanding how static road topology interacts with dynamic agent behavior in AD systems.

major comments (2)

[framework description / abstract] The claim that frozen LLMs perform genuine map-aware spatio-temporal reasoning (abstract and framework description) is load-bearing but unsupported by controls that isolate the LLM's contribution. No adapter-only ablation, random-weight LLM baseline, or frozen-vs-unfrozen comparison is described, leaving open the possibility that the reprogramming adapter and encoders perform the core mapping while the LLM acts as a passive token processor.
[abstract / evaluation section] No quantitative results, ablation tables, or specific metrics (e.g., ADE/FDE on standard datasets like nuScenes or Argoverse) are provided to substantiate the stated accuracy gains, map-semantics influence, or cross-LLM generalizability. Without these, the central empirical claims cannot be evaluated.

minor comments (1)

[abstract] The abstract and framework overview would benefit from explicit notation for the reprogramming adapter (e.g., its input/output dimensions and loss function) to clarify how scene features become LLM tokens.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will implement to strengthen the empirical support for our claims.

read point-by-point responses

Referee: The claim that frozen LLMs perform genuine map-aware spatio-temporal reasoning (abstract and framework description) is load-bearing but unsupported by controls that isolate the LLM's contribution. No adapter-only ablation, random-weight LLM baseline, or frozen-vs-unfrozen comparison is described, leaving open the possibility that the reprogramming adapter and encoders perform the core mapping while the LLM acts as a passive token processor.

Authors: We agree that additional controls are needed to rigorously isolate the LLM's reasoning contribution. The framework positions the frozen LLM as the central spatio-temporal reasoner after the reprogramming adapter converts encoded features into tokens, but we acknowledge the current description does not include explicit ablations. In the revised manuscript, we will add an adapter-only ablation and a random-weight LLM baseline to quantify the LLM's role. A frozen-versus-unfrozen comparison falls outside the core premise of minimal adaptation with pre-trained models and may not be included, but we will clarify the design rationale. revision: partial
Referee: No quantitative results, ablation tables, or specific metrics (e.g., ADE/FDE on standard datasets like nuScenes or Argoverse) are provided to substantiate the stated accuracy gains, map-semantics influence, or cross-LLM generalizability. Without these, the central empirical claims cannot be evaluated.

Authors: We acknowledge that the current manuscript version focuses on framework introduction and does not include the requested quantitative results or tables. This limits evaluation of the claims. In the revised manuscript, we will add comprehensive experimental results with ADE/FDE metrics on nuScenes and Argoverse, ablation studies on map semantics impact, and cross-LLM evaluations to demonstrate generalizability and accuracy gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework without derivation chain

full rationale

The paper describes an empirical architecture that combines a traffic encoder, CNN map encoder, reprogramming adapter, frozen LLM, and linear decoder for trajectory prediction. No equations, first-principles derivations, or closed-form predictions are presented that reduce outputs to inputs by construction. Claims rest on experimental evaluation of multi-modal inputs and cross-LLM generalizability rather than any self-referential mathematical steps. The approach is self-contained against external benchmarks with no load-bearing self-citations or fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on free parameters, axioms, or invented entities; the framework implicitly assumes LLMs possess transferable spatio-temporal reasoning that can be unlocked via token reprogramming.

pith-pipeline@v0.9.0 · 5517 in / 1157 out tokens · 22184 ms · 2026-05-09T21:38:44.686875+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 6 canonical work pages · 2 internal anchors

[1]

A comprehensive review of autonomous driving algorithms: Tackling adverse weather conditions, unpredictable traffic vi- olations, blind spot monitoring, and emergency maneuvers,

C. Xu and R. Sankar, “A comprehensive review of autonomous driving algorithms: Tackling adverse weather conditions, unpredictable traffic vi- olations, blind spot monitoring, and emergency maneuvers,”Algorithms, vol. 17, no. 11, p. 526, 2024

2024
[2]

Improving intelligent per- ception and decision optimization of pedestrian crossing scenarios in autonomous driving environments through large visual language mod- els,

X. Teng, L. Huang, Z. Shen, and W. Li, “Improving intelligent per- ception and decision optimization of pedestrian crossing scenarios in autonomous driving environments through large visual language mod- els,”Scientific Reports, vol. 15, no. 1, p. 31283, 2025

2025
[3]

Summary and reflections on pedestrian trajectory prediction in the field of autonomous driving,

Z. Fu, K. Jiang, C. Xie, Y . Xu, J. Huang, and D. Yang, “Summary and reflections on pedestrian trajectory prediction in the field of autonomous driving,”IEEE Transactions on Intelligent Vehicles, 2024

2024
[4]

A review of decision-making and planning for autonomous vehicles in intersection environments,

S. Chen, X. Hu, J. Zhao, R. Wang, and M. Qiao, “A review of decision-making and planning for autonomous vehicles in intersection environments,”World Electric Vehicle Journal, vol. 15, no. 3, p. 99, 2024

2024
[5]

Trajectory Prediction for Autonomous Driving: Progress, Limitations, and Future Directions

N. A. Madjid, A. Ahmad, M. Mebrahtu, Y . Babaa, A. Nasser, S. Malik, B. Hassan, N. Werghi, J. Dias, and M. Khonji, “Trajectory prediction for autonomous driving: Progress, limitations, and future directions,”arXiv preprint arXiv:2503.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Large language models: a survey of their development, capabilities, and applications,

Y . Annepaka and P. Pakray, “Large language models: a survey of their development, capabilities, and applications,”Knowledge and Informa- tion Systems, vol. 67, no. 3, pp. 2967–3022, 2025

2025
[7]

Semi- supervised feature selection with minimal redundancy based on group optimization strategy for multi-label data,

D. Qing, Y . Zheng, W. Zhang, W. Ren, X. Zeng, and G. Li, “Semi- supervised feature selection with minimal redundancy based on group optimization strategy for multi-label data,”Knowledge and Information Systems, vol. 67, no. 2, pp. 1271–1308, 2025

2025
[8]

Knowledge and task- driven multimodal adaptive transfer through llms with limited data,

X. Zhang, Z. Chen, H. Ren, and Y . Tian, “Knowledge and task- driven multimodal adaptive transfer through llms with limited data,” in 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2024, pp. 5343–5348

2024
[9]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[10]

Karl Duncker

L. Berti, F. Giorgi, and G. Kasneci, “Emergent abilities in large language models: A survey,”arXiv preprint arXiv:2503.05788, 2025

work page arXiv 2025
[11]

Logical reasoning in large language models: A survey, 2025 b

H. Liu, Z. Fu, M. Ding, R. Ning, C. Zhang, X. Liu, and Y . Zhang, “Logical reasoning in large language models: A survey,”arXiv preprint arXiv:2502.09100, 2025

work page arXiv 2025
[12]

Advancing reasoning in large language models: Promising methods and approaches,

A. Patil and A. Jadon, “Advancing reasoning in large language models: Promising methods and approaches,”arXiv preprint arXiv:2502.03671, 2025

work page arXiv 2025
[13]

Empowering autonomous driving with large language models: A safety perspective,

Y . Wang, R. Jiao, S. S. Zhan, C. Lang, C. Huang, Z. Wang, Z. Yang, and Q. Zhu, “Empowering autonomous driving with large language models: A safety perspective,”arXiv preprint arXiv:2312.00812, 2023

work page arXiv 2023
[14]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Chen, Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-llm: Time series forecasting by reprogramming large language models,” inInternational Conference on Learning Representations (ICLR), 2024, accepted at ICLR 2024. [Online]. Available: https://arxiv.org/abs/2310.01728

work page internal anchor Pith review arXiv 2024
[15]

Spatial- temporal large language model for traffic prediction,

C. Liu, S. Yang, Q. Xu, Z. Li, C. Long, Z. Li, and R. Zhao, “Spatial- temporal large language model for traffic prediction,” in2024 25th IEEE International Conference on Mobile Data Management (MDM). IEEE, 2024, pp. 31–40

2024
[16]

Harnessing and evaluating the intrinsic extrapolation ability of large language models for vehicle trajectory prediction,

J. Liu, Y . Liu, X. Gong, T. Wang, H. Chen, and Y . Hu, “Harnessing and evaluating the intrinsic extrapolation ability of large language models for vehicle trajectory prediction,” inProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025, pp. 4379–4391

2025
[17]

Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences,

H. Zhi, P. Chen, J. Li, S. Ma, X. Sun, T. Xiang, Y . Lei, M. Tan, and C. Gan, “Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3761–3771

2025
[18]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

2024
[19]

Machine learning for autonomous vehicle’s trajectory prediction: A comprehensive survey, challenges, and future research directions,

V . Bharilya and N. Kumar, “Machine learning for autonomous vehicle’s trajectory prediction: A comprehensive survey, challenges, and future research directions,”Vehicular Communications, vol. 46, p. 100733, 2024

2024
[20]

A survey of autonomous driving trajectory prediction: Methodologies, challenges, and future prospects

M. Xu, Z. Liu, B. Wang, and S. Li, “A survey of autonomous driving trajectory prediction: Methodologies, challenges, and future prospects.” Machines, vol. 13, no. 9, 2025

2025
[21]

A vehicle trajectory prediction model that integrates spatial interaction and multiscale temporal fea- tures,

Y . Gao, K. Yang, Y . Yue, and Y . Wu, “A vehicle trajectory prediction model that integrates spatial interaction and multiscale temporal fea- tures,”Scientific Reports, vol. 15, no. 1, p. 8217, 2025

2025
[22]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

2020