NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

Jiahui Li; Jiamin Shi; Jiawei Sun; Kaidi Yang; Liying Liu; Ming Liu; Ruiteng Zhao; Zhiyang Liu; Zixiang Ren; Zuoguan Wang

arxiv: 2605.31116 · v1 · pith:IJ7ES7ZLnew · submitted 2026-05-29 · 💻 cs.CV · cs.RO

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

Jiahui Li , Jiawei Sun , Zixiang Ren , Ming Liu , Jiamin Shi , Ruiteng Zhao , Zhiyang Liu , Liying Liu

show 2 more authors

Zuoguan Wang Kaidi Yang

This is my paper

Pith reviewed 2026-06-28 22:55 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords end-to-end autonomous drivingscene tokenstoken reconstructionself-distillationvisual bottleneckmasked reconstructionperception-free planning

0 comments

The pith

A masked self-distillation reconstruction objective on scene tokens produces richer visual representations that improve end-to-end driving performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end driving models compress image patches into a small set of scene tokens that feed directly into the planner. These tokens currently receive supervision only from the final planning loss, leaving their visual content under-constrained. The paper adds a reconstruction task that requires the scene tokens to rebuild masked patch features, routing all gradients through the bottleneck alone. Semantic cues from foundation models guide the reconstruction targets toward relevant structures. The result is less redundant tokens and higher scores on standard driving benchmarks, all without changing the model at test time.

Core claim

NTR constrains the scene-token bottleneck in perception-free E2E driving by a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only the compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. Weak semantic priors derived from foundation-model annotations bias reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time.

What carries the argument

Neural Token Reconstruction (NTR) via self-distillation masked latent reconstruction objective that reconstructs masked patch latents using only scene tokens

If this is right

Scene tokens exhibit lower pairwise redundancy and higher effective rank.
Planning performance reaches 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2.
All auxiliary reconstruction components are removed at inference, leaving the deployed planner unchanged.
Effective bottleneck supervision improves both compact visual representation learning and planning performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reconstruction pattern could be applied to other compressed token bottlenecks where downstream supervision is indirect.
Token effective rank and redundancy metrics may serve as lightweight diagnostics for representation quality in other vision-planning systems.
If foundation-model priors prove hard to obtain, simpler unsupervised reconstruction targets might still deliver part of the bottleneck improvement.

Load-bearing premise

Forcing reconstruction gradients to pass exclusively through the scene-token bottleneck will cause the tokens to preserve richer and less redundant visual representations that directly improve downstream planning.

What would settle it

An ablation removing only the masked reconstruction objective while keeping all other training elements fixed, followed by no change in planning scores or token redundancy metrics on the same benchmarks, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.31116 by Jiahui Li, Jiamin Shi, Jiawei Sun, Kaidi Yang, Liying Liu, Ming Liu, Ruiteng Zhao, Zhiyang Liu, Zixiang Ren, Zuoguan Wang.

**Figure 2.** Figure 2: Overview of Neural Token Reconstruction (NTR). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples on different driving scenarios. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NTR adds a masked self-distillation loss that routes gradients only through the scene-token bottleneck during training, then drops the whole thing at inference; the abstract reports SOTA numbers on Waymo and NavSim plus cleaner token statistics.

read the letter

The main thing here is a training-only auxiliary objective that tries to make the compact scene tokens in perception-free E2E driving carry more visual information without adding anything to the deployed model. They mask patch latents and reconstruct them using only the scene tokens as memory, plus some weak semantic targets pulled from foundation models. All of that is stripped out at test time, so the planner stays unchanged.

What stands out is the clean routing of gradients: the reconstruction loss is forced to go exclusively through the bottleneck, which directly targets the problem that scene tokens otherwise only see planning supervision. The token-level diagnostics (lower pairwise redundancy, higher effective rank) are a reasonable way to check whether the objective did what it was supposed to. The reported numbers—8.0461 RFS on Waymo E2E and 94.1/90.9 on the NavSim variants—are specific enough to be worth checking against the baselines.

The soft spot is that everything rests on the abstract. We have no ablations showing that the gains actually trace to the reconstruction term rather than other training tweaks, no error bars, and no details on how the foundation-model priors were applied or whether they introduce their own biases. Without those controls it is hard to know how much of the improvement is causal versus incidental.

This is for people working on compact token bottlenecks in driving stacks who want a lightweight way to add representation supervision. It is worth a serious referee because the mechanism is internally consistent and the empirical claims are concrete, even if the current write-up leaves the usual verification steps for the full review process.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Neural Token Reconstruction (NTR), a self-distillation masked latent reconstruction objective that supervises the compact scene-token bottleneck in perception-free end-to-end autonomous driving. Scene tokens reconstruct masked patch-level latent features (using only the bottleneck as memory), with semantic priors from foundation-model annotations serving as weak targets; all auxiliary components are discarded at inference. The paper reports state-of-the-art results on three benchmarks (8.0461 RFS on Waymo E2E; 94.1 PDMS / 90.9 EPDMS on NavSim1&2) together with improved token statistics (lower pairwise redundancy, higher effective rank).

Significance. If the reported gains and token metrics hold under scrutiny, the work shows that an auxiliary reconstruction objective can enrich the visual information retained by a planning-only bottleneck without altering the deployed model. The mechanism is internally consistent: gradients are routed exclusively through the scene tokens, semantic priors act only as soft targets, and direct measurements of redundancy and rank provide supporting evidence beyond planning metrics alone. This addresses a clear limitation of prior perception-free E2E methods and could influence future bottleneck designs.

major comments (2)

[§4] §4 (Experiments) and associated tables: the SOTA claims rest on single-run point estimates (e.g., 8.0461 RFS, 94.1 PDMS) without reported standard deviations, multiple random seeds, or statistical tests; this weakens the ability to attribute gains specifically to the NTR objective versus training variance.
[§3.2, §4.3] §3.2 and §4.3: while token redundancy and effective-rank metrics are presented as evidence that the reconstruction objective improves representations, the manuscript does not include an ablation that isolates the masked latent reconstruction loss from the semantic-prior term; without this control it remains unclear which component drives the reported token-quality improvements.

minor comments (2)

[§3.1] The description of the self-distillation target construction (how masked latents are generated and aligned) would benefit from an explicit equation or pseudocode block for reproducibility.
[Figure 3] Figure 3 (token visualization) and the associated redundancy plots would be clearer with explicit axis labels and a statement of the exact distance metric used for pairwise redundancy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback. We address the two major comments point by point below.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated tables: the SOTA claims rest on single-run point estimates (e.g., 8.0461 RFS, 94.1 PDMS) without reported standard deviations, multiple random seeds, or statistical tests; this weakens the ability to attribute gains specifically to the NTR objective versus training variance.

Authors: We agree that single-run point estimates limit the strength of the SOTA claims and that variance estimates would better isolate the contribution of NTR from training stochasticity. End-to-end driving models are computationally expensive to train, which is why we initially reported single runs (a practice seen in several recent E2E driving papers). In the revision we will rerun the key experiments with at least three random seeds, report means and standard deviations, and add a brief statistical comparison where appropriate. revision: yes
Referee: [§3.2, §4.3] §3.2 and §4.3: while token redundancy and effective-rank metrics are presented as evidence that the reconstruction objective improves representations, the manuscript does not include an ablation that isolates the masked latent reconstruction loss from the semantic-prior term; without this control it remains unclear which component drives the reported token-quality improvements.

Authors: We acknowledge that the current manuscript lacks an ablation that separates the masked latent reconstruction loss from the semantic-prior term. The semantic priors are used only as soft targets to bias reconstruction toward driving-relevant structures; however, without the requested control experiment it is indeed difficult to quantify their individual impact on the token statistics. We will add this ablation (full NTR vs. reconstruction loss only) to §4.3 and update the corresponding discussion in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an auxiliary masked latent reconstruction objective used only during training to supervise the scene-token bottleneck; this objective and all associated components are explicitly removed at inference time, leaving the planner unchanged. Performance claims rest on empirical SOTA metrics across external benchmarks plus post-hoc token statistics (pairwise redundancy, effective rank), none of which are defined in terms of the same fitted parameters or reduced to self-citations. No equations, uniqueness theorems, or ansatzes are shown to collapse the reported gains to quantities already present in the inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, domain assumptions, or invented entities beyond the high-level description of the NTR objective itself.

axioms (1)

standard math Standard assumptions of gradient-based optimization in neural networks.
Implicit in any deep-learning training procedure described in the abstract.

pith-pipeline@v0.9.1-grok · 5813 in / 1242 out tokens · 31441 ms · 2026-06-28T22:55:33.797333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 25 canonical work pages · 4 internal anchors

[1]

End to End Learning for Self-Driving Cars

M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars, 2016. URLhttps://arxiv.org/abs/1604.07316

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

F. Codevilla, E. Santana, A. Lopez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9328–9337, 2019. doi:10.1109/ICCV .2019.00942

work page doi:10.1109/iccv 2019
[3]

L. Wang, Z. Yang, C. Bai, G. Zhang, X. Liu, X. Zheng, X.-X. Long, C.-T. Lu, and C. Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving, 2026. URLhttps://arxiv.org/abs/2601.22032

work page arXiv 2026
[4]

K. Guo, H. Liu, X. Wu, J. Pan, and C. Lv. ipad: Iterative proposal-centric end-to-end au- tonomous driving, 2025. URLhttps://arxiv.org/abs/2505.15111

work page arXiv 2025
[5]

L. Feng, Y . Gao, E. Zablocki, Q. Li, W. Li, S. Liu, M. Cord, and A. Alahi. RAP: 3d rasterization augmented end-to-end planning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=a9bOgeqbdB

2026
[6]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, 2023. doi:10.1109/CVPR52729.2023.01712

work page doi:10.1109/cvpr52729.2023.01712 2023
[7]

In: IEEE/CVF International Conference on Computer Vision

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, Los Alamitos, CA, USA, Oct 2023. IEEE Computer Society. doi:10.1109/ICCV51070.2023.00766. URLhttps: //doi.i...

work page doi:10.1109/iccv51070.2023.00766 2023
[8]

Jiang, S

B. Jiang, S. Chen, H. Gao, B. Liao, Q. Zhang, W. Liu, and X. Wang. V ADv2: End-to-end vectorized autonomous driving via probabilistic planning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=0a4dA6eUHN

2026
[9]

Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27137–27146, October 2025

2025
[10]

W. Sun, X. Lin, Y . Shi, C. Zhang, H. Wu, and S. Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801, 2025. doi:10.1109/ICRA55743.2025.11128800

work page doi:10.1109/icra55743.2025.11128800 2025
[11]

W. Sun, X. Lin, K. Chen, Z. Pei, X. Li, Y . Shi, and S. Zheng. SparseDriveV2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026

work page arXiv 2026
[12]

Y . Li, D. Yuan, H. Zhang, Y . Yang, and X. Luo. End to end autonomous driving via occu- pancy and motion flow. In2024 IEEE International Conference on Real-time Computing and Robotics (RCAR), pages 360–365, 2024. doi:10.1109/RCAR61438.2024.10670964

work page doi:10.1109/rcar61438.2024.10670964 2024
[13]

Kirby, A

E. Kirby, A. Boulch, Y . Xu, Y . Yin, G. Puy, E. Zablocki, A. Bursuc, S. Gidaris, R. Marlet, F. Bartoccioni, A.-Q. Cao, N. Samet, T.-H. VU, and M. Cord. Driving on registers. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 32058–32069, June 2026

2026
[14]

S. Ang, Y . Yang, C. Chen, and Y . Wang. CLOVER: Closed-loop value estimation and ranking for end-to-end autonomous driving planning.arXiv preprint arXiv:2605.15120, 2026. 9

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Q. Yu, M. Weber, X. Deng, X. Shen, D. Cremers, and L.-C. Chen. An image is worth 32 tokens for reconstruction and generation. InThe Thirty-eighth Annual Conference on Neural Informa- tion Processing Systems, 2024. URLhttps://openreview.net/forum?id=tOXoQPRzPL

2024
[16]

Bachmann, J

R. Bachmann, J. Allardice, D. Mizrahi, E. Fini, O. F. Kar, E. Amirloo, A. El-Nouby, A. Zamir, and A. Dehghan. Flextok: Resampling images into 1d token sequences of flex- ible length. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=DgdOkUUBzf

2025
[17]

Marouani, O

A. Marouani, O. Sim ´eoni, H. Jegou, P. Bojanowski, and H. V . V o. Revisiting [CLS] and patch token interaction in vision transformers. InThe Fourteenth International Conference on Learn- ing Representations, 2026. URLhttps://openreview.net/forum?id=xs9xwFphLy

2026
[18]

C. Shi, Y . Yu, and S. Yang. Vision transformers need more than registers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26328–26337, June 2026

2026
[19]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without s...

2024
[20]

Sim ´eoni, H

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. E. Yi, M. Ramamonjisoa, F. Massa, D. HAZIZA, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Cou- prie, J. Mairal, H. Jegou, P. Labatut, and P. Bojanowski. DINOv3.Transactions on Machine Lear...

2026
[21]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, June 2023

2023
[22]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. J. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-supervis...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

URLhttps://doi.org/10.48550/arXiv.2506.09985

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.09985
[24]

M. K. Wozniak, L. Liu, Y . Cai, and P. Jensfelt. Prix: Learning to plan from raw pixels for end-to-end autonomous driving.IEEE Robotics and Automation Letters, 11:6400–6407, 2025

2025
[25]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analy- sis and machine intelligence, 45(11):12878–12895, 2022

2022
[26]

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In S. Avidan, G. Brostow, M. Ciss´e, G. M. Farinella, and T. Hassner, editors,Computer Vision – ECCV 2022, pages 1–18, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-20077- 9

2022
[27]

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, June 2022. 10

2022
[28]

Z. Tong, Y . Song, J. Wang, and L. Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps: //openreview.net/forum?id=AhccnBXSne

2022
[29]

Baevski, W.-N

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors,Proceedings of the 39th Interna- tional Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Resear...

2022
[30]

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre-training with online tokenizer. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=ydopy-e6Dg

2022
[31]

Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, AnYasong, C. Tang, L. Hou, L. Fan, and Z. Zhang. DriveVLA-w0: World models amplify data scaling law in au- tonomous driving. InThe Fourteenth International Conference on Learning Representations,
[32]

URLhttps://openreview.net/forum?id=plrGn3RdzN
[33]

Y . Chen, Y . Wang, and Z. Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 26890–26900, October 2025

2025
[34]

J. Wang, G. Li, Z. Huang, C. Dang, H. Ye, Y . Han, and L. Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10954–10964, June 2026

2026
[35]

Zhang, H

Z. Zhang, H. Li, Y . Dai, Z. Zhu, L. Zhou, C. Liu, D. Wang, F. E. H. Tay, S. Chen, Z. Liu, Y . Liu, X. Li, and P. Zhou. From spatial to actions: Grounding vision-language-action model in spatial foundation priors. InThe Fourteenth International Conference on Learning Representations,
[36]

URLhttps://openreview.net/forum?id=fzmittHfq3
[37]

S. Zeng, X. Chang, M. Xie, X. Liu, Y . Bai, Z. Pan, M. Xu, and X. Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview. net/forum?id=fCirUh6FRb

2026
[38]

W. Song, Z. Zhou, H. Zhao, J. Chen, P. Ding, H. Yan, Y . Huang, F. Tang, D. Wang, and H. Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver.CoRR, abs/2508.10333, August 2025. URLhttps://doi.org/10.48550/arXiv.2508.10333

work page doi:10.48550/arxiv.2508.10333 2025
[39]

Carion, L

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. S. Coll-Vinent, C. Ryali, K. V . Al- wala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. R¨adle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. HAZRA, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollar, N. ...

2026
[40]

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9. 11

2022
[42]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. InThe Thirty-eight Conference on Neu- ral Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps: //openreview.net/...

2024
[43]

W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Pseudo-simulation for autonomous driving. In9th Annual Conference on Robot Learning, 2025. URLhttps: //openreview.net/forum?id=9uKL9FJBiz

2025
[44]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Con- ference on Learning Representations, 2019. URLhttps://openreview.net/forum?id= Bkg6RiCqY7

2019
[45]

L. Rowe, R. de Schaetzen, R. Girgis, C. Pal, and L. Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving, 2025. URLhttps://arxiv.org/abs/2506.11234

work page arXiv 2025
[46]

Available: https://arxiv.org/abs/2508.06571

A. Jiang, Y . Gao, Y . Wang, Z. Sun, S. Wang, Y . Heng, H. Sun, S. Tang, L. Zhu, J. Chai, J. Wang, Z. Gu, H. Jiang, and L. Sun. Irl-vla: Training an vision-language-action policy via reward world model, 2025. URLhttps://arxiv.org/abs/2508.06571

work page arXiv 2025
[47]

Z. Dong, Y . Zhu, Y . Wu, and Y . Sun. Frost-drive: Scalable and efficient end-to-end driving with a frozen vision encoder, 2026. URLhttps://arxiv.org/abs/2601.03460

work page arXiv 2026
[48]

D. Wang, Y . Song, Z. He, K. Chen, X. Pan, L. Deng, and W. Gu. Hmvlm: Multistage reasoning-enhanced vision-language model for long-tailed driving scenarios, 2025. URL https://arxiv.org/abs/2506.05883

work page arXiv 2025
[49]

Y . Ma, Y . Cao, W. Ding, S. Zhang, Y . Wang, B. Ivanovic, M. Jiang, M. Pavone, and C. Xiao. dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning,
[50]

URLhttps://arxiv.org/abs/2512.04459

work page arXiv
[51]

Z. Zhou, T. Cai, S. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language- action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.Advances in Neural Information Processing Systems, 38:27920–27956, 2026

2026
[52]

R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, B. Sapp, M. Tan, J.-J. Hwang, and D. Anguelov. Wod-e2e: Waymo open dataset for end-to- end driving in challenging long-tail scenarios, 2025. URLhttps://arxiv.org/abs/2510. 26125

2025
[53]

Z. Liu, R. Huang, R. Yang, S. Yan, Z. Wang, L. Hou, D. Lin, X. Bai, and H. Zhao. Drivepi: Spatial-aware 4d mllm for unified autonomous driving understanding, perception, prediction and planning, 2025. URLhttps://arxiv.org/abs/2512.12799

work page arXiv 2025
[54]

S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving, 2025. URLhttps://arxiv.org/ abs/2412.15208

work page arXiv 2025
[55]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023. 12

2023
[56]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(11):12878–12895, 2023. doi:10.1109/TPAMI.2022.3200245

work page doi:10.1109/tpami.2022.3200245 2023
[57]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 12037–12047, June 2025

2025
[58]

Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1602–1611, June 2025

2025
[59]

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, G. Chen, H. Ye, W. Liu, and X. Wang. ReCogDrive: A reinforced cognitive framework for end-to-end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2026

2026
[60]

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14864–14873, June 2024

2024
[61]

R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for au- tonomous driving.IEEE Robotics and Automation Letters, 11(1):226–233, 2026. doi: 10.1109/LRA.2025.3632616

work page doi:10.1109/lra.2025.3632616 2026
[62]

K. Li, Z. Li, S. Lan, Y . Xie, Z. Zhang, J. Liu, Z. Wu, Z. Yu, and J. M. Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation, 2025. URLhttps:// arxiv.org/abs/2503.12820

work page arXiv 2025
[63]

W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.Proceedings of the AAAI Conference on Artificial Intelligence, 40:11910–11918, 03 2026. doi:10.1609/aaai.v40i14.38178

work page doi:10.1609/aaai.v40i14.38178 2026
[64]

J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end- to-end autonomous driving, 2025. URLhttps://arxiv.org/abs/2512.07745. 13

work page arXiv 2025

[1] [1]

End to End Learning for Self-Driving Cars

M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars, 2016. URLhttps://arxiv.org/abs/1604.07316

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

F. Codevilla, E. Santana, A. Lopez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9328–9337, 2019. doi:10.1109/ICCV .2019.00942

work page doi:10.1109/iccv 2019

[3] [3]

L. Wang, Z. Yang, C. Bai, G. Zhang, X. Liu, X. Zheng, X.-X. Long, C.-T. Lu, and C. Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving, 2026. URLhttps://arxiv.org/abs/2601.22032

work page arXiv 2026

[4] [4]

K. Guo, H. Liu, X. Wu, J. Pan, and C. Lv. ipad: Iterative proposal-centric end-to-end au- tonomous driving, 2025. URLhttps://arxiv.org/abs/2505.15111

work page arXiv 2025

[5] [5]

L. Feng, Y . Gao, E. Zablocki, Q. Li, W. Li, S. Liu, M. Cord, and A. Alahi. RAP: 3d rasterization augmented end-to-end planning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=a9bOgeqbdB

2026

[6] [6]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, 2023. doi:10.1109/CVPR52729.2023.01712

work page doi:10.1109/cvpr52729.2023.01712 2023

[7] [7]

In: IEEE/CVF International Conference on Computer Vision

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, Los Alamitos, CA, USA, Oct 2023. IEEE Computer Society. doi:10.1109/ICCV51070.2023.00766. URLhttps: //doi.i...

work page doi:10.1109/iccv51070.2023.00766 2023

[8] [8]

Jiang, S

B. Jiang, S. Chen, H. Gao, B. Liao, Q. Zhang, W. Liu, and X. Wang. V ADv2: End-to-end vectorized autonomous driving via probabilistic planning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=0a4dA6eUHN

2026

[9] [9]

Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27137–27146, October 2025

2025

[10] [10]

W. Sun, X. Lin, Y . Shi, C. Zhang, H. Wu, and S. Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801, 2025. doi:10.1109/ICRA55743.2025.11128800

work page doi:10.1109/icra55743.2025.11128800 2025

[11] [11]

W. Sun, X. Lin, K. Chen, Z. Pei, X. Li, Y . Shi, and S. Zheng. SparseDriveV2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026

work page arXiv 2026

[12] [12]

Y . Li, D. Yuan, H. Zhang, Y . Yang, and X. Luo. End to end autonomous driving via occu- pancy and motion flow. In2024 IEEE International Conference on Real-time Computing and Robotics (RCAR), pages 360–365, 2024. doi:10.1109/RCAR61438.2024.10670964

work page doi:10.1109/rcar61438.2024.10670964 2024

[13] [13]

Kirby, A

E. Kirby, A. Boulch, Y . Xu, Y . Yin, G. Puy, E. Zablocki, A. Bursuc, S. Gidaris, R. Marlet, F. Bartoccioni, A.-Q. Cao, N. Samet, T.-H. VU, and M. Cord. Driving on registers. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 32058–32069, June 2026

2026

[14] [14]

S. Ang, Y . Yang, C. Chen, and Y . Wang. CLOVER: Closed-loop value estimation and ranking for end-to-end autonomous driving planning.arXiv preprint arXiv:2605.15120, 2026. 9

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Q. Yu, M. Weber, X. Deng, X. Shen, D. Cremers, and L.-C. Chen. An image is worth 32 tokens for reconstruction and generation. InThe Thirty-eighth Annual Conference on Neural Informa- tion Processing Systems, 2024. URLhttps://openreview.net/forum?id=tOXoQPRzPL

2024

[16] [16]

Bachmann, J

R. Bachmann, J. Allardice, D. Mizrahi, E. Fini, O. F. Kar, E. Amirloo, A. El-Nouby, A. Zamir, and A. Dehghan. Flextok: Resampling images into 1d token sequences of flex- ible length. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=DgdOkUUBzf

2025

[17] [17]

Marouani, O

A. Marouani, O. Sim ´eoni, H. Jegou, P. Bojanowski, and H. V . V o. Revisiting [CLS] and patch token interaction in vision transformers. InThe Fourteenth International Conference on Learn- ing Representations, 2026. URLhttps://openreview.net/forum?id=xs9xwFphLy

2026

[18] [18]

C. Shi, Y . Yu, and S. Yang. Vision transformers need more than registers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26328–26337, June 2026

2026

[19] [19]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without s...

2024

[20] [20]

Sim ´eoni, H

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. E. Yi, M. Ramamonjisoa, F. Massa, D. HAZIZA, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Cou- prie, J. Mairal, H. Jegou, P. Labatut, and P. Bojanowski. DINOv3.Transactions on Machine Lear...

2026

[21] [21]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, June 2023

2023

[22] [22]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. J. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-supervis...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

URLhttps://doi.org/10.48550/arXiv.2506.09985

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.09985

[24] [24]

M. K. Wozniak, L. Liu, Y . Cai, and P. Jensfelt. Prix: Learning to plan from raw pixels for end-to-end autonomous driving.IEEE Robotics and Automation Letters, 11:6400–6407, 2025

2025

[25] [25]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analy- sis and machine intelligence, 45(11):12878–12895, 2022

2022

[26] [26]

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In S. Avidan, G. Brostow, M. Ciss´e, G. M. Farinella, and T. Hassner, editors,Computer Vision – ECCV 2022, pages 1–18, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-20077- 9

2022

[27] [27]

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, June 2022. 10

2022

[28] [28]

Z. Tong, Y . Song, J. Wang, and L. Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps: //openreview.net/forum?id=AhccnBXSne

2022

[29] [29]

Baevski, W.-N

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors,Proceedings of the 39th Interna- tional Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Resear...

2022

[30] [30]

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre-training with online tokenizer. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=ydopy-e6Dg

2022

[31] [31]

Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, AnYasong, C. Tang, L. Hou, L. Fan, and Z. Zhang. DriveVLA-w0: World models amplify data scaling law in au- tonomous driving. InThe Fourteenth International Conference on Learning Representations,

[32] [32]

URLhttps://openreview.net/forum?id=plrGn3RdzN

[33] [33]

Y . Chen, Y . Wang, and Z. Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 26890–26900, October 2025

2025

[34] [34]

J. Wang, G. Li, Z. Huang, C. Dang, H. Ye, Y . Han, and L. Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10954–10964, June 2026

2026

[35] [35]

Zhang, H

Z. Zhang, H. Li, Y . Dai, Z. Zhu, L. Zhou, C. Liu, D. Wang, F. E. H. Tay, S. Chen, Z. Liu, Y . Liu, X. Li, and P. Zhou. From spatial to actions: Grounding vision-language-action model in spatial foundation priors. InThe Fourteenth International Conference on Learning Representations,

[36] [36]

URLhttps://openreview.net/forum?id=fzmittHfq3

[37] [37]

S. Zeng, X. Chang, M. Xie, X. Liu, Y . Bai, Z. Pan, M. Xu, and X. Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview. net/forum?id=fCirUh6FRb

2026

[38] [38]

W. Song, Z. Zhou, H. Zhao, J. Chen, P. Ding, H. Yan, Y . Huang, F. Tang, D. Wang, and H. Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver.CoRR, abs/2508.10333, August 2025. URLhttps://doi.org/10.48550/arXiv.2508.10333

work page doi:10.48550/arxiv.2508.10333 2025

[39] [39]

Carion, L

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. S. Coll-Vinent, C. Ryali, K. V . Al- wala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. R¨adle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. HAZRA, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollar, N. ...

2026

[40] [40]

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9. 11

2022

[41] [42]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. InThe Thirty-eight Conference on Neu- ral Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps: //openreview.net/...

2024

[42] [43]

W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Pseudo-simulation for autonomous driving. In9th Annual Conference on Robot Learning, 2025. URLhttps: //openreview.net/forum?id=9uKL9FJBiz

2025

[43] [44]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Con- ference on Learning Representations, 2019. URLhttps://openreview.net/forum?id= Bkg6RiCqY7

2019

[44] [45]

L. Rowe, R. de Schaetzen, R. Girgis, C. Pal, and L. Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving, 2025. URLhttps://arxiv.org/abs/2506.11234

work page arXiv 2025

[45] [46]

Available: https://arxiv.org/abs/2508.06571

A. Jiang, Y . Gao, Y . Wang, Z. Sun, S. Wang, Y . Heng, H. Sun, S. Tang, L. Zhu, J. Chai, J. Wang, Z. Gu, H. Jiang, and L. Sun. Irl-vla: Training an vision-language-action policy via reward world model, 2025. URLhttps://arxiv.org/abs/2508.06571

work page arXiv 2025

[46] [47]

Z. Dong, Y . Zhu, Y . Wu, and Y . Sun. Frost-drive: Scalable and efficient end-to-end driving with a frozen vision encoder, 2026. URLhttps://arxiv.org/abs/2601.03460

work page arXiv 2026

[47] [48]

D. Wang, Y . Song, Z. He, K. Chen, X. Pan, L. Deng, and W. Gu. Hmvlm: Multistage reasoning-enhanced vision-language model for long-tailed driving scenarios, 2025. URL https://arxiv.org/abs/2506.05883

work page arXiv 2025

[48] [49]

Y . Ma, Y . Cao, W. Ding, S. Zhang, Y . Wang, B. Ivanovic, M. Jiang, M. Pavone, and C. Xiao. dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning,

[49] [50]

URLhttps://arxiv.org/abs/2512.04459

work page arXiv

[50] [51]

Z. Zhou, T. Cai, S. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language- action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.Advances in Neural Information Processing Systems, 38:27920–27956, 2026

2026

[51] [52]

R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, B. Sapp, M. Tan, J.-J. Hwang, and D. Anguelov. Wod-e2e: Waymo open dataset for end-to- end driving in challenging long-tail scenarios, 2025. URLhttps://arxiv.org/abs/2510. 26125

2025

[52] [53]

Z. Liu, R. Huang, R. Yang, S. Yan, Z. Wang, L. Hou, D. Lin, X. Bai, and H. Zhao. Drivepi: Spatial-aware 4d mllm for unified autonomous driving understanding, perception, prediction and planning, 2025. URLhttps://arxiv.org/abs/2512.12799

work page arXiv 2025

[53] [54]

S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving, 2025. URLhttps://arxiv.org/ abs/2412.15208

work page arXiv 2025

[54] [55]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023. 12

2023

[55] [56]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(11):12878–12895, 2023. doi:10.1109/TPAMI.2022.3200245

work page doi:10.1109/tpami.2022.3200245 2023

[56] [57]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 12037–12047, June 2025

2025

[57] [58]

Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1602–1611, June 2025

2025

[58] [59]

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, G. Chen, H. Ye, W. Liu, and X. Wang. ReCogDrive: A reinforced cognitive framework for end-to-end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2026

2026

[59] [60]

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14864–14873, June 2024

2024

[60] [61]

R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for au- tonomous driving.IEEE Robotics and Automation Letters, 11(1):226–233, 2026. doi: 10.1109/LRA.2025.3632616

work page doi:10.1109/lra.2025.3632616 2026

[61] [62]

K. Li, Z. Li, S. Lan, Y . Xie, Z. Zhang, J. Liu, Z. Wu, Z. Yu, and J. M. Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation, 2025. URLhttps:// arxiv.org/abs/2503.12820

work page arXiv 2025

[62] [63]

W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.Proceedings of the AAAI Conference on Artificial Intelligence, 40:11910–11918, 03 2026. doi:10.1609/aaai.v40i14.38178

work page doi:10.1609/aaai.v40i14.38178 2026

[63] [64]

J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end- to-end autonomous driving, 2025. URLhttps://arxiv.org/abs/2512.07745. 13

work page arXiv 2025