pith. sign in

arxiv: 2605.31116 · v1 · pith:IJ7ES7ZLnew · submitted 2026-05-29 · 💻 cs.CV · cs.RO

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

Pith reviewed 2026-06-28 22:55 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords end-to-end autonomous drivingscene tokenstoken reconstructionself-distillationvisual bottleneckmasked reconstructionperception-free planning
0
0 comments X

The pith

A masked self-distillation reconstruction objective on scene tokens produces richer visual representations that improve end-to-end driving performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end driving models compress image patches into a small set of scene tokens that feed directly into the planner. These tokens currently receive supervision only from the final planning loss, leaving their visual content under-constrained. The paper adds a reconstruction task that requires the scene tokens to rebuild masked patch features, routing all gradients through the bottleneck alone. Semantic cues from foundation models guide the reconstruction targets toward relevant structures. The result is less redundant tokens and higher scores on standard driving benchmarks, all without changing the model at test time.

Core claim

NTR constrains the scene-token bottleneck in perception-free E2E driving by a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only the compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. Weak semantic priors derived from foundation-model annotations bias reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time.

What carries the argument

Neural Token Reconstruction (NTR) via self-distillation masked latent reconstruction objective that reconstructs masked patch latents using only scene tokens

If this is right

  • Scene tokens exhibit lower pairwise redundancy and higher effective rank.
  • Planning performance reaches 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2.
  • All auxiliary reconstruction components are removed at inference, leaving the deployed planner unchanged.
  • Effective bottleneck supervision improves both compact visual representation learning and planning performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reconstruction pattern could be applied to other compressed token bottlenecks where downstream supervision is indirect.
  • Token effective rank and redundancy metrics may serve as lightweight diagnostics for representation quality in other vision-planning systems.
  • If foundation-model priors prove hard to obtain, simpler unsupervised reconstruction targets might still deliver part of the bottleneck improvement.

Load-bearing premise

Forcing reconstruction gradients to pass exclusively through the scene-token bottleneck will cause the tokens to preserve richer and less redundant visual representations that directly improve downstream planning.

What would settle it

An ablation removing only the masked reconstruction objective while keeping all other training elements fixed, followed by no change in planning scores or token redundancy metrics on the same benchmarks, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.31116 by Jiahui Li, Jiamin Shi, Jiawei Sun, Kaidi Yang, Liying Liu, Ming Liu, Ruiteng Zhao, Zhiyang Liu, Zixiang Ren, Zuoguan Wang.

Figure 1
Figure 1. Figure 1: Diagnostic visualization of scene-token behavior. (a) The planning-only baseline shows [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Neural Token Reconstruction (NTR). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples on different driving scenarios. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Neural Token Reconstruction (NTR), a self-distillation masked latent reconstruction objective that supervises the compact scene-token bottleneck in perception-free end-to-end autonomous driving. Scene tokens reconstruct masked patch-level latent features (using only the bottleneck as memory), with semantic priors from foundation-model annotations serving as weak targets; all auxiliary components are discarded at inference. The paper reports state-of-the-art results on three benchmarks (8.0461 RFS on Waymo E2E; 94.1 PDMS / 90.9 EPDMS on NavSim1&2) together with improved token statistics (lower pairwise redundancy, higher effective rank).

Significance. If the reported gains and token metrics hold under scrutiny, the work shows that an auxiliary reconstruction objective can enrich the visual information retained by a planning-only bottleneck without altering the deployed model. The mechanism is internally consistent: gradients are routed exclusively through the scene tokens, semantic priors act only as soft targets, and direct measurements of redundancy and rank provide supporting evidence beyond planning metrics alone. This addresses a clear limitation of prior perception-free E2E methods and could influence future bottleneck designs.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: the SOTA claims rest on single-run point estimates (e.g., 8.0461 RFS, 94.1 PDMS) without reported standard deviations, multiple random seeds, or statistical tests; this weakens the ability to attribute gains specifically to the NTR objective versus training variance.
  2. [§3.2, §4.3] §3.2 and §4.3: while token redundancy and effective-rank metrics are presented as evidence that the reconstruction objective improves representations, the manuscript does not include an ablation that isolates the masked latent reconstruction loss from the semantic-prior term; without this control it remains unclear which component drives the reported token-quality improvements.
minor comments (2)
  1. [§3.1] The description of the self-distillation target construction (how masked latents are generated and aligned) would benefit from an explicit equation or pseudocode block for reproducibility.
  2. [Figure 3] Figure 3 (token visualization) and the associated redundancy plots would be clearer with explicit axis labels and a statement of the exact distance metric used for pairwise redundancy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the SOTA claims rest on single-run point estimates (e.g., 8.0461 RFS, 94.1 PDMS) without reported standard deviations, multiple random seeds, or statistical tests; this weakens the ability to attribute gains specifically to the NTR objective versus training variance.

    Authors: We agree that single-run point estimates limit the strength of the SOTA claims and that variance estimates would better isolate the contribution of NTR from training stochasticity. End-to-end driving models are computationally expensive to train, which is why we initially reported single runs (a practice seen in several recent E2E driving papers). In the revision we will rerun the key experiments with at least three random seeds, report means and standard deviations, and add a brief statistical comparison where appropriate. revision: yes

  2. Referee: [§3.2, §4.3] §3.2 and §4.3: while token redundancy and effective-rank metrics are presented as evidence that the reconstruction objective improves representations, the manuscript does not include an ablation that isolates the masked latent reconstruction loss from the semantic-prior term; without this control it remains unclear which component drives the reported token-quality improvements.

    Authors: We acknowledge that the current manuscript lacks an ablation that separates the masked latent reconstruction loss from the semantic-prior term. The semantic priors are used only as soft targets to bias reconstruction toward driving-relevant structures; however, without the requested control experiment it is indeed difficult to quantify their individual impact on the token statistics. We will add this ablation (full NTR vs. reconstruction loss only) to §4.3 and update the corresponding discussion in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an auxiliary masked latent reconstruction objective used only during training to supervise the scene-token bottleneck; this objective and all associated components are explicitly removed at inference time, leaving the planner unchanged. Performance claims rest on empirical SOTA metrics across external benchmarks plus post-hoc token statistics (pairwise redundancy, effective rank), none of which are defined in terms of the same fitted parameters or reduced to self-citations. No equations, uniqueness theorems, or ansatzes are shown to collapse the reported gains to quantities already present in the inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, domain assumptions, or invented entities beyond the high-level description of the NTR objective itself.

axioms (1)
  • standard math Standard assumptions of gradient-based optimization in neural networks.
    Implicit in any deep-learning training procedure described in the abstract.

pith-pipeline@v0.9.1-grok · 5813 in / 1242 out tokens · 31441 ms · 2026-06-28T22:55:33.797333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 25 canonical work pages · 4 internal anchors

  1. [1]

    End to End Learning for Self-Driving Cars

    M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Mon- fort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars, 2016. URLhttps://arxiv.org/abs/1604.07316

  2. [2]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    F. Codevilla, E. Santana, A. Lopez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9328–9337, 2019. doi:10.1109/ICCV .2019.00942

  3. [3]

    L. Wang, Z. Yang, C. Bai, G. Zhang, X. Liu, X. Zheng, X.-X. Long, C.-T. Lu, and C. Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving, 2026. URLhttps://arxiv.org/abs/2601.22032

  4. [4]

    K. Guo, H. Liu, X. Wu, J. Pan, and C. Lv. ipad: Iterative proposal-centric end-to-end au- tonomous driving, 2025. URLhttps://arxiv.org/abs/2505.15111

  5. [5]

    L. Feng, Y . Gao, E. Zablocki, Q. Li, W. Li, S. Liu, M. Cord, and A. Alahi. RAP: 3d rasterization augmented end-to-end planning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=a9bOgeqbdB

  6. [6]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, 2023. doi:10.1109/CVPR52729.2023.01712

  7. [7]

    In: IEEE/CVF International Conference on Computer Vision

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, Los Alamitos, CA, USA, Oct 2023. IEEE Computer Society. doi:10.1109/ICCV51070.2023.00766. URLhttps: //doi.i...

  8. [8]

    Jiang, S

    B. Jiang, S. Chen, H. Gao, B. Liao, Q. Zhang, W. Liu, and X. Wang. V ADv2: End-to-end vectorized autonomous driving via probabilistic planning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum? id=0a4dA6eUHN

  9. [9]

    Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27137–27146, October 2025

  10. [10]

    W. Sun, X. Lin, Y . Shi, C. Zhang, H. Wu, and S. Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801, 2025. doi:10.1109/ICRA55743.2025.11128800

  11. [11]

    W. Sun, X. Lin, K. Chen, Z. Pei, X. Li, Y . Shi, and S. Zheng. SparseDriveV2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026

  12. [12]

    Y . Li, D. Yuan, H. Zhang, Y . Yang, and X. Luo. End to end autonomous driving via occu- pancy and motion flow. In2024 IEEE International Conference on Real-time Computing and Robotics (RCAR), pages 360–365, 2024. doi:10.1109/RCAR61438.2024.10670964

  13. [13]

    Kirby, A

    E. Kirby, A. Boulch, Y . Xu, Y . Yin, G. Puy, E. Zablocki, A. Bursuc, S. Gidaris, R. Marlet, F. Bartoccioni, A.-Q. Cao, N. Samet, T.-H. VU, and M. Cord. Driving on registers. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 32058–32069, June 2026

  14. [14]

    S. Ang, Y . Yang, C. Chen, and Y . Wang. CLOVER: Closed-loop value estimation and ranking for end-to-end autonomous driving planning.arXiv preprint arXiv:2605.15120, 2026. 9

  15. [15]

    Q. Yu, M. Weber, X. Deng, X. Shen, D. Cremers, and L.-C. Chen. An image is worth 32 tokens for reconstruction and generation. InThe Thirty-eighth Annual Conference on Neural Informa- tion Processing Systems, 2024. URLhttps://openreview.net/forum?id=tOXoQPRzPL

  16. [16]

    Bachmann, J

    R. Bachmann, J. Allardice, D. Mizrahi, E. Fini, O. F. Kar, E. Amirloo, A. El-Nouby, A. Zamir, and A. Dehghan. Flextok: Resampling images into 1d token sequences of flex- ible length. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=DgdOkUUBzf

  17. [17]

    Marouani, O

    A. Marouani, O. Sim ´eoni, H. Jegou, P. Bojanowski, and H. V . V o. Revisiting [CLS] and patch token interaction in vision transformers. InThe Fourteenth International Conference on Learn- ing Representations, 2026. URLhttps://openreview.net/forum?id=xs9xwFphLy

  18. [18]

    C. Shi, Y . Yu, and S. Yang. Vision transformers need more than registers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26328–26337, June 2026

  19. [19]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without s...

  20. [20]

    Sim ´eoni, H

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. E. Yi, M. Ramamonjisoa, F. Massa, D. HAZIZA, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Cou- prie, J. Mairal, H. Jegou, P. Labatut, and P. Bojanowski. DINOv3.Transactions on Machine Lear...

  21. [21]

    Assran, Q

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, June 2023

  22. [22]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. J. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-supervis...

  23. [23]

    URLhttps://doi.org/10.48550/arXiv.2506.09985

  24. [24]

    M. K. Wozniak, L. Liu, Y . Cai, and P. Jensfelt. Prix: Learning to plan from raw pixels for end-to-end autonomous driving.IEEE Robotics and Automation Letters, 11:6400–6407, 2025

  25. [25]

    Chitta, A

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analy- sis and machine intelligence, 45(11):12878–12895, 2022

  26. [26]

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In S. Avidan, G. Brostow, M. Ciss´e, G. M. Farinella, and T. Hassner, editors,Computer Vision – ECCV 2022, pages 1–18, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-20077- 9

  27. [27]

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, June 2022. 10

  28. [28]

    Z. Tong, Y . Song, J. Wang, and L. Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps: //openreview.net/forum?id=AhccnBXSne

  29. [29]

    Baevski, W.-N

    A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors,Proceedings of the 39th Interna- tional Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Resear...

  30. [30]

    J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre-training with online tokenizer. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=ydopy-e6Dg

  31. [31]

    Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, AnYasong, C. Tang, L. Hou, L. Fan, and Z. Zhang. DriveVLA-w0: World models amplify data scaling law in au- tonomous driving. InThe Fourteenth International Conference on Learning Representations,

  32. [32]

    URLhttps://openreview.net/forum?id=plrGn3RdzN

  33. [33]

    Y . Chen, Y . Wang, and Z. Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 26890–26900, October 2025

  34. [34]

    J. Wang, G. Li, Z. Huang, C. Dang, H. Ye, Y . Han, and L. Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10954–10964, June 2026

  35. [35]

    Zhang, H

    Z. Zhang, H. Li, Y . Dai, Z. Zhu, L. Zhou, C. Liu, D. Wang, F. E. H. Tay, S. Chen, Z. Liu, Y . Liu, X. Li, and P. Zhou. From spatial to actions: Grounding vision-language-action model in spatial foundation priors. InThe Fourteenth International Conference on Learning Representations,

  36. [36]

    URLhttps://openreview.net/forum?id=fzmittHfq3

  37. [37]

    S. Zeng, X. Chang, M. Xie, X. Liu, Y . Bai, Z. Pan, M. Xu, and X. Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview. net/forum?id=fCirUh6FRb

  38. [38]

    W. Song, Z. Zhou, H. Zhao, J. Chen, P. Ding, H. Yan, Y . Huang, F. Tang, D. Wang, and H. Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver.CoRR, abs/2508.10333, August 2025. URLhttps://doi.org/10.48550/arXiv.2508.10333

  39. [39]

    Carion, L

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. S. Coll-Vinent, C. Ryali, K. V . Al- wala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. R¨adle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. HAZRA, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollar, N. ...

  40. [40]

    E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9. 11

  41. [42]

    Dauner, M

    D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. InThe Thirty-eight Conference on Neu- ral Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps: //openreview.net/...

  42. [43]

    W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Pseudo-simulation for autonomous driving. In9th Annual Conference on Robot Learning, 2025. URLhttps: //openreview.net/forum?id=9uKL9FJBiz

  43. [44]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Con- ference on Learning Representations, 2019. URLhttps://openreview.net/forum?id= Bkg6RiCqY7

  44. [45]

    L. Rowe, R. de Schaetzen, R. Girgis, C. Pal, and L. Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving, 2025. URLhttps://arxiv.org/abs/2506.11234

  45. [46]

    Available: https://arxiv.org/abs/2508.06571

    A. Jiang, Y . Gao, Y . Wang, Z. Sun, S. Wang, Y . Heng, H. Sun, S. Tang, L. Zhu, J. Chai, J. Wang, Z. Gu, H. Jiang, and L. Sun. Irl-vla: Training an vision-language-action policy via reward world model, 2025. URLhttps://arxiv.org/abs/2508.06571

  46. [47]

    Z. Dong, Y . Zhu, Y . Wu, and Y . Sun. Frost-drive: Scalable and efficient end-to-end driving with a frozen vision encoder, 2026. URLhttps://arxiv.org/abs/2601.03460

  47. [48]

    D. Wang, Y . Song, Z. He, K. Chen, X. Pan, L. Deng, and W. Gu. Hmvlm: Multistage reasoning-enhanced vision-language model for long-tailed driving scenarios, 2025. URL https://arxiv.org/abs/2506.05883

  48. [49]

    Y . Ma, Y . Cao, W. Ding, S. Zhang, Y . Wang, B. Ivanovic, M. Jiang, M. Pavone, and C. Xiao. dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning,

  49. [50]

    URLhttps://arxiv.org/abs/2512.04459

  50. [51]

    Z. Zhou, T. Cai, S. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language- action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.Advances in Neural Information Processing Systems, 38:27920–27956, 2026

  51. [52]

    R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, B. Sapp, M. Tan, J.-J. Hwang, and D. Anguelov. Wod-e2e: Waymo open dataset for end-to- end driving in challenging long-tail scenarios, 2025. URLhttps://arxiv.org/abs/2510. 26125

  52. [53]

    Z. Liu, R. Huang, R. Yang, S. Yan, Z. Wang, L. Hou, D. Lin, X. Bai, and H. Zhao. Drivepi: Spatial-aware 4d mllm for unified autonomous driving understanding, perception, prediction and planning, 2025. URLhttps://arxiv.org/abs/2512.12799

  53. [54]

    S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving, 2025. URLhttps://arxiv.org/ abs/2412.15208

  54. [55]

    Dauner, M

    D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023. 12

  55. [56]

    Chitta, A

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(11):12878–12895, 2023. doi:10.1109/TPAMI.2022.3200245

  56. [57]

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 12037–12047, June 2025

  57. [58]

    Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1602–1611, June 2025

  58. [59]

    Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, G. Chen, H. Ye, W. Liu, and X. Wang. ReCogDrive: A reinforced cognitive framework for end-to-end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2026

  59. [60]

    Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14864–14873, June 2024

  60. [61]

    R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for au- tonomous driving.IEEE Robotics and Automation Letters, 11(1):226–233, 2026. doi: 10.1109/LRA.2025.3632616

  61. [62]

    K. Li, Z. Li, S. Lan, Y . Xie, Z. Zhang, J. Liu, Z. Wu, Z. Yu, and J. M. Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation, 2025. URLhttps:// arxiv.org/abs/2503.12820

  62. [63]

    W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.Proceedings of the AAAI Conference on Artificial Intelligence, 40:11910–11918, 03 2026. doi:10.1609/aaai.v40i14.38178

  63. [64]

    J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end- to-end autonomous driving, 2025. URLhttps://arxiv.org/abs/2512.07745. 13