CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

Jianqiang Wang; Wenhao Yu; Yanbo Jiang; Yining Xing; Zehong Ke; Zhiyuan Liu

arxiv: 2606.06219 · v1 · pith:KSFBJV6Lnew · submitted 2026-06-04 · 💻 cs.RO · cs.AI

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

Yining Xing , Zehong Ke , Zhiyuan Liu , Yanbo Jiang , Wenhao Yu , Jianqiang Wang This is my paper

Pith reviewed 2026-06-28 01:10 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords end-to-end autonomous drivinglatent space planningVAE conditional driftLLM hidden statesadaptive schedulermulti-modal trajectory generationNAVSIM benchmarkreal-time inference

0 comments

The pith

CLEAR replaces multi-step denoising with single-step VAE latent drift guided by fine-tuned LLM hidden states to reach 93.7 PDMS on NAVSIM v1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that end-to-end driving can generate diverse maneuvers in real time by moving the generative step into a compressed latent space instead of running iterative denoising. It pairs this with hidden states from a fine-tuned small language model that feed an adaptive scheduler and a scorer to pick both the right generation parameters and the best final trajectory for each scene. A reader would care because the approach removes the main speed bottleneck of diffusion-style planners while still claiming top benchmark scores without extra geometric labels. If the method holds, it shows a concrete route to multi-modal planning that meets safety-critical timing limits.

Core claim

CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Scene-aware hidden states extracted from a fully fine-tuned Qwen 3.5 0.8B on driving QA pairs guide both an Adaptive Scheduler that selects the conditioning coefficient α and sample count N from predefined discrete schemes and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark this yields a state-of-the-art PDMS of 93.7.

What carries the argument

Single-step conditional drift in VAE latent space whose conditioning coefficient and sample count are chosen by an Adaptive Scheduler and whose output trajectories are ranked by a cross-attention scorer, both driven by scene-aware hidden states from the fine-tuned language model.

If this is right

Multi-modal driving plans can be produced at inference speeds that avoid the latency of iterative sampling.
High benchmark scores are possible without dense geometric annotations or post-hoc refinement.
Scene-aware states from the language model enable dynamic trade-offs between trajectory diversity and precision.
The overall pipeline demonstrates that latent-space generation plus lightweight routing can match or exceed heavier diffusion baselines on NAVSIM v1.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-drift-plus-LLM-scorer pattern could be tested on other real-time control domains that currently rely on diffusion planners.
If the small language model's hidden states transfer across datasets, the method may reduce dependence on large vision-language models at deployment time.
The discrete scheme selection could be examined for stability when the underlying visual encoder or driving distribution changes slightly.

Load-bearing premise

The predefined discrete schemes and scene-aware hidden states let the scheduler and scorer pick the optimal trajectory without needing extra tuning or validation data for new scenes.

What would settle it

An experiment on a held-out driving dataset where the scheduler-selected schemes produce lower PDMS than a single fixed scheme or where performance drops sharply unless the discrete scheme set is manually adjusted.

Figures

Figures reproduced from arXiv: 2606.06219 by Jianqiang Wang, Wenhao Yu, Yanbo Jiang, Yining Xing, Zehong Ke, Zhiyuan Liu.

**Figure 2.** Figure 2: Evolution of trajectory generation in both physical space (rows 1 and 3) and latent feature [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $\alpha$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLEAR swaps multi-step diffusion for single-step VAE latent drift plus LLM hidden-state guidance and a discrete adaptive scheduler, claiming 93.7 PDMS on NAVSIM, but the abstract supplies almost no evidence that the scheduler choices are robust rather than benchmark-tuned.

read the letter

The main point is that this work replaces the slow iterative denoising of diffusion planners with a one-step conditional drift inside a VAE latent space, where the drift is steered by hidden states from a fully fine-tuned Qwen 3.5 0.8B model. An adaptive scheduler then picks the conditioning strength alpha and sample count N from a small set of fixed schemes, and a cross-attention scorer picks the final trajectory. The abstract reports this reaches 93.7 PDMS on NAVSIM v1.

The architecture itself is a reasonable integration. Drive-JEPA as the visual backbone, the VAE shortcut for speed, and the use of LLM states for both scheduling and scoring are sensible ways to keep multi-modal generation while cutting latency. The motivation around real-time constraints in safety-critical driving is clear.

The soft spot is exactly the one the stress-test flags: the headline number rests on the scheduler and scorer reliably choosing the right scheme from the discrete set using only scene-aware states. The abstract gives no information on how those schemes were selected, whether they were validated on held-out data, or what the ablations look like. Without baselines, error bars, or details on the selection process, it is impossible to tell whether the single-step VAE plus LLM guidance actually delivers the result on its own or whether the discrete choices were adjusted with knowledge of the NAVSIM test distribution.

This is for people working on practical end-to-end driving stacks who need generative planners to run under tight latency budgets. A reader already familiar with VAE and small-LLM conditioning might pick up the specific routing pattern, but the current write-up is too thin for anyone to assess the strength of the performance claim.

I would send it to peer review if the full paper adds proper comparisons and shows the scheduler was not tuned to the benchmark; the core latency-reduction idea is worth checking even if the evidence so far is limited.

Referee Report

3 major / 2 minor

Summary. The manuscript presents CLEAR, a framework for end-to-end autonomous driving that integrates Drive-JEPA as visual encoder, replaces multi-step denoising with a single-step conditional drift in VAE latent space (with conditioning coefficient α), and uses fully fine-tuned Qwen 3.5 0.8B to extract scene-aware hidden states. These states guide an Adaptive Scheduler that selects α and sample count N from a discrete set of predefined schemes, plus a cross-attention scorer for choosing the optimal trajectory from candidates. The central claim is a state-of-the-art PDMS of 93.7 on the NAVSIM v1 benchmark.

Significance. If the performance claims are substantiated with proper controls, the work could advance real-time multi-modal planning by demonstrating that LLM-guided latent-space drift can match or exceed iterative diffusion methods in efficiency. The integration of semantic reasoning from a small fine-tuned LLM with generative planning is a timely direction for autonomous driving. However, the dependence on external benchmarks and fixed discrete schemes for the scheduler limits broader significance without evidence that the gains are not due to benchmark-specific tuning.

major comments (3)

[Abstract] Abstract: The SOTA PDMS 93.7 claim is presented with no baselines, ablation results, error bars, dataset splits, or statistical tests. This directly undermines evaluation of whether the single-step VAE drift, the Adaptive Scheduler, or the cross-attention scorer is responsible for the reported performance.
[Abstract and §3] Adaptive Scheduler description (Abstract and §3): The scheduler selects α and N from a fixed discrete set of predefined schemes using only scene-aware hidden states from the fine-tuned Qwen model. No details are given on scheme construction, whether selection was validated on held-out data, or if the discrete set was chosen with knowledge of the NAVSIM v1 test distribution. This is load-bearing for the claim that the core method (single-step drift plus LLM guidance) delivers the headline result without post-hoc adjustments.
[Abstract] No information is supplied on how the fine-tuning of Qwen 3.5 0.8B on driving QA pairs interacts with the NAVSIM v1 benchmark splits, raising the possibility of unintended data leakage that could inflate the reported PDMS.

minor comments (2)

[Abstract] The conditioning coefficient α is introduced in the abstract without an accompanying equation or precise definition of how it balances diversity and expert precision in the VAE drift.
[Abstract] Notation for the sample count N and the cross-attention scorer mechanism could be clarified with a short pseudocode or diagram reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and address the raised concerns.

read point-by-point responses

Referee: [Abstract] Abstract: The SOTA PDMS 93.7 claim is presented with no baselines, ablation results, error bars, dataset splits, or statistical tests. This directly undermines evaluation of whether the single-step VAE drift, the Adaptive Scheduler, or the cross-attention scorer is responsible for the reported performance.

Authors: The abstract is length-constrained, but Section 4 of the full manuscript provides baseline comparisons, ablation studies isolating each component (VAE drift, scheduler, scorer), NAVSIM v1 dataset split details, and multiple-run results. We will revise the abstract to briefly reference key baselines and ablations, and ensure error bars and statistical details are explicitly noted. revision: yes
Referee: [Abstract and §3] Adaptive Scheduler description (Abstract and §3): The scheduler selects α and N from a fixed discrete set of predefined schemes using only scene-aware hidden states from the fine-tuned Qwen model. No details are given on scheme construction, whether selection was validated on held-out data, or if the discrete set was chosen with knowledge of the NAVSIM v1 test distribution. This is load-bearing for the claim that the core method (single-step drift plus LLM guidance) delivers the headline result without post-hoc adjustments.

Authors: The discrete schemes for α and N were constructed and validated exclusively on the NAVSIM v1 validation split using only the Qwen hidden states, with no access to or knowledge of the test distribution. We will expand Section 3 with explicit details on scheme construction, the validation procedure on held-out data, and confirmation that selection avoids post-hoc test-set adjustments. revision: yes
Referee: [Abstract] No information is supplied on how the fine-tuning of Qwen 3.5 0.8B on driving QA pairs interacts with the NAVSIM v1 benchmark splits, raising the possibility of unintended data leakage that could inflate the reported PDMS.

Authors: Fine-tuning used driving QA pairs derived solely from the NAVSIM v1 training split and public external sources, with explicit checks ensuring zero overlap with the test split. We will add a clarification paragraph detailing the data splits and leakage-prevention steps. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external benchmark and independent components

full rationale

The provided abstract describes a composite framework (Drive-JEPA encoder, single-step VAE drift with conditioning coefficient, fine-tuned Qwen 3.5 for hidden states, Adaptive Scheduler selecting from a fixed discrete set of schemes, cross-attention scorer) evaluated on the external NAVSIM v1 benchmark to reach PDMS 93.7. No equations, self-definitions, fitted-input predictions, or self-citation chains are present that reduce any claimed result to its own inputs by construction. The scheduler's use of predefined schemes is stated without evidence of test-set calibration or renaming of known results. This meets the default expectation of a non-circular paper self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5747 in / 949 out tokens · 19749 ms · 2026-06-28T01:10:45.103842+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 11 canonical work pages · 6 internal anchors

[1]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023
[2]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

2023
[3]

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Jiang, A

C. Jiang, A. Cornman, C. Park, B. Sapp, Y . Zhou, D. Anguelov, et al. Motiondiffuser: Control- lable multi-agent motion prediction using diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9644–9653, 2023

2023
[5]

Zhong, D

Z. Zhong, D. Rempe, Y . Chen, B. Ivanovic, Y . Cao, D. Xu, M. Pavone, and B. Ray. Language- guided traffic simulation via scene-level diffusion. InConference on robot learning, pages 144–177. PMLR, 2023

2023
[6]

Zheng, R

Y . Zheng, R. Liang, K. ZHENG, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. E. Li, X. Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance. InICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy

2025
[7]

Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

2025
[8]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[9]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. 2023

2023
[11]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

2023
[13]

M. Deng, H. Li, T. Li, Y . Du, and K. He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

L. Wang, Z. Yang, C. Bai, G. Zhang, X. Liu, X. Zheng, X.-X. Long, C.-T. Lu, and C. Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving.arXiv preprint arXiv:2601.22032, 2026

work page arXiv 2026
[15]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 9

2022
[17]

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Zheng, Y

P. Zheng, Y . Zhao, Z. Gong, H. Zhu, and S. Wu. Simplevsf: Vlm-scoring fusion for trajectory prediction of end-to-end autonomous driving.arXiv preprint arXiv:2510.17191, 2025

work page arXiv 2025
[19]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[20]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

2024
[21]

Jiang, S

B. Jiang, S. Chen, H. Gao, B. Liao, Q. Zhang, W. Liu, and X. Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. InThe Fourteenth International Conference on Learning Representations, 2024

2024
[22]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[23]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[24]

K. Guo, H. Liu, X. Wu, J. Pan, and C. Lv. ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025

work page arXiv 2025
[25]

Z. Li, W. Yao, Z. Wang, X. Sun, J. Chen, N. Chang, M. Shen, Z. Wu, S. Lan, and J. M. Alvarez. Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

work page arXiv 2025
[26]

W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

2026
[27]

B. Liao, S. Chen, Y . Wang, T. Cheng, Q. Zhang, W. Liu, and C. Huang. Diffusiondrive: Towards an efficient diffusion-based end-to-end planner.arXiv preprint arXiv:2411.15139, 2025

work page arXiv 2025
[28]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. InIEEE Transactions on Pattern Analysis and Machine Intelligence, volume 45, pages 12878–12955, 2022. 10

2022

[1] [1]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023

[2] [2]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

2023

[3] [3]

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Jiang, A

C. Jiang, A. Cornman, C. Park, B. Sapp, Y . Zhou, D. Anguelov, et al. Motiondiffuser: Control- lable multi-agent motion prediction using diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9644–9653, 2023

2023

[5] [5]

Zhong, D

Z. Zhong, D. Rempe, Y . Chen, B. Ivanovic, Y . Cao, D. Xu, M. Pavone, and B. Ray. Language- guided traffic simulation via scene-level diffusion. InConference on robot learning, pages 144–177. PMLR, 2023

2023

[6] [6]

Zheng, R

Y . Zheng, R. Liang, K. ZHENG, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. E. Li, X. Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance. InICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy

2025

[7] [7]

Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

2025

[8] [8]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[9] [9]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. 2023

2023

[11] [11]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

2023

[13] [13]

M. Deng, H. Li, T. Li, Y . Du, and K. He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

L. Wang, Z. Yang, C. Bai, G. Zhang, X. Liu, X. Zheng, X.-X. Long, C.-T. Lu, and C. Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving.arXiv preprint arXiv:2601.22032, 2026

work page arXiv 2026

[15] [15]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 9

2022

[17] [17]

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Zheng, Y

P. Zheng, Y . Zhao, Z. Gong, H. Zhu, and S. Wu. Simplevsf: Vlm-scoring fusion for trajectory prediction of end-to-end autonomous driving.arXiv preprint arXiv:2510.17191, 2025

work page arXiv 2025

[19] [19]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026

[20] [20]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

2024

[21] [21]

Jiang, S

B. Jiang, S. Chen, H. Gao, B. Liao, Q. Zhang, W. Liu, and X. Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. InThe Fourteenth International Conference on Learning Representations, 2024

2024

[22] [22]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[23] [23]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[24] [24]

K. Guo, H. Liu, X. Wu, J. Pan, and C. Lv. ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025

work page arXiv 2025

[25] [25]

Z. Li, W. Yao, Z. Wang, X. Sun, J. Chen, N. Chang, M. Shen, Z. Wu, S. Lan, and J. M. Alvarez. Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

work page arXiv 2025

[26] [26]

W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

2026

[27] [27]

B. Liao, S. Chen, Y . Wang, T. Cheng, Q. Zhang, W. Liu, and C. Huang. Diffusiondrive: Towards an efficient diffusion-based end-to-end planner.arXiv preprint arXiv:2411.15139, 2025

work page arXiv 2025

[28] [28]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. InIEEE Transactions on Pattern Analysis and Machine Intelligence, volume 45, pages 12878–12955, 2022. 10

2022