CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving
Pith reviewed 2026-06-28 01:10 UTC · model grok-4.3
The pith
CLEAR replaces multi-step denoising with single-step VAE latent drift guided by fine-tuned LLM hidden states to reach 93.7 PDMS on NAVSIM v1.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Scene-aware hidden states extracted from a fully fine-tuned Qwen 3.5 0.8B on driving QA pairs guide both an Adaptive Scheduler that selects the conditioning coefficient α and sample count N from predefined discrete schemes and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark this yields a state-of-the-art PDMS of 93.7.
What carries the argument
Single-step conditional drift in VAE latent space whose conditioning coefficient and sample count are chosen by an Adaptive Scheduler and whose output trajectories are ranked by a cross-attention scorer, both driven by scene-aware hidden states from the fine-tuned language model.
If this is right
- Multi-modal driving plans can be produced at inference speeds that avoid the latency of iterative sampling.
- High benchmark scores are possible without dense geometric annotations or post-hoc refinement.
- Scene-aware states from the language model enable dynamic trade-offs between trajectory diversity and precision.
- The overall pipeline demonstrates that latent-space generation plus lightweight routing can match or exceed heavier diffusion baselines on NAVSIM v1.
Where Pith is reading between the lines
- The same latent-drift-plus-LLM-scorer pattern could be tested on other real-time control domains that currently rely on diffusion planners.
- If the small language model's hidden states transfer across datasets, the method may reduce dependence on large vision-language models at deployment time.
- The discrete scheme selection could be examined for stability when the underlying visual encoder or driving distribution changes slightly.
Load-bearing premise
The predefined discrete schemes and scene-aware hidden states let the scheduler and scorer pick the optimal trajectory without needing extra tuning or validation data for new scenes.
What would settle it
An experiment on a held-out driving dataset where the scheduler-selected schemes produce lower PDMS than a single fixed scheme or where performance drops sharply unless the discrete scheme set is manually adjusted.
Figures
read the original abstract
End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $\alpha$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CLEAR, a framework for end-to-end autonomous driving that integrates Drive-JEPA as visual encoder, replaces multi-step denoising with a single-step conditional drift in VAE latent space (with conditioning coefficient α), and uses fully fine-tuned Qwen 3.5 0.8B to extract scene-aware hidden states. These states guide an Adaptive Scheduler that selects α and sample count N from a discrete set of predefined schemes, plus a cross-attention scorer for choosing the optimal trajectory from candidates. The central claim is a state-of-the-art PDMS of 93.7 on the NAVSIM v1 benchmark.
Significance. If the performance claims are substantiated with proper controls, the work could advance real-time multi-modal planning by demonstrating that LLM-guided latent-space drift can match or exceed iterative diffusion methods in efficiency. The integration of semantic reasoning from a small fine-tuned LLM with generative planning is a timely direction for autonomous driving. However, the dependence on external benchmarks and fixed discrete schemes for the scheduler limits broader significance without evidence that the gains are not due to benchmark-specific tuning.
major comments (3)
- [Abstract] Abstract: The SOTA PDMS 93.7 claim is presented with no baselines, ablation results, error bars, dataset splits, or statistical tests. This directly undermines evaluation of whether the single-step VAE drift, the Adaptive Scheduler, or the cross-attention scorer is responsible for the reported performance.
- [Abstract and §3] Adaptive Scheduler description (Abstract and §3): The scheduler selects α and N from a fixed discrete set of predefined schemes using only scene-aware hidden states from the fine-tuned Qwen model. No details are given on scheme construction, whether selection was validated on held-out data, or if the discrete set was chosen with knowledge of the NAVSIM v1 test distribution. This is load-bearing for the claim that the core method (single-step drift plus LLM guidance) delivers the headline result without post-hoc adjustments.
- [Abstract] No information is supplied on how the fine-tuning of Qwen 3.5 0.8B on driving QA pairs interacts with the NAVSIM v1 benchmark splits, raising the possibility of unintended data leakage that could inflate the reported PDMS.
minor comments (2)
- [Abstract] The conditioning coefficient α is introduced in the abstract without an accompanying equation or precise definition of how it balances diversity and expert precision in the VAE drift.
- [Abstract] Notation for the sample count N and the cross-attention scorer mechanism could be clarified with a short pseudocode or diagram reference.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and address the raised concerns.
read point-by-point responses
-
Referee: [Abstract] Abstract: The SOTA PDMS 93.7 claim is presented with no baselines, ablation results, error bars, dataset splits, or statistical tests. This directly undermines evaluation of whether the single-step VAE drift, the Adaptive Scheduler, or the cross-attention scorer is responsible for the reported performance.
Authors: The abstract is length-constrained, but Section 4 of the full manuscript provides baseline comparisons, ablation studies isolating each component (VAE drift, scheduler, scorer), NAVSIM v1 dataset split details, and multiple-run results. We will revise the abstract to briefly reference key baselines and ablations, and ensure error bars and statistical details are explicitly noted. revision: yes
-
Referee: [Abstract and §3] Adaptive Scheduler description (Abstract and §3): The scheduler selects α and N from a fixed discrete set of predefined schemes using only scene-aware hidden states from the fine-tuned Qwen model. No details are given on scheme construction, whether selection was validated on held-out data, or if the discrete set was chosen with knowledge of the NAVSIM v1 test distribution. This is load-bearing for the claim that the core method (single-step drift plus LLM guidance) delivers the headline result without post-hoc adjustments.
Authors: The discrete schemes for α and N were constructed and validated exclusively on the NAVSIM v1 validation split using only the Qwen hidden states, with no access to or knowledge of the test distribution. We will expand Section 3 with explicit details on scheme construction, the validation procedure on held-out data, and confirmation that selection avoids post-hoc test-set adjustments. revision: yes
-
Referee: [Abstract] No information is supplied on how the fine-tuning of Qwen 3.5 0.8B on driving QA pairs interacts with the NAVSIM v1 benchmark splits, raising the possibility of unintended data leakage that could inflate the reported PDMS.
Authors: Fine-tuning used driving QA pairs derived solely from the NAVSIM v1 training split and public external sources, with explicit checks ensuring zero overlap with the test split. We will add a clarification paragraph detailing the data splits and leakage-prevention steps. revision: yes
Circularity Check
No circularity; derivation relies on external benchmark and independent components
full rationale
The provided abstract describes a composite framework (Drive-JEPA encoder, single-step VAE drift with conditioning coefficient, fine-tuned Qwen 3.5 for hidden states, Adaptive Scheduler selecting from a fixed discrete set of schemes, cross-attention scorer) evaluated on the external NAVSIM v1 benchmark to reach PDMS 93.7. No equations, self-definitions, fitted-input predictions, or self-citation chains are present that reduce any claimed result to its own inputs by construction. The scheduler's use of predefined schemes is stated without evidence of test-set calibration or renaming of known results. This meets the default expectation of a non-circular paper self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023
2023
-
[2]
Jiang, S
B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023
2023
-
[3]
Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Jiang, A
C. Jiang, A. Cornman, C. Park, B. Sapp, Y . Zhou, D. Anguelov, et al. Motiondiffuser: Control- lable multi-agent motion prediction using diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9644–9653, 2023
2023
-
[5]
Zhong, D
Z. Zhong, D. Rempe, Y . Chen, B. Ivanovic, Y . Cao, D. Xu, M. Pavone, and B. Ray. Language- guided traffic simulation via scene-level diffusion. InConference on robot learning, pages 144–177. PMLR, 2023
2023
-
[6]
Zheng, R
Y . Zheng, R. Liang, K. ZHENG, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. E. Li, X. Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance. InICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy
2025
-
[7]
Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025
2025
-
[8]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[9]
J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. 2023
2023
-
[11]
X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Lipman, R
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[13]
M. Deng, H. Li, T. Li, Y . Du, and K. He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [14]
-
[15]
A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 9
2022
-
[17]
Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [18]
-
[19]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
2026
-
[20]
Dauner, M
D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024
2024
-
[21]
Jiang, S
B. Jiang, S. Chen, H. Gao, B. Liao, Q. Zhang, W. Liu, and X. Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. InThe Fourteenth International Conference on Learning Representations, 2024
2024
-
[22]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[23]
Peebles and S
W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
- [24]
- [25]
-
[26]
W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026
2026
- [27]
-
[28]
Chitta, A
K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. InIEEE Transactions on Pattern Analysis and Machine Intelligence, volume 45, pages 12878–12955, 2022. 10
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.