pith. sign in

arxiv: 2606.06219 · v1 · pith:KSFBJV6Lnew · submitted 2026-06-04 · 💻 cs.RO · cs.AI

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

Pith reviewed 2026-06-28 01:10 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords end-to-end autonomous drivinglatent space planningVAE conditional driftLLM hidden statesadaptive schedulermulti-modal trajectory generationNAVSIM benchmarkreal-time inference
0
0 comments X

The pith

CLEAR replaces multi-step denoising with single-step VAE latent drift guided by fine-tuned LLM hidden states to reach 93.7 PDMS on NAVSIM v1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that end-to-end driving can generate diverse maneuvers in real time by moving the generative step into a compressed latent space instead of running iterative denoising. It pairs this with hidden states from a fine-tuned small language model that feed an adaptive scheduler and a scorer to pick both the right generation parameters and the best final trajectory for each scene. A reader would care because the approach removes the main speed bottleneck of diffusion-style planners while still claiming top benchmark scores without extra geometric labels. If the method holds, it shows a concrete route to multi-modal planning that meets safety-critical timing limits.

Core claim

CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Scene-aware hidden states extracted from a fully fine-tuned Qwen 3.5 0.8B on driving QA pairs guide both an Adaptive Scheduler that selects the conditioning coefficient α and sample count N from predefined discrete schemes and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark this yields a state-of-the-art PDMS of 93.7.

What carries the argument

Single-step conditional drift in VAE latent space whose conditioning coefficient and sample count are chosen by an Adaptive Scheduler and whose output trajectories are ranked by a cross-attention scorer, both driven by scene-aware hidden states from the fine-tuned language model.

If this is right

  • Multi-modal driving plans can be produced at inference speeds that avoid the latency of iterative sampling.
  • High benchmark scores are possible without dense geometric annotations or post-hoc refinement.
  • Scene-aware states from the language model enable dynamic trade-offs between trajectory diversity and precision.
  • The overall pipeline demonstrates that latent-space generation plus lightweight routing can match or exceed heavier diffusion baselines on NAVSIM v1.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-drift-plus-LLM-scorer pattern could be tested on other real-time control domains that currently rely on diffusion planners.
  • If the small language model's hidden states transfer across datasets, the method may reduce dependence on large vision-language models at deployment time.
  • The discrete scheme selection could be examined for stability when the underlying visual encoder or driving distribution changes slightly.

Load-bearing premise

The predefined discrete schemes and scene-aware hidden states let the scheduler and scorer pick the optimal trajectory without needing extra tuning or validation data for new scenes.

What would settle it

An experiment on a held-out driving dataset where the scheduler-selected schemes produce lower PDMS than a single fixed scheme or where performance drops sharply unless the discrete scheme set is manually adjusted.

Figures

Figures reproduced from arXiv: 2606.06219 by Jianqiang Wang, Wenhao Yu, Yanbo Jiang, Yining Xing, Zehong Ke, Zhiyuan Liu.

Figure 1
Figure 1. Figure 1: Overview of the CLEAR architecture. Given front-view images and a navigation command, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of trajectory generation in both physical space (rows 1 and 3) and latent feature [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $\alpha$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents CLEAR, a framework for end-to-end autonomous driving that integrates Drive-JEPA as visual encoder, replaces multi-step denoising with a single-step conditional drift in VAE latent space (with conditioning coefficient α), and uses fully fine-tuned Qwen 3.5 0.8B to extract scene-aware hidden states. These states guide an Adaptive Scheduler that selects α and sample count N from a discrete set of predefined schemes, plus a cross-attention scorer for choosing the optimal trajectory from candidates. The central claim is a state-of-the-art PDMS of 93.7 on the NAVSIM v1 benchmark.

Significance. If the performance claims are substantiated with proper controls, the work could advance real-time multi-modal planning by demonstrating that LLM-guided latent-space drift can match or exceed iterative diffusion methods in efficiency. The integration of semantic reasoning from a small fine-tuned LLM with generative planning is a timely direction for autonomous driving. However, the dependence on external benchmarks and fixed discrete schemes for the scheduler limits broader significance without evidence that the gains are not due to benchmark-specific tuning.

major comments (3)
  1. [Abstract] Abstract: The SOTA PDMS 93.7 claim is presented with no baselines, ablation results, error bars, dataset splits, or statistical tests. This directly undermines evaluation of whether the single-step VAE drift, the Adaptive Scheduler, or the cross-attention scorer is responsible for the reported performance.
  2. [Abstract and §3] Adaptive Scheduler description (Abstract and §3): The scheduler selects α and N from a fixed discrete set of predefined schemes using only scene-aware hidden states from the fine-tuned Qwen model. No details are given on scheme construction, whether selection was validated on held-out data, or if the discrete set was chosen with knowledge of the NAVSIM v1 test distribution. This is load-bearing for the claim that the core method (single-step drift plus LLM guidance) delivers the headline result without post-hoc adjustments.
  3. [Abstract] No information is supplied on how the fine-tuning of Qwen 3.5 0.8B on driving QA pairs interacts with the NAVSIM v1 benchmark splits, raising the possibility of unintended data leakage that could inflate the reported PDMS.
minor comments (2)
  1. [Abstract] The conditioning coefficient α is introduced in the abstract without an accompanying equation or precise definition of how it balances diversity and expert precision in the VAE drift.
  2. [Abstract] Notation for the sample count N and the cross-attention scorer mechanism could be clarified with a short pseudocode or diagram reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and address the raised concerns.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The SOTA PDMS 93.7 claim is presented with no baselines, ablation results, error bars, dataset splits, or statistical tests. This directly undermines evaluation of whether the single-step VAE drift, the Adaptive Scheduler, or the cross-attention scorer is responsible for the reported performance.

    Authors: The abstract is length-constrained, but Section 4 of the full manuscript provides baseline comparisons, ablation studies isolating each component (VAE drift, scheduler, scorer), NAVSIM v1 dataset split details, and multiple-run results. We will revise the abstract to briefly reference key baselines and ablations, and ensure error bars and statistical details are explicitly noted. revision: yes

  2. Referee: [Abstract and §3] Adaptive Scheduler description (Abstract and §3): The scheduler selects α and N from a fixed discrete set of predefined schemes using only scene-aware hidden states from the fine-tuned Qwen model. No details are given on scheme construction, whether selection was validated on held-out data, or if the discrete set was chosen with knowledge of the NAVSIM v1 test distribution. This is load-bearing for the claim that the core method (single-step drift plus LLM guidance) delivers the headline result without post-hoc adjustments.

    Authors: The discrete schemes for α and N were constructed and validated exclusively on the NAVSIM v1 validation split using only the Qwen hidden states, with no access to or knowledge of the test distribution. We will expand Section 3 with explicit details on scheme construction, the validation procedure on held-out data, and confirmation that selection avoids post-hoc test-set adjustments. revision: yes

  3. Referee: [Abstract] No information is supplied on how the fine-tuning of Qwen 3.5 0.8B on driving QA pairs interacts with the NAVSIM v1 benchmark splits, raising the possibility of unintended data leakage that could inflate the reported PDMS.

    Authors: Fine-tuning used driving QA pairs derived solely from the NAVSIM v1 training split and public external sources, with explicit checks ensuring zero overlap with the test split. We will add a clarification paragraph detailing the data splits and leakage-prevention steps. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external benchmark and independent components

full rationale

The provided abstract describes a composite framework (Drive-JEPA encoder, single-step VAE drift with conditioning coefficient, fine-tuned Qwen 3.5 for hidden states, Adaptive Scheduler selecting from a fixed discrete set of schemes, cross-attention scorer) evaluated on the external NAVSIM v1 benchmark to reach PDMS 93.7. No equations, self-definitions, fitted-input predictions, or self-citation chains are present that reduce any claimed result to its own inputs by construction. The scheduler's use of predefined schemes is stated without evidence of test-set calibration or renaming of known results. This meets the default expectation of a non-circular paper self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5747 in / 949 out tokens · 19749 ms · 2026-06-28T01:10:45.103842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  2. [2]

    Jiang, S

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  3. [3]

    Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  4. [4]

    Jiang, A

    C. Jiang, A. Cornman, C. Park, B. Sapp, Y . Zhou, D. Anguelov, et al. Motiondiffuser: Control- lable multi-agent motion prediction using diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9644–9653, 2023

  5. [5]

    Zhong, D

    Z. Zhong, D. Rempe, Y . Chen, B. Ivanovic, Y . Cao, D. Xu, M. Pavone, and B. Ray. Language- guided traffic simulation via scene-level diffusion. InConference on robot learning, pages 144–177. PMLR, 2023

  6. [6]

    Zheng, R

    Y . Zheng, R. Liang, K. ZHENG, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. E. Li, X. Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance. InICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy

  7. [7]

    Z. Xing, X. Zhang, Y . Hu, B. Jiang, T. He, Q. Zhang, X. Long, and W. Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

  8. [8]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  9. [9]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  10. [10]

    Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. 2023

  11. [11]

    X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  12. [12]

    Lipman, R

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

  13. [13]

    M. Deng, H. Li, T. Li, Y . Du, and K. He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

  14. [14]

    L. Wang, Z. Yang, C. Bai, G. Zhang, X. Liu, X. Zheng, X.-X. Long, C.-T. Lu, and C. Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving.arXiv preprint arXiv:2601.22032, 2026

  15. [15]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  16. [16]

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 9

  17. [17]

    Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

  18. [18]

    Zheng, Y

    P. Zheng, Y . Zhao, Z. Gong, H. Zhu, and S. Wu. Simplevsf: Vlm-scoring fusion for trajectory prediction of end-to-end autonomous driving.arXiv preprint arXiv:2510.17191, 2025

  19. [19]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  20. [20]

    Dauner, M

    D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

  21. [21]

    Jiang, S

    B. Jiang, S. Chen, H. Gao, B. Liao, Q. Zhang, W. Liu, and X. Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. InThe Fourteenth International Conference on Learning Representations, 2024

  22. [22]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  23. [23]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  24. [24]

    K. Guo, H. Liu, X. Wu, J. Pan, and C. Lv. ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025

  25. [25]

    Z. Li, W. Yao, Z. Wang, X. Sun, J. Chen, N. Chang, M. Shen, Z. Wu, S. Lan, and J. M. Alvarez. Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

  26. [26]

    W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

  27. [27]

    B. Liao, S. Chen, Y . Wang, T. Cheng, Q. Zhang, W. Liu, and C. Huang. Diffusiondrive: Towards an efficient diffusion-based end-to-end planner.arXiv preprint arXiv:2411.15139, 2025

  28. [28]

    Chitta, A

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. InIEEE Transactions on Pattern Analysis and Machine Intelligence, volume 45, pages 12878–12955, 2022. 10