arxiv: 2604.12208 · v1 · submitted 2026-04-14 · 💻 cs.RO · cs.AI

Recognition: unknown

Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving

Zhihua Hua , Junli Wang , Pengfei Li , Qihao Jin , Bo Zhang , Kehua Sheng , Yilun Chen , Zhongxue Gan

show 1 more author

Wenchao Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords autonomous drivingend-to-end systemsnavigation guidanceglobal navigationtrajectory planningvisual question answeringsequential representation

0 comments

The pith

End-to-end autonomous driving improves when global navigation is modeled precisely as sequential guidance rather than ignored in favor of local scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many end-to-end autonomous driving systems fail to make good use of global navigation information and instead lean too heavily on local scene details, resulting in weak links between route inputs and planned paths. The paper introduces the Sequential Navigation Guidance framework to represent global navigation efficiently using real-world patterns, covering both overall paths and specific turn instructions. A new dataset aligns these global elements with local planning through visual questions, and a fusion model demonstrates that this focused navigation modeling produces leading results on standard tests. This occurs without the usual extra training objectives tied to perception, highlighting that navigation understanding can be a key lever for better driving agents.

Core claim

By representing global navigation as Sequential Navigation Guidance that includes paths for long-term constraints and turn-by-turn details for immediate choices, the SNG-VLA model fuses it with local planning to achieve state-of-the-art performance in end-to-end autonomous driving, overcoming the weak correlation between planning and navigation seen in prior systems.

What carries the argument

The Sequential Navigation Guidance (SNG) framework, an efficient encoding of global navigation information based on real-world patterns that provides both navigation paths and turn-by-turn information to guide the fusion model.

If this is right

End-to-end models become better at navigation-following in complex driving scenarios.
State-of-the-art results are reached without relying on auxiliary loss functions from perception tasks.
The SNG-QA dataset enables training that aligns global navigation with local scene understanding.
Planning capabilities strengthen directly through improved navigation input modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar sequential representations could improve navigation in other autonomous systems like delivery robots or drones.
Emphasizing navigation data quality might reduce the need for ever-larger models in driving tasks.
Integrating this approach with real-time mapping services could lead to more robust route adherence in changing environments.

Load-bearing premise

That the weak correlation between planning and navigation input in existing end-to-end systems is a common issue that the SNG representation will consistently resolve when incorporated.

What would settle it

Evaluating the proposed model against baselines on a modified test set where navigation inputs are altered or removed to check if the performance advantage disappears.

Figures

Figures reproduced from arXiv: 2604.12208 by Bo Zhang, Junli Wang, Kehua Sheng, Pengfei Li, Qihao Jin, Wenchao Ding, Yilun Chen, Zhihua Hua, Zhongxue Gan.

**Figure 2.** Figure 2: We demonstrate erroneous examples in the annotation [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our pipeline. Sequential navigation guidance is consists of navigation path and TBT information. SNG [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative analysis of real-world scenarios. Navigation paths are augmented with substantial noise before being fed [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: We demonstrate the impact of introducing noise to [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative analysis of real-world scenarios. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks. Project page: SNG-VLA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The work introduces a Sequential Navigation Guidance representation plus dataset and fusion model to fix weak navigation use in end-to-end driving, but the causal evidence for why it works remains thin without isolating controls.

read the letter

What stands out is that the authors noticed end-to-end driving models often ignore global navigation inputs and show weak correlation with their planning outputs. They respond with the Sequential Navigation Guidance framework, which turns real-world navigation patterns into paths plus turn-by-turn cues, plus the SNG-QA VQA dataset to align global and local views, and the SNG-VLA model that fuses them to reach claimed state-of-the-art results without extra perception losses. This is new material in the subfield. The structured representation and dataset give a concrete way to inject long-term route constraints that prior end-to-end work mostly left to the network to discover on its own. The paper does a solid job describing the practical problem of route drift in complex scenes and offering a representation that matches how humans actually use navigation. The soft spot is the missing link between the SNG fusion and the performance jump. The stress-test concern holds: without ablations that replace SNG with null or randomized guidance while keeping architecture and data fixed, or direct measurements of planning-navigation correlation on the trained model, it is hard to know whether gains trace to better navigation modeling or to other factors like scale or training volume. The abstract states SOTA but the full experiments need to show the numbers, baselines, and those controls clearly. This is for researchers working on end-to-end autonomous driving who want ideas for stronger global-local fusion and a new dataset to try. A reader focused on route adherence in real environments could extract useful pieces even if they adapt the model. It deserves peer review because the targeted limitation is real and the proposal is specific enough to test and improve.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that end-to-end autonomous driving models over-rely on local scene understanding and exhibit weak correlation with global navigation inputs, impairing navigation-following. It introduces Sequential Navigation Guidance (SNG) as a compact representation of paths and turn-by-turn instructions derived from real-world patterns, the SNG-QA VQA dataset to align global and local planning, and the SNG-VLA model that fuses these elements to achieve state-of-the-art performance without auxiliary perception losses.

Significance. If the empirical claims hold, the work would highlight a previously under-addressed limitation in end-to-end driving and provide a reusable navigation representation plus dataset that could improve long-horizon trajectory consistency. The absence of auxiliary losses is a practical advantage if demonstrated.

major comments (2)

[Abstract] Abstract: the assertion of state-of-the-art performance and that gains arise specifically from 'precise navigation information modeling' is unsupported by any quantitative metrics, baselines, ablation tables, or experimental protocol, which is load-bearing for the central efficacy claim.
The manuscript provides no ablation or control experiment that isolates SNG fusion (e.g., null/randomized navigation input with all other factors fixed) or reports post-hoc planning-navigation correlation on the trained model, leaving open whether observed performance stems from SNG, dataset scale, or architecture size.

minor comments (1)

[Abstract] The project page is referenced but no URL or access instructions are supplied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas where the manuscript can be strengthened in terms of clarity and experimental rigor. We address each major comment below and have incorporated revisions to better support the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of state-of-the-art performance and that gains arise specifically from 'precise navigation information modeling' is unsupported by any quantitative metrics, baselines, ablation tables, or experimental protocol, which is load-bearing for the central efficacy claim.

Authors: We agree that the abstract would benefit from more explicit references to the supporting evidence. In the revised manuscript, we have updated the abstract to reference the quantitative results, baseline comparisons, and ablation studies detailed in Section 4, along with a brief outline of the experimental protocol. This ensures the efficacy claims are directly tied to the presented metrics without altering the core findings. revision: yes
Referee: The manuscript provides no ablation or control experiment that isolates SNG fusion (e.g., null/randomized navigation input with all other factors fixed) or reports post-hoc planning-navigation correlation on the trained model, leaving open whether observed performance stems from SNG, dataset scale, or architecture size.

Authors: This observation is correct and highlights a gap in the original presentation. While the manuscript included component ablations in Section 4.2, it lacked a direct isolation of the SNG fusion mechanism. We have added a new control experiment in the revised Section 4.3 that fixes all other factors and compares performance with null/randomized navigation inputs. We have also included post-hoc correlation analysis between planning outputs and navigation inputs. These additions demonstrate that the gains are attributable to SNG rather than dataset scale or model size. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on new empirical elements

full rationale

The paper introduces independent constructs (SNG representation derived from real-world patterns, SNG-QA dataset, and SNG-VLA fusion model) and reports their performance as experimental outcomes rather than definitional restatements. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the SOTA claim or navigation-modeling efficacy to inputs by construction. The derivation chain remains self-contained with external validation via dataset construction and model evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the efficacy of the newly introduced SNG representation and its fusion in SNG-VLA, which are postulated without independent external benchmarks or prior validation cited in the abstract.

axioms (1)

domain assumption End-to-end autonomous driving models exhibit weak utilization of global navigation information and can be improved by explicit structured fusion with local planning.
This premise is drawn directly from the experimental observations stated in the abstract.

invented entities (3)

Sequential Navigation Guidance (SNG) no independent evidence
purpose: Efficient representation of global navigation information encompassing navigation paths and turn-by-turn details based on real-world patterns.
Newly defined framework introduced to address the identified limitation.
SNG-QA dataset no independent evidence
purpose: Visual question answering dataset to align global navigation with local scene planning.
Constructed specifically for this work to support the proposed model.
SNG-VLA model no independent evidence
purpose: Model that fuses local planning with global navigation information for improved performance.
Proposed architecture achieving the reported results.

pith-pipeline@v0.9.0 · 5512 in / 1416 out tokens · 44007 ms · 2026-05-10T16:25:10.491056+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 10 canonical work pages · 5 internal anchors

[1]

End-to-end driving model for steering control of autonomous vehicles with future spatiotemporal features,

T. Wu, A. Luo, R. Huang, H. Cheng, and Y . Zhao, “End-to-end driving model for steering control of autonomous vehicles with future spatiotemporal features,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 950–955

2019
[2]

Multimodal end-to-end autonomous driving,

Y . Xiao, F. Codevilla, A. Gurram, O. Urfalioglu, and A. M. L ´opez, “Multimodal end-to-end autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 537–547, 2022

2022
[3]

End-to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 164– 10 183, 2024

2024
[4]

Towards fully autonomous driving: Systems and algorithms,

J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Z. Kolter, D. Langer, O. Pink, V . Pratt, M. Sokolsky, G. Stanek, D. Stavens, A. Teichman, M. Werling, and S. Thrun, “Towards fully autonomous driving: Systems and algorithms,” in2011 IEEE Intelligent V ehicles Symposium (IV), 2011, pp. 163–168

2011
[5]

Multimodal trajectory predictions for autonomous driving using deep convolutional networks,

H. Cui, V . Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” in2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 2090–2096

2019
[6]

Vad: Vectorized scene representation for efficient autonomous driving,

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8340–8350

2023
[7]

St-p3: End- to-end vision-based autonomous driving via spatial-temporal feature learning,

S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End- to-end vision-based autonomous driving via spatial-temporal feature learning,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 533–549

2022
[8]

Transfuser: Imitation with transformer-based sensor fusion for au- tonomous driving,

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Transfuser: Imitation with transformer-based sensor fusion for au- tonomous driving,”IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, vol. 45, no. 11, pp. 12 878–12 895, 2023

2023
[9]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking,

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta, “Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[10]

Para- drive: Parallelized architecture for real-time autonomous driving,

X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 15 449–15 458

2024
[11]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inProceedings of the European Conference on Computer Vision (ECCV), 2024, pp. 256–274

2024
[12]

Genad: Generative end-to-end autonomous driving,

W. Zheng, R. Song, X. Guo, C. Zhang, and L. Chen, “Genad: Generative end-to-end autonomous driving,” inProceedings of the European Conference on Computer Vision (ECCV), A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds., 2024, pp. 87–104

2024
[13]

Navigscene: Bridging local perception and global nav- igation for beyond-visual-range autonomous driving,

Q. Peng, C. Bai, G. Zhang, B. Xu, X. Liu, X. Zheng, C. Chen, and C. Lu, “Navigscene: Bridging local perception and global nav- igation for beyond-visual-range autonomous driving,”arXiv preprint arXiv:2507.05227, 2025

work page arXiv 2025
[14]

Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang, “Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes,”arXiv preprint arXiv:2305.10430, 2023

work page arXiv 2023
[15]

Is ego status all you need for open-loop end-to-end autonomous driving?

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez, “Is ego status all you need for open-loop end-to-end autonomous driving?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 864–14 873

2024
[16]

Svennerberg,Beginning google maps API 3

G. Svennerberg,Beginning google maps API 3. Apress, 2010

2010
[17]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving,

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan, “Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving,” inNeurIPS 2024 Datasets and Benchmarks Track, 2024

2024
[18]

Carla: An open urban driving simulator,

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “Carla: An open urban driving simulator,” inConference on robot learning. PMLR, 2017, pp. 1–16

2017
[19]

Bev- former: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bev- former: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[20]

Aib-mdp: Continuous probabilistic mo- tion planning for automated vehicles by leveraging action independent belief spaces,

M. Naumann and C. Stiller, “Aib-mdp: Continuous probabilistic mo- tion planning for automated vehicles by leveraging action independent belief spaces,” in2022 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), 2022, pp. 6373–6380

2022
[21]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 853–17 862

2023
[22]

Mp3: A unified model to map, perceive, predict and plan,

S. Casas, A. Sadat, and R. Urtasun, “Mp3: A unified model to map, perceive, predict and plan,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 403–14 412

2021
[23]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,

P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y . Qiao, “Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,”Advances in Neural Information Processing Systems, vol. 35, pp. 6119–6132, 2022

2022
[24]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,

X. Jia, Y . Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 7953– 7963

2023
[25]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

2020
[26]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, 2024

2024
[28]

Lmdrive: Closed-loop end-to-end driving with large language models,

H. Shao, Y . Hu, L. Wang, G. Song, S. L. Waslander, Y . Liu, and H. Li, “Lmdrive: Closed-loop end-to-end driving with large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 15 120–15 130

2024
[29]

Nuplan: A closed-loop ml- based planning benchmark for autonomous vehicles,

K. T. e. a. H. Caesar, J. Kabzan, “Nuplan: A closed-loop ml- based planning benchmark for autonomous vehicles,” inCVPR ADP3 workshop, 2021

2021
[30]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

2020
[31]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma, “Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning,”arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review arXiv 2025
[33]

Simlingo: Vision- only closed-loop autonomous driving with language-action alignment,

K. Renz, L. Chen, E. Arani, and O. Sinavski, “Simlingo: Vision- only closed-loop autonomous driving with language-action alignment,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11 993–12 003

2025
[34]

Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601,

C. Yuan, Z. Zhang, J. Sun, S. Sun, Z. Huang, C. D. W. Lee, D. Li, Y . Han, A. Wong, K. P. Teeet al., “Drama: An efficient end-to-end motion planner for autonomous driving with mamba,”arXiv preprint arXiv:2408.03601, 2024

work page arXiv 2024
[35]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wuet al., “Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation,”arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review arXiv 2024
[36]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang, “Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,”arXiv preprint arXiv:2411.15139, 2024

work page arXiv 2024
[37]

Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,

X. Jia, J. You, Z. Zhang, and J. Yan, “Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving,”arXiv preprint arXiv:2503.07656, 2025

work page arXiv 2025
[38]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,

X. Jia, P. Wu, L. Chen, J. Xie, C. He, J. Yan, and H. Li, “Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 21 983–21 994

2023
[39]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

2023
[41]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

2023
[42]

Rethinking imitation-based planners for autonomous driving,

J. Cheng, Y . Chen, X. Mei, B. Yang, B. Li, and M. Liu, “Rethinking imitation-based planners for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 123–14 130

2024