pith. machine review for the scientific record. sign in

arxiv: 2604.12208 · v1 · submitted 2026-04-14 · 💻 cs.RO · cs.AI

Recognition: unknown

Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords autonomous drivingend-to-end systemsnavigation guidanceglobal navigationtrajectory planningvisual question answeringsequential representation
0
0 comments X

The pith

End-to-end autonomous driving improves when global navigation is modeled precisely as sequential guidance rather than ignored in favor of local scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many end-to-end autonomous driving systems fail to make good use of global navigation information and instead lean too heavily on local scene details, resulting in weak links between route inputs and planned paths. The paper introduces the Sequential Navigation Guidance framework to represent global navigation efficiently using real-world patterns, covering both overall paths and specific turn instructions. A new dataset aligns these global elements with local planning through visual questions, and a fusion model demonstrates that this focused navigation modeling produces leading results on standard tests. This occurs without the usual extra training objectives tied to perception, highlighting that navigation understanding can be a key lever for better driving agents.

Core claim

By representing global navigation as Sequential Navigation Guidance that includes paths for long-term constraints and turn-by-turn details for immediate choices, the SNG-VLA model fuses it with local planning to achieve state-of-the-art performance in end-to-end autonomous driving, overcoming the weak correlation between planning and navigation seen in prior systems.

What carries the argument

The Sequential Navigation Guidance (SNG) framework, an efficient encoding of global navigation information based on real-world patterns that provides both navigation paths and turn-by-turn information to guide the fusion model.

If this is right

  • End-to-end models become better at navigation-following in complex driving scenarios.
  • State-of-the-art results are reached without relying on auxiliary loss functions from perception tasks.
  • The SNG-QA dataset enables training that aligns global navigation with local scene understanding.
  • Planning capabilities strengthen directly through improved navigation input modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar sequential representations could improve navigation in other autonomous systems like delivery robots or drones.
  • Emphasizing navigation data quality might reduce the need for ever-larger models in driving tasks.
  • Integrating this approach with real-time mapping services could lead to more robust route adherence in changing environments.

Load-bearing premise

That the weak correlation between planning and navigation input in existing end-to-end systems is a common issue that the SNG representation will consistently resolve when incorporated.

What would settle it

Evaluating the proposed model against baselines on a modified test set where navigation inputs are altered or removed to check if the performance advantage disappears.

Figures

Figures reproduced from arXiv: 2604.12208 by Bo Zhang, Junli Wang, Kehua Sheng, Pengfei Li, Qihao Jin, Wenchao Ding, Yilun Chen, Zhihua Hua, Zhongxue Gan.

Figure 1
Figure 1. Figure 1: We demonstrate the impact of introducing pertur [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We demonstrate erroneous examples in the annotation [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our pipeline. Sequential navigation guidance is consists of navigation path and TBT information. SNG [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative analysis of real-world scenarios. Navigation paths are augmented with substantial noise before being fed [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: We demonstrate the impact of introducing noise to [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative analysis of real-world scenarios. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks. Project page: SNG-VLA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that end-to-end autonomous driving models over-rely on local scene understanding and exhibit weak correlation with global navigation inputs, impairing navigation-following. It introduces Sequential Navigation Guidance (SNG) as a compact representation of paths and turn-by-turn instructions derived from real-world patterns, the SNG-QA VQA dataset to align global and local planning, and the SNG-VLA model that fuses these elements to achieve state-of-the-art performance without auxiliary perception losses.

Significance. If the empirical claims hold, the work would highlight a previously under-addressed limitation in end-to-end driving and provide a reusable navigation representation plus dataset that could improve long-horizon trajectory consistency. The absence of auxiliary losses is a practical advantage if demonstrated.

major comments (2)
  1. [Abstract] Abstract: the assertion of state-of-the-art performance and that gains arise specifically from 'precise navigation information modeling' is unsupported by any quantitative metrics, baselines, ablation tables, or experimental protocol, which is load-bearing for the central efficacy claim.
  2. The manuscript provides no ablation or control experiment that isolates SNG fusion (e.g., null/randomized navigation input with all other factors fixed) or reports post-hoc planning-navigation correlation on the trained model, leaving open whether observed performance stems from SNG, dataset scale, or architecture size.
minor comments (1)
  1. [Abstract] The project page is referenced but no URL or access instructions are supplied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas where the manuscript can be strengthened in terms of clarity and experimental rigor. We address each major comment below and have incorporated revisions to better support the central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of state-of-the-art performance and that gains arise specifically from 'precise navigation information modeling' is unsupported by any quantitative metrics, baselines, ablation tables, or experimental protocol, which is load-bearing for the central efficacy claim.

    Authors: We agree that the abstract would benefit from more explicit references to the supporting evidence. In the revised manuscript, we have updated the abstract to reference the quantitative results, baseline comparisons, and ablation studies detailed in Section 4, along with a brief outline of the experimental protocol. This ensures the efficacy claims are directly tied to the presented metrics without altering the core findings. revision: yes

  2. Referee: The manuscript provides no ablation or control experiment that isolates SNG fusion (e.g., null/randomized navigation input with all other factors fixed) or reports post-hoc planning-navigation correlation on the trained model, leaving open whether observed performance stems from SNG, dataset scale, or architecture size.

    Authors: This observation is correct and highlights a gap in the original presentation. While the manuscript included component ablations in Section 4.2, it lacked a direct isolation of the SNG fusion mechanism. We have added a new control experiment in the revised Section 4.3 that fixes all other factors and compares performance with null/randomized navigation inputs. We have also included post-hoc correlation analysis between planning outputs and navigation inputs. These additions demonstrate that the gains are attributable to SNG rather than dataset scale or model size. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on new empirical elements

full rationale

The paper introduces independent constructs (SNG representation derived from real-world patterns, SNG-QA dataset, and SNG-VLA fusion model) and reports their performance as experimental outcomes rather than definitional restatements. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the SOTA claim or navigation-modeling efficacy to inputs by construction. The derivation chain remains self-contained with external validation via dataset construction and model evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the efficacy of the newly introduced SNG representation and its fusion in SNG-VLA, which are postulated without independent external benchmarks or prior validation cited in the abstract.

axioms (1)
  • domain assumption End-to-end autonomous driving models exhibit weak utilization of global navigation information and can be improved by explicit structured fusion with local planning.
    This premise is drawn directly from the experimental observations stated in the abstract.
invented entities (3)
  • Sequential Navigation Guidance (SNG) no independent evidence
    purpose: Efficient representation of global navigation information encompassing navigation paths and turn-by-turn details based on real-world patterns.
    Newly defined framework introduced to address the identified limitation.
  • SNG-QA dataset no independent evidence
    purpose: Visual question answering dataset to align global navigation with local scene planning.
    Constructed specifically for this work to support the proposed model.
  • SNG-VLA model no independent evidence
    purpose: Model that fuses local planning with global navigation information for improved performance.
    Proposed architecture achieving the reported results.

pith-pipeline@v0.9.0 · 5512 in / 1416 out tokens · 44007 ms · 2026-05-10T16:25:10.491056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    End-to-end driving model for steering control of autonomous vehicles with future spatiotemporal features,

    T. Wu, A. Luo, R. Huang, H. Cheng, and Y . Zhao, “End-to-end driving model for steering control of autonomous vehicles with future spatiotemporal features,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 950–955

  2. [2]

    Multimodal end-to-end autonomous driving,

    Y . Xiao, F. Codevilla, A. Gurram, O. Urfalioglu, and A. M. L ´opez, “Multimodal end-to-end autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 537–547, 2022

  3. [3]

    End-to-end autonomous driving: Challenges and frontiers,

    L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 164– 10 183, 2024

  4. [4]

    Towards fully autonomous driving: Systems and algorithms,

    J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Z. Kolter, D. Langer, O. Pink, V . Pratt, M. Sokolsky, G. Stanek, D. Stavens, A. Teichman, M. Werling, and S. Thrun, “Towards fully autonomous driving: Systems and algorithms,” in2011 IEEE Intelligent V ehicles Symposium (IV), 2011, pp. 163–168

  5. [5]

    Multimodal trajectory predictions for autonomous driving using deep convolutional networks,

    H. Cui, V . Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” in2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 2090–2096

  6. [6]

    Vad: Vectorized scene representation for efficient autonomous driving,

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8340–8350

  7. [7]

    St-p3: End- to-end vision-based autonomous driving via spatial-temporal feature learning,

    S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End- to-end vision-based autonomous driving via spatial-temporal feature learning,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 533–549

  8. [8]

    Transfuser: Imitation with transformer-based sensor fusion for au- tonomous driving,

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Transfuser: Imitation with transformer-based sensor fusion for au- tonomous driving,”IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, vol. 45, no. 11, pp. 12 878–12 895, 2023

  9. [9]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking,

    D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta, “Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

  10. [10]

    Para- drive: Parallelized architecture for real-time autonomous driving,

    X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 15 449–15 458

  11. [11]

    Drivelm: Driving with graph visual question answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inProceedings of the European Conference on Computer Vision (ECCV), 2024, pp. 256–274

  12. [12]

    Genad: Generative end-to-end autonomous driving,

    W. Zheng, R. Song, X. Guo, C. Zhang, and L. Chen, “Genad: Generative end-to-end autonomous driving,” inProceedings of the European Conference on Computer Vision (ECCV), A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds., 2024, pp. 87–104

  13. [13]

    Navigscene: Bridging local perception and global nav- igation for beyond-visual-range autonomous driving,

    Q. Peng, C. Bai, G. Zhang, B. Xu, X. Liu, X. Zheng, C. Chen, and C. Lu, “Navigscene: Bridging local perception and global nav- igation for beyond-visual-range autonomous driving,”arXiv preprint arXiv:2507.05227, 2025

  14. [14]

    Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

    J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang, “Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes,”arXiv preprint arXiv:2305.10430, 2023

  15. [15]

    Is ego status all you need for open-loop end-to-end autonomous driving?

    Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez, “Is ego status all you need for open-loop end-to-end autonomous driving?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 864–14 873

  16. [16]

    Svennerberg,Beginning google maps API 3

    G. Svennerberg,Beginning google maps API 3. Apress, 2010

  17. [17]

    Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving,

    X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan, “Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving,” inNeurIPS 2024 Datasets and Benchmarks Track, 2024

  18. [18]

    Carla: An open urban driving simulator,

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “Carla: An open urban driving simulator,” inConference on robot learning. PMLR, 2017, pp. 1–16

  19. [19]

    Bev- former: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bev- former: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  20. [20]

    Aib-mdp: Continuous probabilistic mo- tion planning for automated vehicles by leveraging action independent belief spaces,

    M. Naumann and C. Stiller, “Aib-mdp: Continuous probabilistic mo- tion planning for automated vehicles by leveraging action independent belief spaces,” in2022 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), 2022, pp. 6373–6380

  21. [21]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 853–17 862

  22. [22]

    Mp3: A unified model to map, perceive, predict and plan,

    S. Casas, A. Sadat, and R. Urtasun, “Mp3: A unified model to map, perceive, predict and plan,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 403–14 412

  23. [23]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,

    P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y . Qiao, “Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,”Advances in Neural Information Processing Systems, vol. 35, pp. 6119–6132, 2022

  24. [24]

    Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,

    X. Jia, Y . Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 7953– 7963

  25. [25]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

  26. [26]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  27. [27]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, 2024

  28. [28]

    Lmdrive: Closed-loop end-to-end driving with large language models,

    H. Shao, Y . Hu, L. Wang, G. Song, S. L. Waslander, Y . Liu, and H. Li, “Lmdrive: Closed-loop end-to-end driving with large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 15 120–15 130

  29. [29]

    Nuplan: A closed-loop ml- based planning benchmark for autonomous vehicles,

    K. T. e. a. H. Caesar, J. Kabzan, “Nuplan: A closed-loop ml- based planning benchmark for autonomous vehicles,” inCVPR ADP3 workshop, 2021

  30. [30]

    Scalability in perception for autonomous driving: Waymo open dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

  31. [31]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024

  32. [32]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma, “Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning,”arXiv preprint arXiv:2506.13757, 2025

  33. [33]

    Simlingo: Vision- only closed-loop autonomous driving with language-action alignment,

    K. Renz, L. Chen, E. Arani, and O. Sinavski, “Simlingo: Vision- only closed-loop autonomous driving with language-action alignment,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11 993–12 003

  34. [34]

    Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601,

    C. Yuan, Z. Zhang, J. Sun, S. Sun, Z. Huang, C. D. W. Lee, D. Li, Y . Han, A. Wong, K. P. Teeet al., “Drama: An efficient end-to-end motion planner for autonomous driving with mamba,”arXiv preprint arXiv:2408.03601, 2024

  35. [35]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wuet al., “Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation,”arXiv preprint arXiv:2406.06978, 2024

  36. [36]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang, “Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,”arXiv preprint arXiv:2411.15139, 2024

  37. [37]

    Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,

    X. Jia, J. You, Z. Zhang, and J. Yan, “Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving,”arXiv preprint arXiv:2503.07656, 2025

  38. [38]

    Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,

    X. Jia, P. Wu, L. Chen, J. Xie, C. He, J. Yan, and H. Li, “Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 21 983–21 994

  39. [39]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  40. [40]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

  41. [41]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

  42. [42]

    Rethinking imitation-based planners for autonomous driving,

    J. Cheng, Y . Chen, X. Mei, B. Yang, B. Li, and M. Liu, “Rethinking imitation-based planners for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 123–14 130