VERDI: VLM-Embedded Reasoning for Autonomous Driving
Pith reviewed 2026-05-22 13:24 UTC · model grok-4.3
The pith
VERDI aligns intermediate outputs from perception, prediction and planning modules with VLM text features at training time so the driving stack absorbs commonsense reasoning without paying VLM inference costs at runtime.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VERDI augments modular differentiable end-to-end AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs, enabling the modular AD stack to internalize structured reasoning without incurring the inference-time costs of large VLMs.
What carries the argument
Latent-space alignment loss that matches AD module outputs at perception, prediction and planning stages to VLM-generated text features describing driving reasoning.
If this is right
- The aligned models achieve up to 11 percent lower L2 distance than prior end-to-end methods without embedded reasoning.
- Closed-loop driving in the HugSim simulator reaches the highest overall score with a 10 percent gain in non-collision rate.
- Inference speed remains fast because no VLM runs at test time.
- Modular structure is preserved, supporting safety decomposition that monolithic VLM planners lack.
Where Pith is reading between the lines
- The same latent-alignment idea could transfer commonsense from large models into other sequential control tasks that must run at real-time rates.
- Measuring how much each stage (perception versus planning) benefits from the alignment would clarify where the reasoning transfer occurs most strongly.
- Replacing the VLM text features with cheaper synthetic descriptions might test how much of the gain depends on the specific VLM chosen.
Load-bearing premise
That matching the latent representations of driving modules to VLM text features is enough to transfer useful commonsense reasoning into the driving policy.
What would settle it
A closed-loop test in which the VERDI-trained model shows no improvement or a drop in non-collision rate and trajectory accuracy relative to the identical baseline without any VLM alignment would falsify the claimed benefit.
Figures
read the original abstract
While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of applying commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous DrIving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We evaluate VERDI in both open-loop and closed-loop settings. Our method outperforms existing end-to-end approaches without embedded reasoning by up to 11% in $\ell_{2}$ distance, and achieves the best overall driving performance in the closed-loop HugSim simulator, including a 10% improvement in Non-Collision Rate, while maintaining fast inference speed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VERDI, a training-time framework that augments modular differentiable end-to-end autonomous driving models by aligning intermediate outputs at the perception, prediction, and planning stages with text features from VLMs that describe driving reasoning. This alignment is intended to distill commonsense knowledge and structured reasoning into the AD stack without incurring VLM inference costs at deployment. The method is evaluated in open- and closed-loop settings, reporting up to 11% improvement in L2 distance over existing end-to-end approaches and a 10% gain in non-collision rate in the HugSim simulator while preserving fast inference.
Significance. If the alignment mechanism demonstrably transfers functional reasoning rather than providing generic regularization, the approach would offer a practical route to embedding human-like decision-making in efficient, safety-decomposable AD stacks. The reported closed-loop gains on non-collision rate and trajectory accuracy would then represent a meaningful advance for handling partial observability without monolithic VLM deployment.
major comments (2)
- [Abstract / alignment objective] Abstract and the paragraph describing the alignment objective: the central claim that minimizing distance between module outputs and VLM text embeddings causes the stack to internalize and apply commonsense reasoning (e.g., inference under partial observability) is not supported by any mechanism or ablation showing that VLM-derived logic becomes causally active in the forward pass; performance deltas could arise from auxiliary supervision alone.
- [Evaluation] Evaluation section: the abstract states concrete gains (11% L2, 10% non-collision) but supplies no statistical significance tests, exact baseline configurations, or controls for post-hoc hyperparameter choices, leaving the strength of the outperformance claim moderate at best.
minor comments (2)
- [Method] Clarify whether the VLM text features are high-level summaries or step-wise decision traces, as this affects the interpretation of what reasoning is being transferred.
- [Discussion] Add a brief discussion of potential failure modes when VLM features contain hallucinations or biases that could propagate through the alignment.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify our contributions. We address each major comment below, providing the strongest honest defense of the manuscript while indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / alignment objective] Abstract and the paragraph describing the alignment objective: the central claim that minimizing distance between module outputs and VLM text embeddings causes the stack to internalize and apply commonsense reasoning (e.g., inference under partial observability) is not supported by any mechanism or ablation showing that VLM-derived logic becomes causally active in the forward pass; performance deltas could arise from auxiliary supervision alone.
Authors: We agree that stronger evidence is needed to distinguish reasoning transfer from generic auxiliary supervision. The alignment objective specifically projects module outputs onto VLM text embeddings that encode explicit driving reasoning (e.g., descriptions of occluded agents or intent inference), rather than arbitrary features. The closed-loop gains, especially the 10% non-collision improvement in scenarios requiring partial-observability reasoning, provide indirect support that the internalized representations are functionally relevant. To directly address the concern, we will add an ablation replacing reasoning-specific VLM text with random or non-driving captions and demonstrate degraded performance, isolating the contribution of the structured reasoning content. revision: yes
-
Referee: [Evaluation] Evaluation section: the abstract states concrete gains (11% L2, 10% non-collision) but supplies no statistical significance tests, exact baseline configurations, or controls for post-hoc hyperparameter choices, leaving the strength of the outperformance claim moderate at best.
Authors: We concur that statistical rigor and precise experimental details are necessary to substantiate the reported improvements. In the revised manuscript we will include statistical significance tests (e.g., paired t-tests across multiple seeds with p-values) for both L2 distance and non-collision rate. We will also document the exact hyperparameter settings, training protocols, and baseline configurations used, together with additional controls such as fixed random seeds and sensitivity analysis to post-hoc choices. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation consists of a training-time latent alignment objective between modular AD outputs and externally generated VLM text features, followed by separate empirical evaluation of the resulting model in open-loop and closed-loop settings. The alignment serves as an auxiliary supervision signal rather than a self-referential definition of the reported metrics; L2 distance and non-collision rate are measured against independent baselines and simulators. No self-citations, uniqueness theorems, or fitted parameters that are later renamed as predictions appear in the abstract or method description. The central claim therefore remains an empirical hypothesis about transfer via alignment, not a reduction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLM-generated text features encode useful commonsense reasoning for driving decisions under partial observability.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VERDI augments modular differentiable end-to-end AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs... Lf(fP,fM)=fP·fM/∥fP∥∥fM∥
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate VERDI in both open-loop and closed-loop settings... 11% in ℓ2 distance... 10% improvement in Non-Collision Rate
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
-
How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
VENUSS benchmark shows top VLMs achieve 57% accuracy on sequential driving scenes, strong on static objects but weak on vehicle dynamics and temporal relations.
-
How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
VENUSS evaluates 25+ VLMs across 2600+ sequential driving scenarios and finds top models reach only 57% accuracy versus 65% for humans, with good static detection but poor performance on vehicle dynamics and temporal ...
Reference graph
Works this paper leans on
-
[1]
L. Di Lillo, T. Gode, X. Zhou, M. Atzei, R. Chen, and T. Victor, “Comparative safety performance of autonomous-and human drivers: A real-world case study of the waymo driver,”Heliyon, vol. 10, no. 14, 2024
work page 2024
-
[2]
Safety on autopilot: An empirical investigation of autonomous driving and traffic safety,
M. Jung, J. Park, and M.-S. Pang, “Safety on autopilot: An empirical investigation of autonomous driving and traffic safety,” 2024
work page 2024
-
[3]
D. Chen and P. Kr ¨ahenb¨uhl, “Learning from all vehicles,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 222–17 231
work page 2022
-
[4]
Effective adaptation in multi-task co-training for unified autonomous driving,
X. Liang, Y . Wu, J. Han, H. Xu, C. Xu, and X. Liang, “Effective adaptation in multi-task co-training for unified autonomous driving,” Advances in Neural Information Processing Systems, vol. 35, pp. 19 645– 19 658, 2022
work page 2022
-
[5]
End-to-end interpretable neural motion planner,
W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8660–8669
work page 2019
-
[6]
Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,
Y . Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving,”arXiv preprint arXiv:2205.09743, 2022
-
[7]
Learning unsupervised world models for autonomous driving via discrete diffusion,
L. Zhang, Y . Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun, “Learning unsupervised world models for autonomous driving via discrete diffusion,” arXiv preprint arXiv:2311.01017, 2023
-
[8]
Oatracker: Object-aware anti-occlusion 3d multiobject tracking for autonomous driving,
X. Zhang, X. Tan, Y . An, Y . Li, and Z. Fan, “Oatracker: Object-aware anti-occlusion 3d multiobject tracking for autonomous driving,”Expert Systems with Applications, vol. 252, p. 124158, 2024
work page 2024
-
[9]
Planning-oriented autonomous driving,
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862
work page 2023
-
[10]
Vad: Vectorized scene representation for efficient autonomous driving,
B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8340–8350
work page 2023
-
[11]
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Vadv2: End-to-end vectorized autonomous driving via probabilistic planning,”arXiv preprint arXiv:2402.13243, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Is ego status all you need for open-loop end-to-end autonomous driving?
Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez, “Is ego status all you need for open-loop end-to-end autonomous driving?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 864–14 873
work page 2024
-
[13]
nuscenes: A multimodal dataset for autonomous driving,
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631
work page 2020
-
[14]
Scalability in perception for autonomous driving: Waymo open dataset,
P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454
work page 2020
-
[16]
H. A. Simon, “Bounded rationality,”Utility and probability, pp. 15–18, 1990
work page 1990
-
[17]
B. D. Jones, “Bounded rationality,”Annual review of political science, vol. 2, no. 1, pp. 297–321, 1999
work page 1999
-
[18]
EMMA: End-to-End Multimodal Model for Autonomous Driving
J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-End Multimodal Model for Autonomous Driving,” Oct. 2024, arXiv:2410.23262 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2410.23262
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving,
S. Xing, C. Qian, Y . Wang, H. Hua, K. Tian, Y . Zhou, and Z. Tu, “OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving,” Dec. 2024, arXiv:2412.15208 [cs]. [Online]. Available: http://arxiv.org/abs/2412.15208
-
[20]
Driving with llms: Fusing object- level vector modality for explainable autonomous driving,
L. Chen, O. Sinavski, J. H ¨unermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 093–14 100
work page 2024
-
[21]
Drivegpt4: Interpretable end-to-end autonomous driving via large language model,
Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, 2024
work page 2024
-
[22]
S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,”arXiv preprint arXiv:2405.01533, 2024
-
[23]
Drivelm: Driving with graph visual question answering,
C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 256–274
work page 2024
-
[24]
W. Wang, J. Xie, C. Hu, H. Zou, J. Fan, W. Tong, Y . Wen, S. Wu, H. Deng, Z. Liet al., “Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving,”arXiv preprint arXiv:2312.09245, 2023
-
[25]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[27]
Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving,
H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao, “Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving,”arXiv preprint arXiv:2412.01718, 2024
-
[28]
End-to-end autonomous driving: Challenges and frontiers,
L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[29]
Recent advancements in end-to-end autonomous driving using deep learning: A survey,
P. S. Chib and P. Singh, “Recent advancements in end-to-end autonomous driving using deep learning: A survey,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 1, pp. 103–118, 2023
work page 2023
-
[30]
Diffstack: A differentiable and modular control stack for autonomous vehicles,
P. Karkus, B. Ivanovic, S. Mannor, and M. Pavone, “Diffstack: A differentiable and modular control stack for autonomous vehicles,” in Conference on robot learning. PMLR, 2023, pp. 2170–2180
work page 2023
-
[31]
Planning-oriented autonomous driving,
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li, “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[32]
Safety-enhanced autonomous driving using interpretable sensor fusion transformer,
H. Shao, L. Wang, R. Chen, H. Li, and Y . Liu, “Safety-enhanced autonomous driving using interpretable sensor fusion transformer,” in Conference on Robot Learning. PMLR, 2023, pp. 726–737
work page 2023
-
[33]
Visual point cloud forecasting enables scalable autonomous driving,
Z. Yang, L. Chen, Y . Sun, and H. Li, “Visual point cloud forecasting enables scalable autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 673–14 684
work page 2024
-
[34]
Behavior-inspired neural networks for relational inference,
Y . Yang, B. Feng, K. Wang, N. Leonard, A. B. Dieng, and C. Allen- Blanchette, “Behavior-inspired neural networks for relational inference,” arXiv preprint arXiv:2406.14746, 2024
-
[35]
Latent variable sequential set transformers for joint multi-agent motion prediction,
R. Girgis, F. Golemo, F. Codevilla, M. Weiss, J. A. D’Souza, S. E. Kahou, F. Heide, and C. Pal, “Latent variable sequential set transformers for joint multi-agent motion prediction,”arXiv preprint arXiv:2104.00563, 2021
-
[36]
Plant: Explainable planning transformers via object-level representations,
K. Renz, K. Chitta, O.-B. Mercea, A. Koepke, Z. Akata, and A. Geiger, “Plant: Explainable planning transformers via object-level representations,” arXiv preprint arXiv:2210.14222, 2022
-
[37]
Quad: Query-based interpretable neural motion planning for autonomous driving,
S. Biswas, S. Casas, Q. Sykora, B. Agro, A. Sadat, and R. Urtasun, “Quad: Query-based interpretable neural motion planning for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 236–14 243
work page 2024
-
[38]
St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,
S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,” in European Conference on Computer Vision. Springer, 2022, pp. 533–549
work page 2022
-
[39]
Dualad: Disentangling the dynamic and static world for end-to-end driving,
S. Doll, N. Hanselmann, L. Schneider, R. Schulz, M. Cordts, M. En- zweiler, and H. P. Lensch, “Dualad: Disentangling the dynamic and static world for end-to-end driving,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 728– 14 737
work page 2024
-
[40]
Para- drive: Parallelized architecture for real-time autonomous driving,
X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458
work page 2024
-
[41]
Vlp: Vision language planning for autonomous driving,
C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 760–14 769
work page 2024
-
[42]
Vlm-ad: End-to-end autonomous driving through vision-language model supervision,
Y . Xu, Y . Hu, Z. Zhang, G. P. Meyer, S. K. Mustikovela, S. Srini- vasa, E. M. Wolff, and X. Huang, “Vlm-ad: End-to-end autonomous driving through vision-language model supervision,”arXiv preprint arXiv:2412.14446, 2024
-
[43]
Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Dinget al., “Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail,”arXiv preprint arXiv:2511.00088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[45]
Knowledge distillation: A survey,
J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,”International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021
work page 2021
-
[46]
A Survey on Knowledge Distillation of Large Language Models
X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou, “A survey on knowledge distillation of large language models,” arXiv preprint arXiv:2402.13116, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Learning Transferable Visual Models From Natural Language Supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” inProceedings of the 38th International Conference on Machine Learning. PMLR, Jul. 2021, pp. 8748–8763, iSSN: 2640-3498. [Online]. Available: ht...
work page 2021
-
[48]
Reproducible Scaling Laws for Contrastive Language-Image Learning,
M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible Scaling Laws for Contrastive Language-Image Learning,” 2023, pp. 2818–2829. [Online]. Available: https://openaccess.thecvf.com/content/ CVPR2023/html/Cherti Reproducible Scaling Laws for Contrastive Language-Image Learning CVPR 2...
work page 2023
-
[49]
Supervised contrastive learning,
P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020
work page 2020
-
[50]
Sensei: Semantic exploration guided by foundation models to learn versatile world models,
C. Sancaktar, C. Gumbsch, A. Zadaianchuk, P. Kolev, and G. Martius, “Sensei: Semantic exploration guided by foundation models to learn versatile world models,”arXiv preprint arXiv:2503.01584, 2025
-
[51]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[52]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. APPENDIXA HALLUCINATIONS FORINFERENCE-TIMEVLMS(SEC. 1.0 MAIN) Methods that use finetuned multimodal VLMs at inference ...
work page 2024
-
[54]
**Traffic Lights **: There are no visible traffic lights in the image. 2. **Movements of Other Cars or Pedestrians **: There are no other cars or pedestrians visible in the image. 3. **Lane Markings **: The road has clear lane markings, including a solid white line on the right side and a dashed white line on the left side. There is also a black and white...
-
[55]
**Bus (Location: Center of the image, moving towards the camera) **: - **Description**: The bus is moving towards you on the same road. It is important to monitor its speed and direction to ensure safe overtaking or passing. - **Why it’s important**: Ensuring you have enough space to overtake safely is crucial to avoid collisions. ... Fig. 6: OpenEMMA Tes...
-
[56]
**FedEx Truck (Location: Right side of the image, near the center) ** - **Description:** The FedEx truck is on the right side of the image, near the center. It is important to pay attention to this truck because it is a large vehicle that may have a longer stopping distance than smaller cars. You should be prepared for it to slow down or stop suddenly, es...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.