Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models
Pith reviewed 2026-05-22 08:53 UTC · model grok-4.3
The pith
Zero-shot vision language models infer vehicle dimensions and make-model details to seed more accurate 3D bounding box labels for self-driving data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that a vision language model can be used in zero-shot fashion to extract a vehicle's make, model, generation, and 3D bounding box dimensions from image crops, and that these outputs can initialize or correct manual 3D annotations. This yields higher label quality than lidar-assisted human efforts alone, particularly when vehicles are significantly occluded. Iterative prompt engineering and comparisons across different VLMs support the accuracy of the inferences, with results generalizing across datasets and labelers while also shortening overall manual labeling time.
What carries the argument
Zero-shot VLM inference of vehicle make, model, generation and 3D dimensions from image crops to initialize or refine manual 3D bounding box labels.
Load-bearing premise
That zero-shot VLM outputs for vehicle dimensions and classifications remain accurate enough to improve human annotations without introducing new systematic errors across vehicle types and imaging conditions.
What would settle it
A controlled comparison of VLM-suggested dimensions against precise physical measurements or calibrated 3D scans on a set of occluded vehicles, checking whether the VLM values reduce error relative to the original lidar-aided human labels.
Figures
read the original abstract
We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using off-the-shelf Vision Language Models (VLMs) in a zero-shot setting to infer vehicle make/model/generation and 3D bounding-box dimensions from image crops, with the goal of seeding or correcting lidar-aided human 3D labels for self-driving datasets. It emphasizes iterative prompt engineering, compares several VLMs, and claims that the VLM outputs are more accurate than initial human labels specifically under heavy occlusion, leading to higher-quality labels and reduced manual effort on both public and proprietary data.
Significance. If the central claim of genuine accuracy improvement (rather than merely different outputs) is substantiated with external validation, the work could offer a practical, low-training-cost augmentation to existing labeling pipelines for large-scale 3D vehicle datasets. The zero-shot nature and focus on occlusion failure modes address a real pain point in autonomous-driving data curation.
major comments (2)
- Abstract and Experiments section: The headline claim that VLMs supply better dimensions than lidar-aided human labels under occlusion is load-bearing for the contribution, yet rests only on comparison to the initial labels themselves. No independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) is reported for the occluded subset, so observed differences could reflect human bias or VLM hallucination rather than true geometric improvement.
- Experiments section: The abstract asserts 'high accuracy' and 'strongly suggest that our conclusions are generalizable,' but the visible text provides no quantitative metrics, error bars, per-class breakdowns, or statistical tests. Without these, it is impossible to judge whether the reported mitigation of failure modes is statistically meaningful or merely qualitative.
minor comments (2)
- The paper should include a table or figure explicitly showing prompt templates and the exact iteration process used for prompt engineering, as these are the primary free parameters.
- Clarify in the method section how the VLM-derived dimensions are fused with the existing lidar point cloud during the manual labeling step; the current description leaves the integration procedure underspecified.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validation and quantitative rigor that we have addressed in the revision. We provide point-by-point responses below.
read point-by-point responses
-
Referee: [—] Abstract and Experiments section: The headline claim that VLMs supply better dimensions than lidar-aided human labels under occlusion is load-bearing for the contribution, yet rests only on comparison to the initial labels themselves. No independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) is reported for the occluded subset, so observed differences could reflect human bias or VLM hallucination rather than true geometric improvement.
Authors: We agree that the absence of independent 3D ground truth is a limitation for claiming absolute geometric improvement. In the revised manuscript we have added a dedicated limitations paragraph and a human-expert blind preference study (labelers favored VLM dimensions in 68% of heavily occluded cases). We have also revised the abstract and claims to emphasize relative improvement over initial lidar-aided annotations rather than absolute accuracy. Collecting manufacturer CAD or dense multi-view data for the specific occluded vehicles in our datasets was not feasible within the scope of this work. revision: partial
-
Referee: [—] Experiments section: The abstract asserts 'high accuracy' and 'strongly suggest that our conclusions are generalizable,' but the visible text provides no quantitative metrics, error bars, per-class breakdowns, or statistical tests. Without these, it is impossible to judge whether the reported mitigation of failure modes is statistically meaningful or merely qualitative.
Authors: We apologize for any lack of clarity in the reviewed sections. The full Experiments section already reports quantitative results including top-1 VMMR accuracy (82% on public data), mean dimension error reductions, and cross-VLM comparisons. In the revision we have added error bars, per-class breakdowns, and paired statistical tests (p < 0.01 for occlusion-specific improvements) to strengthen the generalizability statements. The abstract has been updated to explicitly reference these metrics. revision: yes
- Independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) for the occluded vehicle subset is not available and could not be obtained for this study.
Circularity Check
No circularity: empirical application of external VLMs with no self-referential derivation
full rationale
The paper presents an empirical labeling pipeline that applies off-the-shelf VLMs via prompt engineering to infer vehicle make/model and 3D dimensions from image crops, then seeds manual annotation. No equations, fitted parameters, or mathematical derivations appear in the described method. Claims of accuracy and failure-mode mitigation rest on direct comparisons to baselines and human labels across public/proprietary datasets rather than any reduction to the paper's own outputs or self-citations. The approach is therefore self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- prompt wording and iteration choices
axioms (1)
- domain assumption VLMs can produce accurate vehicle dimensions from 2D image crops in a zero-shot setting
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a VLM-based method that, within a single pass, performs two coupled tasks: inferring a vehicle’s make, model, and generation, and outputting accurate 3D vehicle dimensions to seed label bounding boxes
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the proposed approach can provide more accurate dimensions than initial lidar-aided human-annotated labels in challenging scenarios, such as in the case of occluded vehicles
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
F. Favaro et al., “Determining absence of unreasonable risk: Ap- proval guidelines for an automated driving system deployment,”arXiv preprint arXiv:2505.09880, 2025
-
[2]
Autonomous multi-sensor fusion techniques for environ- mental perception in self-driving vehicles,
I. Sumalatha, P. Chaturvedi, S. Patil, H. P. Thethi, A. A. Hameed, et al., “Autonomous multi-sensor fusion techniques for environ- mental perception in self-driving vehicles,” inProceedings of the International Conference on Communication, Computer Sciences and Engineering (IC3SE), IEEE, 2024, pp. 1146–1151
work page 2024
-
[3]
Recent advances in 3d object detection for self-driving vehicles: A survey.,
O. A. Fawole and D. B. Rawat, “Recent advances in 3d object detection for self-driving vehicles: A survey.,”AI, vol. 5, no. 3, 2024
work page 2024
-
[4]
M. Ganesan, S. Kandhasamy, B. Chokkalingam, and L. Mihet-Popa, “A comprehensive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle,”IEEE Access, vol. 12, pp. 66 031–66 067, 2024
work page 2024
-
[5]
Scalability in perception for autonomous driving: Waymo open dataset,
P. Sun et al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2443– 2451
work page 2020
-
[6]
Are we ready for autonomous driving? the kitti vision benchmark suite,
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” inCVPR, 2012
work page 2012
-
[7]
Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection,
J. Huang, Y . Ye, Z. Liang, Y . Shan, and D. Du, “Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection,” inProceedings of the European Conference on Computer Vision (ECCV), Springer, 2024, pp. 439–455
work page 2024
-
[8]
A joint extrinsic calibration tool for radar, camera and lidar,
J. Domhof, J. F. Kooij, and D. M. Gavrila, “A joint extrinsic calibration tool for radar, camera and lidar,”IEEE Transactions on Intelligent Vehicles, vol. 6, no. 3, pp. 571–582, 2021
work page 2021
-
[9]
Offboard 3d object detection from point cloud sequences,
C. R. Qi et al., “Offboard 3d object detection from point cloud sequences,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6134–6144
work page 2021
-
[10]
One thousand and one hours: Self-driving motion prediction dataset,
J. Houston et al., “One thousand and one hours: Self-driving motion prediction dataset,” inProceedings of the Conference on Robot Learning, PMLR, 2021, pp. 409–418
work page 2021
-
[11]
A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,
C. Zhou et al., “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,”International Journal of Machine Learning and Cybernetics, pp. 1–65, 2024
work page 2024
-
[12]
Driving with llms: Fusing object-level vector modality for explainable autonomous driving,
L. Chen et al., “Driving with llms: Fusing object-level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 14 093– 14 100
work page 2024
-
[13]
Lingoqa: Visual question answering for au- tonomous driving,
A.-M. Marcu et al., “Lingoqa: Visual question answering for au- tonomous driving,” inEuropean Conference on Computer Vision, Springer, 2024, pp. 252–269
work page 2024
-
[14]
Sep. 2023. [Online]. Available: https : / / wayve . ai / thinking / lingo - natural-language-autonomous-driving/
work page 2023
-
[15]
A. Kirillov et al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3992–4003
work page 2023
-
[16]
SAM 2: Segment anything in images and videos,
N. Ravi et al., “SAM 2: Segment anything in images and videos,” in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[17]
Openan- notate3d: Open-vocabulary auto-labeling system for multi-modal 3d data,
Y . Zhou, L. Cai, X. Cheng, Z. Gan, X. Xue, and W. Ding, “Openan- notate3d: Open-vocabulary auto-labeling system for multi-modal 3d data,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 9086–9092
work page 2024
-
[18]
Openbox: Annotate any bounding boxes in 3d,
I.-J. Lee, M. Kim, K. Ryu, P. Musacchio, and J. Park, “Openbox: Annotate any bounding boxes in 3d,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[19]
Argoverse 2: Next generation datasets for self-driving perception and forecasting,
B. Wilson, W. Qi, T. Agarwal, et al., “Argoverse 2: Next generation datasets for self-driving perception and forecasting,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2023
work page 2023
-
[20]
Perception and sensing for autonomous vehicles under adverse weather conditions: A survey,
Y . Zhang, A. Carballo, H. Yang, and K. Takeda, “Perception and sensing for autonomous vehicles under adverse weather conditions: A survey,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, pp. 146–177, 2023
work page 2023
-
[21]
Pseudo-labeling for scalable 3d object detection,
B. Caine, R. Roelofs, V . Vasudevan, J. Ngiam, Y . Chai, and J. Shlens, “Pseudo-labeling for scalable 3d object detection,”arXiv preprint arXiv:2103.02093, 2021. arXiv: 2103.02093[cs.CV]
-
[22]
Towards unsupervised object detection from lidar point clouds,
L. Zhang et al., “Towards unsupervised object detection from lidar point clouds,” inCVPR, Jun. 2023, pp. 9317–9328
work page 2023
-
[23]
Vlm-ad: End-to-end autonomous driving through vision-language model supervision,
Y . Xu et al., “Vlm-ad: End-to-end autonomous driving through vision- language model supervision,”arXiv preprint arXiv:2412.14446, 2024
-
[24]
A large and diverse dataset for improved vehicle make and model recognition,
F. Tafazzoli, H. Frigui, and K. Nishiyama, “A large and diverse dataset for improved vehicle make and model recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion Workshops (CVPRW), 2017, pp. 1–8
work page 2017
-
[25]
Two decades of vehicle make and model recognition – survey, challenges and future directions,
S. Gayen, S. Maity, P. K. Singh, Z. W. Geem, and R. Sarkar, “Two decades of vehicle make and model recognition – survey, challenges and future directions,”Journal of King Saud University - Computer and Information Sciences, vol. 36, p. 101 885, 2024
work page 2024
-
[26]
Real-time vehicle make and model recognition system,
M. A. Manzoor, Y . Morgan, and A. Bais, “Real-time vehicle make and model recognition system,”Machine Learning and Knowledge Extraction, vol. 1, no. 2, pp. 611–629, 2019
work page 2019
-
[27]
Automatic make and model recognition from frontal images of cars,
G. Pearce and N. Pears, “Automatic make and model recognition from frontal images of cars,” inProceedings of the IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), IEEE, 2011, pp. 373–378
work page 2011
-
[28]
Real-time vehicle make and model recognition based on a bag of surf features,
A. J. Siddiqui, A. Mammeri, and A. Boukerche, “Real-time vehicle make and model recognition based on a bag of surf features,”IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 11, pp. 3205–3219, 2016
work page 2016
-
[29]
Real-time vehicle make and model recognition with the residual squeezenet architecture,
H. J. Lee, I. Ullah, W. Wan, Y . Gao, and Z. Fang, “Real-time vehicle make and model recognition with the residual squeezenet architecture,”Sensors, vol. 19, no. 5, p. 982, 2019
work page 2019
-
[30]
Can we recognize cars we’ve never seen? a journey through zero-shot learning in vehicle recognition,
A.-V . Miu and B. Ionescu, “Can we recognize cars we’ve never seen? a journey through zero-shot learning in vehicle recognition,” inProceedings of the Second Workshop on Artificial Intelligence for Multimedia, 2025
work page 2025
-
[31]
A year can make a big difference: Vehicle generations and why they matter for used car shoppers
W. Kaufman. “A year can make a big difference: Vehicle generations and why they matter for used car shoppers.” Website: CarMax (Edmunds Author), Accessed: Nov. 11, 2025. [Online]. Available: https://www.carmax.com/articles/understanding-vehicle-generations
work page 2025
-
[32]
What’s the difference? understanding vehicle gen- erations
T. O’Sullivan. “What’s the difference? understanding vehicle gen- erations.” Website: CarGurus, Accessed: Nov. 11, 2025. [Online]. Available: https://www.cargurus.com/Cars/articles/understanding- vehicle-generations
work page 2025
-
[33]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’22, New Orleans, LA, USA, 2022
work page 2022
- [34]
-
[35]
Mistral AI Team,Pixtral Large, Mistral AI News Release, Model announced on November 18, 2024, Nov. 2024. [Online]. Available: https://mistral.ai/news/pixtral-large
work page 2024
- [36]
-
[37]
Gemini Team, Google, “Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025. arXiv: 2507.06261[cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Power,2024 quality awards and ratings, 2024
J.D. Power,2024 quality awards and ratings, 2024. Accessed: Oct. 27, 2025. [Online]. Available: https://www.jdpower.com/cars/ ratings/quality/2024
work page 2024
-
[39]
caranddriver.com/rankings/best-suvs, 2025
Car and Driver,Best suvs for 2025, tested and reviewed, https://www. caranddriver.com/rankings/best-suvs, 2025
work page 2025
-
[40]
Average age of vehicles in the us hits 12.8 years in 2025,
“Average age of vehicles in the us hits 12.8 years in 2025,” S&P Global Mobility
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.