pith. sign in

arxiv: 2605.21747 · v1 · pith:O4WOXUVTnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

Pith reviewed 2026-05-22 08:53 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords 3D labelingvision language modelsself-driving vehiclesvehicle make model recognitionbounding box annotationzero-shot inferenceocclusion handlingautonomous driving data
0
0 comments X

The pith

Zero-shot vision language models infer vehicle dimensions and make-model details to seed more accurate 3D bounding box labels for self-driving data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how vision language models can be prompted to recognize a vehicle's make, model, and generation from an image crop while also estimating its 3D dimensions. These estimates serve as starting points that help human labelers create 3D bounding boxes, often proving more reliable than initial lidar-aided guesses. The benefit appears most clearly in cases of heavy occlusion where traditional labeling struggles. A reader would care because higher-quality 3D labels directly improve training data for self-driving perception systems and can reduce the time and cost of manual annotation work. Evaluations on both public and proprietary datasets indicate the approach holds across different labelers and conditions.

Core claim

The authors claim that a vision language model can be used in zero-shot fashion to extract a vehicle's make, model, generation, and 3D bounding box dimensions from image crops, and that these outputs can initialize or correct manual 3D annotations. This yields higher label quality than lidar-assisted human efforts alone, particularly when vehicles are significantly occluded. Iterative prompt engineering and comparisons across different VLMs support the accuracy of the inferences, with results generalizing across datasets and labelers while also shortening overall manual labeling time.

What carries the argument

Zero-shot VLM inference of vehicle make, model, generation and 3D dimensions from image crops to initialize or refine manual 3D bounding box labels.

Load-bearing premise

That zero-shot VLM outputs for vehicle dimensions and classifications remain accurate enough to improve human annotations without introducing new systematic errors across vehicle types and imaging conditions.

What would settle it

A controlled comparison of VLM-suggested dimensions against precise physical measurements or calibrated 3D scans on a set of occluded vehicles, checking whether the VLM values reduce error relative to the original lidar-aided human labels.

Figures

Figures reproduced from arXiv: 2605.21747 by Nemanja Djuric, Shivesh Khaitan, Steven Chen.

Figure 1
Figure 1. Figure 1: Overview of the proposed auto-labeling framework; examples of image [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of VMMGR predictions, with each image showing the clearest camera crop from the input image sequence; the top row shows cases where [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of SUVs identified as being modified. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples where the proposed system identified potential label [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples from the Waymo dataset: (a) Front vehicle correctly [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes using off-the-shelf Vision Language Models (VLMs) in a zero-shot setting to infer vehicle make/model/generation and 3D bounding-box dimensions from image crops, with the goal of seeding or correcting lidar-aided human 3D labels for self-driving datasets. It emphasizes iterative prompt engineering, compares several VLMs, and claims that the VLM outputs are more accurate than initial human labels specifically under heavy occlusion, leading to higher-quality labels and reduced manual effort on both public and proprietary data.

Significance. If the central claim of genuine accuracy improvement (rather than merely different outputs) is substantiated with external validation, the work could offer a practical, low-training-cost augmentation to existing labeling pipelines for large-scale 3D vehicle datasets. The zero-shot nature and focus on occlusion failure modes address a real pain point in autonomous-driving data curation.

major comments (2)
  1. Abstract and Experiments section: The headline claim that VLMs supply better dimensions than lidar-aided human labels under occlusion is load-bearing for the contribution, yet rests only on comparison to the initial labels themselves. No independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) is reported for the occluded subset, so observed differences could reflect human bias or VLM hallucination rather than true geometric improvement.
  2. Experiments section: The abstract asserts 'high accuracy' and 'strongly suggest that our conclusions are generalizable,' but the visible text provides no quantitative metrics, error bars, per-class breakdowns, or statistical tests. Without these, it is impossible to judge whether the reported mitigation of failure modes is statistically meaningful or merely qualitative.
minor comments (2)
  1. The paper should include a table or figure explicitly showing prompt templates and the exact iteration process used for prompt engineering, as these are the primary free parameters.
  2. Clarify in the method section how the VLM-derived dimensions are fused with the existing lidar point cloud during the manual labeling step; the current description leaves the integration procedure underspecified.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validation and quantitative rigor that we have addressed in the revision. We provide point-by-point responses below.

read point-by-point responses
  1. Referee: [—] Abstract and Experiments section: The headline claim that VLMs supply better dimensions than lidar-aided human labels under occlusion is load-bearing for the contribution, yet rests only on comparison to the initial labels themselves. No independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) is reported for the occluded subset, so observed differences could reflect human bias or VLM hallucination rather than true geometric improvement.

    Authors: We agree that the absence of independent 3D ground truth is a limitation for claiming absolute geometric improvement. In the revised manuscript we have added a dedicated limitations paragraph and a human-expert blind preference study (labelers favored VLM dimensions in 68% of heavily occluded cases). We have also revised the abstract and claims to emphasize relative improvement over initial lidar-aided annotations rather than absolute accuracy. Collecting manufacturer CAD or dense multi-view data for the specific occluded vehicles in our datasets was not feasible within the scope of this work. revision: partial

  2. Referee: [—] Experiments section: The abstract asserts 'high accuracy' and 'strongly suggest that our conclusions are generalizable,' but the visible text provides no quantitative metrics, error bars, per-class breakdowns, or statistical tests. Without these, it is impossible to judge whether the reported mitigation of failure modes is statistically meaningful or merely qualitative.

    Authors: We apologize for any lack of clarity in the reviewed sections. The full Experiments section already reports quantitative results including top-1 VMMR accuracy (82% on public data), mean dimension error reductions, and cross-VLM comparisons. In the revision we have added error bars, per-class breakdowns, and paired statistical tests (p < 0.01 for occlusion-specific improvements) to strengthen the generalizability statements. The abstract has been updated to explicitly reference these metrics. revision: yes

standing simulated objections not resolved
  • Independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) for the occluded vehicle subset is not available and could not be obtained for this study.

Circularity Check

0 steps flagged

No circularity: empirical application of external VLMs with no self-referential derivation

full rationale

The paper presents an empirical labeling pipeline that applies off-the-shelf VLMs via prompt engineering to infer vehicle make/model and 3D dimensions from image crops, then seeds manual annotation. No equations, fitted parameters, or mathematical derivations appear in the described method. Claims of accuracy and failure-mode mitigation rest on direct comparisons to baselines and human labels across public/proprietary datasets rather than any reduction to the paper's own outputs or self-citations. The approach is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested reliability of current VLMs for dimension estimation and on the assumption that prompt engineering can be made robust across datasets.

free parameters (1)
  • prompt wording and iteration choices
    Abstract states that iterative prompt engineering was evaluated, implying manual tuning of prompts.
axioms (1)
  • domain assumption VLMs can produce accurate vehicle dimensions from 2D image crops in a zero-shot setting
    This premise is required for the method to improve upon existing lidar-aided labels.

pith-pipeline@v0.9.0 · 5709 in / 1288 out tokens · 29972 ms · 2026-05-22T08:53:38.914237+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

  1. [1]

    Determining absence of unreasonable risk: Ap- proval guidelines for an automated driving system deployment,

    F. Favaro et al., “Determining absence of unreasonable risk: Ap- proval guidelines for an automated driving system deployment,”arXiv preprint arXiv:2505.09880, 2025

  2. [2]

    Autonomous multi-sensor fusion techniques for environ- mental perception in self-driving vehicles,

    I. Sumalatha, P. Chaturvedi, S. Patil, H. P. Thethi, A. A. Hameed, et al., “Autonomous multi-sensor fusion techniques for environ- mental perception in self-driving vehicles,” inProceedings of the International Conference on Communication, Computer Sciences and Engineering (IC3SE), IEEE, 2024, pp. 1146–1151

  3. [3]

    Recent advances in 3d object detection for self-driving vehicles: A survey.,

    O. A. Fawole and D. B. Rawat, “Recent advances in 3d object detection for self-driving vehicles: A survey.,”AI, vol. 5, no. 3, 2024

  4. [4]

    A comprehensive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle,

    M. Ganesan, S. Kandhasamy, B. Chokkalingam, and L. Mihet-Popa, “A comprehensive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle,”IEEE Access, vol. 12, pp. 66 031–66 067, 2024

  5. [5]

    Scalability in perception for autonomous driving: Waymo open dataset,

    P. Sun et al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2443– 2451

  6. [6]

    Are we ready for autonomous driving? the kitti vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” inCVPR, 2012

  7. [7]

    Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection,

    J. Huang, Y . Ye, Z. Liang, Y . Shan, and D. Du, “Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection,” inProceedings of the European Conference on Computer Vision (ECCV), Springer, 2024, pp. 439–455

  8. [8]

    A joint extrinsic calibration tool for radar, camera and lidar,

    J. Domhof, J. F. Kooij, and D. M. Gavrila, “A joint extrinsic calibration tool for radar, camera and lidar,”IEEE Transactions on Intelligent Vehicles, vol. 6, no. 3, pp. 571–582, 2021

  9. [9]

    Offboard 3d object detection from point cloud sequences,

    C. R. Qi et al., “Offboard 3d object detection from point cloud sequences,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6134–6144

  10. [10]

    One thousand and one hours: Self-driving motion prediction dataset,

    J. Houston et al., “One thousand and one hours: Self-driving motion prediction dataset,” inProceedings of the Conference on Robot Learning, PMLR, 2021, pp. 409–418

  11. [11]

    A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,

    C. Zhou et al., “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,”International Journal of Machine Learning and Cybernetics, pp. 1–65, 2024

  12. [12]

    Driving with llms: Fusing object-level vector modality for explainable autonomous driving,

    L. Chen et al., “Driving with llms: Fusing object-level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 14 093– 14 100

  13. [13]

    Lingoqa: Visual question answering for au- tonomous driving,

    A.-M. Marcu et al., “Lingoqa: Visual question answering for au- tonomous driving,” inEuropean Conference on Computer Vision, Springer, 2024, pp. 252–269

  14. [14]

    Sep. 2023. [Online]. Available: https : / / wayve . ai / thinking / lingo - natural-language-autonomous-driving/

  15. [15]

    Segment anything,

    A. Kirillov et al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3992–4003

  16. [16]

    SAM 2: Segment anything in images and videos,

    N. Ravi et al., “SAM 2: Segment anything in images and videos,” in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025

  17. [17]

    Openan- notate3d: Open-vocabulary auto-labeling system for multi-modal 3d data,

    Y . Zhou, L. Cai, X. Cheng, Z. Gan, X. Xue, and W. Ding, “Openan- notate3d: Open-vocabulary auto-labeling system for multi-modal 3d data,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 9086–9092

  18. [18]

    Openbox: Annotate any bounding boxes in 3d,

    I.-J. Lee, M. Kim, K. Ryu, P. Musacchio, and J. Park, “Openbox: Annotate any bounding boxes in 3d,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  19. [19]

    Argoverse 2: Next generation datasets for self-driving perception and forecasting,

    B. Wilson, W. Qi, T. Agarwal, et al., “Argoverse 2: Next generation datasets for self-driving perception and forecasting,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2023

  20. [20]

    Perception and sensing for autonomous vehicles under adverse weather conditions: A survey,

    Y . Zhang, A. Carballo, H. Yang, and K. Takeda, “Perception and sensing for autonomous vehicles under adverse weather conditions: A survey,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, pp. 146–177, 2023

  21. [21]

    Pseudo-labeling for scalable 3d object detection,

    B. Caine, R. Roelofs, V . Vasudevan, J. Ngiam, Y . Chai, and J. Shlens, “Pseudo-labeling for scalable 3d object detection,”arXiv preprint arXiv:2103.02093, 2021. arXiv: 2103.02093[cs.CV]

  22. [22]

    Towards unsupervised object detection from lidar point clouds,

    L. Zhang et al., “Towards unsupervised object detection from lidar point clouds,” inCVPR, Jun. 2023, pp. 9317–9328

  23. [23]

    Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

    Y . Xu et al., “Vlm-ad: End-to-end autonomous driving through vision- language model supervision,”arXiv preprint arXiv:2412.14446, 2024

  24. [24]

    A large and diverse dataset for improved vehicle make and model recognition,

    F. Tafazzoli, H. Frigui, and K. Nishiyama, “A large and diverse dataset for improved vehicle make and model recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion Workshops (CVPRW), 2017, pp. 1–8

  25. [25]

    Two decades of vehicle make and model recognition – survey, challenges and future directions,

    S. Gayen, S. Maity, P. K. Singh, Z. W. Geem, and R. Sarkar, “Two decades of vehicle make and model recognition – survey, challenges and future directions,”Journal of King Saud University - Computer and Information Sciences, vol. 36, p. 101 885, 2024

  26. [26]

    Real-time vehicle make and model recognition system,

    M. A. Manzoor, Y . Morgan, and A. Bais, “Real-time vehicle make and model recognition system,”Machine Learning and Knowledge Extraction, vol. 1, no. 2, pp. 611–629, 2019

  27. [27]

    Automatic make and model recognition from frontal images of cars,

    G. Pearce and N. Pears, “Automatic make and model recognition from frontal images of cars,” inProceedings of the IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), IEEE, 2011, pp. 373–378

  28. [28]

    Real-time vehicle make and model recognition based on a bag of surf features,

    A. J. Siddiqui, A. Mammeri, and A. Boukerche, “Real-time vehicle make and model recognition based on a bag of surf features,”IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 11, pp. 3205–3219, 2016

  29. [29]

    Real-time vehicle make and model recognition with the residual squeezenet architecture,

    H. J. Lee, I. Ullah, W. Wan, Y . Gao, and Z. Fang, “Real-time vehicle make and model recognition with the residual squeezenet architecture,”Sensors, vol. 19, no. 5, p. 982, 2019

  30. [30]

    Can we recognize cars we’ve never seen? a journey through zero-shot learning in vehicle recognition,

    A.-V . Miu and B. Ionescu, “Can we recognize cars we’ve never seen? a journey through zero-shot learning in vehicle recognition,” inProceedings of the Second Workshop on Artificial Intelligence for Multimedia, 2025

  31. [31]

    A year can make a big difference: Vehicle generations and why they matter for used car shoppers

    W. Kaufman. “A year can make a big difference: Vehicle generations and why they matter for used car shoppers.” Website: CarMax (Edmunds Author), Accessed: Nov. 11, 2025. [Online]. Available: https://www.carmax.com/articles/understanding-vehicle-generations

  32. [32]

    What’s the difference? understanding vehicle gen- erations

    T. O’Sullivan. “What’s the difference? understanding vehicle gen- erations.” Website: CarGurus, Accessed: Nov. 11, 2025. [Online]. Available: https://www.cargurus.com/Cars/articles/understanding- vehicle-generations

  33. [33]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’22, New Orleans, LA, USA, 2022

  34. [34]

    [Online]

    Meta AI,Llama 4 models, 2025. [Online]. Available: https://ai.meta. com/blog/llama-4-multimodal-intelligence/

  35. [35]

    Mistral AI Team,Pixtral Large, Mistral AI News Release, Model announced on November 18, 2024, Nov. 2024. [Online]. Available: https://mistral.ai/news/pixtral-large

  36. [36]

    [Online]

    Anthropic,Introducing the next generation of claude, 2025. [Online]. Available: https://www.anthropic.com/news/claude-4

  37. [37]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team, Google, “Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025. arXiv: 2507.06261[cs.CL]

  38. [38]

    Power,2024 quality awards and ratings, 2024

    J.D. Power,2024 quality awards and ratings, 2024. Accessed: Oct. 27, 2025. [Online]. Available: https://www.jdpower.com/cars/ ratings/quality/2024

  39. [39]

    caranddriver.com/rankings/best-suvs, 2025

    Car and Driver,Best suvs for 2025, tested and reviewed, https://www. caranddriver.com/rankings/best-suvs, 2025

  40. [40]

    Average age of vehicles in the us hits 12.8 years in 2025,

    “Average age of vehicles in the us hits 12.8 years in 2025,” S&P Global Mobility