Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

Nemanja Djuric; Shivesh Khaitan; Steven Chen

arxiv: 2605.21747 · v1 · pith:O4WOXUVTnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

Steven Chen , Shivesh Khaitan , Nemanja Djuric This is my paper

Pith reviewed 2026-05-22 08:53 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords 3D labelingvision language modelsself-driving vehiclesvehicle make model recognitionbounding box annotationzero-shot inferenceocclusion handlingautonomous driving data

0 comments

The pith

Zero-shot vision language models infer vehicle dimensions and make-model details to seed more accurate 3D bounding box labels for self-driving data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how vision language models can be prompted to recognize a vehicle's make, model, and generation from an image crop while also estimating its 3D dimensions. These estimates serve as starting points that help human labelers create 3D bounding boxes, often proving more reliable than initial lidar-aided guesses. The benefit appears most clearly in cases of heavy occlusion where traditional labeling struggles. A reader would care because higher-quality 3D labels directly improve training data for self-driving perception systems and can reduce the time and cost of manual annotation work. Evaluations on both public and proprietary datasets indicate the approach holds across different labelers and conditions.

Core claim

The authors claim that a vision language model can be used in zero-shot fashion to extract a vehicle's make, model, generation, and 3D bounding box dimensions from image crops, and that these outputs can initialize or correct manual 3D annotations. This yields higher label quality than lidar-assisted human efforts alone, particularly when vehicles are significantly occluded. Iterative prompt engineering and comparisons across different VLMs support the accuracy of the inferences, with results generalizing across datasets and labelers while also shortening overall manual labeling time.

What carries the argument

Zero-shot VLM inference of vehicle make, model, generation and 3D dimensions from image crops to initialize or refine manual 3D bounding box labels.

Load-bearing premise

That zero-shot VLM outputs for vehicle dimensions and classifications remain accurate enough to improve human annotations without introducing new systematic errors across vehicle types and imaging conditions.

What would settle it

A controlled comparison of VLM-suggested dimensions against precise physical measurements or calibrated 3D scans on a set of occluded vehicles, checking whether the VLM values reduce error relative to the original lidar-aided human labels.

Figures

Figures reproduced from arXiv: 2605.21747 by Nemanja Djuric, Shivesh Khaitan, Steven Chen.

**Figure 2.** Figure 2: Examples of VMMGR predictions, with each image showing the clearest camera crop from the input image sequence; the top row shows cases where [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of SUVs identified as being modified. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Examples where the proposed system identified potential label [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Examples from the Waymo dataset: (a) Front vehicle correctly [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes using off-the-shelf Vision Language Models (VLMs) in a zero-shot setting to infer vehicle make/model/generation and 3D bounding-box dimensions from image crops, with the goal of seeding or correcting lidar-aided human 3D labels for self-driving datasets. It emphasizes iterative prompt engineering, compares several VLMs, and claims that the VLM outputs are more accurate than initial human labels specifically under heavy occlusion, leading to higher-quality labels and reduced manual effort on both public and proprietary data.

Significance. If the central claim of genuine accuracy improvement (rather than merely different outputs) is substantiated with external validation, the work could offer a practical, low-training-cost augmentation to existing labeling pipelines for large-scale 3D vehicle datasets. The zero-shot nature and focus on occlusion failure modes address a real pain point in autonomous-driving data curation.

major comments (2)

Abstract and Experiments section: The headline claim that VLMs supply better dimensions than lidar-aided human labels under occlusion is load-bearing for the contribution, yet rests only on comparison to the initial labels themselves. No independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) is reported for the occluded subset, so observed differences could reflect human bias or VLM hallucination rather than true geometric improvement.
Experiments section: The abstract asserts 'high accuracy' and 'strongly suggest that our conclusions are generalizable,' but the visible text provides no quantitative metrics, error bars, per-class breakdowns, or statistical tests. Without these, it is impossible to judge whether the reported mitigation of failure modes is statistically meaningful or merely qualitative.

minor comments (2)

The paper should include a table or figure explicitly showing prompt templates and the exact iteration process used for prompt engineering, as these are the primary free parameters.
Clarify in the method section how the VLM-derived dimensions are fused with the existing lidar point cloud during the manual labeling step; the current description leaves the integration procedure underspecified.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validation and quantitative rigor that we have addressed in the revision. We provide point-by-point responses below.

read point-by-point responses

Referee: [—] Abstract and Experiments section: The headline claim that VLMs supply better dimensions than lidar-aided human labels under occlusion is load-bearing for the contribution, yet rests only on comparison to the initial labels themselves. No independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) is reported for the occluded subset, so observed differences could reflect human bias or VLM hallucination rather than true geometric improvement.

Authors: We agree that the absence of independent 3D ground truth is a limitation for claiming absolute geometric improvement. In the revised manuscript we have added a dedicated limitations paragraph and a human-expert blind preference study (labelers favored VLM dimensions in 68% of heavily occluded cases). We have also revised the abstract and claims to emphasize relative improvement over initial lidar-aided annotations rather than absolute accuracy. Collecting manufacturer CAD or dense multi-view data for the specific occluded vehicles in our datasets was not feasible within the scope of this work. revision: partial
Referee: [—] Experiments section: The abstract asserts 'high accuracy' and 'strongly suggest that our conclusions are generalizable,' but the visible text provides no quantitative metrics, error bars, per-class breakdowns, or statistical tests. Without these, it is impossible to judge whether the reported mitigation of failure modes is statistically meaningful or merely qualitative.

Authors: We apologize for any lack of clarity in the reviewed sections. The full Experiments section already reports quantitative results including top-1 VMMR accuracy (82% on public data), mean dimension error reductions, and cross-VLM comparisons. In the revision we have added error bars, per-class breakdowns, and paired statistical tests (p < 0.01 for occlusion-specific improvements) to strengthen the generalizability statements. The abstract has been updated to explicitly reference these metrics. revision: yes

standing simulated objections not resolved

Independent 3D ground truth (manufacturer CAD, multi-view photogrammetry, or dense lidar fusion) for the occluded vehicle subset is not available and could not be obtained for this study.

Circularity Check

0 steps flagged

No circularity: empirical application of external VLMs with no self-referential derivation

full rationale

The paper presents an empirical labeling pipeline that applies off-the-shelf VLMs via prompt engineering to infer vehicle make/model and 3D dimensions from image crops, then seeds manual annotation. No equations, fitted parameters, or mathematical derivations appear in the described method. Claims of accuracy and failure-mode mitigation rest on direct comparisons to baselines and human labels across public/proprietary datasets rather than any reduction to the paper's own outputs or self-citations. The approach is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested reliability of current VLMs for dimension estimation and on the assumption that prompt engineering can be made robust across datasets.

free parameters (1)

prompt wording and iteration choices
Abstract states that iterative prompt engineering was evaluated, implying manual tuning of prompts.

axioms (1)

domain assumption VLMs can produce accurate vehicle dimensions from 2D image crops in a zero-shot setting
This premise is required for the method to improve upon existing lidar-aided labels.

pith-pipeline@v0.9.0 · 5709 in / 1288 out tokens · 29972 ms · 2026-05-22T08:53:38.914237+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a VLM-based method that, within a single pass, performs two coupled tasks: inferring a vehicle’s make, model, and generation, and outputting accurate 3D vehicle dimensions to seed label bounding boxes
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the proposed approach can provide more accurate dimensions than initial lidar-aided human-annotated labels in challenging scenarios, such as in the case of occluded vehicles

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

Determining absence of unreasonable risk: Ap- proval guidelines for an automated driving system deployment,

F. Favaro et al., “Determining absence of unreasonable risk: Ap- proval guidelines for an automated driving system deployment,”arXiv preprint arXiv:2505.09880, 2025

work page arXiv 2025
[2]

Autonomous multi-sensor fusion techniques for environ- mental perception in self-driving vehicles,

I. Sumalatha, P. Chaturvedi, S. Patil, H. P. Thethi, A. A. Hameed, et al., “Autonomous multi-sensor fusion techniques for environ- mental perception in self-driving vehicles,” inProceedings of the International Conference on Communication, Computer Sciences and Engineering (IC3SE), IEEE, 2024, pp. 1146–1151

work page 2024
[3]

Recent advances in 3d object detection for self-driving vehicles: A survey.,

O. A. Fawole and D. B. Rawat, “Recent advances in 3d object detection for self-driving vehicles: A survey.,”AI, vol. 5, no. 3, 2024

work page 2024
[4]

A comprehensive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle,

M. Ganesan, S. Kandhasamy, B. Chokkalingam, and L. Mihet-Popa, “A comprehensive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle,”IEEE Access, vol. 12, pp. 66 031–66 067, 2024

work page 2024
[5]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun et al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2443– 2451

work page 2020
[6]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” inCVPR, 2012

work page 2012
[7]

Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection,

J. Huang, Y . Ye, Z. Liang, Y . Shan, and D. Du, “Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection,” inProceedings of the European Conference on Computer Vision (ECCV), Springer, 2024, pp. 439–455

work page 2024
[8]

A joint extrinsic calibration tool for radar, camera and lidar,

J. Domhof, J. F. Kooij, and D. M. Gavrila, “A joint extrinsic calibration tool for radar, camera and lidar,”IEEE Transactions on Intelligent Vehicles, vol. 6, no. 3, pp. 571–582, 2021

work page 2021
[9]

Offboard 3d object detection from point cloud sequences,

C. R. Qi et al., “Offboard 3d object detection from point cloud sequences,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6134–6144

work page 2021
[10]

One thousand and one hours: Self-driving motion prediction dataset,

J. Houston et al., “One thousand and one hours: Self-driving motion prediction dataset,” inProceedings of the Conference on Robot Learning, PMLR, 2021, pp. 409–418

work page 2021
[11]

A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,

C. Zhou et al., “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,”International Journal of Machine Learning and Cybernetics, pp. 1–65, 2024

work page 2024
[12]

Driving with llms: Fusing object-level vector modality for explainable autonomous driving,

L. Chen et al., “Driving with llms: Fusing object-level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 14 093– 14 100

work page 2024
[13]

Lingoqa: Visual question answering for au- tonomous driving,

A.-M. Marcu et al., “Lingoqa: Visual question answering for au- tonomous driving,” inEuropean Conference on Computer Vision, Springer, 2024, pp. 252–269

work page 2024
[14]

Sep. 2023. [Online]. Available: https : / / wayve . ai / thinking / lingo - natural-language-autonomous-driving/

work page 2023
[15]

Segment anything,

A. Kirillov et al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3992–4003

work page 2023
[16]

SAM 2: Segment anything in images and videos,

N. Ravi et al., “SAM 2: Segment anything in images and videos,” in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[17]

Openan- notate3d: Open-vocabulary auto-labeling system for multi-modal 3d data,

Y . Zhou, L. Cai, X. Cheng, Z. Gan, X. Xue, and W. Ding, “Openan- notate3d: Open-vocabulary auto-labeling system for multi-modal 3d data,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 9086–9092

work page 2024
[18]

Openbox: Annotate any bounding boxes in 3d,

I.-J. Lee, M. Kim, K. Ryu, P. Musacchio, and J. Park, “Openbox: Annotate any bounding boxes in 3d,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[19]

Argoverse 2: Next generation datasets for self-driving perception and forecasting,

B. Wilson, W. Qi, T. Agarwal, et al., “Argoverse 2: Next generation datasets for self-driving perception and forecasting,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2023

work page 2023
[20]

Perception and sensing for autonomous vehicles under adverse weather conditions: A survey,

Y . Zhang, A. Carballo, H. Yang, and K. Takeda, “Perception and sensing for autonomous vehicles under adverse weather conditions: A survey,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, pp. 146–177, 2023

work page 2023
[21]

Pseudo-labeling for scalable 3d object detection,

B. Caine, R. Roelofs, V . Vasudevan, J. Ngiam, Y . Chai, and J. Shlens, “Pseudo-labeling for scalable 3d object detection,”arXiv preprint arXiv:2103.02093, 2021. arXiv: 2103.02093[cs.CV]

work page arXiv 2021
[22]

Towards unsupervised object detection from lidar point clouds,

L. Zhang et al., “Towards unsupervised object detection from lidar point clouds,” inCVPR, Jun. 2023, pp. 9317–9328

work page 2023
[23]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

Y . Xu et al., “Vlm-ad: End-to-end autonomous driving through vision- language model supervision,”arXiv preprint arXiv:2412.14446, 2024

work page arXiv 2024
[24]

A large and diverse dataset for improved vehicle make and model recognition,

F. Tafazzoli, H. Frigui, and K. Nishiyama, “A large and diverse dataset for improved vehicle make and model recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion Workshops (CVPRW), 2017, pp. 1–8

work page 2017
[25]

Two decades of vehicle make and model recognition – survey, challenges and future directions,

S. Gayen, S. Maity, P. K. Singh, Z. W. Geem, and R. Sarkar, “Two decades of vehicle make and model recognition – survey, challenges and future directions,”Journal of King Saud University - Computer and Information Sciences, vol. 36, p. 101 885, 2024

work page 2024
[26]

Real-time vehicle make and model recognition system,

M. A. Manzoor, Y . Morgan, and A. Bais, “Real-time vehicle make and model recognition system,”Machine Learning and Knowledge Extraction, vol. 1, no. 2, pp. 611–629, 2019

work page 2019
[27]

Automatic make and model recognition from frontal images of cars,

G. Pearce and N. Pears, “Automatic make and model recognition from frontal images of cars,” inProceedings of the IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), IEEE, 2011, pp. 373–378

work page 2011
[28]

Real-time vehicle make and model recognition based on a bag of surf features,

A. J. Siddiqui, A. Mammeri, and A. Boukerche, “Real-time vehicle make and model recognition based on a bag of surf features,”IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 11, pp. 3205–3219, 2016

work page 2016
[29]

Real-time vehicle make and model recognition with the residual squeezenet architecture,

H. J. Lee, I. Ullah, W. Wan, Y . Gao, and Z. Fang, “Real-time vehicle make and model recognition with the residual squeezenet architecture,”Sensors, vol. 19, no. 5, p. 982, 2019

work page 2019
[30]

Can we recognize cars we’ve never seen? a journey through zero-shot learning in vehicle recognition,

A.-V . Miu and B. Ionescu, “Can we recognize cars we’ve never seen? a journey through zero-shot learning in vehicle recognition,” inProceedings of the Second Workshop on Artificial Intelligence for Multimedia, 2025

work page 2025
[31]

A year can make a big difference: Vehicle generations and why they matter for used car shoppers

W. Kaufman. “A year can make a big difference: Vehicle generations and why they matter for used car shoppers.” Website: CarMax (Edmunds Author), Accessed: Nov. 11, 2025. [Online]. Available: https://www.carmax.com/articles/understanding-vehicle-generations

work page 2025
[32]

What’s the difference? understanding vehicle gen- erations

T. O’Sullivan. “What’s the difference? understanding vehicle gen- erations.” Website: CarGurus, Accessed: Nov. 11, 2025. [Online]. Available: https://www.cargurus.com/Cars/articles/understanding- vehicle-generations

work page 2025
[33]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’22, New Orleans, LA, USA, 2022

work page 2022
[34]

[Online]

Meta AI,Llama 4 models, 2025. [Online]. Available: https://ai.meta. com/blog/llama-4-multimodal-intelligence/

work page 2025
[35]

Mistral AI Team,Pixtral Large, Mistral AI News Release, Model announced on November 18, 2024, Nov. 2024. [Online]. Available: https://mistral.ai/news/pixtral-large

work page 2024
[36]

[Online]

Anthropic,Introducing the next generation of claude, 2025. [Online]. Available: https://www.anthropic.com/news/claude-4

work page 2025
[37]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team, Google, “Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025. arXiv: 2507.06261[cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Power,2024 quality awards and ratings, 2024

J.D. Power,2024 quality awards and ratings, 2024. Accessed: Oct. 27, 2025. [Online]. Available: https://www.jdpower.com/cars/ ratings/quality/2024

work page 2024
[39]

caranddriver.com/rankings/best-suvs, 2025

Car and Driver,Best suvs for 2025, tested and reviewed, https://www. caranddriver.com/rankings/best-suvs, 2025

work page 2025
[40]

Average age of vehicles in the us hits 12.8 years in 2025,

“Average age of vehicles in the us hits 12.8 years in 2025,” S&P Global Mobility

work page 2025

[1] [1]

Determining absence of unreasonable risk: Ap- proval guidelines for an automated driving system deployment,

F. Favaro et al., “Determining absence of unreasonable risk: Ap- proval guidelines for an automated driving system deployment,”arXiv preprint arXiv:2505.09880, 2025

work page arXiv 2025

[2] [2]

Autonomous multi-sensor fusion techniques for environ- mental perception in self-driving vehicles,

I. Sumalatha, P. Chaturvedi, S. Patil, H. P. Thethi, A. A. Hameed, et al., “Autonomous multi-sensor fusion techniques for environ- mental perception in self-driving vehicles,” inProceedings of the International Conference on Communication, Computer Sciences and Engineering (IC3SE), IEEE, 2024, pp. 1146–1151

work page 2024

[3] [3]

Recent advances in 3d object detection for self-driving vehicles: A survey.,

O. A. Fawole and D. B. Rawat, “Recent advances in 3d object detection for self-driving vehicles: A survey.,”AI, vol. 5, no. 3, 2024

work page 2024

[4] [4]

A comprehensive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle,

M. Ganesan, S. Kandhasamy, B. Chokkalingam, and L. Mihet-Popa, “A comprehensive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle,”IEEE Access, vol. 12, pp. 66 031–66 067, 2024

work page 2024

[5] [5]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun et al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2443– 2451

work page 2020

[6] [6]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” inCVPR, 2012

work page 2012

[7] [7]

Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection,

J. Huang, Y . Ye, Z. Liang, Y . Shan, and D. Du, “Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection,” inProceedings of the European Conference on Computer Vision (ECCV), Springer, 2024, pp. 439–455

work page 2024

[8] [8]

A joint extrinsic calibration tool for radar, camera and lidar,

J. Domhof, J. F. Kooij, and D. M. Gavrila, “A joint extrinsic calibration tool for radar, camera and lidar,”IEEE Transactions on Intelligent Vehicles, vol. 6, no. 3, pp. 571–582, 2021

work page 2021

[9] [9]

Offboard 3d object detection from point cloud sequences,

C. R. Qi et al., “Offboard 3d object detection from point cloud sequences,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6134–6144

work page 2021

[10] [10]

One thousand and one hours: Self-driving motion prediction dataset,

J. Houston et al., “One thousand and one hours: Self-driving motion prediction dataset,” inProceedings of the Conference on Robot Learning, PMLR, 2021, pp. 409–418

work page 2021

[11] [11]

A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,

C. Zhou et al., “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,”International Journal of Machine Learning and Cybernetics, pp. 1–65, 2024

work page 2024

[12] [12]

Driving with llms: Fusing object-level vector modality for explainable autonomous driving,

L. Chen et al., “Driving with llms: Fusing object-level vector modality for explainable autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 14 093– 14 100

work page 2024

[13] [13]

Lingoqa: Visual question answering for au- tonomous driving,

A.-M. Marcu et al., “Lingoqa: Visual question answering for au- tonomous driving,” inEuropean Conference on Computer Vision, Springer, 2024, pp. 252–269

work page 2024

[14] [14]

Sep. 2023. [Online]. Available: https : / / wayve . ai / thinking / lingo - natural-language-autonomous-driving/

work page 2023

[15] [15]

Segment anything,

A. Kirillov et al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3992–4003

work page 2023

[16] [16]

SAM 2: Segment anything in images and videos,

N. Ravi et al., “SAM 2: Segment anything in images and videos,” in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[17] [17]

Openan- notate3d: Open-vocabulary auto-labeling system for multi-modal 3d data,

Y . Zhou, L. Cai, X. Cheng, Z. Gan, X. Xue, and W. Ding, “Openan- notate3d: Open-vocabulary auto-labeling system for multi-modal 3d data,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 9086–9092

work page 2024

[18] [18]

Openbox: Annotate any bounding boxes in 3d,

I.-J. Lee, M. Kim, K. Ryu, P. Musacchio, and J. Park, “Openbox: Annotate any bounding boxes in 3d,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[19] [19]

Argoverse 2: Next generation datasets for self-driving perception and forecasting,

B. Wilson, W. Qi, T. Agarwal, et al., “Argoverse 2: Next generation datasets for self-driving perception and forecasting,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2023

work page 2023

[20] [20]

Perception and sensing for autonomous vehicles under adverse weather conditions: A survey,

Y . Zhang, A. Carballo, H. Yang, and K. Takeda, “Perception and sensing for autonomous vehicles under adverse weather conditions: A survey,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, pp. 146–177, 2023

work page 2023

[21] [21]

Pseudo-labeling for scalable 3d object detection,

B. Caine, R. Roelofs, V . Vasudevan, J. Ngiam, Y . Chai, and J. Shlens, “Pseudo-labeling for scalable 3d object detection,”arXiv preprint arXiv:2103.02093, 2021. arXiv: 2103.02093[cs.CV]

work page arXiv 2021

[22] [22]

Towards unsupervised object detection from lidar point clouds,

L. Zhang et al., “Towards unsupervised object detection from lidar point clouds,” inCVPR, Jun. 2023, pp. 9317–9328

work page 2023

[23] [23]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision,

Y . Xu et al., “Vlm-ad: End-to-end autonomous driving through vision- language model supervision,”arXiv preprint arXiv:2412.14446, 2024

work page arXiv 2024

[24] [24]

A large and diverse dataset for improved vehicle make and model recognition,

F. Tafazzoli, H. Frigui, and K. Nishiyama, “A large and diverse dataset for improved vehicle make and model recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion Workshops (CVPRW), 2017, pp. 1–8

work page 2017

[25] [25]

Two decades of vehicle make and model recognition – survey, challenges and future directions,

S. Gayen, S. Maity, P. K. Singh, Z. W. Geem, and R. Sarkar, “Two decades of vehicle make and model recognition – survey, challenges and future directions,”Journal of King Saud University - Computer and Information Sciences, vol. 36, p. 101 885, 2024

work page 2024

[26] [26]

Real-time vehicle make and model recognition system,

M. A. Manzoor, Y . Morgan, and A. Bais, “Real-time vehicle make and model recognition system,”Machine Learning and Knowledge Extraction, vol. 1, no. 2, pp. 611–629, 2019

work page 2019

[27] [27]

Automatic make and model recognition from frontal images of cars,

G. Pearce and N. Pears, “Automatic make and model recognition from frontal images of cars,” inProceedings of the IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), IEEE, 2011, pp. 373–378

work page 2011

[28] [28]

Real-time vehicle make and model recognition based on a bag of surf features,

A. J. Siddiqui, A. Mammeri, and A. Boukerche, “Real-time vehicle make and model recognition based on a bag of surf features,”IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 11, pp. 3205–3219, 2016

work page 2016

[29] [29]

Real-time vehicle make and model recognition with the residual squeezenet architecture,

H. J. Lee, I. Ullah, W. Wan, Y . Gao, and Z. Fang, “Real-time vehicle make and model recognition with the residual squeezenet architecture,”Sensors, vol. 19, no. 5, p. 982, 2019

work page 2019

[30] [30]

Can we recognize cars we’ve never seen? a journey through zero-shot learning in vehicle recognition,

A.-V . Miu and B. Ionescu, “Can we recognize cars we’ve never seen? a journey through zero-shot learning in vehicle recognition,” inProceedings of the Second Workshop on Artificial Intelligence for Multimedia, 2025

work page 2025

[31] [31]

A year can make a big difference: Vehicle generations and why they matter for used car shoppers

W. Kaufman. “A year can make a big difference: Vehicle generations and why they matter for used car shoppers.” Website: CarMax (Edmunds Author), Accessed: Nov. 11, 2025. [Online]. Available: https://www.carmax.com/articles/understanding-vehicle-generations

work page 2025

[32] [32]

What’s the difference? understanding vehicle gen- erations

T. O’Sullivan. “What’s the difference? understanding vehicle gen- erations.” Website: CarGurus, Accessed: Nov. 11, 2025. [Online]. Available: https://www.cargurus.com/Cars/articles/understanding- vehicle-generations

work page 2025

[33] [33]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’22, New Orleans, LA, USA, 2022

work page 2022

[34] [34]

[Online]

Meta AI,Llama 4 models, 2025. [Online]. Available: https://ai.meta. com/blog/llama-4-multimodal-intelligence/

work page 2025

[35] [35]

Mistral AI Team,Pixtral Large, Mistral AI News Release, Model announced on November 18, 2024, Nov. 2024. [Online]. Available: https://mistral.ai/news/pixtral-large

work page 2024

[36] [36]

[Online]

Anthropic,Introducing the next generation of claude, 2025. [Online]. Available: https://www.anthropic.com/news/claude-4

work page 2025

[37] [37]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team, Google, “Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025. arXiv: 2507.06261[cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Power,2024 quality awards and ratings, 2024

J.D. Power,2024 quality awards and ratings, 2024. Accessed: Oct. 27, 2025. [Online]. Available: https://www.jdpower.com/cars/ ratings/quality/2024

work page 2024

[39] [39]

caranddriver.com/rankings/best-suvs, 2025

Car and Driver,Best suvs for 2025, tested and reviewed, https://www. caranddriver.com/rankings/best-suvs, 2025

work page 2025

[40] [40]

Average age of vehicles in the us hits 12.8 years in 2025,

“Average age of vehicles in the us hits 12.8 years in 2025,” S&P Global Mobility

work page 2025