Scalable Object Detection in the Car Interior With Vision Foundation Models

Ahmet Firintepe; B\'alint M\'esz\'aros; Sebastian Schmidt; Stephan G\"unnemann

arxiv: 2508.19651 · v3 · submitted 2025-08-27 · 💻 cs.CV

Scalable Object Detection in the Car Interior With Vision Foundation Models

B\'alint M\'esz\'aros , Ahmet Firintepe , Sebastian Schmidt , Stephan G\"unnemann This is my paper

Pith reviewed 2026-05-18 21:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords car interior object detectionvision foundation modelsdistributed architecturemodel fine-tuninghallucination reductionLLaVAODALbenchautomotive AI

0 comments

The pith

Fine-tuned 7B vision model reaches 89% accuracy detecting objects in car interiors and outperforms GPT-4o.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the ODAL framework that splits vision foundation model computation between a car's limited on-board hardware and cloud resources to enable object detection and localization inside vehicles. It introduces the ODALbench metric to measure both accurate placement of detected objects and the rate of hallucinations. The authors demonstrate that fine-tuning the LLaVA 1.5 7B model on interior scenes produces an 89% ODAL score, a 71% gain over its untuned baseline, and nearly 20% better results than GPT-4o while tripling the signal-to-noise ratio through fewer false detections. A reader would care because this approach could let resource-constrained vehicles run reliable scene-understanding assistants that notice and locate items introduced by passengers.

Core claim

The ODAL framework applies vision foundation models to car-interior object detection and localization via a distributed on-board and cloud architecture that overcomes on-board compute limits. Fine-tuning the lightweight LLaVA 1.5 7B model on this task yields an ODAL score of 89%, a 71% improvement over the baseline, and outperforms the GPT-4o model by nearly 20% while achieving an ODAL SNR three times higher through reduced hallucinations.

What carries the argument

The ODAL framework, a distributed on-board/cloud split that offloads heavy vision foundation model inference to the cloud while keeping lightweight detection on the vehicle.

If this is right

Lightweight models can be fine-tuned to surpass much larger general models on narrow, safety-relevant tasks such as interior monitoring.
Targeted fine-tuning reduces hallucinations enough to make vision-language outputs more trustworthy for driver-assistance systems.
The ODALbench metric supplies a single number that balances detection precision against false positives, allowing consistent comparison across models.
Distributed execution removes the need for high-end on-board GPUs while preserving most of the accuracy of cloud-scale models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same split-and-fine-tune pattern could extend to other constrained settings such as drones or industrial robots that must interpret their immediate surroundings.
Lower hallucination rates may translate into fewer erroneous assistant responses that distract drivers, a direct safety benefit not quantified in the paper.
Pairing the vision output with existing cabin sensors could further tighten localization without extra cloud calls.

Load-bearing premise

The on-board and cloud split can run with acceptable latency, bandwidth use, and reliability in real driving conditions, and the ODALbench metric captures what matters for actual deployment.

What would settle it

Deploy the full distributed system in a moving vehicle, introduce varied objects under changing lighting and network conditions, and measure whether the fine-tuned model still exceeds 80% ODAL score and maintains three times the SNR of GPT-4o.

Figures

Figures reproduced from arXiv: 2508.19651 by Ahmet Firintepe, B\'alint M\'esz\'aros, Sebastian Schmidt, Stephan G\"unnemann.

**Figure 2.** Figure 2: Overview of our ODAL framework. Our framework allows for the decoupled execution of different model parts across on-board and cloud [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Example of a label for the object backpack. The isvisible attribute helps for the further evaluation. To address the limited amount of data available for finetuning, three levels of data augmentation were applied: (1) no augmentation, (2) basic augmentation, which included rotation, flipping, and brightness adjustment, and (3) extensive augmentation, which, in addition to the basic techniques, incorpora… view at source ↗

**Figure 4.** Figure 4: Comparison of GPT-4o, LLava-1.5, and our ODAL-LLaVA on two different metrics. a) Plot of model performance for ODAL [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework's potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL$_{score}$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL$_{SNR}$ three times higher than GPT-4o.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned LLaVA beats GPT-4o on a custom car-interior metric but the new ODAL_score lacks validation against standard measures.

read the letter

Your colleague should know that this work applies vision foundation models to car interior object detection by splitting compute between the vehicle and the cloud, and shows that fine-tuning LLaVA 1.5 7B yields big gains on their new ODAL score over both its baseline and GPT-4o. The new elements are the ODAL framework for distributed scene understanding and the ODALbench metric that scores detection accuracy, localization, and hallucination reduction together. The paper does well in identifying the resource constraint in vehicles and proposing a split architecture to work around it. The empirical results on fine-tuning are presented clearly, with specific claims of a 71% relative improvement to 89% and a threefold increase in ODAL_SNR compared to GPT-4o. This kind of domain adaptation for foundation models in a constrained setting is useful to see. The main soft spot is the reliance on the newly introduced metric without evidence that it correlates with standard detection metrics or human judgments. As the stress test notes, there's no reported correlation with mAP, IoU, or F1, and no preference study, so the claimed superiority could partly be an artifact of how ODAL_score and ODAL_SNR are defined. The abstract also lacks details on the training data, exact evaluation protocol, statistical tests, or ablations on the architecture choices. Without those, it's hard to assess how solid the performance claims are. The assumption that the on-board/cloud split will handle real-world latency and reliability isn't backed by experiments here. This paper would interest people in automotive AI, edge deployment of large models, or VLM fine-tuning for specific domains. A reader focused on practical applications rather than theoretical advances could get value from the setup and the fine-tuning results. It deserves a serious referee because the problem is well-motivated and the approach is straightforward, though the metric validation and missing experimental details would need attention in review.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Object Detection and Localization (ODAL) framework, which employs a distributed on-board/cloud architecture with vision foundation models to identify and localize externally introduced objects in car interiors. It presents ODALbench as a new composite metric and reports that a fine-tuned ODAL-LLaVA model (based on LLaVA 1.5 7B) reaches an ODAL_score of 89% (71% relative improvement over baseline) while outperforming GPT-4o by nearly 20% and achieving three times higher ODAL_SNR.

Significance. If the central empirical claims hold after validation of the new metric, the work could enable more practical deployment of vision foundation models in resource-constrained automotive settings by demonstrating effective fine-tuning of lightweight models and hallucination reduction.

major comments (1)

[Section 4] Section 4: ODAL_score and ODAL_SNR are defined as a composite of detection accuracy, localization precision, and hallucination penalty, yet the manuscript provides no correlation analysis against established metrics (mAP@0.5, IoU, F1) or human preference studies on the same images. This makes the reported 89% score, 71% gain, and 3× SNR advantage over GPT-4o difficult to interpret as genuine improvements rather than artifacts of the custom formulation.

minor comments (2)

[Abstract] The abstract and methods sections omit dataset details, evaluation protocol, statistical significance tests, and ablation studies, which are needed to support the numerical claims even if the metric validation is addressed.
The distributed on-board/cloud split is presented as overcoming resource constraints, but no quantitative analysis of latency, bandwidth, or reliability under driving conditions is provided.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment point by point below and describe the revisions we will incorporate to strengthen the validation of our proposed metrics.

read point-by-point responses

Referee: [Section 4] Section 4: ODAL_score and ODAL_SNR are defined as a composite of detection accuracy, localization precision, and hallucination penalty, yet the manuscript provides no correlation analysis against established metrics (mAP@0.5, IoU, F1) or human preference studies on the same images. This makes the reported 89% score, 71% gain, and 3× SNR advantage over GPT-4o difficult to interpret as genuine improvements rather than artifacts of the custom formulation.

Authors: We agree that additional validation would improve interpretability of ODALbench. In the revised manuscript we will add a new analysis subsection in Section 4 that reports Pearson and Spearman correlations between ODAL_score and the standard metrics mAP@0.5, mean IoU, and F1-score evaluated on the same test images. This will show that our composite metric is consistent with conventional measures while explicitly penalizing hallucinations, which is particularly relevant for safety-critical car-interior applications. We did not conduct human preference studies in the present work; we will therefore add an explicit limitations paragraph noting this gap and identifying it as valuable future work. The reported gains remain grounded in the quantitative reduction of hallucinations observed across the benchmark, which is the core practical contribution for resource-constrained automotive deployment. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on custom benchmark

full rationale

The paper introduces the ODAL framework and ODALbench metric, then reports experimental outcomes from fine-tuning LLaVA and comparing to GPT-4o. Performance figures (89% ODAL_score, 71% relative gain, 3× SNR) are obtained by direct evaluation on held-out test data rather than any derivation, equation, or fitted parameter that reduces to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central claims. The work is self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The work rests on standard assumptions about fine-tuning effectiveness and the feasibility of distributed inference.

axioms (1)

domain assumption Vision foundation models can be effectively fine-tuned for the specialized domain of car interior object detection and localization.
The reported 71% improvement and outperformance of GPT-4o depend on this premise.

invented entities (2)

ODAL framework no independent evidence
purpose: Distributed on-board/cloud architecture for scalable interior scene understanding.
Newly proposed system whose independent validation is not described in the abstract.
ODALbench no independent evidence
purpose: Composite metric for detection and localization performance.
Introduced by the authors to benchmark the framework.

pith-pipeline@v0.9.0 · 5770 in / 1495 out tokens · 51349 ms · 2026-05-18T21:04:30.312811+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a novel framework for interior scene understanding with distributed on-board and cloud computing... new benchmark for interior scene understanding... ODALscore... ODALSNR = C/H
IndisputableMonolith/Foundation/AlphaCoordinateFixation alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuned ODAL-LLaVA model achieves an ODALscore of 89%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , 2015

work page 2015
[2]

You Only Look Once: Unified, Real-Time Object Detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016

work page 2016
[3]

End-to-End Object Detection with Transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” in Proc. Computer Vision – ECCV 2020 , A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., Springer International Publishing, Cham, 2020

work page 2020
[4]

FCOS: Fully Convolutional One-Stage Object Detection,

Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully Convolutional One-Stage Object Detection,” in Proc. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , 2019

work page 2019
[5]

Object Detection with Deep Learning: A Review,

Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object Detection with Deep Learning: A Review,” IEEE Transactions on Neural Networks and Learning Systems , 2019

work page 2019
[6]

Learning Transferable Visual Models From Natural Language Supervision,

A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. International Conference on Machine Learning (ICML) , 2021

work page 2021
[7]

On the Opportunities and Risks of Foundation Models

R. Bommasani et al. , “On the Opportunities and Risks of Foundation Models,” arXiv preprint arXiv:2108.07258 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Flamingo: a Visual Language Model for Few- Shot Learning,

J.-B. Alayrac et al. , “Flamingo: a Visual Language Model for Few- Shot Learning,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., Curran Associates, Inc., 2022

work page 2022
[9]

InstructBLIP: Towards General-Purpose Vision- Language Models with Instruction Tuning,

W. Dai et al. , “InstructBLIP: Towards General-Purpose Vision- Language Models with Instruction Tuning,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., Curran Associates, Inc., 2023

work page 2023
[10]

CenterNet: Keypoint Triplets for Object Detection,

K. Duan et al., “CenterNet: Keypoint Triplets for Object Detection,” in Proc. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019

work page 2019
[11]

Visual Instruction Tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual Instruction Tuning,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., Curran Associates, Inc., 2023

work page 2023
[12]

Improved Baselines with Visual Instruction Tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved Baselines with Visual Instruction Tuning,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

work page 2024
[13]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu et al. , “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2106.09685 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Scaling Vision Transformers,

M. Dehghani et al. , “Scaling Vision Transformers,” arXiv preprint arXiv:2302.05442, 2023

work page arXiv 2023
[15]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone,” arXiv preprint arXiv:2404.14219 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” arXiv preprint arXiv:2302.13971 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,” Mar. 2023. [Online]. Available: https://lmsys.org/ blog/2023-03-30-vicuna/

work page 2023
[18]

Learning Transferable Visual Models From Natural Language Super- vision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Super- vision,” in Proc. 38th International Conference on Machine Learning , M. Meila and T. Zhang, Eds., 2021

work page 2021
[19]

Open-Set Recognition in the Age of Vision-Language Models,

D. Miller et al., “Open-Set Recognition in the Age of Vision-Language Models,” in Computer Vision – ECCV 2024 , A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds., Springer Nature Switzerland, 2025

work page 2024
[20]

Renovating Names in Open-V ocabulary Segmenta- tion Benchmarks,

H. Huang et al. , “Renovating Names in Open-V ocabulary Segmenta- tion Benchmarks,” arXiv preprint arXiv:2403.09593 , 2024

work page arXiv 2024
[21]

Transformers: State-of-the-Art Natural Language Pro- cessing,

T. Wolf et al., “Transformers: State-of-the-Art Natural Language Pro- cessing,” in Proc. 2020 Conf. Empirical Methods in Natural Language Processing: System Demonstrations , 2020

work page 2020
[22]

TRL: Transformer Reinforcement Learning,

L. von Werra et al. , “TRL: Transformer Reinforcement Learning,” GitHub repository, 2020. [Online]. Available: https://github. com/huggingface/trl

work page 2020
[23]

PEFT: State-of-the-art Parameter-Efficient Fine- Tuning methods,

S. Mangrulkar et al., “PEFT: State-of-the-art Parameter-Efficient Fine- Tuning methods,” 2022. [Online]. Available: https://github. com/huggingface/peft

work page 2022

[1] [1]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , 2015

work page 2015

[2] [2]

You Only Look Once: Unified, Real-Time Object Detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016

work page 2016

[3] [3]

End-to-End Object Detection with Transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” in Proc. Computer Vision – ECCV 2020 , A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., Springer International Publishing, Cham, 2020

work page 2020

[4] [4]

FCOS: Fully Convolutional One-Stage Object Detection,

Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully Convolutional One-Stage Object Detection,” in Proc. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , 2019

work page 2019

[5] [5]

Object Detection with Deep Learning: A Review,

Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object Detection with Deep Learning: A Review,” IEEE Transactions on Neural Networks and Learning Systems , 2019

work page 2019

[6] [6]

Learning Transferable Visual Models From Natural Language Supervision,

A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. International Conference on Machine Learning (ICML) , 2021

work page 2021

[7] [7]

On the Opportunities and Risks of Foundation Models

R. Bommasani et al. , “On the Opportunities and Risks of Foundation Models,” arXiv preprint arXiv:2108.07258 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Flamingo: a Visual Language Model for Few- Shot Learning,

J.-B. Alayrac et al. , “Flamingo: a Visual Language Model for Few- Shot Learning,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., Curran Associates, Inc., 2022

work page 2022

[9] [9]

InstructBLIP: Towards General-Purpose Vision- Language Models with Instruction Tuning,

W. Dai et al. , “InstructBLIP: Towards General-Purpose Vision- Language Models with Instruction Tuning,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., Curran Associates, Inc., 2023

work page 2023

[10] [10]

CenterNet: Keypoint Triplets for Object Detection,

K. Duan et al., “CenterNet: Keypoint Triplets for Object Detection,” in Proc. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019

work page 2019

[11] [11]

Visual Instruction Tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual Instruction Tuning,” in Proc. Advances in Neural Information Processing Systems (NeurIPS) , A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., Curran Associates, Inc., 2023

work page 2023

[12] [12]

Improved Baselines with Visual Instruction Tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved Baselines with Visual Instruction Tuning,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

work page 2024

[13] [13]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu et al. , “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2106.09685 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Scaling Vision Transformers,

M. Dehghani et al. , “Scaling Vision Transformers,” arXiv preprint arXiv:2302.05442, 2023

work page arXiv 2023

[15] [15]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone,” arXiv preprint arXiv:2404.14219 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” arXiv preprint arXiv:2302.13971 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,” Mar. 2023. [Online]. Available: https://lmsys.org/ blog/2023-03-30-vicuna/

work page 2023

[18] [18]

Learning Transferable Visual Models From Natural Language Super- vision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Super- vision,” in Proc. 38th International Conference on Machine Learning , M. Meila and T. Zhang, Eds., 2021

work page 2021

[19] [19]

Open-Set Recognition in the Age of Vision-Language Models,

D. Miller et al., “Open-Set Recognition in the Age of Vision-Language Models,” in Computer Vision – ECCV 2024 , A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds., Springer Nature Switzerland, 2025

work page 2024

[20] [20]

Renovating Names in Open-V ocabulary Segmenta- tion Benchmarks,

H. Huang et al. , “Renovating Names in Open-V ocabulary Segmenta- tion Benchmarks,” arXiv preprint arXiv:2403.09593 , 2024

work page arXiv 2024

[21] [21]

Transformers: State-of-the-Art Natural Language Pro- cessing,

T. Wolf et al., “Transformers: State-of-the-Art Natural Language Pro- cessing,” in Proc. 2020 Conf. Empirical Methods in Natural Language Processing: System Demonstrations , 2020

work page 2020

[22] [22]

TRL: Transformer Reinforcement Learning,

L. von Werra et al. , “TRL: Transformer Reinforcement Learning,” GitHub repository, 2020. [Online]. Available: https://github. com/huggingface/trl

work page 2020

[23] [23]

PEFT: State-of-the-art Parameter-Efficient Fine- Tuning methods,

S. Mangrulkar et al., “PEFT: State-of-the-art Parameter-Efficient Fine- Tuning methods,” 2022. [Online]. Available: https://github. com/huggingface/peft

work page 2022