arxiv: 2605.00480 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

Leveraging Vision-Language Models as Weak Annotators in Active Learning

Phuong Ngoc Nguyen , Kaito Shiku , Ryoma Bise , Seiichi Uchida , Shinnosuke Matsuo

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords active learningvision-language modelsweak labelsfine-grained recognitionlabel noise modelingannotation efficiencycoarse-grained supervision

0 comments

The pith

Vision-language models supply reliable coarse labels that combine with sparse human fine labels to outperform standard active learning under fixed budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that vision-language models perform poorly on fine-grained labels but reliably on coarse-grained ones in recognition tasks. It builds an active learning method that assigns VLM coarse labels instance-wise alongside selected human fine labels and corrects VLM noise using a small trusted set of full labels. This hybrid setup lowers the total human annotation needed while raising accuracy. A sympathetic reader cares because detailed human labeling is expensive and the method exploits an existing strength of readily available models. Experiments on CUB200 and FGVC-Aircraft confirm consistent gains over prior active learning approaches at the same budget.

Core claim

The central claim is that an active learning framework can leverage VLMs to generate coarse-grained weak labels, merge them instance-wise with human-provided fine-grained labels, and model the systematic noise in those weak labels from only a small trusted set, thereby achieving higher performance than existing active learning methods under identical annotation budgets on fine-grained datasets such as CUB200 and FGVC-Aircraft.

What carries the argument

Instance-wise label assignment that fuses VLM-generated coarse labels with human fine labels, paired with noise modeling derived from a small trusted set of full labels.

If this is right

Fewer total human annotations are required to reach a target accuracy level.
The method works on standard fine-grained benchmarks like CUB200 and FGVC-Aircraft.
Noise correction remains effective even when the trusted set is small.
Active learning query selection can now incorporate weak VLM signals without full supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coarse-versus-fine reliability pattern may appear in other recognition domains and could support similar hybrid supervision.
Scaling to larger or newer VLMs might further improve the quality of the coarse labels supplied to the framework.
The noise-modeling step offers a template for incorporating other imperfect weak annotators in selection-based learning.

Load-bearing premise

That vision-language models produce accurate coarse-grained labels in fine-grained tasks and that their errors follow a systematic pattern correctable from only a few trusted full labels.

What would settle it

Running the framework on a new fine-grained dataset where the VLM's coarse labels match random accuracy would eliminate the reported performance advantage.

read the original abstract

Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further reduce the reliance on costly human annotation within the active learning paradigm. To this end, we find that the reliability of VLMs varies significantly with label granularity in fine-grained recognition tasks: they perform poorly on fine-grained labels but can provide accurate coarse-grained labels. Leveraging this property, we propose an active learning framework that combines fine-grained human annotations with coarse-grained VLM-generated weak labels through instance-wise label assignment. We further model the systematic noise in VLM-generated labels using a small set of trusted full labels. Experiments on CUB200 and FGVC-Aircraft show that the proposed framework consistently outperforms existing active learning methods under the same annotation budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable way to blend VLM coarse labels with selective human fine labels in active learning, but the gains rest on assumptions about VLM reliability and noise modeling that the abstract leaves untested.

read the letter

The main point is a targeted active learning setup for fine-grained tasks that uses VLMs to supply coarse labels where they are reliable, then switches to human fine labels instance by instance while fitting a noise model on a small trusted set. This is a direct response to the cost of full human annotation on datasets like CUB200 and FGVC-Aircraft. The authors correctly note that VLMs tend to fail at fine distinctions but can handle coarser categories, and they build the framework around that split plus the noise correction step. That combination is new enough as a concrete pipeline even if the separate pieces have appeared before. The experiments claim consistent improvement over standard active learning under fixed budget, which is the kind of practical result that matters for annotation-heavy work. The approach is simple to describe and stays grounded in public benchmarks. The soft spot is that the abstract supplies no numbers on coarse-label accuracy, no ablation on the noise model, and no detail on how the coarse/fine hierarchy was chosen. If the VLM coarse labels are only modestly accurate or the noise is not low-rank enough to recover from a small trusted set, the claimed advantage could shrink or vanish. The stress-test concern lands: without those checks the outperformance is hard to trust at face value. This is for people already working on active learning or weak supervision in specialized vision tasks who need incremental efficiency gains rather than a new paradigm. It is coherent on its own terms and has enough of a testable claim on standard data to deserve referee time, though it will need clearer results and sensitivity checks before publication.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an active learning framework for fine-grained visual recognition that exploits the differential reliability of vision-language models (VLMs): poor performance on fine-grained labels but usable accuracy on coarse-grained labels. It combines instance-wise assignment of VLM-generated coarse weak labels with a small set of trusted human full labels to model systematic noise in the VLM outputs, claiming that the resulting hybrid supervision consistently outperforms standard active learning baselines on CUB200 and FGVC-Aircraft under identical annotation budgets.

Significance. If the empirical results and supporting ablations are robust, the work demonstrates a concrete, low-cost way to reduce human labeling effort in active learning by capitalizing on existing VLM capabilities, which could meaningfully improve annotation efficiency for fine-grained tasks where full supervision is expensive.

major comments (2)

[Abstract] Abstract: the central claim that the framework 'consistently outperforms existing active learning methods under the same annotation budget' is asserted without any quantitative performance numbers, baseline names, accuracy deltas, or references to tables/figures; this absence prevents verification of the magnitude or reliability of the reported gains.
[Abstract] Abstract and method description: the load-bearing assumption that VLMs supply sufficiently accurate coarse-grained labels (despite poor fine-grained performance) and that their systematic noise is recoverable from only a small trusted full-label subset is stated but unsupported by any reported coarse-label accuracy figures, label-hierarchy details, or ablation isolating the noise-model contribution on CUB200/FGVC-Aircraft.

minor comments (1)

The abstract would be clearer if it named the specific VLM(s) employed and the exact coarse/fine label hierarchy used for the two datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the abstract to provide more immediate empirical support for our claims while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the framework 'consistently outperforms existing active learning methods under the same annotation budget' is asserted without any quantitative performance numbers, baseline names, accuracy deltas, or references to tables/figures; this absence prevents verification of the magnitude or reliability of the reported gains.

Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript we will add specific accuracy figures from our CUB200 and FGVC-Aircraft experiments, name the primary active-learning baselines, report the observed accuracy deltas under the fixed annotation budget, and include explicit references to the corresponding tables and figures. revision: yes
Referee: [Abstract] Abstract and method description: the load-bearing assumption that VLMs supply sufficiently accurate coarse-grained labels (despite poor fine-grained performance) and that their systematic noise is recoverable from only a small trusted full-label subset is stated but unsupported by any reported coarse-label accuracy figures, label-hierarchy details, or ablation isolating the noise-model contribution on CUB200/FGVC-Aircraft.

Authors: The full paper reports the coarse-grained VLM accuracies, the label hierarchy employed, and dedicated ablations on the noise-modeling component in the experimental and ablation sections. To make this evidence visible already in the abstract, we will add brief quantitative statements on coarse-label reliability and the noise-model contribution together with references to the relevant figures and tables. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations

full rationale

The paper presents an empirical active learning framework that combines human fine-grained labels with VLM coarse-grained weak labels and a noise model trained on a small trusted set. No equations, derivations, or self-definitional steps appear in the provided text or abstract. Central claims rest on experimental outperformance on public benchmarks (CUB200, FGVC-Aircraft) under fixed budgets, not on any fitted parameter renamed as prediction or on self-citation chains. Assumptions about VLM reliability are stated as observations and tested empirically rather than derived by construction. This matches the default case of a self-contained empirical study with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on standard active learning selection and existing VLMs without introducing new postulated objects.

pith-pipeline@v0.9.0 · 5454 in / 1112 out tokens · 53002 ms · 2026-05-09T19:50:53.129738+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 4 internal anchors

[1]

Leveraging Vision-Language Models as Weak Annotators in Active Learning

INTRODUCTION Active learning (AL) [1, 2, 3, 4, 5, 6] aims to improve model perfor- mance under a limited annotation budget by selectively querying the most informative data samples for annotation. A common strategy is to prioritize samples near the decision boundary, where additional supervision is expected to yield the largest performance gain. In conven...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Active Learning Active learning (AL) [1, 2, 3] aims to improve model performance under a limited annotation budget by selectively querying informa- tive samples for labeling

RELA TED WORK 2.1. Active Learning Active learning (AL) [1, 2, 3] aims to improve model performance under a limited annotation budget by selectively querying informa- tive samples for labeling. Many AL methods select samples based on predictive uncertainty, prioritizing instances that are expected to pro- vide the largest performance gain, using criteria ...
[3]

Specifically, we compare VLM performance on fine-grained and coarse-grained class labels

PRELIMINARY EXPERIMENTS To use vision-language models (VLMs) as annotators in active learn- ing for fine-grained classification, we first examine how their in- ference performance depends on label granularity. Specifically, we compare VLM performance on fine-grained and coarse-grained class labels. 3.1. Datasets We conduct preliminary experiments on two f...
[4]

In our study, species-level labels are treated as fine-grained labels

Caltech-UCSD Birds-200-2011 (CUB200) [11]:A bird classifi- cation dataset containing 11,788 images across 200 species. In our study, species-level labels are treated as fine-grained labels. Coarse- grained labels are defined following [17], where 70 superclasses are constructed based on suffix patterns in the class names. The dataset consists of 5,994 tra...

2011
[5]

Classify the image into one of the followingN classes: CLASS 1, . . . , CLASS N

FGVC-Aircraft [12]:An aircraft classification dataset consisting of 10,000 images covering 100 variants. We use the coarse-grained labels as predefined in the dataset, where the coarse-grained label corresponds to the manufacturer level. The dataset consists of 6,667 training images and 3,333 evaluation images. 3.2. Experimental settings We adopt Gemini 2...
[6]

PROPOSED METHOD 4.1. Problem Setting and Overview We consider an active learning setting for fine-grained image clas- sification under a limited annotation budget, where each queried in- stance is annotated either with a fine-grained human label (full la- bel) or a coarse-grained label (weak label) generated by a vision- language model (VLM), as illustrat...
[7]

Dataset In this section, we use the same datasets as those employed in the preliminary experiments, namely CUB200 [11] and FGVC- Aircraft [12]

EXPERIMENTS 5.1. Dataset In this section, we use the same datasets as those employed in the preliminary experiments, namely CUB200 [11] and FGVC- Aircraft [12]. Both datasets are fine-grained image classification benchmarks with predefined hierarchical label structures, which allow us to naturally use fine-grained full labels and coarse-grained weak label...
[8]

Random:A baseline that randomly samples instances from the unlabeled pool
[9]

Entropy [4]:An uncertainty-based AL method that selects in- stances with high entropy of the class probability predictions pro- duced by the classifier
[10]

BADGE [5]:An AL method that accounts for both predictive un- certainty and diversity by selecting instances with diverse and high- magnitude gradients in the gradient space
[11]

ISOAL [6]:An AL framework that selects the supervision level for each instance under a fixed annotation budget, assuming all an- notations are provided by human annotators. 5.3. Implementation Details The proposed network architecture consists of a shared feature ex- tractor and two classification heads, one for fully supervised learn- ing and the other f...
[12]

CONCLUSION We proposed an active learning (AL) framework that leverages vision-language models (VLMs) as weak annotators to reduce re- liance on costly human supervision. From preliminary experiments, we found that the reliability of VLMs varied significantly with label granularity in fine-grained recognition tasks, where VLMs struggled with fine-grained ...
[13]

Active learning literature survey,

Burr Settles, “Active learning literature survey,” 2009

2009
[14]

A sur- vey of deep active learning,

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhi- hui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang, “A sur- vey of deep active learning,”ACM computing surveys, vol. 54, no. 9, pp. 1–40, 2021

2021
[15]

Active learning: A survey,

Charu C Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and S Yu Philip, “Active learning: A survey,” inData classification, pp. 599–634. Chapman and Hall/CRC, 2014

2014
[16]

A new active labeling method for deep learning,

Dan Wang and Yi Shang, “A new active labeling method for deep learning,” inInternational Joint Conference on Neural Networks, 2014, pp. 112–119

2014
[17]

Deep batch active learning by diverse, uncertain gradient lower bounds,

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal, “Deep batch active learning by diverse, uncertain gradient lower bounds,” inInternational Conference on Learning Representations, 2019

2019
[18]

Instance-wise supervision- level optimization in active learning,

Shinnosuke Matsuo, Riku Togashi, Ryoma Bise, Seiichi Uchida, and Masahiro Nomura, “Instance-wise supervision- level optimization in active learning,” inComputer Vision and Pattern Recognition, June 2025, pp. 4939–4947

2025
[19]

Learning trans- ferable visual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning trans- ferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[20]

Gemini 2.0 flash: The gemini 2 family expands,

Google DeepMind, “Gemini 2.0 flash: The gemini 2 family expands,”https://developers.googleblog.com/ en/gemini-2-family-expands/, Dec. 2024, [On- line]. Accessed: Jan 18, 2026

2024
[21]

Active prompt learning in vision language models,

Jihwan Bang, Sumyeong Ahn, and Jae-Gil Lee, “Active prompt learning in vision language models,” inComputer Vi- sion and Pattern Recognition, 2024, pp. 27004–27014

2024
[22]

Active learning for vision- language models,

Bardia Safaei and Vishal M Patel, “Active learning for vision- language models,” inWinter Conference on Applications of Computer Vision. IEEE, 2025, pp. 4902–4912

2025
[23]

The caltech-ucsd birds-200-2011 dataset,

Wah Catherine, Branson Steve, Welinder Peter, Perona Pietro, and Belongie Serge, “The caltech-ucsd birds-200-2011 dataset,” 2011

2011
[24]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi, “Fine-grained visual classification of air- craft,”arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review arXiv 2013
[25]

Deep bayesian active learning with image data,

Yarin Gal, Riashat Islam, and Zoubin Ghahramani, “Deep bayesian active learning with image data,” inInternational Conference on Machine Learning, 2017, pp. 1183–1192

2017
[26]

Active learning for convolu- tional neural networks: A core-set approach,

Ozan Sener and Silvio Savarese, “Active learning for convolu- tional neural networks: A core-set approach,” inInternational Conference on Learning Representations, 2018

2018
[27]

Full or weak annotations? an adaptive strategy for budget-constrained an- notation campaigns,

Javier Gamazo Tejero, Martin S Zinkernagel, Sebastian Wolf, Raphael Sznitman, and Pablo M ´arquez-Neila, “Full or weak annotations? an adaptive strategy for budget-constrained an- notation campaigns,” inConference on Computer Vision and Pattern Recognition, 2023, pp. 11381–11391

2023
[28]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Using coarse label constraint for fine-grained visual classification,

Chao Lu and Yuexian Zou, “Using coarse label constraint for fine-grained visual classification,” inConference on Multime- dia Modeling, 2018

2018
[30]

Making deep neural networks robust to label noise: A loss correction approach,

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu, “Making deep neural networks robust to label noise: A loss correction approach,” inComputer Vision and Pattern Recognition, 2017, pp. 1944–1952

2017
[31]

An image is worth 16x16 words: Transformers for image recognition at scale,

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInter- national Conference on Learning Representations, 2021

2021
[32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma, “Adam: A method for stochastic opti- mization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014