Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models

Lan-Zhe Guo; Rui Zhu; Song-Lin Lv; Zi-Kang Wang

arxiv: 2510.20477 · v2 · submitted 2025-10-23 · 💻 cs.LG

Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models

Rui Zhu , Song-Lin Lv , Zi-Kang Wang , Lan-Zhe Guo This is my paper

Pith reviewed 2026-05-18 04:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords semi-supervised learningvision-language modelsself-trainingpseudo-labelingconsistencyfine-tuningbias reduction

0 comments

The pith

Bi-CoG produces higher-quality pseudo-labels for vision-language models by checking consistency both across models and inside a single model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve semi-supervised fine-tuning of pre-trained vision-language models when labeled data is limited. Prior approaches often produce biased pseudo-labels or depend heavily on fixed confidence thresholds that require careful tuning. Bi-CoG instead applies consistency checks in two directions at once and uses an error-aware rule to decide which labels to trust at each step. If the method succeeds, it supplies labels that are both accurate and less biased, allowing existing fine-tuning pipelines to reach higher accuracy on new tasks. The authors support this with experiments across fourteen datasets showing steady gains over baseline techniques.

Core claim

Bi-CoG assigns high-quality and low-bias pseudo-labels by simultaneously exploiting inter-model and intra-model consistency, along with an error-aware dynamic pseudo-label assignment strategy.

What carries the argument

Bi-Consistency-Guided Self-Training, which pairs agreement checks between separate models with agreement checks inside one model to steer pseudo-label selection.

If this is right

Existing consistency-based or threshold-based methods for VLM fine-tuning receive consistent accuracy gains.
Model bias in the generated labels is reduced compared with single-direction consistency.
Dependence on pre-chosen confidence thresholds decreases.
The approach functions as a plug-in addition to current self-training pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-consistency idea could be tried on other pre-trained models that process only text or only images.
Combining the dynamic assignment rule with different training losses might further stabilize learning.
The method could be tested on much larger unlabeled collections to check whether gains hold at scale.

Load-bearing premise

That agreement across models and inside one model will reliably mark correct pseudo-labels without creating fresh forms of bias or new tuning problems.

What would settle it

Measure whether the pseudo-labels chosen by Bi-CoG match known ground-truth labels at a higher rate than labels from single-consistency or threshold methods on a dataset where true answers are available.

Figures

Figures reproduced from arXiv: 2510.20477 by Lan-Zhe Guo, Rui Zhu, Song-Lin Lv, Zi-Kang Wang.

**Figure 1.** Figure 1: Comparison of existing pseudo-labeling paradigms and our proposed Bi-CoG. (a) Majority voting from multiple VLMs; (b) Top-k selection based on predicted logits; (c) Bi-CoG dynamically generate accurate and unbiased pseudo-labels based on bi-consistency and model error. thresholds. For instance, GRIP (Menghini, Delworth, and Bach 2023) and UPL (Huang, Chu, and Wei 2022) enforce class balance by selecting e… view at source ↗

**Figure 2.** Figure 2: The overall framework of Bi-CoG’s pseudo-label selection. During the self-training phase, Bi-CoG employs inter [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Pseudo-label accuracy over training iterations [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between the estimated and true er [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Exploiting unlabeled data through semi-supervised learning (SSL) or leveraging pre-trained models via fine-tuning are two prevailing paradigms for addressing label-scarce scenarios. Recently, growing attention has been given to combining fine-tuning of pre-trained vision-language models (VLMs) with SSL, forming the emerging paradigm of semi-supervised fine-tuning. However, existing methods often suffer from model bias and hyperparameter sensitivity, due to reliance on prediction consistency or pre-defined confidence thresholds. To address these limitations, we propose a simple yet effective plug-and-play methodology named $\underline{\textbf{Bi-Co}}$nsistency-$\underline{\textbf{G}}$uided Self-Training (Bi-CoG), which assigns high-quality and low-bias pseudo-labels, by simultaneously exploiting inter-model and intra-model consistency, along with an error-aware dynamic pseudo-label assignment strategy. Both theoretical analysis and extensive experiments over 14 datasets demonstrate the effectiveness of Bi-CoG, which consistently and significantly improves the performance of existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bi-CoG adds inter-plus-intra consistency and dynamic error-aware assignment to VLM self-training and reports gains on 14 datasets, but the low-bias claim hinges on whether those consistencies actually break shared priors rather than reinforce them.

read the letter

Bi-CoG is a plug-and-play self-training method for fine-tuning vision-language models under label scarcity. It generates pseudo-labels by checking consistency both across different models and within the same model, then uses an error-aware dynamic rule to decide which labels to keep or adjust. The paper claims this reduces model bias and hyperparameter sensitivity compared with plain consistency or fixed-threshold approaches, backed by theory and results on 14 datasets where it lifts several baselines.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Bi-CoG, a plug-and-play self-training method for semi-supervised fine-tuning of vision-language models. It claims to mitigate model bias and hyperparameter sensitivity in existing consistency- or threshold-based approaches by simultaneously exploiting inter-model and intra-model consistency together with an error-aware dynamic pseudo-label assignment strategy. The authors assert that this yields high-quality, low-bias pseudo-labels, supported by theoretical analysis and consistent, significant gains across 14 datasets when applied to existing methods.

Significance. If the empirical gains and theoretical support hold after detailed verification, Bi-CoG could offer a practical advance in combining SSL with VLM fine-tuning by reducing reliance on fixed thresholds and addressing confirmation bias through dual consistency signals. The plug-and-play design and reported robustness over many datasets would make the contribution broadly usable in label-scarce multimodal settings.

major comments (2)

[Abstract] Abstract: the central claim that the error-aware dynamic pseudo-label assignment produces low-bias labels by breaking cycles of shared VLM priors is load-bearing, yet the abstract gives no indication of how the independent error signal is obtained or whether it depends on the same consistency statistics used for inter- and intra-model checks; this directly affects whether observed gains reflect genuinely lower bias or simply higher unlabeled-data utilization.
[Theoretical analysis] Theoretical analysis section: the manuscript asserts theoretical analysis supporting the low-bias property, but without the explicit derivations it remains unclear whether the analysis establishes that simultaneous inter- and intra-model consistency avoids reinforcing systematic VLM biases (e.g., object co-occurrence or captioning artifacts) rather than merely re-expressing the consistency metric.

minor comments (2)

[Abstract] Abstract: a short quantitative statement of the magnitude of improvements or the range of the 14 datasets would help readers immediately gauge the empirical scope.
[Method] Ensure all hyperparameters introduced by the dynamic assignment strategy are explicitly listed and their sensitivity is ablated, as the introduction highlights hyperparameter sensitivity as a key limitation of prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the manuscript to improve clarity where needed.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the error-aware dynamic pseudo-label assignment produces low-bias labels by breaking cycles of shared VLM priors is load-bearing, yet the abstract gives no indication of how the independent error signal is obtained or whether it depends on the same consistency statistics used for inter- and intra-model checks; this directly affects whether observed gains reflect genuinely lower bias or simply higher unlabeled-data utilization.

Authors: We appreciate the referee highlighting the need for greater precision in the abstract. The error-aware component derives its signal from a dynamic, iteration-dependent error estimation that is computed independently of the inter-model and intra-model consistency statistics; this separation is what enables the method to interrupt reinforcement of shared VLM priors. We have revised the abstract to explicitly note the independent origin of the error signal and its role in bias mitigation, thereby clarifying that the reported gains arise from improved label quality rather than solely from greater unlabeled-data utilization. revision: yes
Referee: [Theoretical analysis] Theoretical analysis section: the manuscript asserts theoretical analysis supporting the low-bias property, but without the explicit derivations it remains unclear whether the analysis establishes that simultaneous inter- and intra-model consistency avoids reinforcing systematic VLM biases (e.g., object co-occurrence or captioning artifacts) rather than merely re-expressing the consistency metric.

Authors: We agree that the theoretical section would benefit from more explicit derivations. The analysis demonstrates that the joint enforcement of inter-model consistency (across distinct VLMs) and intra-model consistency (under input perturbations) supplies orthogonal constraints that penalize systematic biases such as object co-occurrence and captioning artifacts, rather than simply restating a single consistency measure. We have expanded the revised theoretical analysis section with the full step-by-step derivations to make this distinction rigorous and transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Bi-CoG is a procedural recipe validated by external experiments

full rationale

The paper introduces Bi-CoG as a plug-and-play self-training method that combines inter-model consistency, intra-model consistency, and an error-aware dynamic pseudo-label assignment strategy to generate pseudo-labels for semi-supervised fine-tuning of VLMs. The abstract and description contain no mathematical derivations, equations, or first-principles results that reduce a claimed prediction or uniqueness claim to a fitted parameter or self-citation by construction. The central effectiveness claim is supported by a theoretical analysis (whose details are not shown to be tautological) plus empirical results across 14 datasets, rendering the contribution externally falsifiable rather than internally self-referential. No load-bearing step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that prediction consistency across models and within a model correlates with label correctness in the semi-supervised VLM setting; no free parameters or invented entities are explicitly introduced in the abstract description.

axioms (1)

domain assumption Inter-model and intra-model prediction consistency reliably signals low-bias pseudo-labels in vision-language model fine-tuning.
This premise underpins the claim that the proposed strategy produces higher-quality labels than prior consistency or threshold methods.

pith-pipeline@v0.9.0 · 5709 in / 1417 out tokens · 41555 ms · 2026-05-18T04:55:53.717476+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bi-CoG leverages multiple VLMs ... inter-model consistency ... intra-model consistency ... error-aware dynamic pseudo-label assignment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

InPro- ceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 3606–3613

Describing Textures in the Wild. InPro- ceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 3606–3613. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei- Fei, L

work page 2014
[2]

InProceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255

ImageNet: a large-scale hierarchical image database. InProceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Fei-Fei, L.; Fergus, R.; and Perona, P

work page 2009
[3]

Inthe 2024 IEEE Conference on Computer Vision and Pattern Recognition Workshops,

Learning Gen- erative Visual Models from Few Training Examples: An In- cremental Bayesian Approach Tested on 101 Object Cate- gories. Inthe 2024 IEEE Conference on Computer Vision and Pattern Recognition Workshops,

work page 2024
[4]

Jia, C.; Yang, Y .; Xia, Y .; Chen, Y .-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y .-H.; Li, Z.; and Duerig, T

Unsupervised prompt learning for vision-language models.arXiv preprint arXiv:2204.03649. Jia, C.; Yang, Y .; Xia, Y .; Chen, Y .-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y .-H.; Li, Z.; and Duerig, T

work page arXiv
[5]

Inthe 2013 IEEE International Conference on Computer Vision Workshops, 554–561

3D ob- ject representations for fine-grained categorization. Inthe 2013 IEEE International Conference on Computer Vision Workshops, 554–561. Krizhevsky, A.; Hinton, G.; et al

work page 2013
[6]

In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 14394–14410

Vision-Language Model Fine-Tuning via Sim- ple Parameter-Efficient Modification. In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 14394–14410. Miami, Florida, USA: Associa- tion for Computational Linguistics. Lv, S.-L.; Chen, Y .-Y .; Zhou, Z.; Yang, M.; and Guo,...

work page arXiv 2024
[7]

Fine-Grained Visual Classification of Aircraft

Fine-grained visual classification of aircraft. arXiv:arXiv:1306.5151. Menghini, C.; Delworth, A.; and Bach, S

work page internal anchor Pith review Pith/arXiv arXiv
[8]

InProceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3498–3505

Cats and dogs. InProceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3498–3505. Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

work page 2012
[9]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:arXiv:1212.0402. Wang, A.; and Russakovsky, O

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InProceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition, 3485–3492

Sun database: Large-scale scene recognition from abbey to zoo. InProceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition, 3485–3492. Yang, X.; Song, Z.; King, I.; and Xu, Z

work page 2010
[11]

InCom- puter Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 493–510

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification. InCom- puter Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 493–510. Berlin, Heidelberg: Springer-Verlag. ISBN 978-3- 031-19832-8. Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022a. Condi- tional prompt learning for vision...

work page 2022
[12]

InInterna- tional Conference on Learning Representations

The Rich Get Richer: Disparate Impact of Semi-Supervised Learning. InInterna- tional Conference on Learning Representations. Supplementary Material A. Detailed Theoretical Analysis A.1. Proof of Lemma 1 Proof.When the aggregated training data contains noisy samples, letFdenote the hypothesis space,ηthe noise ratio in the training set (η <0.5),ϵ >0the targ...

work page 1988
[13]

padding 28 Strong Augmentations ColorJitterRandomly changes the brightness, contrast, saturation, and hue. brightness 0.4 contrast 0.4 saturation 0.1 hue 0.4 RandomGrayscaleConverts the image to grayscale with a probability ofp.p0.2 GaussianBlurApplies Gaussian blur to the image with a probability ofp. kernel size 21 p0.5 Table 6: Details of Weak and Stro...

work page 2009

[1] [1]

InPro- ceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 3606–3613

Describing Textures in the Wild. InPro- ceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 3606–3613. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei- Fei, L

work page 2014

[2] [2]

InProceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255

ImageNet: a large-scale hierarchical image database. InProceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Fei-Fei, L.; Fergus, R.; and Perona, P

work page 2009

[3] [3]

Inthe 2024 IEEE Conference on Computer Vision and Pattern Recognition Workshops,

Learning Gen- erative Visual Models from Few Training Examples: An In- cremental Bayesian Approach Tested on 101 Object Cate- gories. Inthe 2024 IEEE Conference on Computer Vision and Pattern Recognition Workshops,

work page 2024

[4] [4]

Jia, C.; Yang, Y .; Xia, Y .; Chen, Y .-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y .-H.; Li, Z.; and Duerig, T

Unsupervised prompt learning for vision-language models.arXiv preprint arXiv:2204.03649. Jia, C.; Yang, Y .; Xia, Y .; Chen, Y .-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y .-H.; Li, Z.; and Duerig, T

work page arXiv

[5] [5]

Inthe 2013 IEEE International Conference on Computer Vision Workshops, 554–561

3D ob- ject representations for fine-grained categorization. Inthe 2013 IEEE International Conference on Computer Vision Workshops, 554–561. Krizhevsky, A.; Hinton, G.; et al

work page 2013

[6] [6]

In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 14394–14410

Vision-Language Model Fine-Tuning via Sim- ple Parameter-Efficient Modification. In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 14394–14410. Miami, Florida, USA: Associa- tion for Computational Linguistics. Lv, S.-L.; Chen, Y .-Y .; Zhou, Z.; Yang, M.; and Guo,...

work page arXiv 2024

[7] [7]

Fine-Grained Visual Classification of Aircraft

Fine-grained visual classification of aircraft. arXiv:arXiv:1306.5151. Menghini, C.; Delworth, A.; and Bach, S

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

InProceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3498–3505

Cats and dogs. InProceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3498–3505. Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

work page 2012

[9] [9]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:arXiv:1212.0402. Wang, A.; and Russakovsky, O

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

InProceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition, 3485–3492

Sun database: Large-scale scene recognition from abbey to zoo. InProceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition, 3485–3492. Yang, X.; Song, Z.; King, I.; and Xu, Z

work page 2010

[11] [11]

InCom- puter Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 493–510

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification. InCom- puter Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 493–510. Berlin, Heidelberg: Springer-Verlag. ISBN 978-3- 031-19832-8. Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022a. Condi- tional prompt learning for vision...

work page 2022

[12] [12]

InInterna- tional Conference on Learning Representations

The Rich Get Richer: Disparate Impact of Semi-Supervised Learning. InInterna- tional Conference on Learning Representations. Supplementary Material A. Detailed Theoretical Analysis A.1. Proof of Lemma 1 Proof.When the aggregated training data contains noisy samples, letFdenote the hypothesis space,ηthe noise ratio in the training set (η <0.5),ϵ >0the targ...

work page 1988

[13] [13]

padding 28 Strong Augmentations ColorJitterRandomly changes the brightness, contrast, saturation, and hue. brightness 0.4 contrast 0.4 saturation 0.1 hue 0.4 RandomGrayscaleConverts the image to grayscale with a probability ofp.p0.2 GaussianBlurApplies Gaussian blur to the image with a probability ofp. kernel size 21 p0.5 Table 6: Details of Weak and Stro...

work page 2009