pith. machine review for the scientific record. sign in

arxiv: 2605.08156 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords zero-shot recognitionvisual-text alignmentobject-centric focuslanguage-guided refinementadaptive candidate selectionprediction loopfine-grained classificationdistribution shift
0
0 comments X

The pith

LAGO improves zero-shot visual-text alignment by adaptively focusing on object regions with confidence-controlled language guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Zero-shot recognition in fine-grained cases often fails when matching an entire image to class descriptions because key evidence is in parts or textures, yet prior localized methods rely on large sets of random crops that raise cost and introduce redundant or weak candidates. LAGO starts with class-agnostic object-centric discovery to produce a stable set of initial regions, then applies language-guided refinement whose strength is scaled by the model's intermediate confidence to avoid the prediction loop in which early errors bias later steps and compound. It also fuses object, contextual, and full-image signals through dual-channel aggregation. A reader would care because the result is higher accuracy on both standard benchmarks and shifted distributions while using far fewer regions at inference.

Core claim

LAGO first performs class-agnostic object-centric candidate discovery to obtain a stable visual initialization, and then applies adaptive language-guided refinement with the strength of semantic guidance controlled by intermediate confidence. It further combines object-level, contextual, and full-image evidence through an effective object-context dual-channel aggregation strategy, achieving state-of-the-art performance on standard zero-shot benchmarks and challenging distribution-shift settings while requiring substantially fewer candidate regions at inference time.

What carries the argument

Class-agnostic object-centric candidate discovery followed by confidence-modulated adaptive language-guided refinement and object-context dual-channel aggregation.

Load-bearing premise

Class-agnostic object-centric candidate discovery yields a stable visual initialization and modulating semantic guidance strength by intermediate confidence effectively mitigates the prediction loop error amplification.

What would settle it

An experiment that disables the confidence-based modulation on fine-grained zero-shot tasks and measures whether accuracy falls due to amplified errors from initial inaccurate predictions, or that compares region counts needed to reach prior methods' accuracy levels.

Figures

Figures reproduced from arXiv: 2605.08156 by Junyi Hu, Lei Zhang, Qiji Zhou, Yue Zhang.

Figure 1
Figure 1. Figure 1: Visual-text alignment for zero-shot label selection. Given an image and a candidate class (e.g., golden retriever), the model first derives fine-grained class descriptions (center), then searches for visual regions that closely correspond to these semantic cues, and finally decides whether the image matches the label. Compared with WCA (left) [6], which relies on random crop sampling, LAGO (right) discover… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LAGO. The pipeline consists of preprocessing (§3.1), visual-only diverse search (§3.2), ensemble prediction & confidence and adaptive text-guided refinement (§3.3); and merge & calculate (§3.4). Further implementation details of our method are provided in Appendix A. tions with learnable contexts, as well as training-free approaches that exploit LLM-generated class descriptions containing fine-… view at source ↗
Figure 3
Figure 3. Figure 3: Candidate-region budget analysis of LAGO. Efficiency analysis on ImageNet-V2, showing that LAGO uses candidate regions more effectively than WCA as the budget increases [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Stage-wise visualization. Each example shows the image, Stage 1/Stage 2 top crops, candidate regions, and confidence/text-similarity changes. Stage 1 selects object-centric regions, whereas Stage 2 refines them toward class-relevant evidence aligned with ground-truth text prototype. Candidate-region efficiency. We first examine how LAGO uses a fixed inference region budget, measured as candidate regions ev… view at source ↗
Figure 5
Figure 5. Figure 5: Component ablations of LAGO. Ablation results showing improved prediction quality through region discovery and refinement. Values denote gains of Full LAGO over each variant. cues are misleading, when proposals fail to isolate the discriminative region, or when competing categories share similar attributes even after refinement. Examples are provided in Appendix C.4. 4.4 Component Ablations We ablate LAGO’… view at source ↗
Figure 6
Figure 6. Figure 6: Stage-wise visualization on Oxford Pets. Each example shows the image, top crops from Stage 1 and Stage 2, their candidate regions, and statistics for confidence and text-similarity changes. Stage 1 selects object-centric regions, while Stage 2 further refines them toward more semantically informative, class-relevant evidence, increasing alignment with the ground-truth text prototype. More broadly, these f… view at source ↗
Figure 7
Figure 7. Figure 7: Confidence-aware corrections and failure analysis under shift. Left: Adaptive-corrected samples across confidence buckets. Right: ImageNet-R failure distribution under natural shift [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Stage-wise visualization across ImageNet-V2, Food101, and CUB. Representative examples from multiple datasets show a consistent pattern: Stage 1 identifies visually salient and object-centric regions, whereas Stage 2 shifts the selected crops toward regions that are more discriminative and semantically relevant for the predicted class. This qualitative trend suggests that the two-stage design consistently … view at source ↗
Figure 9
Figure 9. Figure 9: Representative cases corrected by adaptive guidance. Each example shows an image for which the adaptive strategy predicts the correct class while the fixed-guidance variant fails at inference time. The examples are grouped by intermediate-confidence bucket, illustrating that adaptive refinement is especially beneficial in many low-confidence cases, where aggressive fixed semantic guidance is more likely to… view at source ↗
Figure 10
Figure 10. Figure 10: Representative semantic mismatch failures on ImageNet-R. The selected regions are locally plausible, but the text-side semantics emphasize misleading cues from the predicted text prototype under distribution shift conditions that do not support the ground-truth category. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representative cluttered or poorly localized failures on ImageNet-R. The proposal set fails to isolate the truly discriminative region, leading the model to attend to incomplete or irrelevant evidence during inference under distribution shift, weakening later semantic refinement overall [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Representative visually ambiguous failures on ImageNet-R. Even after localized refinement, visually similar categories remain difficult to distinguish because they share highly overlapping local and semantically similar attributes under natural distribution shift at inference time. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Representative failure cases categorized as other on ImageNet-R. These examples do not fall cleanly into a single dominant error type, but still illustrate challenging cases for localized visual-text alignment under natural distribution shift with mixed semantic and localization errors. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Stage 2-only failure case on an underwater scene. The ground-truth class is brain coral. Without class-agnostic visual initialization, Stage 2-only refinement produces diffuse evidence across related underwater categories. The ground-truth class is not separated after refinement, illustrating how unreliable early semantics can bias region selection and reinforce ambiguous predictions. 28 [PITH_FULL_IMAGE… view at source ↗
Figure 15
Figure 15. Figure 15: Stage 2-only failure case on a fine-grained pet image. The ground-truth class is Sealyham terrier. Without stable class-agnostic initialization, Stage 2-only refinement remains diffuse over related dog categories and confounding classes. This shows that applying semantic refinement from the start can overcommit to unreliable early evidence rather than isolate the discriminative region. 29 [PITH_FULL_IMAG… view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative example of crop reweighting on Oxford Pets (basset hound). LAGO selectively suppresses low-weight crops whose evidence is weak or less object-focused, while assigning larger weights to crops whose evidence is more concentrated around the ground-truth class. As a result, the prediction distribution after reweighting becomes more favorable to the correct class. 30 [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative example of crop reweighting on Oxford Pets (leonberger). Stage-wise reweighting suppresses weakly informative crop evidence and emphasizes regions whose predictions are more aligned with the target class, leading to a more concentrated final distribution. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative example of crop reweighting on Oxford Pets (wheaten terrier). LAGO selectively downweights dispersed crop-level evidence and assigns larger weights to object-focused crops, improving the concentration of aggregated evidence around the ground-truth class. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative example of crop reweighting on Food101 (croque madame). LAGO suppresses less relevant crop evidence and emphasizes crops that better capture the visually and semantically discriminative food structure, leading to a more reliable prediction after aggregation. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative example of crop reweighting on ImageNet-V2 (mosquito net). The weighted crop evidence after LAGO is more concentrated around the ground-truth class, illustrating how the method suppresses noisy crops and emphasizes semantically informative regions. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qualitative example of crop reweighting on ImageNet-R (cowboy hat). In this stylized example, LAGO suppresses weakly informative crops and strengthens crop evidence that is more semantically consistent with the target object under natural distribution shift at inference time, leading to a clearer separation of the ground-truth class in the final prediction distribution overall [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 22
Figure 22. Figure 22: Dataset-wise relative preference distribution over Stage 2, tie, and Stage 1. Stage 2 is strongest on ImageNet-R and Oxford Pets, followed by ImageNet and CUB, while DTD shows many ties and remains favorable to Stage 2 among non-tie responses under the human evaluation protocol. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Dataset-wise absolute Stage 2 alignment distribution over Yes, Partially, and No. The strict alignment rate is highest on CUB, followed by DTD and ImageNet-R, while ImageNet and Oxford Pets are lower under the strict criterion in the human evaluation setting used here [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Item-level support for Stage 2 preference across the four evaluators. All four evaluators consistently prefer Stage 2 on 32/100 samples overall, a strict majority prefer Stage 2 on 57/100 samples in this study, at least half prefer Stage 2 on 78/100 samples across all annotated items, and at least one evaluator prefers Stage 2 on 96/100 samples under the human evaluation protocol [PITH_FULL_IMAGE:figures… view at source ↗
Figure 25
Figure 25. Figure 25: Stage 2 alignment distribution conditioned on whether the final prediction is correct. In this subset, human-perceived path plausibility does not simply track final decision correctness, especially in fine-grained or semantically ambiguous recognition settings at inference time. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Representative human evaluation examples. Each row shows the image, Stage 1 regions, and Stage 2 regions, together with the ground-truth label, model prediction, and evaluator votes. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗
read the original abstract

Zero-shot recognition aims to classify an image by selecting the most compatible label description from a set of candidate classes without any task-specific supervision. In fine-grained settings, however, the relevant evidence often lies in localized parts, attributes, or textures rather than in the full image, making whole-image alignment suboptimal. Recent localized visual-text alignment methods address this by comparing class descriptions with multiple image regions, but they typically rely on large sets of random or redundant crops, increasing inference cost and introducing many highly redundant or weakly relevant candidates. Moreover, introducing semantic guidance too early can create an error-amplifying feedback process in which inaccurate intermediate predictions bias later localization and reinforce subsequent mistakes; we refer to this failure mode as the prediction loop. We propose LAGO (LAnguage-Guided adaptive Object-region focus), a framework for efficient and robust zero-shot localized visual-text alignment. LAGO first performs class-agnostic object-centric candidate discovery to obtain a stable visual initialization, and then applies adaptive language-guided refinement with the strength of semantic guidance controlled by intermediate confidence. It further combines object-level, contextual, and full-image evidence through an effective object-context dual-channel aggregation strategy. Extensive experiments show that LAGO consistently achieves state-of-the-art performance on standard zero-shot benchmarks and challenging distribution-shift settings, while requiring substantially fewer candidate regions at inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LAGO, a framework for zero-shot visual-text alignment. It performs class-agnostic object-centric candidate discovery to obtain stable visual initialization, then applies adaptive language-guided refinement where the strength of semantic guidance is modulated by intermediate confidence to avoid an error-amplifying 'prediction loop.' Object-level, contextual, and full-image evidence are combined via an object-context dual-channel aggregation strategy. The authors claim that LAGO achieves state-of-the-art performance on standard zero-shot benchmarks and distribution-shift settings while using substantially fewer candidate regions at inference time.

Significance. If the empirical results and the effectiveness of the proposed safeguards hold, LAGO would offer a practical advance in efficient localized zero-shot recognition by lowering inference cost and addressing a documented failure mode in iterative alignment. The design choices—object-centric initialization plus confidence-controlled guidance—are well-motivated extensions of prior localized visual-text methods and could influence subsequent work on fine-grained and robust zero-shot tasks.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim that confidence-modulated guidance reliably prevents prediction-loop error amplification is load-bearing yet unsupported by direct evidence. No measurement of loop failure rates (fraction of cases in which an early low-confidence error propagates) or ablation that isolates the modulation component on distribution-shift data is reported. If the class-agnostic candidates already contain the correct region at high frequency, the modulation may be incidental rather than necessary.
  2. [§3] §3 (Method): the description of adaptive refinement and dual-channel aggregation is clear, but the paper must quantify the stability of the initial class-agnostic object-centric discovery (e.g., recall of the correct region in the candidate set) across the evaluated datasets and distribution shifts to substantiate that it provides a reliable starting point.
minor comments (2)
  1. [Figure 1] Figure 1: the schematic of the prediction loop would benefit from an annotated example showing how an early incorrect region selection leads to reinforced error in subsequent steps.
  2. [Tables 1-2] Table 1 and Table 2: report the exact number of candidate regions used by each baseline so that the efficiency claim can be directly compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential practical advance offered by LAGO. We address each of the major comments point by point below. We will revise the manuscript to include the requested quantifications and analyses.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that confidence-modulated guidance reliably prevents prediction-loop error amplification is load-bearing yet unsupported by direct evidence. No measurement of loop failure rates (fraction of cases in which an early low-confidence error propagates) or ablation that isolates the modulation component on distribution-shift data is reported. If the class-agnostic candidates already contain the correct region at high frequency, the modulation may be incidental rather than necessary.

    Authors: We acknowledge that the current manuscript does not include direct measurements of prediction-loop failure rates or an ablation isolating the confidence modulation specifically on distribution-shift data. The empirical results demonstrate consistent state-of-the-art performance, but to directly address the concern that the modulation may be incidental, we will add in the revised version: (1) an analysis of loop failure rates by tracking cases where low-confidence early predictions lead to errors, and (2) an ablation study comparing LAGO with and without the modulation component on the distribution-shift benchmarks. This will provide the requested direct evidence. revision: yes

  2. Referee: [§3] §3 (Method): the description of adaptive refinement and dual-channel aggregation is clear, but the paper must quantify the stability of the initial class-agnostic object-centric discovery (e.g., recall of the correct region in the candidate set) across the evaluated datasets and distribution shifts to substantiate that it provides a reliable starting point.

    Authors: We agree that quantifying the stability of the class-agnostic object-centric candidate discovery is necessary to support its role as a reliable initialization. Although the method section describes the process, we did not report recall metrics for the correct region in the candidate sets. In the revised manuscript, we will add these quantifications, reporting the recall of the ground-truth region within the discovered candidates for each standard benchmark and distribution-shift setting. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes LAGO as a new framework combining class-agnostic object-centric candidate discovery with adaptive confidence-modulated language guidance and dual-channel aggregation. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that reduce the claimed performance gains to inputs by construction. The method is described as building on existing techniques with explicit novel adaptations for efficiency and robustness. The skeptic concern addresses empirical validation gaps rather than logical circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract provides no explicit free parameters or detailed axioms. The approach assumes standard components in computer vision pipelines.

axioms (1)
  • domain assumption Pre-trained object detectors and vision-language models provide reliable initial features for zero-shot tasks.
    The method relies on these for candidate discovery and alignment.
invented entities (1)
  • Prediction loop no independent evidence
    purpose: To describe the error-amplifying feedback in early semantic guidance.
    Introduced as a named failure mode in the abstract.

pith-pipeline@v0.9.0 · 5548 in / 1350 out tokens · 54521 ms · 2026-05-12T01:33:48.333709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 2 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

  2. [2]

    Learning to prompt for vision-language models.International Journal of Computer Vision, 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 2022

  3. [3]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  4. [4]

    Visual classification via description from large language models

    Sachit Menon and Carl V ondrick. Visual classification via description from large language models. In International Conference on Learning Representations, 2022

  5. [5]

    What does a platypus look like? generating customized prompts for zero-shot image classification

    Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691–15701, 2023

  6. [6]

    Visual-text cross alignment: Refining the similarity score in vision-language models

    Jinhao Li, Haopeng Li, Sarah Erfani, Lei Feng, James Bailey, and Feng Liu. Visual-text cross alignment: Refining the similarity score in vision-language models. InProceedings of the 41st International Conference on Machine Learning, 2024

  7. [7]

    Sophia Koepke, Oriol Vinyals, Cordelia Schmid, and Zeynep Akata

    Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, and Zeynep Akata. Waffling around for performance: Visual classification with random words and broad concepts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  8. [8]

    Let's roll a bifta: Bi-refinement for fine-grained text-visual alignment in vision-language models.Transactions on Machine Learning Research, 2026

    Yuhao Sun, Chengyi Cai, Jiacheng Zhang, Zesheng Ye, Xingliang Yuan, and Feng Liu. Let's roll a bifta: Bi-refinement for fine-grained text-visual alignment in vision-language models.Transactions on Machine Learning Research, 2026

  9. [9]

    arXiv preprint arXiv:2505.05071 (2025) 5, 7, 9, 11, 12, 13, 20, 21, 22

    Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG-CLIP: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025

  10. [10]

    From local details to global context: Advancing vision-language models with attention-based selection

    Lincan Cai, Jingxuan Kang, Shuang Li, Wenxuan Ma, Binhui Xie, Zhida Qin, and Jian Liang. From local details to global context: Advancing vision-language models with attention-based selection. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025), volume 267 ofProceedings of Machine Learning Research, pages 6229–6242. PMLR, 2025

  11. [11]

    Unified vision and language prompt learning.arXiv preprint arXiv:2210.07225, 2022

    Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Unified vision and language prompt learning.arXiv preprint arXiv:2210.07225, 2022

  12. [12]

    Prompt distribution learning

    Yu Lu, Xiao Liu, Yuxin Zhang, Xiao Liu, and Xinmei Tian. Prompt distribution learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022

  13. [13]

    Prompt-aligned gradient for prompt tuning

    Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15659– 15669, 2023

  14. [14]

    Visual-language prompt tuning with knowledge-guided context optimization

    Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6757–6767, 2023

  15. [15]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning.arXiv preprint arXiv:2210.03117, 2023

  16. [16]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 10

  17. [17]

    Tcp: Textual-based class-aware prompt tuning for visual-language model

    Hongming Yao, Aixi Zhang, Xiaoshan Xu, Sicong Liu, Saining Xie, and Qingming Lu. Tcp: Textual-based class-aware prompt tuning for visual-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  18. [18]

    Jooyeon Kim, Eulrang Cho, Sehyung Kim, and Hyunwoo J Kim

    Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, and Federico Tombari. Learning to prompt with text only supervision for vision-language models.arXiv preprint arXiv:2401.02418, 2024

  19. [19]

    Contrastive localized language-image pre-training

    Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, and Zhe Gan. Contrastive localized language-image pre-training. InProceed- ings of the 42nd International Conference on Machine Learning (ICML 2025), volume 267 ofProceedings of Machine Learning Research, pages 8386–8402. PMLR, 2025

  20. [20]

    Flair: Vlm with fine-grained language-informed image representations

    Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. Flair: Vlm with fine-grained language-informed image representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), pages 24884–24894, 2025

  21. [21]

    Test-time prompt tuning for zero-shot generalization in vision-language models

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. InAdvances in Neural Information Processing Systems, 2022

  22. [22]

    Diverse data augmentation with diffusions for effective test-time prompt tuning

    Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  23. [23]

    Hasegawa-Johnson, Yingzhen Li, and Chang D

    Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark A. Hasegawa-Johnson, Yingzhen Li, and Chang D. Yoo. C-TPT: Calibrated test-time prompt tuning for vision-language models via text feature dispersion. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=jzzEHTBFOT

  24. [24]

    Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization

    Jameel Hassan Abdul Samadh, Hanan Gani, Noor Hazim Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Salman Khan, and Fahad Shahbaz Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. InAdvances in Neural Information Processing Systems, 2023

  25. [25]

    Clipartt: Adaptation of clip to new domains at test time

    Gustavo Adolfo Vargas Hakim, David Osowiechi, Mehrdad Noori, Milad Cheraghalikhani, Ali Bahri, Moslem Yazdanpanah, Ismail Ben Ayed, and Christian Desrosiers. Clipartt: Adaptation of clip to new domains at test time. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

  26. [26]

    Max Zanella, Ismail Ben Ayed, and Jose Dolz. On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  27. [27]

    O’Connor, and Kevin McGuinness

    Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. InProceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020

  28. [28]

    Debi- ased self-training for semi-supervised learning

    Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debi- ased self-training for semi-supervised learning. InAdvances in Neural Information Processing Systems, volume 35, pages 32424–32437, 2022

  29. [29]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

  30. [30]

    Towards visual grounding: A survey.arXiv preprint arXiv:2412.20206, 2024

    Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, and Changsheng Xu. Towards visual grounding: A survey.arXiv preprint arXiv:2412.20206, 2024

  31. [31]

    V*: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  32. [32]

    Image-of-thought prompt- ing for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024

    Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of-thought prompt- ing for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024

  33. [33]

    Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms.arXiv preprint arXiv:2505.15436, 2025

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 11

  34. [34]

    VGR: Visual Grounded Reasoning

    Jiacong Wang, Zijiang Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, and Jun Xiao. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

  35. [35]

    Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. Mattnet: Modular attention network for referring expression comprehension. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

  36. [36]

    Transvg: End-to- end visual grounding with transformers

    Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to- end visual grounding with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021

  37. [37]

    Grounded language- image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language- image pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  38. [38]

    Zero-shot referring image segmentation with global- local context features

    Seonghoon Yu, Paul Hongsuck Seo, and Jeany Son. Zero-shot referring image segmentation with global- local context features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), pages 19456–19465, 2023

  39. [39]

    Zegclip: Towards adapting clip for zero-shot semantic segmentation

    Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), pages 11175–11185, 2023

  40. [40]

    Fast segment anything,

    Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

  41. [41]

    Lawrence Zitnick and Piotr Dollár

    C. Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. InProceedings of the European Conference on Computer Vision (ECCV), pages 391–405, Cham, September 2014. Springer International Publishing

  42. [42]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

  43. [43]

    Caltech-ucsd birds 200

    Peter Welinder, Steve Branson, Takeo Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010

  44. [44]

    Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

    Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012

  45. [45]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sami Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014

  46. [46]

    Food-101 – mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, pages 446–461, 2014

  47. [47]

    Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2018

    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2018

  48. [48]

    Do imagenet classifiers generalize to imagenet? InProceedings of the 36th International Conference on Machine Learning, pages 5389–5400, 2019

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InProceedings of the 36th International Conference on Machine Learning, pages 5389–5400, 2019

  49. [49]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Sanjay Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340...

  50. [50]

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P. Xing. Learning robust global representations by penalizing local predictive power. InAdvances in Neural Information Processing Systems, volume 32, 2019

  51. [51]

    a photo of a {class}

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021. 12 A Additional Method Details A.1 Offline Text Construction and Text Prototype Formation Following prior description-based zero-shot works [ ...

  52. [52]

    a set of cached proposal-guided crops,

  53. [53]

    If the number of valid crops is smaller than the fixed tensor length, we pad the remaining slots with zero crops and use a validity mask to exclude them from subsequent aggregation

    optional random completion crops if the target number of views has not been reached. If the number of valid crops is smaller than the fixed tensor length, we pad the remaining slots with zero crops and use a validity mask to exclude them from subsequent aggregation. This fixed-shape design significantly simplifies batching and implementation at inference ...