TeD-Loc: Text Distillation for Weakly Supervised Object Localization

Alexis Guichemerre; Eric Granger; Marco Pedersoli; Shakeeb Murtaza; Soufiane Belharbi

arxiv: 2501.12632 · v2 · submitted 2025-01-22 · 💻 cs.CV · cs.LG

TeD-Loc: Text Distillation for Weakly Supervised Object Localization

Shakeeb Murtaza , Soufiane Belharbi , Alexis Guichemerre , Marco Pedersoli , Eric Granger This is my paper

Pith reviewed 2026-05-23 05:08 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords weakly supervised object localizationCLIPtext distillationcontrastive alignmentpatch embeddingsforeground background separationhistopathologyvision-language models

0 comments

The pith

TeD-Loc distills CLIP text embeddings into patch embeddings via contrastive alignment to produce foreground and background scores for weakly supervised localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Weakly supervised object localization requires identifying both class and spatial extent from image-level labels alone. Standard class activation mapping approaches tend to highlight only the most discriminative object parts rather than the full extent. TeD-Loc transfers semantic knowledge from global CLIP text embeddings to local image patch embeddings through contrastive alignment, generating direct localization scores. A localization-guided classification module then aggregates the scored foreground patches to perform both tasks jointly, while QR orthogonalization of class text embeddings sharpens discrimination among similar categories. The resulting model reports higher localization accuracy than prior methods that depend on conditional denoising and elaborate prompt engineering.

Core claim

TeD-Loc transfers knowledge from CLIP text embeddings to patch embeddings through contrastive alignment, thereby enabling patch-level foreground/background localization. A localization-guided classification module is also introduced that uses localization scores to aggregate foreground patch embeddings for joint classification and localization in a single model. In addition, a QR-based orthogonalization of class text embeddings is applied before distillation to improve discrimination for semantically similar classes.

What carries the argument

Contrastive alignment between global CLIP text embeddings and local patch embeddings that produces foreground/background localization scores.

If this is right

Localization extends beyond the most discriminative regions to cover fuller object extent.
Classification and localization are trained and run jointly in one model without separate post-processing stages.
Top-1 Loc accuracy rises by roughly 5 percent on CUB and ILSVRC benchmarks.
PxAP rises by roughly 31 percent on histopathology benchmarks.
Inference runs more efficiently than methods that require conditional denoising and complex prompt learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment step could be tested on other vision-language models to check whether the localization benefit generalizes beyond CLIP.
QR orthogonalization may prove useful in any multi-class setting where category embeddings are close in embedding space.
Lower inference cost could enable real-time localization on edge devices that cannot run denoising-based alternatives.
The foreground aggregation idea might combine with other weak signals such as scribbles or points for hybrid supervision.

Load-bearing premise

Contrastive alignment between global text embeddings and local patch embeddings will produce reliable foreground/background scores that can be used directly for both localization and classification without further mechanisms.

What would settle it

Run the contrastive alignment on a held-out set with pixel-level masks and observe whether the resulting patch scores fail to separate object from background at rates better than a standard class activation mapping baseline.

Figures

Figures reproduced from arXiv: 2501.12632 by Alexis Guichemerre, Eric Granger, Marco Pedersoli, Shakeeb Murtaza, Soufiane Belharbi.

**Figure 1.** Figure 1: Comparison of our TeD-Loc versus CLIP-ES [1] methods for extracting localization maps from CLIP. (A) CLIP-ES utilizes Grad-CAM to extract localization maps from CLIP, requiring GT class labels during inference. (B) In contrast, our TeD-Loc model distills knowledge from CLIP text embeddings into the visual encoder during training, allowing it to produce both classification scores and localization maps with… view at source ↗

**Figure 2.** Figure 2: Overview of the TeD-Loc method for distilling FG text embeddings into the patch embedding backbone. First, pseudo-labels are extracted to guide the identification of FG and BG patches. By leveraging these FG/BG regions, the model minimizes the similarity of EV with the relevant text embedding for FG classes, while maximizing dissimilarity with embeddings of other classes. Through a binary FG/BG classifier,… view at source ↗

**Figure 3.** Figure 3: t-SNE visualizations of CLIP text embeddings [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Visualization of localization map defined via [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Weakly supervised object localization (WSOL) models are trained using only image-level class labels. They can predict both the object class and spatial regions corresponding to the object, without requiring explicit bounding box annotations. Given their reliance on classification objectives, traditional WSOL methods, like class activation mapping, tend to focus on the most discriminative object regions, often missing the full spatial extent. Although vision-language models such as CLIP encode rich semantic priors, they are not directly suited for WSOL because global text and class-token embeddings are not explicitly aligned with local patch embeddings, making patch-level localization difficult without additional mechanisms. Recent methods such as GenPrompt address this limitation, but at the cost of increased complexity, as they rely on conditional denoising and elaborate prompt-learning strategies. We propose Text Distillation for Localization (TeD-Loc), which transfers knowledge from CLIP text embeddings to patch embeddings through contrastive alignment, thereby enabling patch-level foreground/background localization. A localization-guided classification module is also introduced that uses localization scores to aggregate foreground patch embeddings for joint classification and localization in a single model. In addition, a QR-based orthogonalization of class text embeddings is applied before distillation to improve discrimination for semantically similar classes. Extensive experiments show that TeD-Loc improves Top-1 Loc by ~5% on CUB and ILSVRC, and PxAP by ~31% on histopathology benchmarks, while achieving more efficient inference than GenPrompt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TeD-Loc gives a workable CLIP adaptation for WSOL via contrastive text-to-patch distillation plus QR orthogonalization, with concrete gains on CUB, ILSVRC and histopathology but thin detail on the core alignment step.

read the letter

The paper's central move is to distill global CLIP text embeddings into local patch embeddings with a contrastive loss, then feed the resulting foreground scores into a localization-guided classifier that aggregates only the high-scoring patches. They add QR orthogonalization on the class text embeddings to reduce confusion between similar categories. This combination is presented as simpler than GenPrompt while still using the same pretrained CLIP backbone. The reported numbers are a roughly 5% lift in Top-1 Loc on CUB and ILSVRC and a 31% jump in PxAP on the medical slides, plus faster inference. Those are the concrete results worth noting. The approach is straightforward to describe and the medical-domain test is a plus. The main gap is that the abstract supplies no loss formulation, no description of positive/negative patch sampling, and no regularizer that would force the patch scores to cover the full object rather than just the most discriminative parts. The stress-test worry about the global-to-local transfer therefore still stands on the information given; without ablations or map visualizations it is hard to tell whether the gains come from better localization or from the extra classification head. The citation pattern looks standard for the sub-area. This is a paper for people already working on WSOL with vision-language models who need a lighter alternative to prompt-heavy methods. It is coherent on its own terms and shows honest engagement with the benchmarks, so it deserves referee time even if the mechanism needs more unpacking.

Referee Report

3 major / 2 minor

Summary. The paper proposes TeD-Loc for weakly supervised object localization (WSOL). It transfers CLIP text embeddings to image patch embeddings via contrastive alignment to produce foreground/background localization scores, introduces a localization-guided classification module that aggregates foreground patches for joint classification and localization, and applies QR-based orthogonalization to class text embeddings to reduce inter-class confusion. The method is positioned as simpler and more efficient than GenPrompt. Experiments claim ~5% gains in Top-1 Loc on CUB and ILSVRC and ~31% in PxAP on histopathology benchmarks.

Significance. If the contrastive alignment reliably transfers global semantic discrimination to local patches without inheriting the most-discriminative-part bias of standard WSOL, TeD-Loc would provide a lightweight alternative to prompt-engineering and denoising approaches for leveraging pretrained vision-language models in localization tasks. The QR orthogonalization is a clean addition for handling semantically similar classes. Efficiency advantages over GenPrompt are practically relevant. No machine-checked proofs or open code are mentioned.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): The reported percentage improvements (~5% Top-1 Loc, ~31% PxAP) are presented without experimental protocol details, error bars, ablation tables, or statistical tests, preventing verification that gains are robust rather than sensitive to post-hoc choices or dataset splits.
[§3.1–3.2] §3.1–3.2 (Method, contrastive alignment): The core transfer of global CLIP text embeddings to local patch embeddings via contrastive alignment is described at a high level but supplies no explicit loss formulation, positive/negative patch sampling strategy, or locality-preserving regularizer; without these, the claim that the resulting scores reliably capture full object extent (rather than reverting to the discriminative-part failure mode noted in the introduction) remains unanchored and load-bearing for both localization and the subsequent aggregation module.
[§3.3] §3.3 (QR orthogonalization): While the orthogonalization is introduced to improve discrimination, no ablation quantifies its isolated contribution to localization scores versus classification accuracy, leaving unclear whether it mitigates the locality gap or merely addresses a secondary inter-class issue.

minor comments (2)

[§3] Notation for patch embeddings and localization scores is introduced without a consolidated table or consistent symbols across equations, complicating traceability from text embeddings to final maps.
[Abstract, §4] The abstract states efficiency gains over GenPrompt but provides no inference-time or parameter-count comparison table in the main text or supplement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional clarity and evidence can strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): The reported percentage improvements (~5% Top-1 Loc, ~31% PxAP) are presented without experimental protocol details, error bars, ablation tables, or statistical tests, preventing verification that gains are robust rather than sensitive to post-hoc choices or dataset splits.

Authors: We agree that the current presentation of results limits independent verification of robustness. In the revised manuscript we will expand §4 with the complete experimental protocol (including training details, hyperparameters, and dataset splits), report error bars from multiple random seeds, add further ablation tables, and include statistical significance tests for the reported gains on CUB, ILSVRC, and the histopathology benchmarks. revision: yes
Referee: [§3.1–3.2] §3.1–3.2 (Method, contrastive alignment): The core transfer of global CLIP text embeddings to local patch embeddings via contrastive alignment is described at a high level but supplies no explicit loss formulation, positive/negative patch sampling strategy, or locality-preserving regularizer; without these, the claim that the resulting scores reliably capture full object extent (rather than reverting to the discriminative-part failure mode noted in the introduction) remains unanchored and load-bearing for both localization and the subsequent aggregation module.

Authors: We will revise §3.1–3.2 to supply the explicit contrastive loss equation, the precise positive/negative patch sampling procedure, and any locality-preserving regularizer. These additions will directly support the claim that the distilled scores capture full object extent rather than only the most discriminative parts. revision: yes
Referee: [§3.3] §3.3 (QR orthogonalization): While the orthogonalization is introduced to improve discrimination, no ablation quantifies its isolated contribution to localization scores versus classification accuracy, leaving unclear whether it mitigates the locality gap or merely addresses a secondary inter-class issue.

Authors: We acknowledge the absence of an isolated ablation for the QR orthogonalization step. In the revision we will add an ablation that isolates its effect on localization metrics (Top-1 Loc, PxAP) versus classification accuracy, clarifying whether its primary benefit is reduced inter-class confusion or improved localization. revision: yes

Circularity Check

0 steps flagged

No circularity: method builds on external CLIP pretraining with independent contrastive alignment and localization-guided classification; no derivations reduce to self-fitted quantities or self-citations.

full rationale

The paper presents TeD-Loc as a new architecture that transfers knowledge from a pretrained external CLIP model via contrastive alignment between global text embeddings and local patch embeddings, followed by a localization-guided classification module and QR orthogonalization. No equations or derivations are provided that define a quantity in terms of itself or rename a fitted parameter as a prediction. The central performance claims rest on experimental results on CUB, ILSVRC, and histopathology benchmarks rather than any internal reduction. Reliance on CLIP is external and independent of the present paper's training, satisfying the criteria for a self-contained derivation with no load-bearing self-citation or self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes standard contrastive-loss hyperparameters and that CLIP embeddings contain usable local information once aligned.

pith-pipeline@v0.9.0 · 5809 in / 1177 out tokens · 36265 ms · 2026-05-23T05:08:05.616164+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transfers knowledge from CLIP text embeddings to patch embeddings through contrastive alignment... LKD = sum CE(y, f(zp, ty))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

QR-based orthogonalization of class text embeddings... to improve discrimination

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,

Y. Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, “CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,” in CVPR, 2023. 1, 2, 3, 7, 8, 9, 10

work page 2023
[2]

Learning deep features for discrimina- tive localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discrimina- tive localization,” in CVPR, 2016. 1, 3

work page 2016
[3]

F-CAM: Full resolution class activation maps via guided parametric upscaling,

S. Belharbi, A. Sarraf, M. Pedersoli, I. B. Ayed, L. Mc- Caffrey, and E. Granger, “F-CAM: Full resolution class activation maps via guided parametric upscaling,” in W ACV, 2022. 1, 2, 5, 9

work page 2022
[4]

Geometry constrained weakly supervised object local- ization,

W. Lu, X. Jia, W. Xie, L. Shen, Y. Zhou, and J. Duan, “Geometry constrained weakly supervised object local- ization,” in ECCV (A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, eds.), pp. 481–496, 2020. 2

work page 2020
[5]

Background activation suppression for weakly supervised object localization,

P. Wu, W. Zhai, and Y. Cao, “Background activation suppression for weakly supervised object localization,” in CVPR, pp. 14228–14237, IEEE, 2022. 3, 7, 8

work page 2022
[6]

Danet: Divergent activation for weakly supervised object localization,

H. Xue, C. Liu, F. Wan, J. Jiao, X. Ji, and Q. Ye, “Danet: Divergent activation for weakly supervised object localization,” in CVPR, 2019

work page 2019
[7]

Cutmix: Regularization strategy to train strong classifiers with localizable features,

S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in ICCV, pp. 6023–6032, 2019. 3

work page 2019
[8]

Self-produced guidance for weakly-supervised object localization,

X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced guidance for weakly-supervised object localization,” in ECCV, 2018. 2, 3

work page 2018
[9]

Adversarial complementary learning for weakly su- pervised object localization,

X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang, “Adversarial complementary learning for weakly su- pervised object localization,” in CVPR, 2018. 2, 3

work page 2018
[10]

Attention-based dropout layer for weakly supervised object localization,

J. Choe and H. Shim, “Attention-based dropout layer for weakly supervised object localization,” in CVPR,

work page
[11]

Attention-based dropout layer for weakly supervised single object lo- calization and semantic segmentation,

J. Choe, S. Lee, and H. Shim, “Attention-based dropout layer for weakly supervised single object lo- calization and semantic segmentation,” IEEE TPAMI, pp. 4256–4271, 2021. 2

work page 2021
[12]

Discriminative sampling of proposals in self-supervised transformers for weakly supervised ob- ject localization,

S. Murtaza, S. Belharbi, M. Pedersoli, A. Sarraf, and E. Granger, “Discriminative sampling of proposals in self-supervised transformers for weakly supervised ob- ject localization,” in W ACVw, pp. 155–165, 2023. 2, 9

work page 2023
[13]

TS-CAM: Token semantic cou- pled attention map for weakly supervised object local- ization,

W. Gao, F. Wan, X. Pan, Z. Peng, Q. Tian, Z. Han, B. Zhou, and Q. Ye, “TS-CAM: Token semantic cou- pled attention map for weakly supervised object local- ization,” in ICCV, 2021. 2, 3, 7, 8, 9, 11

work page 2021
[14]

Learn to rectify the bias of clip for unsupervised semantic segmentation,

J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in CVPR, pp. 4102–4112, 2024. 2, 5

work page 2024
[15]

Gen- erative prompt model for weakly supervised object lo- calization,

Y. Zhao, Q. Ye, W. Wu, C. Shen, and F. Wan, “Gen- erative prompt model for weakly supervised object lo- calization,” in CVPR, pp. 6351–6361, 2023. 2, 7, 8, 9, 11

work page 2023
[16]

Algorithms for the qr-decomposition,

W. GANDER, “Algorithms for the qr-decomposition,”

work page
[17]

Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,

K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,” in ICCV, 2017. 3, 6

work page 2017
[18]

Unveiling the potential of struc- ture preserving for weakly supervised object localiza- tion,

X. Pan, Y. Gao, Z. Lin, F. Tang, W. Dong, H. Yuan, F. Huang, and C. Xu, “Unveiling the potential of struc- ture preserving for weakly supervised object localiza- tion,” in CVPR, pp. 11642–11651, 2021. 3

work page 2021
[19]

Rethinking the route towards weakly supervised object localization,

C.-L. Zhang, Y.-H. Cao, and J. Wu, “Rethinking the route towards weakly supervised object localization,” in CVPR, 2020. 3, 7, 8

work page 2020
[20]

An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021

G. Sharir, A. Noy, and L. Zelnik-Manor, “An image is worth 16x16 words, what is a video worth?,” arXiv preprint arXiv:2103.13915, 2021. 3

work page arXiv 2021
[21]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, Springer, 2020. 3

work page 2020
[22]

Learning transferable visual models from natural language su- pervision,

A. Radford, J. Kim, C. Hallacy, and et al, “Learning transferable visual models from natural language su- pervision,” in ICML, 2021. 3, 5, 9, 10

work page 2021
[23]

CLIMS: Cross lan- guage image matching for weakly supervised semantic segmentation,

J. Xie, X. Hou, K. Ye, and L. Shen, “CLIMS: Cross lan- guage image matching for weakly supervised semantic segmentation,” in CVPR, 2022. 3

work page 2022
[24]

Sclip: Rethinking self- attention for dense vision-language inference,

F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self- attention for dense vision-language inference,” arXiv preprint arXiv:2312.01597, 2023. 3, 10

work page arXiv 2023
[25]

Pay at- tention to your neighbours: Training-free open- vocabulary semantic segmentation,

S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay at- tention to your neighbours: Training-free open- vocabulary semantic segmentation,” arXiv preprint arXiv:2404.08181, 2024. 4, 10

work page arXiv 2024
[26]

Foundation model assisted weakly supervised semantic segmentation,

X. Yang and X. Gong, “Foundation model assisted weakly supervised semantic segmentation,” in W ACV,

work page
[27]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Y. Lo, et al. , “Segment anything,” in ICCV, 2023. 4

work page 2023
[28]

Eva: Exploring the limits of masked visual representation learning at scale,

Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, pp. 19358–19369, 2023. 4

work page 2023
[29]

Evaluating weakly supervised object local- ization methods right,

J. Choe, S. Oh, S. Lee, S. Chun, Z. Akata, and H. Shim, “Evaluating weakly supervised object local- ization methods right,” in CVPR, 2020. 5, 6

work page 2020
[30]

A realistic protocol for evaluation of weakly super- vised object localization,

S. Murtaza, S. Belharbi, M. Pedersoli, and E. Granger, “A realistic protocol for evaluation of weakly super- vised object localization,” in IEEE W ACV, 2025

work page 2025
[31]

Deep weakly-supervised learn- ing methods for classification and localization in histol- ogy images: A survey,

J. Rony, S. Belharbi, J. Dolz, I. Ben Ayed, L. McCaf- frey, and E. Granger, “Deep weakly-supervised learn- ing methods for classification and localization in histol- ogy images: A survey,” Machine Learning for Biomed- ical Imaging, vol. 2, pp. 96–150, 2023. 5

work page 2023
[32]

Dips: Discriminative pseudo-label sam- pling with self-supervised transformers for weakly su- pervised object localization,

S. Murtaza, S. Belharbi, M. Pedersoli, A. Sarraf, and E. Granger, “Dips: Discriminative pseudo-label sam- pling with self-supervised transformers for weakly su- pervised object localization,” IVC Journal , vol. 140, p. 104838, 2023. 5, 7, 9

work page 2023
[33]

Imagenet: A large-scale hierarchical im- age database,

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical im- age database,” in CVPR, 2009. 5, 6

work page 2009
[34]

Pay attention to your neighbours: Training-free open-vocabulary se- mantic segmentation,

S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay attention to your neighbours: Training-free open-vocabulary se- mantic segmentation,” in W ACV, 2025. 5

work page 2025
[35]

The caltech-ucsd birds-200-2011 dataset,

C. Wah, S. Branson, W. Steve, P. Peter, and S. Be- longie, “The caltech-ucsd birds-200-2011 dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011. 6

work page 2011
[36]

Do imagenet classifiers generalize to imagenet?,

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?,” inICML,

work page
[37]

Weakly su- pervised localization and learning with generic knowl- edge,

T. Deselaers, B. Alexe, and V. Ferrari, “Weakly su- pervised localization and learning with generic knowl- edge,” IJCV, vol. 100, pp. 275–293, 2012. 6

work page 2012
[38]

Weakly su- pervised object localization via transformer with im- plicit spatial calibration,

H. Bai, R. Zhang, J. Wang, and X. Wan, “Weakly su- pervised object localization via transformer with im- plicit spatial calibration,” ECCV, 2022. 7, 8

work page 2022
[39]

LCTR: on awak- ening the local continuity of transformer for weakly supervised object localization,

Z. Chen, C. Wang, Y. Wang, G. Jiang, Y. Shen, Y. Tai, C. Wang, W. Zhang, and L. Cao, “LCTR: on awak- ening the local continuity of transformer for weakly supervised object localization,” in AAAI, pp. 410–418,

work page
[40]

C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localiza- tion and semantic segmentation,

J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen, “C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localiza- tion and semantic segmentation,” in CVPR, pp. 989– 998, 2022. 7, 8

work page 2022
[41]

Category-aware allocation transformer for weakly supervised object localization,

Z. Chen, J. Ding, L. Cao, Y. Shen, S. Zhang, G. Jiang, and R. Ji, “Category-aware allocation transformer for weakly supervised object localization,” in ICCV, pp. 6643–6652, 2023. 7, 8

work page 2023
[42]

Boost- ing weakly supervised object localization and segmen- tation with domain adaption,

L. Zhu, Q. She, Q. Chen, Q. Ren, and Y. Lu, “Boost- ing weakly supervised object localization and segmen- tation with domain adaption,” IEEE TPAMI, 2024. 7, 8

work page 2024
[43]

A threshold selection method from gray-level histograms,

N. Otsu et al. , “A threshold selection method from gray-level histograms,” Automatica, vol. 11, no. 285- 296, pp. 23–27, 1975. 9

work page 1975

[1] [1]

CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,

Y. Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, “CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,” in CVPR, 2023. 1, 2, 3, 7, 8, 9, 10

work page 2023

[2] [2]

Learning deep features for discrimina- tive localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discrimina- tive localization,” in CVPR, 2016. 1, 3

work page 2016

[3] [3]

F-CAM: Full resolution class activation maps via guided parametric upscaling,

S. Belharbi, A. Sarraf, M. Pedersoli, I. B. Ayed, L. Mc- Caffrey, and E. Granger, “F-CAM: Full resolution class activation maps via guided parametric upscaling,” in W ACV, 2022. 1, 2, 5, 9

work page 2022

[4] [4]

Geometry constrained weakly supervised object local- ization,

W. Lu, X. Jia, W. Xie, L. Shen, Y. Zhou, and J. Duan, “Geometry constrained weakly supervised object local- ization,” in ECCV (A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, eds.), pp. 481–496, 2020. 2

work page 2020

[5] [5]

Background activation suppression for weakly supervised object localization,

P. Wu, W. Zhai, and Y. Cao, “Background activation suppression for weakly supervised object localization,” in CVPR, pp. 14228–14237, IEEE, 2022. 3, 7, 8

work page 2022

[6] [6]

Danet: Divergent activation for weakly supervised object localization,

H. Xue, C. Liu, F. Wan, J. Jiao, X. Ji, and Q. Ye, “Danet: Divergent activation for weakly supervised object localization,” in CVPR, 2019

work page 2019

[7] [7]

Cutmix: Regularization strategy to train strong classifiers with localizable features,

S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in ICCV, pp. 6023–6032, 2019. 3

work page 2019

[8] [8]

Self-produced guidance for weakly-supervised object localization,

X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced guidance for weakly-supervised object localization,” in ECCV, 2018. 2, 3

work page 2018

[9] [9]

Adversarial complementary learning for weakly su- pervised object localization,

X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang, “Adversarial complementary learning for weakly su- pervised object localization,” in CVPR, 2018. 2, 3

work page 2018

[10] [10]

Attention-based dropout layer for weakly supervised object localization,

J. Choe and H. Shim, “Attention-based dropout layer for weakly supervised object localization,” in CVPR,

work page

[11] [11]

Attention-based dropout layer for weakly supervised single object lo- calization and semantic segmentation,

J. Choe, S. Lee, and H. Shim, “Attention-based dropout layer for weakly supervised single object lo- calization and semantic segmentation,” IEEE TPAMI, pp. 4256–4271, 2021. 2

work page 2021

[12] [12]

Discriminative sampling of proposals in self-supervised transformers for weakly supervised ob- ject localization,

S. Murtaza, S. Belharbi, M. Pedersoli, A. Sarraf, and E. Granger, “Discriminative sampling of proposals in self-supervised transformers for weakly supervised ob- ject localization,” in W ACVw, pp. 155–165, 2023. 2, 9

work page 2023

[13] [13]

TS-CAM: Token semantic cou- pled attention map for weakly supervised object local- ization,

W. Gao, F. Wan, X. Pan, Z. Peng, Q. Tian, Z. Han, B. Zhou, and Q. Ye, “TS-CAM: Token semantic cou- pled attention map for weakly supervised object local- ization,” in ICCV, 2021. 2, 3, 7, 8, 9, 11

work page 2021

[14] [14]

Learn to rectify the bias of clip for unsupervised semantic segmentation,

J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in CVPR, pp. 4102–4112, 2024. 2, 5

work page 2024

[15] [15]

Gen- erative prompt model for weakly supervised object lo- calization,

Y. Zhao, Q. Ye, W. Wu, C. Shen, and F. Wan, “Gen- erative prompt model for weakly supervised object lo- calization,” in CVPR, pp. 6351–6361, 2023. 2, 7, 8, 9, 11

work page 2023

[16] [16]

Algorithms for the qr-decomposition,

W. GANDER, “Algorithms for the qr-decomposition,”

work page

[17] [17]

Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,

K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,” in ICCV, 2017. 3, 6

work page 2017

[18] [18]

Unveiling the potential of struc- ture preserving for weakly supervised object localiza- tion,

X. Pan, Y. Gao, Z. Lin, F. Tang, W. Dong, H. Yuan, F. Huang, and C. Xu, “Unveiling the potential of struc- ture preserving for weakly supervised object localiza- tion,” in CVPR, pp. 11642–11651, 2021. 3

work page 2021

[19] [19]

Rethinking the route towards weakly supervised object localization,

C.-L. Zhang, Y.-H. Cao, and J. Wu, “Rethinking the route towards weakly supervised object localization,” in CVPR, 2020. 3, 7, 8

work page 2020

[20] [20]

An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021

G. Sharir, A. Noy, and L. Zelnik-Manor, “An image is worth 16x16 words, what is a video worth?,” arXiv preprint arXiv:2103.13915, 2021. 3

work page arXiv 2021

[21] [21]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, Springer, 2020. 3

work page 2020

[22] [22]

Learning transferable visual models from natural language su- pervision,

A. Radford, J. Kim, C. Hallacy, and et al, “Learning transferable visual models from natural language su- pervision,” in ICML, 2021. 3, 5, 9, 10

work page 2021

[23] [23]

CLIMS: Cross lan- guage image matching for weakly supervised semantic segmentation,

J. Xie, X. Hou, K. Ye, and L. Shen, “CLIMS: Cross lan- guage image matching for weakly supervised semantic segmentation,” in CVPR, 2022. 3

work page 2022

[24] [24]

Sclip: Rethinking self- attention for dense vision-language inference,

F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self- attention for dense vision-language inference,” arXiv preprint arXiv:2312.01597, 2023. 3, 10

work page arXiv 2023

[25] [25]

Pay at- tention to your neighbours: Training-free open- vocabulary semantic segmentation,

S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay at- tention to your neighbours: Training-free open- vocabulary semantic segmentation,” arXiv preprint arXiv:2404.08181, 2024. 4, 10

work page arXiv 2024

[26] [26]

Foundation model assisted weakly supervised semantic segmentation,

X. Yang and X. Gong, “Foundation model assisted weakly supervised semantic segmentation,” in W ACV,

work page

[27] [27]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Y. Lo, et al. , “Segment anything,” in ICCV, 2023. 4

work page 2023

[28] [28]

Eva: Exploring the limits of masked visual representation learning at scale,

Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, pp. 19358–19369, 2023. 4

work page 2023

[29] [29]

Evaluating weakly supervised object local- ization methods right,

J. Choe, S. Oh, S. Lee, S. Chun, Z. Akata, and H. Shim, “Evaluating weakly supervised object local- ization methods right,” in CVPR, 2020. 5, 6

work page 2020

[30] [30]

A realistic protocol for evaluation of weakly super- vised object localization,

S. Murtaza, S. Belharbi, M. Pedersoli, and E. Granger, “A realistic protocol for evaluation of weakly super- vised object localization,” in IEEE W ACV, 2025

work page 2025

[31] [31]

Deep weakly-supervised learn- ing methods for classification and localization in histol- ogy images: A survey,

J. Rony, S. Belharbi, J. Dolz, I. Ben Ayed, L. McCaf- frey, and E. Granger, “Deep weakly-supervised learn- ing methods for classification and localization in histol- ogy images: A survey,” Machine Learning for Biomed- ical Imaging, vol. 2, pp. 96–150, 2023. 5

work page 2023

[32] [32]

Dips: Discriminative pseudo-label sam- pling with self-supervised transformers for weakly su- pervised object localization,

S. Murtaza, S. Belharbi, M. Pedersoli, A. Sarraf, and E. Granger, “Dips: Discriminative pseudo-label sam- pling with self-supervised transformers for weakly su- pervised object localization,” IVC Journal , vol. 140, p. 104838, 2023. 5, 7, 9

work page 2023

[33] [33]

Imagenet: A large-scale hierarchical im- age database,

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical im- age database,” in CVPR, 2009. 5, 6

work page 2009

[34] [34]

Pay attention to your neighbours: Training-free open-vocabulary se- mantic segmentation,

S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay attention to your neighbours: Training-free open-vocabulary se- mantic segmentation,” in W ACV, 2025. 5

work page 2025

[35] [35]

The caltech-ucsd birds-200-2011 dataset,

C. Wah, S. Branson, W. Steve, P. Peter, and S. Be- longie, “The caltech-ucsd birds-200-2011 dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011. 6

work page 2011

[36] [36]

Do imagenet classifiers generalize to imagenet?,

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?,” inICML,

work page

[37] [37]

Weakly su- pervised localization and learning with generic knowl- edge,

T. Deselaers, B. Alexe, and V. Ferrari, “Weakly su- pervised localization and learning with generic knowl- edge,” IJCV, vol. 100, pp. 275–293, 2012. 6

work page 2012

[38] [38]

Weakly su- pervised object localization via transformer with im- plicit spatial calibration,

H. Bai, R. Zhang, J. Wang, and X. Wan, “Weakly su- pervised object localization via transformer with im- plicit spatial calibration,” ECCV, 2022. 7, 8

work page 2022

[39] [39]

LCTR: on awak- ening the local continuity of transformer for weakly supervised object localization,

Z. Chen, C. Wang, Y. Wang, G. Jiang, Y. Shen, Y. Tai, C. Wang, W. Zhang, and L. Cao, “LCTR: on awak- ening the local continuity of transformer for weakly supervised object localization,” in AAAI, pp. 410–418,

work page

[40] [40]

C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localiza- tion and semantic segmentation,

J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen, “C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localiza- tion and semantic segmentation,” in CVPR, pp. 989– 998, 2022. 7, 8

work page 2022

[41] [41]

Category-aware allocation transformer for weakly supervised object localization,

Z. Chen, J. Ding, L. Cao, Y. Shen, S. Zhang, G. Jiang, and R. Ji, “Category-aware allocation transformer for weakly supervised object localization,” in ICCV, pp. 6643–6652, 2023. 7, 8

work page 2023

[42] [42]

Boost- ing weakly supervised object localization and segmen- tation with domain adaption,

L. Zhu, Q. She, Q. Chen, Q. Ren, and Y. Lu, “Boost- ing weakly supervised object localization and segmen- tation with domain adaption,” IEEE TPAMI, 2024. 7, 8

work page 2024

[43] [43]

A threshold selection method from gray-level histograms,

N. Otsu et al. , “A threshold selection method from gray-level histograms,” Automatica, vol. 11, no. 285- 296, pp. 23–27, 1975. 9

work page 1975