pith. sign in

arxiv: 2501.12632 · v2 · submitted 2025-01-22 · 💻 cs.CV · cs.LG

TeD-Loc: Text Distillation for Weakly Supervised Object Localization

Pith reviewed 2026-05-23 05:08 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords weakly supervised object localizationCLIPtext distillationcontrastive alignmentpatch embeddingsforeground background separationhistopathologyvision-language models
0
0 comments X

The pith

TeD-Loc distills CLIP text embeddings into patch embeddings via contrastive alignment to produce foreground and background scores for weakly supervised localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Weakly supervised object localization requires identifying both class and spatial extent from image-level labels alone. Standard class activation mapping approaches tend to highlight only the most discriminative object parts rather than the full extent. TeD-Loc transfers semantic knowledge from global CLIP text embeddings to local image patch embeddings through contrastive alignment, generating direct localization scores. A localization-guided classification module then aggregates the scored foreground patches to perform both tasks jointly, while QR orthogonalization of class text embeddings sharpens discrimination among similar categories. The resulting model reports higher localization accuracy than prior methods that depend on conditional denoising and elaborate prompt engineering.

Core claim

TeD-Loc transfers knowledge from CLIP text embeddings to patch embeddings through contrastive alignment, thereby enabling patch-level foreground/background localization. A localization-guided classification module is also introduced that uses localization scores to aggregate foreground patch embeddings for joint classification and localization in a single model. In addition, a QR-based orthogonalization of class text embeddings is applied before distillation to improve discrimination for semantically similar classes.

What carries the argument

Contrastive alignment between global CLIP text embeddings and local patch embeddings that produces foreground/background localization scores.

If this is right

  • Localization extends beyond the most discriminative regions to cover fuller object extent.
  • Classification and localization are trained and run jointly in one model without separate post-processing stages.
  • Top-1 Loc accuracy rises by roughly 5 percent on CUB and ILSVRC benchmarks.
  • PxAP rises by roughly 31 percent on histopathology benchmarks.
  • Inference runs more efficiently than methods that require conditional denoising and complex prompt learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment step could be tested on other vision-language models to check whether the localization benefit generalizes beyond CLIP.
  • QR orthogonalization may prove useful in any multi-class setting where category embeddings are close in embedding space.
  • Lower inference cost could enable real-time localization on edge devices that cannot run denoising-based alternatives.
  • The foreground aggregation idea might combine with other weak signals such as scribbles or points for hybrid supervision.

Load-bearing premise

Contrastive alignment between global text embeddings and local patch embeddings will produce reliable foreground/background scores that can be used directly for both localization and classification without further mechanisms.

What would settle it

Run the contrastive alignment on a held-out set with pixel-level masks and observe whether the resulting patch scores fail to separate object from background at rates better than a standard class activation mapping baseline.

Figures

Figures reproduced from arXiv: 2501.12632 by Alexis Guichemerre, Eric Granger, Marco Pedersoli, Shakeeb Murtaza, Soufiane Belharbi.

Figure 1
Figure 1. Figure 1: Comparison of our TeD-Loc versus CLIP-ES [1] methods for extracting localization maps from CLIP. (A) CLIP-ES utilizes Grad-CAM to extract localization maps from CLIP, requiring GT class labels during inference. (B) In contrast, our TeD-Loc model distills knowledge from CLIP text embeddings into the visual encoder during train￾ing, allowing it to produce both classification scores and localization maps with… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TeD-Loc method for distilling FG text embeddings into the patch embedding backbone. First, pseudo-labels are extracted to guide the identification of FG and BG patches. By leveraging these FG/BG regions, the model minimizes the similarity of EV with the relevant text embedding for FG classes, while maximizing dissimilarity with embeddings of other classes. Through a binary FG/BG classifier,… view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualizations of CLIP text embeddings [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of localization map defined via [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Weakly supervised object localization (WSOL) models are trained using only image-level class labels. They can predict both the object class and spatial regions corresponding to the object, without requiring explicit bounding box annotations. Given their reliance on classification objectives, traditional WSOL methods, like class activation mapping, tend to focus on the most discriminative object regions, often missing the full spatial extent. Although vision-language models such as CLIP encode rich semantic priors, they are not directly suited for WSOL because global text and class-token embeddings are not explicitly aligned with local patch embeddings, making patch-level localization difficult without additional mechanisms. Recent methods such as GenPrompt address this limitation, but at the cost of increased complexity, as they rely on conditional denoising and elaborate prompt-learning strategies. We propose Text Distillation for Localization (TeD-Loc), which transfers knowledge from CLIP text embeddings to patch embeddings through contrastive alignment, thereby enabling patch-level foreground/background localization. A localization-guided classification module is also introduced that uses localization scores to aggregate foreground patch embeddings for joint classification and localization in a single model. In addition, a QR-based orthogonalization of class text embeddings is applied before distillation to improve discrimination for semantically similar classes. Extensive experiments show that TeD-Loc improves Top-1 Loc by ~5% on CUB and ILSVRC, and PxAP by ~31% on histopathology benchmarks, while achieving more efficient inference than GenPrompt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes TeD-Loc for weakly supervised object localization (WSOL). It transfers CLIP text embeddings to image patch embeddings via contrastive alignment to produce foreground/background localization scores, introduces a localization-guided classification module that aggregates foreground patches for joint classification and localization, and applies QR-based orthogonalization to class text embeddings to reduce inter-class confusion. The method is positioned as simpler and more efficient than GenPrompt. Experiments claim ~5% gains in Top-1 Loc on CUB and ILSVRC and ~31% in PxAP on histopathology benchmarks.

Significance. If the contrastive alignment reliably transfers global semantic discrimination to local patches without inheriting the most-discriminative-part bias of standard WSOL, TeD-Loc would provide a lightweight alternative to prompt-engineering and denoising approaches for leveraging pretrained vision-language models in localization tasks. The QR orthogonalization is a clean addition for handling semantically similar classes. Efficiency advantages over GenPrompt are practically relevant. No machine-checked proofs or open code are mentioned.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): The reported percentage improvements (~5% Top-1 Loc, ~31% PxAP) are presented without experimental protocol details, error bars, ablation tables, or statistical tests, preventing verification that gains are robust rather than sensitive to post-hoc choices or dataset splits.
  2. [§3.1–3.2] §3.1–3.2 (Method, contrastive alignment): The core transfer of global CLIP text embeddings to local patch embeddings via contrastive alignment is described at a high level but supplies no explicit loss formulation, positive/negative patch sampling strategy, or locality-preserving regularizer; without these, the claim that the resulting scores reliably capture full object extent (rather than reverting to the discriminative-part failure mode noted in the introduction) remains unanchored and load-bearing for both localization and the subsequent aggregation module.
  3. [§3.3] §3.3 (QR orthogonalization): While the orthogonalization is introduced to improve discrimination, no ablation quantifies its isolated contribution to localization scores versus classification accuracy, leaving unclear whether it mitigates the locality gap or merely addresses a secondary inter-class issue.
minor comments (2)
  1. [§3] Notation for patch embeddings and localization scores is introduced without a consolidated table or consistent symbols across equations, complicating traceability from text embeddings to final maps.
  2. [Abstract, §4] The abstract states efficiency gains over GenPrompt but provides no inference-time or parameter-count comparison table in the main text or supplement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional clarity and evidence can strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The reported percentage improvements (~5% Top-1 Loc, ~31% PxAP) are presented without experimental protocol details, error bars, ablation tables, or statistical tests, preventing verification that gains are robust rather than sensitive to post-hoc choices or dataset splits.

    Authors: We agree that the current presentation of results limits independent verification of robustness. In the revised manuscript we will expand §4 with the complete experimental protocol (including training details, hyperparameters, and dataset splits), report error bars from multiple random seeds, add further ablation tables, and include statistical significance tests for the reported gains on CUB, ILSVRC, and the histopathology benchmarks. revision: yes

  2. Referee: [§3.1–3.2] §3.1–3.2 (Method, contrastive alignment): The core transfer of global CLIP text embeddings to local patch embeddings via contrastive alignment is described at a high level but supplies no explicit loss formulation, positive/negative patch sampling strategy, or locality-preserving regularizer; without these, the claim that the resulting scores reliably capture full object extent (rather than reverting to the discriminative-part failure mode noted in the introduction) remains unanchored and load-bearing for both localization and the subsequent aggregation module.

    Authors: We will revise §3.1–3.2 to supply the explicit contrastive loss equation, the precise positive/negative patch sampling procedure, and any locality-preserving regularizer. These additions will directly support the claim that the distilled scores capture full object extent rather than only the most discriminative parts. revision: yes

  3. Referee: [§3.3] §3.3 (QR orthogonalization): While the orthogonalization is introduced to improve discrimination, no ablation quantifies its isolated contribution to localization scores versus classification accuracy, leaving unclear whether it mitigates the locality gap or merely addresses a secondary inter-class issue.

    Authors: We acknowledge the absence of an isolated ablation for the QR orthogonalization step. In the revision we will add an ablation that isolates its effect on localization metrics (Top-1 Loc, PxAP) versus classification accuracy, clarifying whether its primary benefit is reduced inter-class confusion or improved localization. revision: yes

Circularity Check

0 steps flagged

No circularity: method builds on external CLIP pretraining with independent contrastive alignment and localization-guided classification; no derivations reduce to self-fitted quantities or self-citations.

full rationale

The paper presents TeD-Loc as a new architecture that transfers knowledge from a pretrained external CLIP model via contrastive alignment between global text embeddings and local patch embeddings, followed by a localization-guided classification module and QR orthogonalization. No equations or derivations are provided that define a quantity in terms of itself or rename a fitted parameter as a prediction. The central performance claims rest on experimental results on CUB, ILSVRC, and histopathology benchmarks rather than any internal reduction. Reliance on CLIP is external and independent of the present paper's training, satisfying the criteria for a self-contained derivation with no load-bearing self-citation or self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes standard contrastive-loss hyperparameters and that CLIP embeddings contain usable local information once aligned.

pith-pipeline@v0.9.0 · 5809 in / 1177 out tokens · 36265 ms · 2026-05-23T05:08:05.616164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,

    Y. Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, “CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,” in CVPR, 2023. 1, 2, 3, 7, 8, 9, 10

  2. [2]

    Learning deep features for discrimina- tive localization,

    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discrimina- tive localization,” in CVPR, 2016. 1, 3

  3. [3]

    F-CAM: Full resolution class activation maps via guided parametric upscaling,

    S. Belharbi, A. Sarraf, M. Pedersoli, I. B. Ayed, L. Mc- Caffrey, and E. Granger, “F-CAM: Full resolution class activation maps via guided parametric upscaling,” in W ACV, 2022. 1, 2, 5, 9

  4. [4]

    Geometry constrained weakly supervised object local- ization,

    W. Lu, X. Jia, W. Xie, L. Shen, Y. Zhou, and J. Duan, “Geometry constrained weakly supervised object local- ization,” in ECCV (A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, eds.), pp. 481–496, 2020. 2

  5. [5]

    Background activation suppression for weakly supervised object localization,

    P. Wu, W. Zhai, and Y. Cao, “Background activation suppression for weakly supervised object localization,” in CVPR, pp. 14228–14237, IEEE, 2022. 3, 7, 8

  6. [6]

    Danet: Divergent activation for weakly supervised object localization,

    H. Xue, C. Liu, F. Wan, J. Jiao, X. Ji, and Q. Ye, “Danet: Divergent activation for weakly supervised object localization,” in CVPR, 2019

  7. [7]

    Cutmix: Regularization strategy to train strong classifiers with localizable features,

    S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in ICCV, pp. 6023–6032, 2019. 3

  8. [8]

    Self-produced guidance for weakly-supervised object localization,

    X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced guidance for weakly-supervised object localization,” in ECCV, 2018. 2, 3

  9. [9]

    Adversarial complementary learning for weakly su- pervised object localization,

    X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang, “Adversarial complementary learning for weakly su- pervised object localization,” in CVPR, 2018. 2, 3

  10. [10]

    Attention-based dropout layer for weakly supervised object localization,

    J. Choe and H. Shim, “Attention-based dropout layer for weakly supervised object localization,” in CVPR,

  11. [11]

    Attention-based dropout layer for weakly supervised single object lo- calization and semantic segmentation,

    J. Choe, S. Lee, and H. Shim, “Attention-based dropout layer for weakly supervised single object lo- calization and semantic segmentation,” IEEE TPAMI, pp. 4256–4271, 2021. 2

  12. [12]

    Discriminative sampling of proposals in self-supervised transformers for weakly supervised ob- ject localization,

    S. Murtaza, S. Belharbi, M. Pedersoli, A. Sarraf, and E. Granger, “Discriminative sampling of proposals in self-supervised transformers for weakly supervised ob- ject localization,” in W ACVw, pp. 155–165, 2023. 2, 9

  13. [13]

    TS-CAM: Token semantic cou- pled attention map for weakly supervised object local- ization,

    W. Gao, F. Wan, X. Pan, Z. Peng, Q. Tian, Z. Han, B. Zhou, and Q. Ye, “TS-CAM: Token semantic cou- pled attention map for weakly supervised object local- ization,” in ICCV, 2021. 2, 3, 7, 8, 9, 11

  14. [14]

    Learn to rectify the bias of clip for unsupervised semantic segmentation,

    J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in CVPR, pp. 4102–4112, 2024. 2, 5

  15. [15]

    Gen- erative prompt model for weakly supervised object lo- calization,

    Y. Zhao, Q. Ye, W. Wu, C. Shen, and F. Wan, “Gen- erative prompt model for weakly supervised object lo- calization,” in CVPR, pp. 6351–6361, 2023. 2, 7, 8, 9, 11

  16. [16]

    Algorithms for the qr-decomposition,

    W. GANDER, “Algorithms for the qr-decomposition,”

  17. [17]

    Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,

    K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,” in ICCV, 2017. 3, 6

  18. [18]

    Unveiling the potential of struc- ture preserving for weakly supervised object localiza- tion,

    X. Pan, Y. Gao, Z. Lin, F. Tang, W. Dong, H. Yuan, F. Huang, and C. Xu, “Unveiling the potential of struc- ture preserving for weakly supervised object localiza- tion,” in CVPR, pp. 11642–11651, 2021. 3

  19. [19]

    Rethinking the route towards weakly supervised object localization,

    C.-L. Zhang, Y.-H. Cao, and J. Wu, “Rethinking the route towards weakly supervised object localization,” in CVPR, 2020. 3, 7, 8

  20. [20]

    An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021

    G. Sharir, A. Noy, and L. Zelnik-Manor, “An image is worth 16x16 words, what is a video worth?,” arXiv preprint arXiv:2103.13915, 2021. 3

  21. [21]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, Springer, 2020. 3

  22. [22]

    Learning transferable visual models from natural language su- pervision,

    A. Radford, J. Kim, C. Hallacy, and et al, “Learning transferable visual models from natural language su- pervision,” in ICML, 2021. 3, 5, 9, 10

  23. [23]

    CLIMS: Cross lan- guage image matching for weakly supervised semantic segmentation,

    J. Xie, X. Hou, K. Ye, and L. Shen, “CLIMS: Cross lan- guage image matching for weakly supervised semantic segmentation,” in CVPR, 2022. 3

  24. [24]

    Sclip: Rethinking self- attention for dense vision-language inference,

    F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self- attention for dense vision-language inference,” arXiv preprint arXiv:2312.01597, 2023. 3, 10

  25. [25]

    Pay at- tention to your neighbours: Training-free open- vocabulary semantic segmentation,

    S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay at- tention to your neighbours: Training-free open- vocabulary semantic segmentation,” arXiv preprint arXiv:2404.08181, 2024. 4, 10

  26. [26]

    Foundation model assisted weakly supervised semantic segmentation,

    X. Yang and X. Gong, “Foundation model assisted weakly supervised semantic segmentation,” in W ACV,

  27. [27]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Y. Lo, et al. , “Segment anything,” in ICCV, 2023. 4

  28. [28]

    Eva: Exploring the limits of masked visual representation learning at scale,

    Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, pp. 19358–19369, 2023. 4

  29. [29]

    Evaluating weakly supervised object local- ization methods right,

    J. Choe, S. Oh, S. Lee, S. Chun, Z. Akata, and H. Shim, “Evaluating weakly supervised object local- ization methods right,” in CVPR, 2020. 5, 6

  30. [30]

    A realistic protocol for evaluation of weakly super- vised object localization,

    S. Murtaza, S. Belharbi, M. Pedersoli, and E. Granger, “A realistic protocol for evaluation of weakly super- vised object localization,” in IEEE W ACV, 2025

  31. [31]

    Deep weakly-supervised learn- ing methods for classification and localization in histol- ogy images: A survey,

    J. Rony, S. Belharbi, J. Dolz, I. Ben Ayed, L. McCaf- frey, and E. Granger, “Deep weakly-supervised learn- ing methods for classification and localization in histol- ogy images: A survey,” Machine Learning for Biomed- ical Imaging, vol. 2, pp. 96–150, 2023. 5

  32. [32]

    Dips: Discriminative pseudo-label sam- pling with self-supervised transformers for weakly su- pervised object localization,

    S. Murtaza, S. Belharbi, M. Pedersoli, A. Sarraf, and E. Granger, “Dips: Discriminative pseudo-label sam- pling with self-supervised transformers for weakly su- pervised object localization,” IVC Journal , vol. 140, p. 104838, 2023. 5, 7, 9

  33. [33]

    Imagenet: A large-scale hierarchical im- age database,

    J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical im- age database,” in CVPR, 2009. 5, 6

  34. [34]

    Pay attention to your neighbours: Training-free open-vocabulary se- mantic segmentation,

    S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay attention to your neighbours: Training-free open-vocabulary se- mantic segmentation,” in W ACV, 2025. 5

  35. [35]

    The caltech-ucsd birds-200-2011 dataset,

    C. Wah, S. Branson, W. Steve, P. Peter, and S. Be- longie, “The caltech-ucsd birds-200-2011 dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011. 6

  36. [36]

    Do imagenet classifiers generalize to imagenet?,

    B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?,” inICML,

  37. [37]

    Weakly su- pervised localization and learning with generic knowl- edge,

    T. Deselaers, B. Alexe, and V. Ferrari, “Weakly su- pervised localization and learning with generic knowl- edge,” IJCV, vol. 100, pp. 275–293, 2012. 6

  38. [38]

    Weakly su- pervised object localization via transformer with im- plicit spatial calibration,

    H. Bai, R. Zhang, J. Wang, and X. Wan, “Weakly su- pervised object localization via transformer with im- plicit spatial calibration,” ECCV, 2022. 7, 8

  39. [39]

    LCTR: on awak- ening the local continuity of transformer for weakly supervised object localization,

    Z. Chen, C. Wang, Y. Wang, G. Jiang, Y. Shen, Y. Tai, C. Wang, W. Zhang, and L. Cao, “LCTR: on awak- ening the local continuity of transformer for weakly supervised object localization,” in AAAI, pp. 410–418,

  40. [40]

    C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localiza- tion and semantic segmentation,

    J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen, “C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localiza- tion and semantic segmentation,” in CVPR, pp. 989– 998, 2022. 7, 8

  41. [41]

    Category-aware allocation transformer for weakly supervised object localization,

    Z. Chen, J. Ding, L. Cao, Y. Shen, S. Zhang, G. Jiang, and R. Ji, “Category-aware allocation transformer for weakly supervised object localization,” in ICCV, pp. 6643–6652, 2023. 7, 8

  42. [42]

    Boost- ing weakly supervised object localization and segmen- tation with domain adaption,

    L. Zhu, Q. She, Q. Chen, Q. Ren, and Y. Lu, “Boost- ing weakly supervised object localization and segmen- tation with domain adaption,” IEEE TPAMI, 2024. 7, 8

  43. [43]

    A threshold selection method from gray-level histograms,

    N. Otsu et al. , “A threshold selection method from gray-level histograms,” Automatica, vol. 11, no. 285- 296, pp. 23–27, 1975. 9