TeD-Loc: Text Distillation for Weakly Supervised Object Localization
Pith reviewed 2026-05-23 05:08 UTC · model grok-4.3
The pith
TeD-Loc distills CLIP text embeddings into patch embeddings via contrastive alignment to produce foreground and background scores for weakly supervised localization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TeD-Loc transfers knowledge from CLIP text embeddings to patch embeddings through contrastive alignment, thereby enabling patch-level foreground/background localization. A localization-guided classification module is also introduced that uses localization scores to aggregate foreground patch embeddings for joint classification and localization in a single model. In addition, a QR-based orthogonalization of class text embeddings is applied before distillation to improve discrimination for semantically similar classes.
What carries the argument
Contrastive alignment between global CLIP text embeddings and local patch embeddings that produces foreground/background localization scores.
If this is right
- Localization extends beyond the most discriminative regions to cover fuller object extent.
- Classification and localization are trained and run jointly in one model without separate post-processing stages.
- Top-1 Loc accuracy rises by roughly 5 percent on CUB and ILSVRC benchmarks.
- PxAP rises by roughly 31 percent on histopathology benchmarks.
- Inference runs more efficiently than methods that require conditional denoising and complex prompt learning.
Where Pith is reading between the lines
- The same alignment step could be tested on other vision-language models to check whether the localization benefit generalizes beyond CLIP.
- QR orthogonalization may prove useful in any multi-class setting where category embeddings are close in embedding space.
- Lower inference cost could enable real-time localization on edge devices that cannot run denoising-based alternatives.
- The foreground aggregation idea might combine with other weak signals such as scribbles or points for hybrid supervision.
Load-bearing premise
Contrastive alignment between global text embeddings and local patch embeddings will produce reliable foreground/background scores that can be used directly for both localization and classification without further mechanisms.
What would settle it
Run the contrastive alignment on a held-out set with pixel-level masks and observe whether the resulting patch scores fail to separate object from background at rates better than a standard class activation mapping baseline.
Figures
read the original abstract
Weakly supervised object localization (WSOL) models are trained using only image-level class labels. They can predict both the object class and spatial regions corresponding to the object, without requiring explicit bounding box annotations. Given their reliance on classification objectives, traditional WSOL methods, like class activation mapping, tend to focus on the most discriminative object regions, often missing the full spatial extent. Although vision-language models such as CLIP encode rich semantic priors, they are not directly suited for WSOL because global text and class-token embeddings are not explicitly aligned with local patch embeddings, making patch-level localization difficult without additional mechanisms. Recent methods such as GenPrompt address this limitation, but at the cost of increased complexity, as they rely on conditional denoising and elaborate prompt-learning strategies. We propose Text Distillation for Localization (TeD-Loc), which transfers knowledge from CLIP text embeddings to patch embeddings through contrastive alignment, thereby enabling patch-level foreground/background localization. A localization-guided classification module is also introduced that uses localization scores to aggregate foreground patch embeddings for joint classification and localization in a single model. In addition, a QR-based orthogonalization of class text embeddings is applied before distillation to improve discrimination for semantically similar classes. Extensive experiments show that TeD-Loc improves Top-1 Loc by ~5% on CUB and ILSVRC, and PxAP by ~31% on histopathology benchmarks, while achieving more efficient inference than GenPrompt.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TeD-Loc for weakly supervised object localization (WSOL). It transfers CLIP text embeddings to image patch embeddings via contrastive alignment to produce foreground/background localization scores, introduces a localization-guided classification module that aggregates foreground patches for joint classification and localization, and applies QR-based orthogonalization to class text embeddings to reduce inter-class confusion. The method is positioned as simpler and more efficient than GenPrompt. Experiments claim ~5% gains in Top-1 Loc on CUB and ILSVRC and ~31% in PxAP on histopathology benchmarks.
Significance. If the contrastive alignment reliably transfers global semantic discrimination to local patches without inheriting the most-discriminative-part bias of standard WSOL, TeD-Loc would provide a lightweight alternative to prompt-engineering and denoising approaches for leveraging pretrained vision-language models in localization tasks. The QR orthogonalization is a clean addition for handling semantically similar classes. Efficiency advantages over GenPrompt are practically relevant. No machine-checked proofs or open code are mentioned.
major comments (3)
- [Abstract, §4] Abstract and §4 (Experiments): The reported percentage improvements (~5% Top-1 Loc, ~31% PxAP) are presented without experimental protocol details, error bars, ablation tables, or statistical tests, preventing verification that gains are robust rather than sensitive to post-hoc choices or dataset splits.
- [§3.1–3.2] §3.1–3.2 (Method, contrastive alignment): The core transfer of global CLIP text embeddings to local patch embeddings via contrastive alignment is described at a high level but supplies no explicit loss formulation, positive/negative patch sampling strategy, or locality-preserving regularizer; without these, the claim that the resulting scores reliably capture full object extent (rather than reverting to the discriminative-part failure mode noted in the introduction) remains unanchored and load-bearing for both localization and the subsequent aggregation module.
- [§3.3] §3.3 (QR orthogonalization): While the orthogonalization is introduced to improve discrimination, no ablation quantifies its isolated contribution to localization scores versus classification accuracy, leaving unclear whether it mitigates the locality gap or merely addresses a secondary inter-class issue.
minor comments (2)
- [§3] Notation for patch embeddings and localization scores is introduced without a consolidated table or consistent symbols across equations, complicating traceability from text embeddings to final maps.
- [Abstract, §4] The abstract states efficiency gains over GenPrompt but provides no inference-time or parameter-count comparison table in the main text or supplement.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight areas where additional clarity and evidence can strengthen the manuscript. We address each major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): The reported percentage improvements (~5% Top-1 Loc, ~31% PxAP) are presented without experimental protocol details, error bars, ablation tables, or statistical tests, preventing verification that gains are robust rather than sensitive to post-hoc choices or dataset splits.
Authors: We agree that the current presentation of results limits independent verification of robustness. In the revised manuscript we will expand §4 with the complete experimental protocol (including training details, hyperparameters, and dataset splits), report error bars from multiple random seeds, add further ablation tables, and include statistical significance tests for the reported gains on CUB, ILSVRC, and the histopathology benchmarks. revision: yes
-
Referee: [§3.1–3.2] §3.1–3.2 (Method, contrastive alignment): The core transfer of global CLIP text embeddings to local patch embeddings via contrastive alignment is described at a high level but supplies no explicit loss formulation, positive/negative patch sampling strategy, or locality-preserving regularizer; without these, the claim that the resulting scores reliably capture full object extent (rather than reverting to the discriminative-part failure mode noted in the introduction) remains unanchored and load-bearing for both localization and the subsequent aggregation module.
Authors: We will revise §3.1–3.2 to supply the explicit contrastive loss equation, the precise positive/negative patch sampling procedure, and any locality-preserving regularizer. These additions will directly support the claim that the distilled scores capture full object extent rather than only the most discriminative parts. revision: yes
-
Referee: [§3.3] §3.3 (QR orthogonalization): While the orthogonalization is introduced to improve discrimination, no ablation quantifies its isolated contribution to localization scores versus classification accuracy, leaving unclear whether it mitigates the locality gap or merely addresses a secondary inter-class issue.
Authors: We acknowledge the absence of an isolated ablation for the QR orthogonalization step. In the revision we will add an ablation that isolates its effect on localization metrics (Top-1 Loc, PxAP) versus classification accuracy, clarifying whether its primary benefit is reduced inter-class confusion or improved localization. revision: yes
Circularity Check
No circularity: method builds on external CLIP pretraining with independent contrastive alignment and localization-guided classification; no derivations reduce to self-fitted quantities or self-citations.
full rationale
The paper presents TeD-Loc as a new architecture that transfers knowledge from a pretrained external CLIP model via contrastive alignment between global text embeddings and local patch embeddings, followed by a localization-guided classification module and QR orthogonalization. No equations or derivations are provided that define a quantity in terms of itself or rename a fitted parameter as a prediction. The central performance claims rest on experimental results on CUB, ILSVRC, and histopathology benchmarks rather than any internal reduction. Reliance on CLIP is external and independent of the present paper's training, satisfying the criteria for a self-contained derivation with no load-bearing self-citation or self-definitional steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
transfers knowledge from CLIP text embeddings to patch embeddings through contrastive alignment... LKD = sum CE(y, f(zp, ty))
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
QR-based orthogonalization of class text embeddings... to improve discrimination
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Y. Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, “CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,” in CVPR, 2023. 1, 2, 3, 7, 8, 9, 10
work page 2023
-
[2]
Learning deep features for discrimina- tive localization,
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discrimina- tive localization,” in CVPR, 2016. 1, 3
work page 2016
-
[3]
F-CAM: Full resolution class activation maps via guided parametric upscaling,
S. Belharbi, A. Sarraf, M. Pedersoli, I. B. Ayed, L. Mc- Caffrey, and E. Granger, “F-CAM: Full resolution class activation maps via guided parametric upscaling,” in W ACV, 2022. 1, 2, 5, 9
work page 2022
-
[4]
Geometry constrained weakly supervised object local- ization,
W. Lu, X. Jia, W. Xie, L. Shen, Y. Zhou, and J. Duan, “Geometry constrained weakly supervised object local- ization,” in ECCV (A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, eds.), pp. 481–496, 2020. 2
work page 2020
-
[5]
Background activation suppression for weakly supervised object localization,
P. Wu, W. Zhai, and Y. Cao, “Background activation suppression for weakly supervised object localization,” in CVPR, pp. 14228–14237, IEEE, 2022. 3, 7, 8
work page 2022
-
[6]
Danet: Divergent activation for weakly supervised object localization,
H. Xue, C. Liu, F. Wan, J. Jiao, X. Ji, and Q. Ye, “Danet: Divergent activation for weakly supervised object localization,” in CVPR, 2019
work page 2019
-
[7]
Cutmix: Regularization strategy to train strong classifiers with localizable features,
S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in ICCV, pp. 6023–6032, 2019. 3
work page 2019
-
[8]
Self-produced guidance for weakly-supervised object localization,
X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced guidance for weakly-supervised object localization,” in ECCV, 2018. 2, 3
work page 2018
-
[9]
Adversarial complementary learning for weakly su- pervised object localization,
X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang, “Adversarial complementary learning for weakly su- pervised object localization,” in CVPR, 2018. 2, 3
work page 2018
-
[10]
Attention-based dropout layer for weakly supervised object localization,
J. Choe and H. Shim, “Attention-based dropout layer for weakly supervised object localization,” in CVPR,
-
[11]
J. Choe, S. Lee, and H. Shim, “Attention-based dropout layer for weakly supervised single object lo- calization and semantic segmentation,” IEEE TPAMI, pp. 4256–4271, 2021. 2
work page 2021
-
[12]
S. Murtaza, S. Belharbi, M. Pedersoli, A. Sarraf, and E. Granger, “Discriminative sampling of proposals in self-supervised transformers for weakly supervised ob- ject localization,” in W ACVw, pp. 155–165, 2023. 2, 9
work page 2023
-
[13]
TS-CAM: Token semantic cou- pled attention map for weakly supervised object local- ization,
W. Gao, F. Wan, X. Pan, Z. Peng, Q. Tian, Z. Han, B. Zhou, and Q. Ye, “TS-CAM: Token semantic cou- pled attention map for weakly supervised object local- ization,” in ICCV, 2021. 2, 3, 7, 8, 9, 11
work page 2021
-
[14]
Learn to rectify the bias of clip for unsupervised semantic segmentation,
J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in CVPR, pp. 4102–4112, 2024. 2, 5
work page 2024
-
[15]
Gen- erative prompt model for weakly supervised object lo- calization,
Y. Zhao, Q. Ye, W. Wu, C. Shen, and F. Wan, “Gen- erative prompt model for weakly supervised object lo- calization,” in CVPR, pp. 6351–6361, 2023. 2, 7, 8, 9, 11
work page 2023
- [16]
-
[17]
K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,” in ICCV, 2017. 3, 6
work page 2017
-
[18]
Unveiling the potential of struc- ture preserving for weakly supervised object localiza- tion,
X. Pan, Y. Gao, Z. Lin, F. Tang, W. Dong, H. Yuan, F. Huang, and C. Xu, “Unveiling the potential of struc- ture preserving for weakly supervised object localiza- tion,” in CVPR, pp. 11642–11651, 2021. 3
work page 2021
-
[19]
Rethinking the route towards weakly supervised object localization,
C.-L. Zhang, Y.-H. Cao, and J. Wu, “Rethinking the route towards weakly supervised object localization,” in CVPR, 2020. 3, 7, 8
work page 2020
-
[20]
An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915, 2021
G. Sharir, A. Noy, and L. Zelnik-Manor, “An image is worth 16x16 words, what is a video worth?,” arXiv preprint arXiv:2103.13915, 2021. 3
-
[21]
End-to-end object detection with transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, Springer, 2020. 3
work page 2020
-
[22]
Learning transferable visual models from natural language su- pervision,
A. Radford, J. Kim, C. Hallacy, and et al, “Learning transferable visual models from natural language su- pervision,” in ICML, 2021. 3, 5, 9, 10
work page 2021
-
[23]
CLIMS: Cross lan- guage image matching for weakly supervised semantic segmentation,
J. Xie, X. Hou, K. Ye, and L. Shen, “CLIMS: Cross lan- guage image matching for weakly supervised semantic segmentation,” in CVPR, 2022. 3
work page 2022
-
[24]
Sclip: Rethinking self- attention for dense vision-language inference,
F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self- attention for dense vision-language inference,” arXiv preprint arXiv:2312.01597, 2023. 3, 10
-
[25]
Pay at- tention to your neighbours: Training-free open- vocabulary semantic segmentation,
S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay at- tention to your neighbours: Training-free open- vocabulary semantic segmentation,” arXiv preprint arXiv:2404.08181, 2024. 4, 10
-
[26]
Foundation model assisted weakly supervised semantic segmentation,
X. Yang and X. Gong, “Foundation model assisted weakly supervised semantic segmentation,” in W ACV,
-
[27]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Y. Lo, et al. , “Segment anything,” in ICCV, 2023. 4
work page 2023
-
[28]
Eva: Exploring the limits of masked visual representation learning at scale,
Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, pp. 19358–19369, 2023. 4
work page 2023
-
[29]
Evaluating weakly supervised object local- ization methods right,
J. Choe, S. Oh, S. Lee, S. Chun, Z. Akata, and H. Shim, “Evaluating weakly supervised object local- ization methods right,” in CVPR, 2020. 5, 6
work page 2020
-
[30]
A realistic protocol for evaluation of weakly super- vised object localization,
S. Murtaza, S. Belharbi, M. Pedersoli, and E. Granger, “A realistic protocol for evaluation of weakly super- vised object localization,” in IEEE W ACV, 2025
work page 2025
-
[31]
J. Rony, S. Belharbi, J. Dolz, I. Ben Ayed, L. McCaf- frey, and E. Granger, “Deep weakly-supervised learn- ing methods for classification and localization in histol- ogy images: A survey,” Machine Learning for Biomed- ical Imaging, vol. 2, pp. 96–150, 2023. 5
work page 2023
-
[32]
S. Murtaza, S. Belharbi, M. Pedersoli, A. Sarraf, and E. Granger, “Dips: Discriminative pseudo-label sam- pling with self-supervised transformers for weakly su- pervised object localization,” IVC Journal , vol. 140, p. 104838, 2023. 5, 7, 9
work page 2023
-
[33]
Imagenet: A large-scale hierarchical im- age database,
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical im- age database,” in CVPR, 2009. 5, 6
work page 2009
-
[34]
Pay attention to your neighbours: Training-free open-vocabulary se- mantic segmentation,
S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay attention to your neighbours: Training-free open-vocabulary se- mantic segmentation,” in W ACV, 2025. 5
work page 2025
-
[35]
The caltech-ucsd birds-200-2011 dataset,
C. Wah, S. Branson, W. Steve, P. Peter, and S. Be- longie, “The caltech-ucsd birds-200-2011 dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011. 6
work page 2011
-
[36]
Do imagenet classifiers generalize to imagenet?,
B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?,” inICML,
-
[37]
Weakly su- pervised localization and learning with generic knowl- edge,
T. Deselaers, B. Alexe, and V. Ferrari, “Weakly su- pervised localization and learning with generic knowl- edge,” IJCV, vol. 100, pp. 275–293, 2012. 6
work page 2012
-
[38]
Weakly su- pervised object localization via transformer with im- plicit spatial calibration,
H. Bai, R. Zhang, J. Wang, and X. Wan, “Weakly su- pervised object localization via transformer with im- plicit spatial calibration,” ECCV, 2022. 7, 8
work page 2022
-
[39]
LCTR: on awak- ening the local continuity of transformer for weakly supervised object localization,
Z. Chen, C. Wang, Y. Wang, G. Jiang, Y. Shen, Y. Tai, C. Wang, W. Zhang, and L. Cao, “LCTR: on awak- ening the local continuity of transformer for weakly supervised object localization,” in AAAI, pp. 410–418,
-
[40]
J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, and L. Shen, “C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localiza- tion and semantic segmentation,” in CVPR, pp. 989– 998, 2022. 7, 8
work page 2022
-
[41]
Category-aware allocation transformer for weakly supervised object localization,
Z. Chen, J. Ding, L. Cao, Y. Shen, S. Zhang, G. Jiang, and R. Ji, “Category-aware allocation transformer for weakly supervised object localization,” in ICCV, pp. 6643–6652, 2023. 7, 8
work page 2023
-
[42]
Boost- ing weakly supervised object localization and segmen- tation with domain adaption,
L. Zhu, Q. She, Q. Chen, Q. Ren, and Y. Lu, “Boost- ing weakly supervised object localization and segmen- tation with domain adaption,” IEEE TPAMI, 2024. 7, 8
work page 2024
-
[43]
A threshold selection method from gray-level histograms,
N. Otsu et al. , “A threshold selection method from gray-level histograms,” Automatica, vol. 11, no. 285- 296, pp. 23–27, 1975. 9
work page 1975
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.