On the Provable Importance of Gradients for Language-Assisted Image Clustering

Bo Peng; Guangquan Zhang; Jie Lu; Zhen Fang

arxiv: 2510.16335 · v4 · pith:QUIQHC3Cnew · submitted 2025-10-18 · 💻 cs.CV

On the Provable Importance of Gradients for Language-Assisted Image Clustering

Bo Peng , Jie Lu , Guangquan Zhang , Zhen Fang This is my paper

Pith reviewed 2026-05-25 07:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords language-assisted image clusteringgradient-based filteringpositive noun selectionerror boundGradNormCLIP featuresimage clusteringtheoretical analysis

0 comments

The pith

The magnitude of gradients from a cross-entropy loss provides a theoretically grounded measure to filter positive nouns in language-assisted image clustering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GradNorm to filter positive nouns in language-assisted image clustering by using the magnitude of back-propagated gradients from a cross-entropy loss. This addresses the lack of theoretical foundation in existing CLIP feature-based methods. A rigorous error bound shows how well GradNorm separates positive nouns, and it proves that prior strategies are special cases of this approach. If correct, this leads to improved image clustering performance by better leveraging textual semantics.

Core claim

GradNorm measures the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. It provides a rigorous error bound to quantify the separability of positive nouns and proves that it subsumes existing filtering strategies as special cases, achieving state-of-the-art clustering performance.

What carries the argument

GradNorm, which measures noun positiveness via the magnitude of gradients back-propagated from the cross-entropy between a predicted target distribution and the softmax output.

If this is right

Provides a rigorous error bound quantifying the separability of positive nouns.
Subsumes existing filtering strategies as special cases.
Achieves state-of-the-art clustering performance on various benchmarks.
Enhances discriminability of visual representations through better noun selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gradient-based filtering could extend to other unsupervised tasks involving text selection.
Different loss functions might yield similar or improved separation bounds.
The subsumption result suggests a unified view of noun filtering methods.

Load-bearing premise

The magnitude of gradients back-propagated from the cross-entropy loss reliably indicates the semantic positiveness of nouns without true class labels.

What would settle it

A dataset where known positive nouns show low gradient magnitudes or the error bound is violated in empirical tests.

read the original abstract

This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks. Code is publicly available at \href{https://github.com/60pen9/On-the-Provable-Importance-of-Gradients-for-Language-Assisted-Image-Clustering}{here}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GradNorm frames noun filtering in LaIC via gradient magnitudes with a claimed error bound and subsumption of prior CLIP methods, but the target distribution construction likely introduces dependence that weakens the separability guarantee.

read the letter

The main takeaway is that this paper shifts noun selection in language-assisted image clustering from heuristic CLIP feature similarity to a gradient-magnitude signal derived from cross-entropy against a predicted target distribution. It supplies an error bound on positive-noun separability and shows that earlier filtering rules emerge as special cases under particular target choices. That formal step is the clearest addition over the existing literature, which treated filtering as an engineering choice without much justification. The empirical side reports state-of-the-art clustering numbers on standard benchmarks, and the code release is a practical plus for anyone wanting to test the method directly. The derivation appears independent rather than circular on its face, which is worth crediting. The soft spot is exactly the one flagged in the stress test. The target distribution has to be produced without ground-truth labels, so it is almost certainly built from the same image-noun pairs or from an initial clustering step. Once that happens, the gradient magnitude used to score a noun already carries information about that noun, which undercuts the claim that the bound cleanly quantifies separability. The abstract does not spell out the target-construction procedure in enough detail to see whether the proof survives this dependence or only holds for an oracle target. Minor additional concerns are the usual ones for an abstract-only view: we do not yet know the precise experimental controls or whether the reported gains survive different random seeds and hyper-parameter choices. This work is aimed at the small but active LaIC sub-community inside multimodal vision. A reader already thinking about theoretical grounding for unsupervised noun selection will find the gradient framing and the subsumption result useful to engage with. The paper is coherent enough on its own terms and makes a concrete, falsifiable claim, so it deserves a serious referee rather than a desk reject. I would send it out for review with the expectation that the authors will need to clarify or strengthen the target-distribution step in the bound.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes GradNorm, a gradient-based framework for language-assisted image clustering (LaIC) that filters positive nouns by the magnitude of gradients back-propagated from the cross-entropy between a predicted target distribution and the softmax output. It claims to provide a rigorous error bound quantifying the separability of positive nouns by this measure and to prove that GradNorm subsumes existing CLIP feature-based filtering strategies as special cases. Extensive experiments are reported to achieve state-of-the-art clustering performance on various benchmarks, with code released publicly.

Significance. If the error bound and subsumption hold without unstated dependencies, the work supplies the first theoretical foundation for noun filtering in LaIC and could replace heuristic approaches. The public code release is a clear strength that enables direct verification of the empirical claims.

major comments (2)

[Theoretical analysis / error bound] The error bound (described in the theoretical analysis) assumes the predicted target distribution is formed independently of the noun features whose positiveness is being measured. Because the target is constructed from model predictions or clustering on the same unlabeled image-noun pairs, the gradient signal may be circularly dependent on the nouns under evaluation; this dependence is not addressed and directly affects whether the bound guarantees separability in the actual algorithm.
[Proof of subsumption] The subsumption proof states that existing strategies are 'extremely special cases' of GradNorm. The proof must specify the exact target-distribution choices under which this reduction occurs; without those conditions, it is unclear whether the subsumption holds for the general target used in the proposed method or only for specially chosen targets that are not justified as canonical.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and insightful comments. The points raised highlight important aspects of the theoretical analysis that require clarification. We address each major comment below and will incorporate revisions to strengthen the presentation of the error bound and subsumption result.

read point-by-point responses

Referee: [Theoretical analysis / error bound] The error bound (described in the theoretical analysis) assumes the predicted target distribution is formed independently of the noun features whose positiveness is being measured. Because the target is constructed from model predictions or clustering on the same unlabeled image-noun pairs, the gradient signal may be circularly dependent on the nouns under evaluation; this dependence is not addressed and directly affects whether the bound guarantees separability in the actual algorithm.

Authors: We acknowledge the concern about potential dependence. The target distribution in GradNorm is obtained from an initial clustering step performed on image features alone (prior to noun evaluation), which is designed to be independent of the specific noun embeddings under consideration. Nevertheless, to fully address the referee's point, we will revise the theoretical analysis section to explicitly state the independence assumption, derive the bound under the precise construction used in the algorithm, and add a paragraph discussing why the gradient magnitudes remain a valid separability measure even when the target is estimated from the same unlabeled pairs. This revision will make the applicability of the bound transparent. revision: yes
Referee: [Proof of subsumption] The subsumption proof states that existing strategies are 'extremely special cases' of GradNorm. The proof must specify the exact target-distribution choices under which this reduction occurs; without those conditions, it is unclear whether the subsumption holds for the general target used in the proposed method or only for specially chosen targets that are not justified as canonical.

Authors: We agree that the subsumption claim benefits from explicit conditions. The reduction to prior CLIP feature-based methods occurs precisely when the target distribution is set to the softmax of the CLIP similarity logits (or a uniform distribution in the degenerate case). Under these choices the gradient magnitude simplifies exactly to the feature similarity used by existing heuristics. We will revise the proof to state these target-distribution choices explicitly, include the corresponding derivations, and note that they correspond to the canonical instantiations of prior methods, thereby showing GradNorm as their generalization. revision: yes

Circularity Check

0 steps flagged

GradNorm framework derives error bound and subsumption independently without reduction to inputs by construction

full rationale

The abstract presents a gradient magnitude measurement from CE(predicted target distribution, softmax) as the core positiveness indicator, followed by a derived error bound for separability and a proof that existing filters are special cases. These steps establish an independent theoretical structure rather than redefining the output in terms of the input or fitting parameters that are then relabeled as predictions. No self-citation chains, ansatzes smuggled via citation, or self-definitional loops are evident in the provided text; the subsumption claim further indicates the derivation extends beyond prior methods without circular dependence on the target construction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; no explicit free parameters, ad-hoc axioms, or invented entities beyond the method name itself are detailed.

axioms (1)

standard math Softmax function and cross-entropy loss are standard mathematical objects.
Used to define the gradient computation for noun positiveness.

invented entities (1)

GradNorm no independent evidence
purpose: Gradient-magnitude measure for noun positiveness in LaIC
Newly introduced framework

pith-pipeline@v0.9.0 · 5775 in / 1175 out tokens · 35316 ms · 2026-05-25T07:28:52.095141+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 … ERR_pos(k) … min_W Q(W) + O(1/√N) + O(1/√B_y)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Debiased negative mining via Monte-Carlo sampling from ID labels and unlabeled wild data improves OOD detection with VLMs and achieves new state-of-the-art results.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization

2 Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE international con- ference on computer vision, pages 5736-5745, 2017. 2 Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiy...

work page 2017
[2]

2 Jiabo Huang, Shaogang Gong, and Xiatian Zhu

PMLR, 2017. 2 Jiabo Huang, Shaogang Gong, and Xiatian Zhu. Deep se- mantic clustering by partition confidence maximisation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8849-8858, 2020. 1, 6 Zhizhong Huang, Jie Chen, Junping Zhang, and Hongming Shan. Learning representation for clustering via prototype scat...

work page 2017
[3]

Robust estimation of a location parameter

1,2 Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492-518. Springer, 1992. 3 Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant in- formation clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF inter- national conference on com...

work page 1992
[4]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

2, 6 Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] conference on machine learning, pages 4904-4916. PMLR,

work page
[5]

Adam: A Method for Stochastic Optimization

2 Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsuper- vised and generative approach to clustering. arXiv preprint arXiv:1611,05148, 2016. 1, 2 Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. In International conference on ...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

VisualBERT: A Simple and Performant Baseline for Vision and Language

5 Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554-561, 2013. 8 Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009), 2009. 5 Marc T Law, Raquel U...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[7]

Fine-Grained Visual Classification of Aircraft

| Sihang Liu, Wenming Cao, Ruigang Fu, Kaixiang Yang, and Zhiwen Yu. Rpsc: robust pseudo-labeling for semantic clus- tering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 14008-14016, 2024. 6 [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] Yiding Lu, Haobin Li, Yunfan Li, Yijie Lin, and Xi Peng. A surve...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Wordnet: a lexical database for english

6 George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39-41, 1995. 1, 3,5 Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. Clustergan: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI conference on artificial intelligence, pages 4610-4617,

work page 1995
[9]

ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

1, 2 Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE con- ference on computer vision and pattern recognition, pages 427-436, 2015. 5 Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes....

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

4, 6 Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv: 1212.0402, 2012. 5 Yaling Tao, Kentaro Takagi, and Kouta Nakata. Clustering- friendly representation learning via instance discrimination and feature decorrelation. arXiv preprint arXiv:2106.00131,

work page internal anchor Pith review Pith/arXiv arXiv 2012
[11]

Mice: Mixture of contrastive experts for unsupervised image clustering

2,6 Tsung Wei Tsai, Chongxuan Li, and Jun Zhu. Mice: Mixture of contrastive experts for unsupervised image clustering. In International conference on learning representations, 2020. 2,5,6 Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9 (11), 2008. 7 Wouter Van Gansbeke, Simon Vandenhende, S...

work page 2020
[12]

Unsupervised deep embedding for clustering analysis

5 Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] conference on machine learning, pages 478-487. PMLR,

work page
[13]

Towards k-means-friendly spaces: Simultaneous deep learning and clustering

1, 2, 6 Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In international conference on ma- chine learning, pages 3861-3870. PMLR, 2017. | Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper- vised learning of deep representations and image clusters. In Proceeding...

work page 2017
[14]

Filip: Fine-grained interactive language-image pre-training

| Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021. 2 Chunlin Yu, Ye Shi, and Jingya Wang. Contextually affini- tive neighborhood refinery for deep clustering. Advances in Neural Inform...

work page arXiv 2021
[15]

Deep sets

| Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barn- abas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing sys- tems, 30, 2017. 4 Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision- langu...

work page arXiv 2017
[16]

Com- munity detection in multiplex networks by deep structure- preserving non-negative matrix factorization

2 Qinli Zhou, Wenjie Zhu, Hao Chen, and Bo Peng. Com- munity detection in multiplex networks by deep structure- preserving non-negative matrix factorization. Applied Intel- ligence, 55(1):26, 2025. | Wenjie Zhu and Bo Peng. Sparse and low-rank regularized deep subspace clustering. Knowledge-Based Systems, 204: 106199, 2020. 2 Wenjie Zhu and Bo Peng. Manif...

work page 2025

[1] [1]

Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization

2 Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE international con- ference on computer vision, pages 5736-5745, 2017. 2 Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiy...

work page 2017

[2] [2]

2 Jiabo Huang, Shaogang Gong, and Xiatian Zhu

PMLR, 2017. 2 Jiabo Huang, Shaogang Gong, and Xiatian Zhu. Deep se- mantic clustering by partition confidence maximisation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8849-8858, 2020. 1, 6 Zhizhong Huang, Jie Chen, Junping Zhang, and Hongming Shan. Learning representation for clustering via prototype scat...

work page 2017

[3] [3]

Robust estimation of a location parameter

1,2 Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492-518. Springer, 1992. 3 Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant in- formation clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF inter- national conference on com...

work page 1992

[4] [4]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

2, 6 Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] conference on machine learning, pages 4904-4916. PMLR,

work page

[5] [5]

Adam: A Method for Stochastic Optimization

2 Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsuper- vised and generative approach to clustering. arXiv preprint arXiv:1611,05148, 2016. 1, 2 Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. In International conference on ...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

VisualBERT: A Simple and Performant Baseline for Vision and Language

5 Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554-561, 2013. 8 Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009), 2009. 5 Marc T Law, Raquel U...

work page internal anchor Pith review Pith/arXiv arXiv 2013

[7] [7]

Fine-Grained Visual Classification of Aircraft

| Sihang Liu, Wenming Cao, Ruigang Fu, Kaixiang Yang, and Zhiwen Yu. Rpsc: robust pseudo-labeling for semantic clus- tering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 14008-14016, 2024. 6 [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] Yiding Lu, Haobin Li, Yunfan Li, Yijie Lin, and Xi Peng. A surve...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Wordnet: a lexical database for english

6 George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39-41, 1995. 1, 3,5 Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. Clustergan: Latent space clustering in generative adversarial networks. In Proceedings of the AAAI conference on artificial intelligence, pages 4610-4617,

work page 1995

[9] [9]

ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

1, 2 Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE con- ference on computer vision and pattern recognition, pages 427-436, 2015. 5 Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes....

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

4, 6 Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv: 1212.0402, 2012. 5 Yaling Tao, Kentaro Takagi, and Kouta Nakata. Clustering- friendly representation learning via instance discrimination and feature decorrelation. arXiv preprint arXiv:2106.00131,

work page internal anchor Pith review Pith/arXiv arXiv 2012

[11] [11]

Mice: Mixture of contrastive experts for unsupervised image clustering

2,6 Tsung Wei Tsai, Chongxuan Li, and Jun Zhu. Mice: Mixture of contrastive experts for unsupervised image clustering. In International conference on learning representations, 2020. 2,5,6 Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9 (11), 2008. 7 Wouter Van Gansbeke, Simon Vandenhende, S...

work page 2020

[12] [12]

Unsupervised deep embedding for clustering analysis

5 Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] conference on machine learning, pages 478-487. PMLR,

work page

[13] [13]

Towards k-means-friendly spaces: Simultaneous deep learning and clustering

1, 2, 6 Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In international conference on ma- chine learning, pages 3861-3870. PMLR, 2017. | Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper- vised learning of deep representations and image clusters. In Proceeding...

work page 2017

[14] [14]

Filip: Fine-grained interactive language-image pre-training

| Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021. 2 Chunlin Yu, Ye Shi, and Jingya Wang. Contextually affini- tive neighborhood refinery for deep clustering. Advances in Neural Inform...

work page arXiv 2021

[15] [15]

Deep sets

| Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barn- abas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing sys- tems, 30, 2017. 4 Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision- langu...

work page arXiv 2017

[16] [16]

Com- munity detection in multiplex networks by deep structure- preserving non-negative matrix factorization

2 Qinli Zhou, Wenjie Zhu, Hao Chen, and Bo Peng. Com- munity detection in multiplex networks by deep structure- preserving non-negative matrix factorization. Applied Intel- ligence, 55(1):26, 2025. | Wenjie Zhu and Bo Peng. Sparse and low-rank regularized deep subspace clustering. Knowledge-Based Systems, 204: 106199, 2020. 2 Wenjie Zhu and Bo Peng. Manif...

work page 2025