arxiv: 2604.14762 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism

Jordan Shipard , Arnold Wiliem , Kien Nguyen Thanh , Wei Xiang , Clinton Fookes

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords generalized category discoverymodality-agnosticzero-shotsynthetic datatransformerclusteringmulti-modalcategory discovery

0 comments

The pith

A model trained only on synthetic data performs zero-shot category discovery across vision, text, audio and remote sensing without fine-tuning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that generalized category discovery can be made modality-agnostic by reducing inputs to a shared latent space and then using a transformer trained solely on synthetic data to reshape that space for clustering. This matters to a sympathetic reader because prior approaches stay locked to one data type and demand retraining for each new dataset, whereas this setup lets one system handle images, words, sound, and satellite measurements immediately after a single training run. If the claim holds, representation learning through encoders becomes separable from the discovery process itself, so advances in one need not wait on the other. The work also introduces a zero-shot evaluation setting that tests generality directly on sixteen real datasets spanning four modalities.

Core claim

OmniGCD routes inputs through modality-specific encoders, reduces their dimension to form a GCD latent space, and applies a transformer trained exclusively on synthetic data to convert that space into a form better suited for clustering at test time. Trained once on synthetic data, the system then carries out zero-shot generalized category discovery on sixteen datasets across four modalities, yielding higher accuracy on both known and novel classes than prior baselines.

What carries the argument

The synthetically trained Transformer-based model that transforms the dimension-reduced GCD latent space at test time into a representation suitable for clustering

Load-bearing premise

A transformer trained exclusively on synthetic data can turn reduced-dimensional representations from real inputs of any of the four modalities into forms suitable for clustering with no further adaptation or fine-tuning

What would settle it

Evaluating the trained model on additional real datasets from one of the four modalities and checking whether known-class and novel-class accuracies exceed those of standard baselines in the zero-shot setting

Figures

Figures reproduced from arXiv: 2604.14762 by Arnold Wiliem, Clinton Fookes, Jordan Shipard, Kien Nguyen Thanh, Wei Xiang.

**Figure 2.** Figure 2: Overview diagram of OmniGCD. Data tokens are pro [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples of GCDformers ability to optimize the GCD latent spaces for k-means clustering. We show 10 examples [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniGCD trains one Transformer on synthetic data for zero-shot GCD across four modalities and claims average gains, but the experiments lack the controls needed to trust those numbers.

read the letter

The paper's central move is to train a single Transformer once on synthetic data, then use it at test time to map dimension-reduced embeddings from real inputs into a space where standard clustering finds both known and novel classes. It does this without any per-dataset fine-tuning and reports average improvements of roughly 6 to 18 points across vision, text, audio, and remote sensing on 16 datasets. That zero-shot, modality-agnostic protocol is the actual novelty compared with prior GCD work that stayed inside one modality and tuned per dataset. The code release and the explicit separation of encoders from the discovery step are practical pluses that make the idea easier to test or extend.

Referee Report

2 major / 1 minor

Summary. The paper introduces OmniGCD, a modality-agnostic approach to Generalized Category Discovery (GCD). Modality-specific encoders produce inputs that are dimension-reduced into a GCD latent space; a Transformer trained exclusively on synthetic data then transforms this space at test time into a representation better suited for clustering known and novel classes. The method is evaluated in a zero-shot GCD protocol (no dataset-specific fine-tuning) across 16 datasets spanning vision, text, audio, and remote sensing, reporting average accuracy gains of +6.2, +17.9, +1.5, and +12.7 percentage points respectively over baselines.

Significance. If the zero-shot synthetic-to-real transfer holds, the work would be significant for decoupling representation learning from category discovery and for establishing a benchmark that allows encoder development to proceed independently of GCD. The emphasis on abstract, human-inspired category formation across modalities could influence future multi-modal discovery methods.

major comments (2)

[Abstract] Abstract and experimental results: the headline claim of average percentage-point gains (+6.2, +17.9, +1.5, +12.7) supplies no information on the concrete baseline methods, statistical significance, error bars, data splits, or controls for synthetic-data leakage; without these the support for the central empirical claim cannot be assessed.
[Method] Method section (description of the synthetically trained Transformer): the central assumption that a single Transformer maps dimension-reduced real embeddings from four modalities into a clustering-friendly space without per-dataset adaptation lacks direct validation; experiments demonstrating that the learned mapping is not distribution-specific to the synthetic training data are required.

minor comments (1)

[Abstract] The code repository link is provided, supporting reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications from the full manuscript and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the headline claim of average percentage-point gains (+6.2, +17.9, +1.5, +12.7) supplies no information on the concrete baseline methods, statistical significance, error bars, data splits, or controls for synthetic-data leakage; without these the support for the central empirical claim cannot be assessed.

Authors: We agree the abstract is concise and omits these specifics. The full manuscript details the baselines in Section 4.2 (standard GCD methods such as GCD, SimGCD, and ORCA adapted to the zero-shot protocol), reports mean accuracy with standard deviation over five runs using different random seeds, follows the canonical data splits and known/novel class partitions for each of the 16 datasets, and prevents synthetic-data leakage by generating purely synthetic samples via a controlled procedure with no real-data overlap (Section 3.1). We will revise the abstract to briefly reference these elements. revision: partial
Referee: [Method] Method section (description of the synthetically trained Transformer): the central assumption that a single Transformer maps dimension-reduced real embeddings from four modalities into a clustering-friendly space without per-dataset adaptation lacks direct validation; experiments demonstrating that the learned mapping is not distribution-specific to the synthetic training data are required.

Authors: The zero-shot results across 16 real datasets from four modalities already provide strong indirect evidence that the mapping generalizes beyond the synthetic distribution, as overfitting to synthetic data would not yield consistent gains on diverse real embeddings. To supply the requested direct validation, we will add an ablation study in the revision that evaluates the Transformer on held-out synthetic distributions and on mixtures of synthetic and real embeddings, confirming the mapping is not distribution-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation consists of an explicit pipeline (modality encoders to dimension-reduced GCD latent space, followed by a Transformer trained exclusively on synthetic data and applied zero-shot) whose central claim is an empirical performance gain on 16 held-out real datasets. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the synthetic training set is distinct from evaluation data, the zero-shot protocol forbids per-dataset adaptation, and no load-bearing uniqueness theorems or ansatzes are invoked. The reported improvements are therefore falsifiable experimental outcomes rather than tautological restatements of the method's inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The method rests on the assumption that modality-specific encoders plus dimension reduction produce a usable shared latent space and that synthetic training suffices for test-time transformation; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption Modality-specific encoders followed by dimension reduction yield a GCD latent space that is amenable to clustering after synthetic Transformer transformation.
Invoked in the description of the pipeline and zero-shot setting.

invented entities (2)

GCD latent space no independent evidence
purpose: Shared representation obtained by dimension reduction from modality-specific encoders
Introduced as the intermediate space before the Transformer transformation.
Synthetically trained Transformer-based model no independent evidence
purpose: Test-time transformation of the latent space for improved clustering
Trained once on synthetic data and applied zero-shot.

pith-pipeline@v0.9.0 · 5619 in / 1394 out tokens · 54455 ms · 2026-05-10T11:15:02.563822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Generalized category discov- ery with large language models in the loop

Wenbin An, Wenkai Shi, Feng Tian, Haonan Lin, QianYing Wang, Yaqiang Wu, Mingxiang Cai, Luyan Wang, Yan Chen, Haiping Zhu, and Ping Chen. Generalized category discov- ery with large language models in the loop. InFindings of ACL, 2024. 1, 2, 8

2024
[2]

Gregory Ashby and W

F. Gregory Ashby and W. Todd Maddox. Human category learning.Annu. Rev. Psychol., 56(1):149–178, 2005. 1, 3

2005
[3]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2

2023
[4]

Bellman.Dynamic Programming

R. Bellman.Dynamic Programming. Princeton University Press, 1957. 4

1957
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 2, 6, 7, 12, 14, 16, 17

2021
[6]

Efficient intent detection with dual sentence encoders

I ˜nigo Casanueva, Tadas Tem ˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli ´c. Efficient intent detection with dual sentence encoders. InProceedings of the 2nd Work- shop on Natural Language Processing for Conversational AI, 2020. 6, 7

2020
[7]

Promptccd: Learning gaussian mixture prompt pool for con- tinual category discovery

Fernando Julio Cendra, Bingchen Zhao, and Kai Han. Promptccd: Learning gaussian mixture prompt pool for con- tinual category discovery. InEuropean Conference on Com- puter Vision, pages 188–205, 2024. 2

2024
[8]

Decision transformer: Reinforce- ment learning via sequence modeling.Adv

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srini- vas, and Igor Mordatch. Decision transformer: Reinforce- ment learning via sequence modeling.Adv. Neural Inf. Pro- cess. Syst., 34:15084–15097, 2021. 2

2021
[9]

Remote sens- ing image scene classification: Benchmark and state of the art.Proc

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proc. IEEE, 105(10):1865–1883, 2017. 6, 7

2017
[10]

Parametric information max- imization for generalized category discovery

Florent Chiaroni, Jose Dolz, Ziko Imtiaz Masud, Amar Mitiche, and Ismail Ben Ayed. Parametric information max- imization for generalized category discovery. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1729–1739, 2023. 2

2023
[11]

Contrastive mean- shift learning for generalized category discovery

Sua Choi, Dahyun Kang, and Minsu Cho. Contrastive mean- shift learning for generalized category discovery. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23094–23104, 2024. 2

2024
[12]

ImageNet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009. 6, 7

2009
[13]

Meg evidence that modality-independent conceptual representations contain se- mantic and visual features.J

Julien Dirani and Liina Pylkk ¨anen. Meg evidence that modality-independent conceptual representations contain se- mantic and visual features.J. Neurosci., 44(27), 2024. 1

2024
[14]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations, 2021. 2

2021
[15]

On-the-fly cate- gory discovery

Ruoyi Du, Dongliang Chang, Kongming Liang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. On-the-fly cate- gory discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11691– 11700, 2023. 2

2023
[16]

Freedman, Maximilian Riesenhuber, Tomaso Pog- gio, and Earl K

David J. Freedman, Maximilian Riesenhuber, Tomaso Pog- gio, and Earl K. Miller. Categorical representation of visual stimuli in the primate prefrontal cortex.Science, 291(5502): 312–316, 2001. 1

2001
[17]

Robust semi-supervised learning when not all classes have labels.Adv

Lan-Zhe Guo, Yi-Ge Zhang, Zhi-Fan Wu, Jie-Jing Shao, and Yu-Feng Li. Robust semi-supervised learning when not all classes have labels.Adv. Neural Inf. Process. Syst., 35:3305– 3317, 2022. 2

2022
[18]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE J

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 2019. 6, 7

2019
[19]

Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V

Andrew G. Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V . Le, and Hartwig Adam. Searching for MobileNetV3. InProceedings of the IEEE/CVF International Conference on Computer Vision,
[20]

A review of t-sne.Image Processing On Line, 14:250–270, 2024

Sangwon Jung, Tristan Dagobert, Jean-Michel Morel, and Gabriele Facciolo. A review of t-sne.Image Processing On Line, 14:250–270, 2024. 12

2024
[21]

Yaelan Jung, Bart Larsen, and Dirk B. Walther. Modality- independent coding of scene categories in prefrontal cortex. J. Neurosci., 38(26):5969–5981, 2018. 1

2018
[22]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015. 6

2015
[23]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 554–561, 2013. 6, 7, 17

2013
[24]

Learning multiple layers of features from tiny images, 2009

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009. Technical report, University of Toronto. 6, 7, 17

2009
[25]

Peper, Christo- pher Clarke, Andrew Lee, Parker Hill, Jonathan K

Stefan Larson, Anish Mahendran, Joseph J. Peper, Christo- pher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummer- feld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. An evaluation dataset for intent classifica- tion and out-of-scope prediction. InProceedings of EMNLP- IJCNLP, 2019. 6, 7

2019
[26]

Mert: Acoustic music understanding model with large-scale self-supervised train- ing

Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, et al. Mert: Acoustic music understanding model with large-scale self-supervised train- ing. InInternational Conference on Learning Representa- tions, 2024. 6, 7

2024
[27]

Zhang, Deva Ramanan, and Shubham Tulsiani

Amy Lin, Jason Y . Zhang, Deva Ramanan, and Shubham Tulsiani. Relpose++: Recovering 6d poses from sparse-view observations. In2024 Int. Conf. 3D Vis. (3DV), 2024. 3

2024
[28]

Open-world semi-supervised novel class discovery.arXiv preprint arXiv:2305.13095,

Jiaming Liu, Yangqiming Wang, Tongze Zhang, Yulu Fan, Qinli Yang, and Junming Shao. Open-world semi-supervised novel class discovery.arXiv preprint arXiv:2305.13095,

work page arXiv
[29]

DebGCD: Debiased learning with distribution guidance for generalized category discov- ery

Yuanpei Liu and Kai Han. DebGCD: Debiased learning with distribution guidance for generalized category discov- ery. InInternational Conference on Learning Representa- tions, 2025. 2

2025
[30]

Hyperbolic cate- gory discovery

Yuanpei Liu, Zhenqi He, and Kai Han. Hyperbolic cate- gory discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9891– 9900, 2025. 8

2025
[31]

S. Lloyd. Least squares quantization in PCM.IEEE Trans. Inf. Theory, 1982. 7

1982
[32]

Happy: A debiased learn- ing framework for continual generalized category discovery

Shijie Ma, Fei Zhu, Zhun Zhong, Wenzhuo Liu, Xu-Yao Zhang, and Cheng-Lin Liu. Happy: A debiased learn- ing framework for continual generalized category discovery. Adv. Neural Inf. Process. Syst., 37:50850–50875, 2024. 2

2024
[33]

Active generalized category discovery

Shijie Ma, Fei Zhu, Zhun Zhong, Xu-Yao Zhang, and Cheng- Lin Liu. Active generalized category discovery. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16890–16900, 2024. 2

2024
[34]

S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft, 2013. 6, 7, 14, 17

2013
[35]

Umap: Uniform manifold approximation and projection.J

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. Umap: Uniform manifold approximation and projection.J. Open Source Softw., 3(29):861, 2018. 5, 8, 16, 17

2018
[36]

Miller and Jonathan D

Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function.Annu. Rev. Neurosci., 24(1): 167–202, 2001. 1

2001
[37]

Miller, David J

Earl K. Miller, David J. Freedman, and Jonathan D. Wallis. The prefrontal cortex: Categories, concepts and cognition. Philos. Trans. R. Soc. Lond. B Biol. Sci., 357(1424):1123– 1136, 2002. 1

2002
[38]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herv ´e J ´egou, Julien Mairal, P...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Guided cluster aggregation: A hierarchical approach to generalized category discovery

Jona Otholt, Christoph Meinel, and Haojin Yang. Guided cluster aggregation: A hierarchical approach to generalized category discovery. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2618–2627, 2024. 2

2024
[40]

Karl Pearson. Liii. on lines and planes of closest fit to sys- tems of points in space.Philosophical Magazine, 2(11):559– 572, 1901. 5, 8, 16, 17

1901
[41]

Language models are unsuper- vised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsuper- vised multitask learners. OpenAI technical report. 3
[42]

Asano, Hazel Doughty, and Cees G

Sarah Rastegar, Mohammadreza Salehi, Yuki M. Asano, Hazel Doughty, and Cees G. M. Snoek. Selex: Self-expertise in fine-grained generalized category discovery. InEuropean Conference on Computer Vision, pages 440–458, 2024. 8

2024
[43]

A dataset and taxonomy for urban sound research

Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. A dataset and taxonomy for urban sound research. InACMMM, pages 1041–1044, 2014. 6, 7

2014
[44]

Hunter, Costas Bekas, and Al- pha A

Philippe Schwaller, Teodoro Laino, Th ´eophile Gaudin, Pe- ter Bolgar, Christopher A. Hunter, Costas Bekas, and Al- pha A. Lee. Molecular transformer: A model for uncertainty- calibrated chemical reaction prediction.ACS Central Sci- ence, 5(9):1572–1583, 2019. 2

2019
[45]

DINOv3

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 7, 8, 12, 14, 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Zhang, Andrea Tagliasacchi, Igor Gilitschenski, and David B

Samarth Sinha, Jason Y . Zhang, Andrea Tagliasacchi, Igor Gilitschenski, and David B. Lindell. Sparsepose: Sparse- view camera pose regression and refinement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 21349–21359, 2023. 3

2023
[47]

The herbarium challenge 2019 dataset

Kiat Chuan Tan, Yulong Liu, Barbara Ambrose, Melissa Tulig, and Serge Belongie. The herbarium challenge 2019 dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019. 6, 7, 17

2019
[48]

Aden: Adaptive density representations for sparse-view camera pose estimation

Hao Tang, Weiyao Wang, Pierre Gleize, and Matt Feis- zli. Aden: Adaptive density representations for sparse-view camera pose estimation. InEuropean Conference on Com- puter Vision, 2024. 3

2024
[49]

Dissecting generalized category discovery: Multiplex con- sensus under self-deconstruction

Luyao Tang, Kunze Huang, Chaoqi Chen, Yuxuan Yuan, Chenxin Li, Xiaotong Tu, Xinghao Ding, and Yue Huang. Dissecting generalized category discovery: Multiplex con- sensus under self-deconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision,
[50]

Visualizing data using t-sne.J

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.J. Mach. Learn. Res., 9:2579–2605, 2008. 5, 8, 16, 17

2008
[51]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Adv. Neural Inf. Pro- cess. Syst., 30, 2017. 4

2017
[52]

Generalized category discovery

Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisser- man. Generalized category discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 1, 2, 3, 6, 7, 8, 15, 17

2022
[53]

No representation rules them all in category discovery.Adv

Sagar Vaze, Andrea Vedaldi, and Andrew Zisserman. No representation rules them all in category discovery.Adv. Neu- ral Inf. Process. Syst., 36:19962–19989, 2023. 1, 2

2023
[54]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be- longie. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technol- ogy, 2011. 6, 7, 17

2011
[55]

Get: Unlocking the multi-modal potential of clip for generalized category dis- covery

Enguang Wang, Zhimao Peng, Zhengyuan Xie, Fei Yang, Xialei Liu, and Ming-Ming Cheng. Get: Unlocking the multi-modal potential of clip for generalized category dis- covery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20296– 20306, 2025. 2

2025
[56]

Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment

Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9773–9783,
[57]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre- training.arXiv preprint arXiv:2212.03533, 2022. 6, 7, 8

work page internal anchor Pith review arXiv 2022
[58]

Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Adv

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J ´erˆome Re- vaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Adv. Neural Inf. Process. Syst., 35:3502–3516, 2022. 2

2022
[59]

Parametric classification for generalized category discovery: A baseline study

Xin Wen, Bingchen Zhao, and Xiaojuan Qi. Parametric classification for generalized category discovery: A baseline study. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 16590–16600, 2023. 2, 8

2023
[60]

V ocalset: A singing voice dataset, 2018

Julia Wilkins, Prem Seetharaman, Alison Wahl, and Bryan Pardo. V ocalset: A singing voice dataset, 2018. 6, 7

2018
[61]

Metagcd: Learning to continually learn in generalized cat- egory discovery

Yanan Wu, Zhixiang Chi, Yang Wang, and Songhe Feng. Metagcd: Learning to continually learn in generalized cat- egory discovery. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 1655–1665,
[62]

Deep active learning in the open world.arXiv preprint arXiv:2411.06353, 2024

Tian Xie, Jifan Zhang, Haoyue Bai, and Robert Nowak. Deep active learning in the open world.arXiv preprint arXiv:2411.06353, 2024. 2

work page arXiv 2024
[63]

Neural plasticity- inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356, 2024

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the Earth crossing modalities.arXiv preprint arXiv:2403.15356, 2024. 6, 7

work page arXiv 2024
[64]

Short text clustering via convolutional neural networks

Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. Short text clustering via convolutional neural networks. InProceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 62–69, 2015. 6, 7

2015
[65]

Generalized category discovery for remote sensing image scene classification

Wenjia Xu, Zijian Yu, Zhiwei Wei, Jiuniu Wang, and Mugen Peng. Generalized category discovery for remote sensing image scene classification. InIGARSS 2024, pages 8731–

2024
[66]

Consistent prompt tuning for generalized category discovery.IJCV, pages 1–28, 2025

Muli Yang, Jie Yin, Yanan Gu, Cheng Deng, Hanwang Zhang, and Hongyuan Zhu. Consistent prompt tuning for generalized category discovery.IJCV, pages 1–28, 2025. 2

2025
[67]

Bag-of-visual-words and spa- tial extensions for land-use classification

Yi Yang and Shawn Newsam. Bag-of-visual-words and spa- tial extensions for land-use classification. InProceedings of the 18th SIGSPATIAL Int. Conf. Advances in Geographic In- formation Systems, pages 270–279, 2010. 6, 7

2010
[68]

Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J. Kim. Graph transformer networks.Adv. Neural Inf. Process. Syst., 32, 2019. 2

2019
[69]

Learning semi- supervised gaussian mixture models for generalized category discovery

Bingchen Zhao, Xin Wen, and Kai Han. Learning semi- supervised gaussian mixture models for generalized category discovery. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16623–16633, 2023. 2

2023
[70]

Prototypical hash encoding for on-the-fly fine- grained category discovery.Adv

Haiyang Zheng, Nan Pu, Wenjing Li, Nicu Sebe, and Zhun Zhong. Prototypical hash encoding for on-the-fly fine- grained category discovery.Adv. Neural Inf. Process. Syst., 37:101428–101455, 2024. 2

2024
[71]

Textual knowledge matters: Cross-modality co- teaching for generalized visual class discovery

Haiyang Zheng, Nan Pu, Wenjing Li, Nicu Sebe, and Zhun Zhong. Textual knowledge matters: Cross-modality co- teaching for generalized visual class discovery. InEuropean Conference on Computer Vision, pages 41–58, 2024. 2

2024
[72]

Generalized category discovery in aerial image classification via slot attention.Drones, 8(4): 160, 2024

Yifan Zhou, Haoran Zhu, Yan Zhang, Shuo Liang, Yujing Wang, and Wen Yang. Generalized category discovery in aerial image classification via slot attention.Drones, 8(4): 160, 2024. 2

2024
[73]

So2sat lcz42: A benchmark data set for the classification of global local climate zones [software and data sets].IEEE Geosci

Xiao Xiang Zhu, Jingliang Hu, Chunping Qiu, Yilei Shi, Jian Kang, Lichao Mou, Hossein Bagheri, Matthias Haberle, Yuansheng Hua, Rong Huang, Lloyd Hughes, Hao Li, Yao Sun, Guichen Zhang, Shiyao Han, Michael Schmitt, and Yuanyuan Wang. So2sat lcz42: A benchmark data set for the classification of global local climate zones [software and data sets].IEEE Geosc...
[74]

GLEAN: Active Generalized Category Discovery with Diverse LLM Feedback

Henry Peng Zou, Siffi Singh, Yi Nian, Jianfeng He, Jason Cai, Saab Mansour, and Hang Su. Glean: Generalized cate- gory discovery with diverse and quality-enhanced llm feed- back.arXiv preprint arXiv:2502.18414, 2025. 1, 2, 8 Supplementary Material for “OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism” In our supplementary mater...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

In Figure 3, we present qualitative examples illustrating OmniGCD’s ability to enhance GCD latent spaces

GCDFormer Optimization of GCD Latent Spaces The GCDformer in OmniGCD is trained exclusively on synthetic data to optimize the GCD latent space for cluster- ing. In Figure 3, we present qualitative examples illustrating OmniGCD’s ability to enhance GCD latent spaces. These examples comprise randomly selected synthetic GCD latent spaces from GCDformer’s tra...
[76]

As noted, at test time, OmniGCD’s input comprises both the labeled and unlabeled subsets

Details regarding OmniGCDs Zero-shot setup In this section, we provide additional details on the sizes of the labeled and unlabeled datasets used by OmniGCD for the main results in Table 4 of the main paper. As noted, at test time, OmniGCD’s input comprises both the labeled and unlabeled subsets. In the zero-shot GCD setting, the labeled subset is constru...
[77]

In this section, we present ablations on train- ing with higher-dimensional data (specifically, 32, 64, and 128 dimensions)

OmniGCD Dimension Ablations In the main paper, we chose 2D data for GCDformer train- ing, as low dimensionality ensures tractability in generating training data. In this section, we present ablations on train- ing with higher-dimensional data (specifically, 32, 64, and 128 dimensions). Due to theO(N 2k)time complexity of t-SNE [20]—whereNis the number of ...
[78]

Additional Vision Encoders Ablations Here, we present results using two additional pretrained vi- sion encoders: MobileNetV3 [19] and DINOv3 [45]. These results are not merely to demonstrate that OmniGCD works with additional encoders, as the main results with multi- modal encoders already establish this, but serve two spe- cific purposes: (1) MobileNetV3...
[79]

These re- sults are presented in Table 13

Further analysis of OmniGCD Zero-shot GCD performance In this section, we provide a more granular analysis of Om- niGCD’s performance, along with standard deviations for the main results in Table 4 of the main paper. These re- sults are presented in Table 13. For the granular analysis, we compare the individual contributions of t-SNE dimen- sionality redu...
[80]

As noted in the main paper, this vision encoder was fine-tuned using GCD’s [52] super- vised and self-supervised contrastive training methods

Additional Results for Vision Encoder Fine-tuning Here, we provide additional results for Table 5 in the main paper, focusing on OmniGCD’s performance with a fine- tuned DINOv1 [5] encoder. As noted in the main paper, this vision encoder was fine-tuned using GCD’s [52] super- vised and self-supervised contrastive training methods. The results in Table 12 ...

Showing first 80 references.