pith. sign in

arxiv: 2605.05674 · v2 · submitted 2026-05-07 · 💻 cs.CV · cs.AI· cs.LG

EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation

Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords Euclidean Geodesic Alignmentfrozen encodersout-of-distribution adaptationvector searchadapter trainingtriplet losslabel precisiongradient sparsity
0
0 comments X

The pith

EGA adapter uses self-limiting updates to refine seen classes without harming unseen ones in frozen-encoder vector search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Frozen vision encoders for vector search encounter queries from unseen classes at deployment, but standard adapter training with global losses reassigns those samples to wrong clusters and drops worst-case label precision by more than 40 points. EGA couples zero initialization, local triplet loss, and hypersphere projection so that triplets already satisfying a small margin produce no gradient; the adapter therefore stops changing geometry that is already correct. At convergence 96.5 percent of triplets are gradient-free, leaving unseen-class regions largely untouched while still allowing full refinement of seen classes. Across five diverse OOD benchmarks EGA records the highest worst-case label precision on four splits and a steady gain on the fifth, and the same pattern holds when stronger backbones replace CLIP.

Core claim

EGA is a residual adapter whose three coupled principles induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct, leaving unseen-class regions largely untouched while enabling full-capacity refinement of seen classes.

What carries the argument

The self-limiting dynamic produced by coupling zero initialization, local triplet loss, and hypersphere projection in Euclidean Geodesic Alignment, which drives gradient sparsity on satisfied triplets.

If this is right

  • Worst-case label precision is highest on four of five primary OOD splits and improves on the fifth.
  • At convergence 96.5 percent of triplets produce no gradient, so unseen-class regions remain largely untouched.
  • The same bounded-degradation pattern holds when stronger backbones replace the original CLIP encoder.
  • Gradient sparsity supplies an analytical justification for bounded OOD perturbation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Local triplet objectives may be systematically preferable to global contrastive losses whenever preservation of certain geometric regions is required.
  • The same self-limiting mechanism could be ported to text or multimodal encoders facing class-shift at deployment.
  • Production vector-search pipelines could adopt EGA to reduce the risk of silent accuracy drops on novel queries.
  • The gradient-sparsity statistic itself could serve as a lightweight monitor for OOD stability during adapter training.

Load-bearing premise

The three design principles together create a self-limiting update rule that automatically halts where local geometry is already correct and that this rule generalizes across OOD benchmarks and stronger backbones.

What would settle it

A new OOD benchmark in which EGA's worst-case label precision falls below the frozen baseline, or in which fewer than 90 percent of triplets are gradient-free at convergence, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.05674 by Dongfang Zhao.

Figure 1
Figure 1. Figure 1: Label Precision@K on CIFAR-100 across K ∈ {1, 3, 5, 10} and nprobe ∈ {1, 5, 10}. 0.0 0.5 1.0 1.5 2.0 0 2 4 6 Density Latent Manifold (K = 1) Top 1 Neighbors Background 0.0 0.5 1.0 1.5 2.0 0 2 4 6 Latent Manifold (K = 2) Top 2 Neighbors Background 0.0 0.5 1.0 1.5 2.0 0 2 4 6 Latent Manifold (K = 4) Top 4 Neighbors Background 0.0 0.5 1.0 1.5 2.0 0 2 4 6 Latent Manifold (K = 8) Top 8 Neighbors Background 0.0 … view at source ↗
Figure 2
Figure 2. Figure 2: Distance distributions for Top-K neighbors vs. random background points on CIFAR-100. 4 Evaluation Experiments were conducted on Chameleon Cloud [17] with 256 AMD EPYC-7763 CPU cores, 512 GiB RAM, and an NVIDIA A100 GPU of 80 GB HBM2e. Datasets include CIFAR-100 and ImageNet-1K for in-distribution, and CIFAR-10, FGVC-Aircraft, Food-101, ImageNet-1K held-out classes, and Oxford-IIIT Pet for OOD evaluation. … view at source ↗
Figure 3
Figure 3. Figure 3: ANNS Recall@K versus nprobe on CIFAR-100. two distributions overlap substantially, which is the condition under which IVF-based ANN indexing degrades. After EGA, the two distributions separate cleanly across all K, and this separation drives the recall-efficiency improvement in view at source ↗
Figure 4
Figure 4. Figure 4: ID/OOD operating points of adapter training in the (ID, OOD) Label Precision plane. view at source ↗
Figure 5
Figure 5. Figure 5: Gradient sparsity as OOD protection. Route 1: Gradient sparsity (EGA). A triplet loss with margin m pro￾duces a non-zero gradient only when d(a, p) − d(a, n) + m > 0 view at source ↗
Figure 6
Figure 6. Figure 6: Label Precision@1 across three OOD benchmarks and three search budgets ( view at source ↗
read the original abstract

Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local triplet loss, and hypersphere projection. These collectively induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct. Our experiments show that at convergence $96.5\%$ of triplets are gradient-free, leaving unseen-class regions largely untouched while still enabling full-capacity refinement of seen classes. Across five diverse out-of-distribution (OOD) benchmarks, EGA achieves the highest worst-case Label Precision on the four primary splits and a consistent improvement on the fifth. The design also transfers to stronger backbones in addition to CLIP, and we provide an analytical justification linking gradient sparsity to bounded OOD perturbation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Euclidean Geodesic Alignment (EGA), a residual adapter for frozen vision encoders that combines zero initialization, local triplet loss, and hypersphere projection. This design is claimed to induce a self-limiting dynamic in which triplets satisfying a small margin produce no gradients, resulting in 96.5% gradient-free triplets at convergence and thereby bounding out-of-distribution degradation on unseen classes. Experiments across five OOD benchmarks report that EGA attains the highest worst-case Label Precision on four primary splits (with consistent improvement on the fifth), transfers to stronger backbones, and is supported by an analytical justification linking gradient sparsity to bounded OOD perturbation.

Significance. If the analytical justification and empirical robustness hold, the result would be significant for practical vector-search systems that must adapt frozen encoders without access to deployment-time classes. The explicit attempt to derive a bound from the self-limiting property, the high reported gradient sparsity, and the transfer experiments constitute concrete strengths that distinguish the work from standard adapter training.

major comments (3)
  1. [Section 3.3 (analytical justification)] The central analytical justification (Section 3.3) links gradient sparsity measured on seen-class triplets to bounded perturbation of unseen-class regions, yet the adapter is a single global residual function. Because updates are driven exclusively by seen triplets, it remains to be shown that sparsity on the training distribution precludes non-local shifts in the embedding geometry of unseen classes; a concrete bound or invariance argument addressing this global coupling is required.
  2. [Table 2 and Section 4.2] Table 2 and the associated experimental protocol report highest worst-case Label Precision, but the 96.5% gradient-free statistic is computed only on training (seen-class) triplets. An auxiliary measurement or bound demonstrating that the same sparsity pattern holds, or at least does not increase perturbation, for held-out unseen-class samples would directly support the OOD claim.
  3. [Section 4.3 (ablations)] The three design principles are asserted to act collectively, yet no ablation isolates the contribution of hypersphere projection versus local triplet loss alone to the observed gradient sparsity and OOD preservation. Removing one principle at a time while keeping the others fixed would clarify whether the self-limiting dynamic is robust or depends on a specific combination.
minor comments (2)
  1. [Section 3.1] The definition of the triplet margin and its interaction with the zero-initialization scheme could be stated more explicitly with a single equation in the method section to avoid ambiguity when readers reproduce the gradient-free condition.
  2. [Figure 3] Figure 3 (embedding visualizations) would benefit from an additional panel showing the change in nearest-neighbor assignments for a few unseen-class queries before and after adaptation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the work's potential significance. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Section 3.3 (analytical justification)] The central analytical justification (Section 3.3) links gradient sparsity measured on seen-class triplets to bounded perturbation of unseen-class regions, yet the adapter is a single global residual function. Because updates are driven exclusively by seen triplets, it remains to be shown that sparsity on the training distribution precludes non-local shifts in the embedding geometry of unseen classes; a concrete bound or invariance argument addressing this global coupling is required.

    Authors: We agree that the current analysis in Section 3.3 would benefit from an explicit treatment of global coupling. The existing derivation shows that the self-limiting triplet loss combined with zero initialization and hypersphere projection bounds the adapter's output norm, thereby limiting perturbation magnitude. To address the referee's concern, we will revise Section 3.3 to include an additional invariance argument: because the residual adapter is applied uniformly and the projection constrains all embeddings to the unit hypersphere, any local update on seen-class support induces only a Lipschitz-bounded shift that cannot arbitrarily rearrange distant unseen-class regions. A formal statement and short proof sketch will be added. revision: yes

  2. Referee: [Table 2 and Section 4.2] Table 2 and the associated experimental protocol report highest worst-case Label Precision, but the 96.5% gradient-free statistic is computed only on training (seen-class) triplets. An auxiliary measurement or bound demonstrating that the same sparsity pattern holds, or at least does not increase perturbation, for held-out unseen-class samples would directly support the OOD claim.

    Authors: The 96.5% gradient-free statistic characterizes the training dynamics on seen-class triplets. To strengthen the link to OOD preservation, we will add an auxiliary measurement in the revised Section 4.2: we will evaluate the fraction of gradient-free triplets (using the same margin) on held-out samples drawn from the unseen classes in the OOD benchmarks. This will be reported alongside the existing worst-case Label Precision results to show that sparsity does not degrade for unseen regions. revision: yes

  3. Referee: [Section 4.3 (ablations)] The three design principles are asserted to act collectively, yet no ablation isolates the contribution of hypersphere projection versus local triplet loss alone to the observed gradient sparsity and OOD preservation. Removing one principle at a time while keeping the others fixed would clarify whether the self-limiting dynamic is robust or depends on a specific combination.

    Authors: We concur that isolating the hypersphere projection from the local triplet loss would improve clarity. We will expand Section 4.3 with two new ablation configurations while retaining zero initialization: (i) local triplet loss without hypersphere projection, and (ii) hypersphere projection paired with a global contrastive loss in place of the local triplet loss. Both will report gradient sparsity at convergence and worst-case Label Precision on the OOD benchmarks, thereby quantifying the contribution of each principle to the self-limiting behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper's core derivation rests on three explicitly stated design principles (zero initialization, local triplet loss, hypersphere projection) whose interaction is described as producing gradient sparsity on satisfied margins. This mechanism is a direct, standard consequence of the triplet loss formulation rather than a redefinition of the target OOD bound. The 96.5% gradient-free statistic is an empirical measurement on seen-class training triplets, not a fitted parameter renamed as a prediction. The link from sparsity to bounded OOD perturbation is asserted via a separate analytical justification whose content is not shown to collapse into the input definitions or prior self-citations. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the provided text, so the central claim retains independent content beyond its own assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is populated from stated design choices; no explicit free parameters or invented entities are named, but the triplet margin and projection radius are implicit hyperparameters.

free parameters (1)
  • triplet margin
    Standard hyperparameter in the local triplet loss; value not reported in abstract but required for the self-limiting dynamic to function.

pith-pipeline@v0.9.0 · 5497 in / 1209 out tokens · 49039 ms · 2026-05-11T00:43:34.671170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    I-con: A unifying framework for representation learning

    Shaden Naif Alshammari, John R Hershey, Axel Feldmann, William T Freeman, and Mark Hamilton. I-con: A unifying framework for representation learning. InThe Thirteenth Interna- tional Conference on Learning Representations (ICLR), 2025

  2. [2]

    Spann: highly-efficient billion-scale approximate nearest neighbor search

    Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. Spann: highly-efficient billion-scale approximate nearest neighbor search. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY , USA, 2021. Curran Associates Inc

  3. [3]

    Navigable graphs for high-dimensional nearest neighbor search: Constructions and limits

    Haya Diwan, Jinrui Gou, Cameron Musco, Christopher Musco, and Torsten Suel. Navigable graphs for high-dimensional nearest neighbor search: Constructions and limits. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 59513–59531. Curran Associates...

  4. [4]

    Improve representation for imbalanced regression through geometric constraints

    Zijian Dong, Yilei Wu, Chongyao Chen, Yingtian Zou, Yichi Zhang, and Juan Helen Zhou. Improve representation for imbalanced regression through geometric constraints. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  5. [5]

    Mitigate the gap: Improving cross-modal alignment in CLIP

    Sedigheh Eslami and Gerard de Melo. Mitigate the gap: Improving cross-modal alignment in CLIP. InThe Thirteenth International Conference on Learning Representations, 2025

  6. [6]

    SimCSE: Simple contrastive learning of sentence embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Assoc...

  7. [7]

    Achieving low-latency graph-based vector search via aligning best- first search algorithm with ssd

    Hao Guo and Youyou Lu. Achieving low-latency graph-based vector search via aligning best- first search algorithm with ssd. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 171–186, Boston, MA, 2025. USENIX Association

  8. [8]

    Odinann: Direct insert for consistently stable performance in billion- scale graph-based vector search

    Hao Guo and Youyou Lu. Odinann: Direct insert for consistently stable performance in billion- scale graph-based vector search. In24th USENIX Conference on File and Storage Technologies (FAST), Santa Clara, CA, 2026. USENIX Association

  9. [9]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learn...

  10. [10]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  11. [11]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell., 42(8):2011–2023, August 2020

  12. [12]

    Worst-case performance of popular approximate nearest neighbor search implementations: Guarantees and limitations

    Piotr Indyk and Haike Xu. Worst-case performance of popular approximate nearest neighbor search implementations: Guarantees and limitations. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 66239–66256. Curran Associates, Inc., 2023. 10

  13. [13]

    LoRANN: Low-rank matrix factorization for approximate nearest neighbor search

    Elias Jääsaari, Ville Hyvönen, and Teemu Roos. LoRANN: Low-rank matrix factorization for approximate nearest neighbor search. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  14. [14]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, page 709–727, Berlin, Heidelberg, 2022. Springer-Verlag

  15. [15]

    FineCLIP: Self-distilled region-based CLIP for better fine-grained understanding

    Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based CLIP for better fine-grained understanding. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  16. [16]

    Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2021

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2021

  17. [17]

    Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs

    Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. Lessons learned from the chameleon testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC ’20). USENIX Assoc...

  18. [18]

    Decoupled contrastive learning for federated learning, 2025

    Hyungbin Kim, Incheol Baek, and Yon Dohn Chung. Decoupled contrastive learning for federated learning, 2025

  19. [19]

    Nicolaou, and Yannis Panagakis

    Panagiotis Koromilas, Efthymios Georgiou, Giorgos Bouritsas, Theodoros Giannakopoulos, Mihalis A. Nicolaou, and Yannis Panagakis. A principled framework for multi-view contrastive learning, 2025

  20. [20]

    Degradation of feature space in continual learning, 2026

    Chiara Lanza, Roberto Pereira, Marco Miozzo, Eduard Angelats, and Paolo Dini. Degradation of feature space in continual learning, 2026

  21. [21]

    Vision language models: A survey of 26k papers, 2025

    Fengming Lin. Vision language models: A survey of 26k papers, 2025

  22. [22]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, April 2020

  23. [23]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  24. [24]

    Compositional entailment learning for hyperbolic vision-language models

    Avik Pal, Max van Spengler, Guido Maria D’Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025

  25. [25]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

  26. [26]

    Accept the modality gap: An exploration in the hyperbolic space

    Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Ajanthan Thalaiyasingam. Accept the modality gap: An exploration in the hyperbolic space. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27253–27262, 2024

  27. [27]

    $\boldsymbol{\lambda}$-orthogonality regularization for compatible representation learning

    Simone Ricci, Niccolò Biondi, Federico Pernici, Ioannis Patras, and Alberto Del Bimbo. $\boldsymbol{\lambda}$-orthogonality regularization for compatible representation learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 11

  28. [28]

    Cohen, and Alan Perotti

    Daniele Savietto, Declan Campbell, André Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, and Alan Perotti. The geometry of representational failures in vision language models, 2026

  29. [29]

    Sevilla, Brittany N Dugger, and Chen-Nee Chuah

    Shivam Rajendra Rai Sharma, Xiaoguang Zhu, Luca Cerny Oliveira, Kartik Patwari, La Rissa Vasquez, David Garcia, Louise Nicole C. Sevilla, Brittany N Dugger, and Chen-Nee Chuah. Benchmarking parameter efficient adaptation of vision language models on pathology. In NeurIPS 2025 Workshop for Imageomics: Discovering Biological Knowledge from Images Using AI, 2025

  30. [30]

    Diskann: fast accurate billion-point nearest neighbor search on a single node

    Suhas Jayaram Subramanya, Devvrit, Rohan Kadekodi, Ravishankar Krishaswamy, and Har- sha Vardhan Simhadri. Diskann: fast accurate billion-point nearest neighbor search on a single node. InProceedings of the 33rd International Conference on Neural Information Processing Systems, 2019

  31. [31]

    Hangke Sui, Yuqing Wang, and Minh N. Do. Unicon: Unified framework for efficient contrastive alignment via kernels. InThe Fourteenth International Conference on Learning Representations, 2026

  32. [32]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  33. [33]

    Rethinking graph masked autoencoders through alignment and uniformity

    Liang Wang, Xiang Tao, Qiang Liu, Shu Wu, and Liang Wang. Rethinking graph masked autoencoders through alignment and uniformity. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intel...

  34. [34]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InProceedings of the 37th International Confer- ence on Machine Learning, ICML’20. JMLR.org, 2020

  35. [35]

    Mingqi Wu, Qiang Sun, and Archer Y . Yang. PCA++: How uniformity induces robustness to background noise in contrastive learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  36. [36]

    Easemvc: Efficient dual selection mechanism for deep multi-view clustering

    Baili Xiao, Zhibin Dong, Ke Liang, Suyuan Liu, Siwei Wang, Tianrui Liu, Xingchen Hu, En Zhu, and Xinwang Liu. Easemvc: Efficient dual selection mechanism for deep multi-view clustering. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20716–20726, 2025

  37. [37]

    PROTOCOL: Partial optimal transport-enhanced contrastive learning for imbalanced multi-view clustering

    Xuqian Xue, Yiming Lei, Qi Cai, Hongming Shan, and Junping Zhang. PROTOCOL: Partial optimal transport-enhanced contrastive learning for imbalanced multi-view clustering. In Forty-second International Conference on Machine Learning, 2025

  38. [38]

    Cspg: Crossing sparse proximity graphs for approximate nearest neighbor search

    Ming Yang, Yuzheng Cai, and Weiguo Zheng. Cspg: Crossing sparse proximity graphs for approximate nearest neighbor search. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 103076–103100. Curran Associates, Inc., 2024

  39. [39]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

  40. [40]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023

  41. [41]

    Tip-adapter: Training-free adaption of clip for few-shot classification

    Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free adaption of clip for few-shot classification. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, page 493–510, Berlin, Heidelberg, 2022. Springer-Verlag. 12

  42. [42]

    An information-theoretic regularizer for lossy neural image compression

    Yingwen Zhang, Meng Wang, Xihua Sheng, Peilin Chen, Junru Li, Li Zhang, and Shiqi Wang. An information-theoretic regularizer for lossy neural image compression. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15573–15582, October 2025

  43. [43]

    Representation de- generation problem in prompt-based models for natural language understanding

    Qingyan Zhao, Ruifang He, Jinpeng Zhang, Chang Liu, and Bo Wang. Representation de- generation problem in prompt-based models for natural language understanding. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistic...

  44. [44]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022

  45. [45]

    Learning to prompt for vision-language models.Int

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Int. J. Comput. Vision, 130(9):2337–2348, September 2022

  46. [46]

    Cm 3: Calibrating multimodal recommendation, 2025

    Xin Zhou, Yongjie Wang, and Zhiqi Shen. Cm 3: Calibrating multimodal recommendation, 2025. 13 A Theoretical Analysis of EGA’s OOD Stability We provide a theoretical analysis to explain why the proposed EGA adapter limits the perturbation of unseen-class geometry, in contrast to global contrastive methods. The analysis is structured in five steps, proceedi...