pith. machine review for the scientific record. sign in

arxiv: 2605.09859 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval

Shijie Wang, Xin Yu, Yadan Luo, Zi Huang, Zijian Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained image retrievalnormalizing flowsgenerative priorsappearance modelingembedding alignmentgeneralizationimage retrievaldensity estimation
0
0 comments X

The pith

Aligning retrieval embeddings to generative appearance priors learned from normalizing flows improves generalization to unseen categories in fine-grained image retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-grained image retrieval models typically learn from labeled seen categories and end up biased toward their specific semantics, which hurts accuracy when retrieving images from new categories. This paper reframes the objective around modeling underlying appearance characteristics instead. It maps retrieval features through an invertible normalizing flow into a latent space, where each seen category gets its own Gaussian prior fitted by maximum likelihood. Samples drawn from the high-density regions of those priors are mapped back to feature space as anchors that supervise an alignment loss. The result is embeddings that better reflect category-specific appearance variation and transfer more readily to unseen categories.

Core claim

GAPan reformulates retrieval learning as appearance modeling by treating instance features with an invertible density model based on normalizing flows. In the forward pass the flow sends all features into a latent density space in which each seen category is represented by a class-conditional Gaussian prior that is optimized by exact likelihood estimation. In the reverse pass, samples from the high-density regions of these priors are transformed back into the original feature space to yield appearance-aware anchors; a prior-driven alignment objective then pulls retrieval embeddings toward the corresponding category-specific distributions.

What carries the argument

An invertible normalizing flow that maps retrieval features into a latent space for class-conditional Gaussian prior modeling and generates appearance-aware anchors by reversing samples from those priors.

If this is right

  • Retrieval embeddings become aligned with intra-category appearance distributions rather than seen-category semantics.
  • The invertible property of the flow preserves richer appearance detail than non-invertible density models.
  • The method reaches state-of-the-art retrieval accuracy on both fine-grained and coarse-grained benchmarks.
  • Generalization improves because the alignment objective explicitly encourages features to match the learned priors of their categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-alignment idea could be tested on other embedding tasks where training labels introduce unwanted semantic bias.
  • Replacing the Gaussian assumption with more flexible priors inside the same flow architecture would be a direct next experiment.
  • The generated anchors could serve as synthetic training data for low-shot retrieval scenarios.

Load-bearing premise

That the class-conditional Gaussian priors fitted in the flow's latent space encode appearance traits that remain valid for categories never seen during training.

What would settle it

Measuring whether retrieval precision on held-out unseen categories drops when the alignment objective is removed or when the Gaussian priors are replaced by non-generative alternatives on standard FGIR benchmarks.

Figures

Figures reproduced from arXiv: 2605.09859 by Shijie Wang, Xin Yu, Yadan Luo, Zi Huang, Zijian Wang.

Figure 1
Figure 1. Figure 1: Motivation of GAPan. Beyond one-hot labels, GAPan [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detailed illustration of Generative Appearance Prior alignment network. See §3 for more details. appearance-aware anchors for supervision, GAPan encourages embeddings to capture fine-grained appearance cues while maintaining discrimination for retrieving both seen and unseen categories. Normalizing Flow. Normalizing flows [9; 37] are a class of powerful generative models that transform complex data distrib… view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter analysis of α, β and γ in Eq. (10). Hyperparameter analysis. We conduct sensitivity anal￾yses of the hyperparameters in Eq. (10), with the results shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of top-5 retrieval results on CUB [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GAPan, a Generative Appearance Prior alignment network for fine-grained image retrieval. It reformulates the objective using normalizing flows to map retrieval features into a latent space, where each seen category is modeled by a class-conditional Gaussian prior optimized via exact likelihood estimation. Reverse sampling from high-density regions of these priors produces appearance-aware anchors that supervise an alignment loss on the retrieval embeddings, with the goal of reducing semantic bias from seen-category supervision and improving generalization to unseen categories. The abstract claims state-of-the-art results on standard fine- and coarse-grained benchmarks.

Significance. If the generative priors successfully encode transferable appearance characteristics rather than remaining entangled with semantic labels, the approach would offer a principled way to incorporate density modeling into retrieval, complementing purely discriminative methods. The use of invertible flows for both exact likelihood optimization and anchor generation is a technically attractive component that could influence future work on generative priors in embedding spaces.

major comments (2)
  1. [§3] §3 (Method), the class-conditional Gaussian priors in the flow latent space: the central claim that these priors capture intra-category appearance variation generalizable to unseen categories is load-bearing, yet the input features originate from a backbone trained under category supervision. The manuscript must supply concrete evidence (e.g., likelihood values of unseen-category features under the learned seen priors or a controlled ablation separating appearance axes from semantic axes) to substantiate that the Gaussian assumption and invertibility remove the entanglement; without it the generalization argument remains unverified.
  2. [Experiments] Experiments section, Table reporting benchmark results: the SOTA claim is asserted, but the contribution of the prior-driven alignment is not isolated from standard losses or recent FGIR baselines. An ablation removing the anchor generation step or replacing the flow priors with simpler class-conditional models would be required to establish that the generative component drives the reported gains.
minor comments (2)
  1. [§3.2] The description of the reverse sampling procedure for generating anchors could benefit from an accompanying diagram or pseudocode to clarify how high-density samples are selected and mapped back to feature space.
  2. [§3] Notation for the flow transformation and the prior-driven alignment objective should be made consistent between the text and any equations to avoid ambiguity in the likelihood term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to provide the requested evidence and ablations.

read point-by-point responses
  1. Referee: [§3] §3 (Method), the class-conditional Gaussian priors in the flow latent space: the central claim that these priors capture intra-category appearance variation generalizable to unseen categories is load-bearing, yet the input features originate from a backbone trained under category supervision. The manuscript must supply concrete evidence (e.g., likelihood values of unseen-category features under the learned seen priors or a controlled ablation separating appearance axes from semantic axes) to substantiate that the Gaussian assumption and invertibility remove the entanglement; without it the generalization argument remains unverified.

    Authors: We agree that direct evidence is needed to show the priors capture transferable appearance rather than remaining entangled with seen-category semantics. The invertible flow is trained via exact likelihood on the supervised features, and its bijective property is intended to retain full appearance information without the information loss typical of non-invertible mappings. However, the current manuscript does not include the specific likelihood analysis on unseen features or the requested controlled ablation. In the revision we will add (1) log-likelihood values of unseen-category features evaluated under the learned seen priors and (2) an ablation comparing the full model against a non-invertible class-conditional Gaussian fitted directly in feature space. These additions will substantiate whether the Gaussian assumption plus invertibility reduces semantic entanglement. revision: yes

  2. Referee: Experiments section, Table reporting benchmark results: the SOTA claim is asserted, but the contribution of the prior-driven alignment is not isolated from standard losses or recent FGIR baselines. An ablation removing the anchor generation step or replacing the flow priors with simpler class-conditional models would be required to establish that the generative component drives the reported gains.

    Authors: We concur that isolating the generative prior and anchor-alignment contribution is necessary to support the SOTA claim. While the manuscript already compares against recent FGIR baselines, it lacks the specific ablations suggested. In the revised version we will add two experiments: (1) a variant that removes the anchor-generation step (i.e., no prior-driven alignment loss) and (2) a variant that replaces the normalizing-flow priors with simpler class-conditional Gaussians estimated directly on the retrieval features. These results will quantify how much of the reported gains on unseen categories are attributable to the invertible generative modeling versus standard discriminative losses. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent generative modeling steps

full rationale

The paper's core chain fits an invertible normalizing flow to map supervised retrieval features into latent space, places per-class Gaussian priors there, maximizes exact likelihood on seen categories, then reverse-samples high-density regions to create anchors for a separate alignment loss. These steps introduce new trainable components (flow parameters, priors, anchor generation) whose values are determined from data rather than being algebraically equivalent to any input embedding or prior fit. No self-citation is invoked to justify uniqueness or to smuggle an ansatz; the method does not rename an existing empirical pattern and does not treat a fitted quantity as a prediction of itself. The derivation therefore remains self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the invertibility of normalizing flows and the suitability of Gaussian priors for modeling intra-category appearance variation; these are standard but the specific application to retrieval alignment is new.

free parameters (1)
  • class-conditional Gaussian parameters
    Mean and covariance for each seen category's prior in latent space are learned via maximum likelihood; these are fitted to the training data.
axioms (1)
  • standard math Normalizing flows are bijective and allow exact likelihood computation
    Invoked to justify mapping features to latent density space and back while preserving probability densities.
invented entities (1)
  • appearance-aware anchors no independent evidence
    purpose: Synthetic points sampled from high-density regions of learned priors to supervise embedding alignment
    Introduced to provide intra-category variation signals without direct labels; no independent evidence outside the model is given.

pith-pipeline@v0.9.0 · 5516 in / 1415 out tokens · 35255 ms · 2026-05-12T04:45:11.886354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Ak, Ashraf A

    Kenan E. Ak, Ashraf A. Kassim, Joo-Hwee Lim, and Jo Yew Tham. Learning attribute representations with localization for flexible fashion search. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7708–7717. Computer Vision Foundation / IEEE Computer Society, 2018

  2. [2]

    Belongie, and Pietro Perona

    Steve Branson, Grant Van Horn, Serge J. Belongie, and Pietro Perona. Bird species categorization using pose normalized deep convolutional nets.CoRR, abs/1406.2952, 2014

  3. [3]

    Density estimation using real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017

  4. [4]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  5. [5]

    Elgammal

    Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed M. Elgammal. Link the head to the "beak": Zero shot learning from noisy text description at part precision. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6288–6297. IEEE Computer Society, 2017

  6. [6]

    Oseledets

    Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan V . Oseledets. Hyperbolic vision transformers: Combining improvements in metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 7399–7409. IEEE, 2022

  7. [7]

    Mean field theory in deep metric learning

    Takuya Furusawa. Mean field theory in deep metric learning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  8. [8]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 770–778, 2016

  9. [9]

    Normalizing flows for human pose anomaly detection

    Or Hirschorn and Shai Avidan. Normalizing flows for human pose anomaly detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13545–13554, 2023

  10. [10]

    Fine-grained image retrieval via dual-vision adaptation.arXiv preprint arXiv:2506.16273, 2025

    Xin Jiang, Meiqi Cao, Hao Tang, Fei Shen, and Zechao Li. Fine-grained image retrieval via dual-vision adaptation.arXiv preprint arXiv:2506.16273, 2025

  11. [11]

    Contrastive bayesian analysis for deep metric learning.IEEE Trans

    Shichao Kan, Zhiquan He, Yigang Cen, Yang Li, Vladimir Mladenovic, and Zhihai He. Contrastive bayesian analysis for deep metric learning.IEEE Trans. Pattern Anal. Mach. Intell., 45(6):7220–7238, 2023

  12. [12]

    HIER: metric learning beyond class labels via hierarchical regularization

    Sungyeon Kim, Boseung Jeong, and Suha Kwak. HIER: metric learning beyond class labels via hierarchical regularization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19903–19912. IEEE, 2023

  13. [13]

    Embedding transfer with label relaxation for improved metric learning

    Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. Embedding transfer with label relaxation for improved metric learning. InCVPR, pages 3967–3976. Computer Vision Foundation / IEEE, 2021

  14. [14]

    Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018

    Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018

  15. [15]

    A non-isotropic probabilistic take on proxy-based deep metric learning

    Michael Kirchhof, Karsten Roth, Zeynep Akata, and Enkelejda Kasneci. A non-isotropic probabilistic take on proxy-based deep metric learning. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part...

  16. [16]

    Why normalizing flows fail to detect out-of- distribution data.Advances in neural information processing systems, 33:20578–20589, 2020

    Polina Kirichenko, Pavel Izmailov, and Andrew G Wilson. Why normalizing flows fail to detect out-of- distribution data.Advances in neural information processing systems, 33:20578–20589, 2020

  17. [17]

    Learning with memory-based virtual classes for deep metric learning

    ByungSoo Ko, Geonmo Gu, Han-Gyu Kim, and ByungSoo Ko. Learning with memory-based virtual classes for deep metric learning. InICCV, pages 11772–11781. IEEE, 2021

  18. [18]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV Workshops 2013, Sydney, Australia, December 1-8, 2013, pages 554–561, 2013. 10

  19. [19]

    Binh Minh Le and Simon S. Woo. SEE: spherical embedding expansion for improving deep metric learning (extended abstract). InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2025, Montreal, Canada, August 16-22, 2025, pages 10906–10911. ijcai.org, 2025

  20. [20]

    Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow

    Jiarui Lei, Xiaobo Hu, Yue Wang, and Dong Liu. Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14143–14152, 2023

  21. [21]

    Hypergraph-induced semantic tuplet loss for deep metric learning

    Jongin Lim, Sangdoo Yun, Seulki Park, and Jin Young Choi. Hypergraph-induced semantic tuplet loss for deep metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 212–222. IEEE, 2022

  22. [22]

    Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

    Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 1096–1104. IEEE Computer Society, 2016

  23. [23]

    Generative classifiers as a basis for trustworthy image classification

    Radek Mackowiak, Lynton Ardizzone, Ullrich Kothe, and Carsten Rother. Generative classifiers as a basis for trustworthy image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2971–2981, 2021

  24. [24]

    Flowification: Everything is a normalizing flow.Advances in Neural Information Processing Systems, 35:35478–35489, 2022

    Bálint Máté, Samuel Klein, Tobias Golling, and François Fleuret. Flowification: Everything is a normalizing flow.Advances in Neural Information Processing Systems, 35:35478–35489, 2022

  25. [25]

    Keypoint-aligned embeddings for image retrieval and re-identification

    Olga Moskvyak, Frédéric Maire, Feras Dayoub, and Mahsa Baktashmotlagh. Keypoint-aligned embeddings for image retrieval and re-identification. InWinter Conference on Applications of Computer Vision, pages 676–685. IEEE, 2021

  26. [26]

    Deep disentangled metric learning

    Jinhee Park, Jisoo Park, Dagyeong Na, and Junseok Kwon. Deep disentangled metric learning. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 19830–19838. AAAI Press, 2025

  27. [27]

    Li Ren, Chen Chen, Liqiang Wang, and Kien A. Hua. Learning semantic proxies from visual prompts for parameter-efficient fine-tuning in deep metric learning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  28. [28]

    Imagenet-21k pretraining for the masses

    Tal Ridnik, Emanuel Ben Baruch, Asaf Noy, and Lihi Zelnik. Imagenet-21k pretraining for the masses. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

  29. [29]

    Simultaneous similarity-based self-distillation for deep metric learning

    Karsten Roth, Timo Milbich, Björn Ommer, Joseph Paul Cohen, and Marzyeh Ghassemi. Simultaneous similarity-based self-distillation for deep metric learning. In Marina Meila and Tong Zhang, editors, Proceedings of Machine Learning Research, volume 139 ofProceedings of Machine Learning Research, pages 9095–9106. PMLR, 2021

  30. [30]

    Non-isotropy regularization for proxy-based deep metric learning

    Karsten Roth, Oriol Vinyals, and Zeynep Akata. Non-isotropy regularization for proxy-based deep metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 7410–7420. IEEE, 2022

  31. [31]

    Facenet: A unified embedding for face recognition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. InCVPR, pages 815–823. IEEE Computer Society, 2015

  32. [32]

    Deep metric learning via lifted structured feature embedding

    Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. InCVPR, pages 4004–4012. IEEE Computer Society, 2016

  33. [33]

    Taylor, and Graham

    Eu Wern Teh, Terrance DeVries, Graham W. Taylor, and Graham. Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan- Michael Frahm, editors,ECCV, volume 12369 ofLecture Notes in Computer Science, pages 448–464. Springer, 2020

  34. [34]

    Deep factorized metric learning

    Chengkun Wang, Wenzhao Zheng, Junlong Li, Jie Zhou, and Jiwen Lu. Deep factorized metric learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 7672–7682. IEEE, 2023

  35. [35]

    Introspective deep metric learning

    Chengkun Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Introspective deep metric learning. IEEE Trans. Pattern Anal. Mach. Intell., 46(4):1964–1980, 2024. 11

  36. [36]

    Language-driven fine-grained retrieval.CoRR, abs/2512.06255, 2025

    Shijie Wang, Xin Yu, Yadan Luo, Zijian Wang, Pengfei Zhang, and Zi Huang. Language-driven fine-grained retrieval.CoRR, abs/2512.06255, 2025

  37. [37]

    Low-light image enhancement with normalizing flow

    Yufei Wang, Renjie Wan, Wenhan Yang, Haoliang Li, Lap-Pui Chau, and Alex Kot. Low-light image enhancement with normalizing flow. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 2604–2612, 2022

  38. [38]

    A2-net: Learning attribute-aware hash codes for large-scale fine-grained image retrieval.Advances in Neural Information Processing Systems, 34:5720–5730, 2021

    Xiu-Shen Wei, Yang Shen, Xuhao Sun, Han-Jia Ye, and Jian Yang. A2-net: Learning attribute-aware hash codes for large-scale fine-grained image retrieval.Advances in Neural Information Processing Systems, 34:5720–5730, 2021

  39. [39]

    Fine-grained image analysis with deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(12):8927–8948, 2021

    Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(12):8927–8948, 2021

  40. [40]

    Deep metric learning in projected-hypersphere space.Pattern Recognit., 161:111245, 2025

    Yunhao Xu, Zhentao Chen, and Junlin Hu. Deep metric learning in projected-hypersphere space.Pattern Recognit., 161:111245, 2025

  41. [41]

    Lm-metric: Learned pair weighting and contextual memory for deep metric learning.Pattern Recognit., 155:110722, 2024

    Shiyang Yan, Lin Xu, Xinyao Shu, Zhenyu Lu, and Jialie Shen. Lm-metric: Learned pair weighting and contextual memory for deep metric learning.Pattern Recognit., 155:110722, 2024

  42. [42]

    Bailin Yang, Haoqiang Sun, Frederick W. B. Li, Zheng Chen, Jianlu Cai, and Chao Song. HSE: hybrid species embedding for deep metric learning. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 11013–11023. IEEE, 2023

  43. [43]

    DIML: deep interpretable metric learning via structural matching.IEEE Trans

    Wenliang Zhao, Yongming Rao, Jie Zhou, and Jiwen Lu. DIML: deep interpretable metric learning via structural matching.IEEE Trans. Pattern Anal. Mach. Intell., 46(4):2518–2532, 2024

  44. [44]

    Deep compositional metric learning

    Wenzhao Zheng, Chengkun Wang, Jiwen Lu, and Jie Zhou. Deep compositional metric learning. InCVPR, pages 9320–9329. Computer Vision Foundation / IEEE, 2021

  45. [45]

    Deep relational metric learning

    Wenzhao Zheng, Borui Zhang, Jiwen Lu, and Jie Zhou. Deep relational metric learning. InICCV, pages 12045–12054. IEEE, 2021

  46. [46]

    Limitations

    Xiawu Zheng, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, Feiyue Huang, and Yanhua Yang. Centralized ranking loss with weakly supervised localization for fine-grained object retrieval. In Jérôme Lang, editor, IJCAI, pages 1226–1233. ijcai.org, 2018. 12 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurate...

  47. [47]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...