Recognition: 1 theorem link
· Lean TheoremLearning to Align Generative Appearance Priors for Fine-grained Image Retrieval
Pith reviewed 2026-05-12 04:45 UTC · model grok-4.3
The pith
Aligning retrieval embeddings to generative appearance priors learned from normalizing flows improves generalization to unseen categories in fine-grained image retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAPan reformulates retrieval learning as appearance modeling by treating instance features with an invertible density model based on normalizing flows. In the forward pass the flow sends all features into a latent density space in which each seen category is represented by a class-conditional Gaussian prior that is optimized by exact likelihood estimation. In the reverse pass, samples from the high-density regions of these priors are transformed back into the original feature space to yield appearance-aware anchors; a prior-driven alignment objective then pulls retrieval embeddings toward the corresponding category-specific distributions.
What carries the argument
An invertible normalizing flow that maps retrieval features into a latent space for class-conditional Gaussian prior modeling and generates appearance-aware anchors by reversing samples from those priors.
If this is right
- Retrieval embeddings become aligned with intra-category appearance distributions rather than seen-category semantics.
- The invertible property of the flow preserves richer appearance detail than non-invertible density models.
- The method reaches state-of-the-art retrieval accuracy on both fine-grained and coarse-grained benchmarks.
- Generalization improves because the alignment objective explicitly encourages features to match the learned priors of their categories.
Where Pith is reading between the lines
- The same prior-alignment idea could be tested on other embedding tasks where training labels introduce unwanted semantic bias.
- Replacing the Gaussian assumption with more flexible priors inside the same flow architecture would be a direct next experiment.
- The generated anchors could serve as synthetic training data for low-shot retrieval scenarios.
Load-bearing premise
That the class-conditional Gaussian priors fitted in the flow's latent space encode appearance traits that remain valid for categories never seen during training.
What would settle it
Measuring whether retrieval precision on held-out unseen categories drops when the alignment objective is removed or when the Gaussian priors are replaced by non-generative alternatives on standard FGIR benchmarks.
Figures
read the original abstract
Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GAPan, a Generative Appearance Prior alignment network for fine-grained image retrieval. It reformulates the objective using normalizing flows to map retrieval features into a latent space, where each seen category is modeled by a class-conditional Gaussian prior optimized via exact likelihood estimation. Reverse sampling from high-density regions of these priors produces appearance-aware anchors that supervise an alignment loss on the retrieval embeddings, with the goal of reducing semantic bias from seen-category supervision and improving generalization to unseen categories. The abstract claims state-of-the-art results on standard fine- and coarse-grained benchmarks.
Significance. If the generative priors successfully encode transferable appearance characteristics rather than remaining entangled with semantic labels, the approach would offer a principled way to incorporate density modeling into retrieval, complementing purely discriminative methods. The use of invertible flows for both exact likelihood optimization and anchor generation is a technically attractive component that could influence future work on generative priors in embedding spaces.
major comments (2)
- [§3] §3 (Method), the class-conditional Gaussian priors in the flow latent space: the central claim that these priors capture intra-category appearance variation generalizable to unseen categories is load-bearing, yet the input features originate from a backbone trained under category supervision. The manuscript must supply concrete evidence (e.g., likelihood values of unseen-category features under the learned seen priors or a controlled ablation separating appearance axes from semantic axes) to substantiate that the Gaussian assumption and invertibility remove the entanglement; without it the generalization argument remains unverified.
- [Experiments] Experiments section, Table reporting benchmark results: the SOTA claim is asserted, but the contribution of the prior-driven alignment is not isolated from standard losses or recent FGIR baselines. An ablation removing the anchor generation step or replacing the flow priors with simpler class-conditional models would be required to establish that the generative component drives the reported gains.
minor comments (2)
- [§3.2] The description of the reverse sampling procedure for generating anchors could benefit from an accompanying diagram or pseudocode to clarify how high-density samples are selected and mapped back to feature space.
- [§3] Notation for the flow transformation and the prior-driven alignment objective should be made consistent between the text and any equations to avoid ambiguity in the likelihood term.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to provide the requested evidence and ablations.
read point-by-point responses
-
Referee: [§3] §3 (Method), the class-conditional Gaussian priors in the flow latent space: the central claim that these priors capture intra-category appearance variation generalizable to unseen categories is load-bearing, yet the input features originate from a backbone trained under category supervision. The manuscript must supply concrete evidence (e.g., likelihood values of unseen-category features under the learned seen priors or a controlled ablation separating appearance axes from semantic axes) to substantiate that the Gaussian assumption and invertibility remove the entanglement; without it the generalization argument remains unverified.
Authors: We agree that direct evidence is needed to show the priors capture transferable appearance rather than remaining entangled with seen-category semantics. The invertible flow is trained via exact likelihood on the supervised features, and its bijective property is intended to retain full appearance information without the information loss typical of non-invertible mappings. However, the current manuscript does not include the specific likelihood analysis on unseen features or the requested controlled ablation. In the revision we will add (1) log-likelihood values of unseen-category features evaluated under the learned seen priors and (2) an ablation comparing the full model against a non-invertible class-conditional Gaussian fitted directly in feature space. These additions will substantiate whether the Gaussian assumption plus invertibility reduces semantic entanglement. revision: yes
-
Referee: Experiments section, Table reporting benchmark results: the SOTA claim is asserted, but the contribution of the prior-driven alignment is not isolated from standard losses or recent FGIR baselines. An ablation removing the anchor generation step or replacing the flow priors with simpler class-conditional models would be required to establish that the generative component drives the reported gains.
Authors: We concur that isolating the generative prior and anchor-alignment contribution is necessary to support the SOTA claim. While the manuscript already compares against recent FGIR baselines, it lacks the specific ablations suggested. In the revised version we will add two experiments: (1) a variant that removes the anchor-generation step (i.e., no prior-driven alignment loss) and (2) a variant that replaces the normalizing-flow priors with simpler class-conditional Gaussians estimated directly on the retrieval features. These results will quantify how much of the reported gains on unseen categories are attributable to the invertible generative modeling versus standard discriminative losses. revision: yes
Circularity Check
No significant circularity; derivation relies on independent generative modeling steps
full rationale
The paper's core chain fits an invertible normalizing flow to map supervised retrieval features into latent space, places per-class Gaussian priors there, maximizes exact likelihood on seen categories, then reverse-samples high-density regions to create anchors for a separate alignment loss. These steps introduce new trainable components (flow parameters, priors, anchor generation) whose values are determined from data rather than being algebraically equivalent to any input embedding or prior fit. No self-citation is invoked to justify uniqueness or to smuggle an ansatz; the method does not rename an existing empirical pattern and does not treat a fitted quantity as a prediction of itself. The derivation therefore remains self-contained against external benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- class-conditional Gaussian parameters
axioms (1)
- standard math Normalizing flows are bijective and allow exact likelihood computation
invented entities (1)
-
appearance-aware anchors
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearGAPan treats retrieval features with an invertible density model based on normalizing flows... each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation.
Reference graph
Works this paper leans on
-
[1]
Kenan E. Ak, Ashraf A. Kassim, Joo-Hwee Lim, and Jo Yew Tham. Learning attribute representations with localization for flexible fashion search. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7708–7717. Computer Vision Foundation / IEEE Computer Society, 2018
work page 2018
-
[2]
Steve Branson, Grant Van Horn, Serge J. Belongie, and Pietro Perona. Bird species categorization using pose normalized deep convolutional nets.CoRR, abs/1406.2952, 2014
-
[3]
Density estimation using real NVP
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017
work page 2017
-
[4]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[5]
Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed M. Elgammal. Link the head to the "beak": Zero shot learning from noisy text description at part precision. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6288–6297. IEEE Computer Society, 2017
work page 2017
-
[6]
Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan V . Oseledets. Hyperbolic vision transformers: Combining improvements in metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 7399–7409. IEEE, 2022
work page 2022
-
[7]
Mean field theory in deep metric learning
Takuya Furusawa. Mean field theory in deep metric learning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[8]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 770–778, 2016
work page 2016
-
[9]
Normalizing flows for human pose anomaly detection
Or Hirschorn and Shai Avidan. Normalizing flows for human pose anomaly detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13545–13554, 2023
work page 2023
-
[10]
Fine-grained image retrieval via dual-vision adaptation.arXiv preprint arXiv:2506.16273, 2025
Xin Jiang, Meiqi Cao, Hao Tang, Fei Shen, and Zechao Li. Fine-grained image retrieval via dual-vision adaptation.arXiv preprint arXiv:2506.16273, 2025
-
[11]
Contrastive bayesian analysis for deep metric learning.IEEE Trans
Shichao Kan, Zhiquan He, Yigang Cen, Yang Li, Vladimir Mladenovic, and Zhihai He. Contrastive bayesian analysis for deep metric learning.IEEE Trans. Pattern Anal. Mach. Intell., 45(6):7220–7238, 2023
work page 2023
-
[12]
HIER: metric learning beyond class labels via hierarchical regularization
Sungyeon Kim, Boseung Jeong, and Suha Kwak. HIER: metric learning beyond class labels via hierarchical regularization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19903–19912. IEEE, 2023
work page 2023
-
[13]
Embedding transfer with label relaxation for improved metric learning
Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. Embedding transfer with label relaxation for improved metric learning. InCVPR, pages 3967–3976. Computer Vision Foundation / IEEE, 2021
work page 2021
-
[14]
Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018
work page 2018
-
[15]
A non-isotropic probabilistic take on proxy-based deep metric learning
Michael Kirchhof, Karsten Roth, Zeynep Akata, and Enkelejda Kasneci. A non-isotropic probabilistic take on proxy-based deep metric learning. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part...
work page 2022
-
[16]
Polina Kirichenko, Pavel Izmailov, and Andrew G Wilson. Why normalizing flows fail to detect out-of- distribution data.Advances in neural information processing systems, 33:20578–20589, 2020
work page 2020
-
[17]
Learning with memory-based virtual classes for deep metric learning
ByungSoo Ko, Geonmo Gu, Han-Gyu Kim, and ByungSoo Ko. Learning with memory-based virtual classes for deep metric learning. InICCV, pages 11772–11781. IEEE, 2021
work page 2021
-
[18]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV Workshops 2013, Sydney, Australia, December 1-8, 2013, pages 554–561, 2013. 10
work page 2013
-
[19]
Binh Minh Le and Simon S. Woo. SEE: spherical embedding expansion for improving deep metric learning (extended abstract). InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2025, Montreal, Canada, August 16-22, 2025, pages 10906–10911. ijcai.org, 2025
work page 2025
-
[20]
Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow
Jiarui Lei, Xiaobo Hu, Yue Wang, and Dong Liu. Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14143–14152, 2023
work page 2023
-
[21]
Hypergraph-induced semantic tuplet loss for deep metric learning
Jongin Lim, Sangdoo Yun, Seulki Park, and Jin Young Choi. Hypergraph-induced semantic tuplet loss for deep metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 212–222. IEEE, 2022
work page 2022
-
[22]
Deepfashion: Powering robust clothes recognition and retrieval with rich annotations
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 1096–1104. IEEE Computer Society, 2016
work page 2016
-
[23]
Generative classifiers as a basis for trustworthy image classification
Radek Mackowiak, Lynton Ardizzone, Ullrich Kothe, and Carsten Rother. Generative classifiers as a basis for trustworthy image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2971–2981, 2021
work page 2021
-
[24]
Bálint Máté, Samuel Klein, Tobias Golling, and François Fleuret. Flowification: Everything is a normalizing flow.Advances in Neural Information Processing Systems, 35:35478–35489, 2022
work page 2022
-
[25]
Keypoint-aligned embeddings for image retrieval and re-identification
Olga Moskvyak, Frédéric Maire, Feras Dayoub, and Mahsa Baktashmotlagh. Keypoint-aligned embeddings for image retrieval and re-identification. InWinter Conference on Applications of Computer Vision, pages 676–685. IEEE, 2021
work page 2021
-
[26]
Deep disentangled metric learning
Jinhee Park, Jisoo Park, Dagyeong Na, and Junseok Kwon. Deep disentangled metric learning. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 19830–19838. AAAI Press, 2025
work page 2025
-
[27]
Li Ren, Chen Chen, Liqiang Wang, and Kien A. Hua. Learning semantic proxies from visual prompts for parameter-efficient fine-tuning in deep metric learning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[28]
Imagenet-21k pretraining for the masses
Tal Ridnik, Emanuel Ben Baruch, Asaf Noy, and Lihi Zelnik. Imagenet-21k pretraining for the masses. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021
work page 2021
-
[29]
Simultaneous similarity-based self-distillation for deep metric learning
Karsten Roth, Timo Milbich, Björn Ommer, Joseph Paul Cohen, and Marzyeh Ghassemi. Simultaneous similarity-based self-distillation for deep metric learning. In Marina Meila and Tong Zhang, editors, Proceedings of Machine Learning Research, volume 139 ofProceedings of Machine Learning Research, pages 9095–9106. PMLR, 2021
work page 2021
-
[30]
Non-isotropy regularization for proxy-based deep metric learning
Karsten Roth, Oriol Vinyals, and Zeynep Akata. Non-isotropy regularization for proxy-based deep metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 7410–7420. IEEE, 2022
work page 2022
-
[31]
Facenet: A unified embedding for face recognition and clustering
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. InCVPR, pages 815–823. IEEE Computer Society, 2015
work page 2015
-
[32]
Deep metric learning via lifted structured feature embedding
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. InCVPR, pages 4004–4012. IEEE Computer Society, 2016
work page 2016
-
[33]
Eu Wern Teh, Terrance DeVries, Graham W. Taylor, and Graham. Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan- Michael Frahm, editors,ECCV, volume 12369 ofLecture Notes in Computer Science, pages 448–464. Springer, 2020
work page 2020
-
[34]
Deep factorized metric learning
Chengkun Wang, Wenzhao Zheng, Junlong Li, Jie Zhou, and Jiwen Lu. Deep factorized metric learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 7672–7682. IEEE, 2023
work page 2023
-
[35]
Introspective deep metric learning
Chengkun Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Introspective deep metric learning. IEEE Trans. Pattern Anal. Mach. Intell., 46(4):1964–1980, 2024. 11
work page 1964
-
[36]
Language-driven fine-grained retrieval.CoRR, abs/2512.06255, 2025
Shijie Wang, Xin Yu, Yadan Luo, Zijian Wang, Pengfei Zhang, and Zi Huang. Language-driven fine-grained retrieval.CoRR, abs/2512.06255, 2025
-
[37]
Low-light image enhancement with normalizing flow
Yufei Wang, Renjie Wan, Wenhan Yang, Haoliang Li, Lap-Pui Chau, and Alex Kot. Low-light image enhancement with normalizing flow. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 2604–2612, 2022
work page 2022
-
[38]
Xiu-Shen Wei, Yang Shen, Xuhao Sun, Han-Jia Ye, and Jian Yang. A2-net: Learning attribute-aware hash codes for large-scale fine-grained image retrieval.Advances in Neural Information Processing Systems, 34:5720–5730, 2021
work page 2021
-
[39]
Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(12):8927–8948, 2021
work page 2021
-
[40]
Deep metric learning in projected-hypersphere space.Pattern Recognit., 161:111245, 2025
Yunhao Xu, Zhentao Chen, and Junlin Hu. Deep metric learning in projected-hypersphere space.Pattern Recognit., 161:111245, 2025
work page 2025
-
[41]
Shiyang Yan, Lin Xu, Xinyao Shu, Zhenyu Lu, and Jialie Shen. Lm-metric: Learned pair weighting and contextual memory for deep metric learning.Pattern Recognit., 155:110722, 2024
work page 2024
-
[42]
Bailin Yang, Haoqiang Sun, Frederick W. B. Li, Zheng Chen, Jianlu Cai, and Chao Song. HSE: hybrid species embedding for deep metric learning. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 11013–11023. IEEE, 2023
work page 2023
-
[43]
DIML: deep interpretable metric learning via structural matching.IEEE Trans
Wenliang Zhao, Yongming Rao, Jie Zhou, and Jiwen Lu. DIML: deep interpretable metric learning via structural matching.IEEE Trans. Pattern Anal. Mach. Intell., 46(4):2518–2532, 2024
work page 2024
-
[44]
Deep compositional metric learning
Wenzhao Zheng, Chengkun Wang, Jiwen Lu, and Jie Zhou. Deep compositional metric learning. InCVPR, pages 9320–9329. Computer Vision Foundation / IEEE, 2021
work page 2021
-
[45]
Deep relational metric learning
Wenzhao Zheng, Borui Zhang, Jiwen Lu, and Jie Zhou. Deep relational metric learning. InICCV, pages 12045–12054. IEEE, 2021
work page 2021
-
[46]
Xiawu Zheng, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, Feiyue Huang, and Yanhua Yang. Centralized ranking loss with weakly supervised localization for fine-grained object retrieval. In Jérôme Lang, editor, IJCAI, pages 1226–1233. ijcai.org, 2018. 12 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurate...
work page 2018
-
[47]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.