Deep Learning for Fine-Grained Image Analysis: A Survey

Jianxin Wu; Quan Cui; Xiu-Shen Wei

arxiv: 1907.03069 · v1 · pith:RUHPD6KYnew · submitted 2019-07-06 · 💻 cs.CV

Deep Learning for Fine-Grained Image Analysis: A Survey

Xiu-Shen Wei , Jianxin Wu , Quan Cui This is my paper

Pith reviewed 2026-05-25 01:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained image analysisdeep learningfine-grained recognitionfine-grained retrievalfine-grained generationcomputer vision surveybenchmark datasets

0 comments

The pith

Recent deep learning progress in fine-grained image analysis is organized into recognition, retrieval, and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper provides a systematic survey of deep learning techniques for fine-grained image analysis. It divides the field into three categories based on the tasks of recognition, retrieval, and generation. The survey also discusses benchmark datasets, domain-specific applications, and future research directions. This organization helps in understanding how deep learning has advanced the analysis of objects with subtle differences between classes.

Core claim

During the development of deep learning, fine-grained image analysis has made remarkable progress. Existing studies are organized into three major categories: fine-grained image recognition, fine-grained image retrieval, and fine-grained image generation. The survey covers publicly available benchmark datasets and related applications, concluding with directions and open problems.

What carries the argument

The three-category taxonomy consisting of fine-grained image recognition, fine-grained image retrieval, and fine-grained image generation, which structures the review of deep learning methods for analyzing subordinate categories of visual objects.

Load-bearing premise

Existing studies of FGIA techniques can be partitioned into the three categories of recognition, retrieval, and generation without major omissions or overlaps.

What would settle it

A deep learning based FGIA method that cannot be classified under recognition, retrieval, or generation would challenge the survey's organizational framework.

Figures

Figures reproduced from arXiv: 1907.03069 by Jianxin Wu, Quan Cui, Xiu-Shen Wei.

**Figure 1.** Figure 1: Fine-grained image analysis vs. generic image analysis (taking the recognitiont task for an example). meta-category), e.g., different species of animals/plants, different models of cars, different kinds of retail products, etc (cf [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 3.** Figure 3: Key challenge of fine-grained image analysis, [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 2.** Figure 2: Main aspects of our hierarchical and structrual organization [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Example fine-grained images belonging to different species of flowers/vegetable, different models of cars/aircrafts and different kinds [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: An example image with its supervisions associated with [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 6.** Figure 6: An example knowledge graph for modeling the category [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

Computer vision (CV) is the process of using machines to understand and analyze imagery, which is an integral branch of artificial intelligence. Among various research areas of CV, fine-grained image analysis (FGIA) is a longstanding and fundamental problem, and has become ubiquitous in diverse real-world applications. The task of FGIA targets analyzing visual objects from subordinate categories, \eg, species of birds or models of cars. The small inter-class variations and the large intra-class variations caused by the fine-grained nature makes it a challenging problem. During the booming of deep learning, recent years have witnessed remarkable progress of FGIA using deep learning techniques. In this paper, we aim to give a survey on recent advances of deep learning based FGIA techniques in a systematic way. Specifically, we organize the existing studies of FGIA techniques into three major categories: fine-grained image recognition, fine-grained image retrieval and fine-grained image generation. In addition, we also cover some other important issues of FGIA, such as publicly available benchmark datasets and its related domain specific applications. Finally, we conclude this survey by highlighting several directions and open problems which need be further explored by the community in the future.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A standard survey that organizes FGIA deep learning work into recognition, retrieval, and generation, useful for overview but with questions on whether the split covers the field cleanly.

read the letter

This survey lays out recent deep learning work on fine-grained image analysis by splitting it into three main categories: fine-grained image recognition, retrieval, and generation. It also reviews datasets and some applications. The paper does a decent job of giving structure to a growing area. Organizing the literature this way can help readers see the main directions without having to hunt through dozens of papers. Covering benchmarks is useful too, since FGIA often depends on specific datasets like CUB or Stanford Cars. The soft spot is the three-category split itself. Tasks like fine-grained object detection or segmentation might not slot cleanly into recognition, retrieval, or generation, and techniques often cross over between them. If the paper doesn't discuss why these are the major categories or how it handles overlaps, the organization could look incomplete. The abstract presents it as systematic, so the full text needs to back that up with thorough coverage. This is the kind of paper that helps newcomers or people outside the subfield get up to speed. It won't change how experts work, but it can serve as a reference point. A serious editor should send it to peer review because good surveys are worth the time even if they need some tightening on the taxonomy.

Referee Report

2 major / 1 minor

Summary. This survey reviews deep learning advances in fine-grained image analysis (FGIA), a subfield of computer vision focused on subordinate categories with small inter-class and large intra-class variations. It organizes the literature into three major categories—fine-grained image recognition, fine-grained image retrieval, and fine-grained image generation—while additionally covering benchmark datasets, domain-specific applications, and open problems for future work.

Significance. If the categorization and coverage prove comprehensive and balanced, the survey would serve as a useful reference point for the FGIA community, consolidating progress across recognition, retrieval, and generation tasks during the deep learning era and highlighting applications and datasets.

major comments (2)

[Abstract] Abstract: The central organizational claim partitions existing FGIA studies into exactly three major categories (recognition, retrieval, generation). This is load-bearing for the survey's systematic character, yet the abstract provides no explicit justification or mapping for how tasks such as fine-grained detection, localization, or segmentation—which appear frequently in the broader FGIA literature—are assigned or excluded, raising the risk of omissions that would undermine completeness.
[Abstract] Abstract: The partition treats recognition and retrieval as distinct despite substantial shared methodology (e.g., CNN backbones, part-based representations, and metric-learning losses). Without a dedicated discussion of how overlaps are handled or why they do not collapse the categories, the claimed systematic organization risks appearing artificial rather than reflective of the literature structure.

minor comments (1)

[Abstract] Abstract, final sentence: 'need be further explored' is grammatically incomplete and should read 'need to be further explored'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We agree that the abstract should better justify the scope and categorization to strengthen the survey's systematic presentation. We will revise the abstract accordingly while preserving the paper's core organization, which is elaborated in the introduction and subsequent sections.

read point-by-point responses

Referee: [Abstract] Abstract: The central organizational claim partitions existing FGIA studies into exactly three major categories (recognition, retrieval, generation). This is load-bearing for the survey's systematic character, yet the abstract provides no explicit justification or mapping for how tasks such as fine-grained detection, localization, or segmentation—which appear frequently in the broader FGIA literature—are assigned or excluded, raising the risk of omissions that would undermine completeness.

Authors: We accept this point. The three categories were chosen because they constitute the dominant task formulations in the deep-learning FGIA literature (as evidenced by the volume of papers and benchmark protocols). Detection, localization, and segmentation are typically treated as enabling components within recognition pipelines rather than independent FGIA tasks with their own large-scale benchmarks; they are therefore discussed inside the recognition section when relevant. To address the concern, we will expand the abstract with one sentence that explicitly states the scope and notes that auxiliary tasks are subsumed under the three primary categories. revision: yes
Referee: [Abstract] Abstract: The partition treats recognition and retrieval as distinct despite substantial shared methodology (e.g., CNN backbones, part-based representations, and metric-learning losses). Without a dedicated discussion of how overlaps are handled or why they do not collapse the categories, the claimed systematic organization risks appearing artificial rather than reflective of the literature structure.

Authors: We agree that methodological overlap exists and is already noted in the body of the survey (e.g., shared backbone architectures and attention mechanisms appear in both recognition and retrieval sections). The separation is maintained because the problem definitions differ—recognition is a closed-set classification task while retrieval is an open-set ranking/similarity task—leading to distinct evaluation protocols and loss formulations. We will add a short clarifying clause in the revised abstract and ensure the introduction explicitly references the overlap discussion that appears later in the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature without derivations or self-referential reductions

full rationale

This is a literature survey paper with no equations, fitted parameters, predictions, or derivation chain. The central organization into recognition/retrieval/generation is an explicit taxonomic choice stated in the abstract, not derived from any prior result or self-citation within the paper. No load-bearing step reduces to a fit, ansatz, or self-citation; the partition is presented as the authors' systematic framing of existing external work. Self-citations, if present, are not required to justify the taxonomy. The paper is self-contained as a review and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Survey paper with no new derivations, parameters, or entities; the ledger is empty by nature of the contribution type.

pith-pipeline@v0.9.0 · 5736 in / 985 out tokens · 24646 ms · 2026-05-25T01:46:41.693955+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories
cs.CV 2026-03 unverdicted novelty 7.0

PF-MA is a new active learning rule that favors likely-positive uncertain samples to speed up discovery of rare categories in imbalanced visual retrieval.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

[Bao et al., 2017] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CV AE-GAN: Fine- grained image generation through asymmetric training. In ICCV, pages 2745–2754,

work page 2017
[2]

[Berg et al., 2014] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale ﬁne-grained visual categorization of birds. In CVPR, pages 2019–2026,

work page 2014
[3]

Charikar, K

[Charikar et al., 2002] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693–703,

work page 2002
[4]

[Chen et al., 2018] T. Chen, L. Lin, R. Chen, Y . Wu, and X. Luo. Knowledge- embedded representation learning for ﬁne-grained image recognition. In IJCAI, pages 627–634,

work page 2018
[5]

[Cui et al., 2016] Y . Cui, F. Zhou, Y . Lin, and S. Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In CVPR, pages 1153–1162,

work page 2016
[6]

[Cui et al., 2017] Y . Cui, F. Zhou, J. Wang, X. Liu, Y . Lin, and S. Belongie. Kernel pooling for convolutional neural network. In CVPR, pages 2921–2930,

work page 2017
[7]

[Deng et al., 2016] J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the wis- dom of the crowd for ﬁne-grained recognition. TPAMI, 38(4):666–676,

work page 2016
[8]

Dubey, O

[Dubey et al., 2018] A. Dubey, O. Gupta, R. Raskar, and N. Naik. Maximum entropy ﬁne-grained classiﬁcation. In NeurIPS, pages 637–647,

work page 2018
[9]

Neural Architecture Search: A Survey

[Elsken et al., 2018] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Feurer, A

[Feurer et al., 2015] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efﬁcient and robust automated machine learning. In NIPS, pages 2962–2970,

work page 2015
[11]

[Fu et al., 2017] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for ﬁne-grained image recognition. InCVPR, pages 4438–4446,

work page 2017
[12]

[Gao et al., 2016] Y . Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In CVPR, pages 317–326,

work page 2016
[13]

Goodfellow, J

[Goodfellow et al., 2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In NIPS, pages 2672–2680,

work page 2014
[14]

Van Horn, O

[Horn et al., 2017] G. Van Horn, O. M. Aodha, Y . Song, Y . Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie. The iNaturalist species classiﬁcation and detection dataset. In CVPR, pages 8769–8778,

work page 2017
[15]

[Hou et al., 2017] S. Hou, Y . Feng, and Z. Wang. VegFru: A domain-speciﬁc dataset for ﬁne-grained visual categorization. In ICCV, pages 541–549,

work page 2017
[16]

Jaderberg, K

[Jaderberg et al., 2015] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025,

work page 2015
[17]

Khosla, N

[Khosla et al., 2011] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for ﬁne-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, pages 806–813,

work page 2011
[18]

Kong and C

[Kong and Fowlkes, 2017] S. Kong and C. Fowlkes. Low-rank bilinear pooling for ﬁne-grained classiﬁcation. In CVPR, pages 365–374,

work page 2017
[19]

Krause, M

[Krause et al., 2013] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object repre- sentations for ﬁne-grained categorization. In ICCV Workshop on 3D Representation and Recognition,

work page 2013
[20]

LeCun, Y

[LeCun et al., 2015] Y . LeCun, Y . Bengion, and G. Hinton. Deep learning. Nature, 521:436–444,

work page 2015
[21]

Lehmann, R

[Lehmann et al., 2015] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal, pages 167–195,

work page 2015
[22]

[Li et al., 2016] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep su- pervised hashing with pairwise labels. In IJCAI, pages 1711–1717,

work page 2016
[23]

[Liu et al., 2016] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. DeepFashion: Power- ing robust clothes recognition and retrieval with rich annotations. In CVPR, pages 1096–1104,

work page 2016
[24]

[Maji et al., 2013] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine- grained visual classiﬁcation of aircraft. arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv 2013
[25]

Nilsback and A

[Nilsback and Zisserman, 2008] M.-E. Nilsback and A. Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In Indian Conf. on Comput. Vision, Graph. and Image Process., pages 722–729,

work page 2008
[26]

[Niu et al., 2018] L. Niu, A. Veeraraghavan, and A. Sabharwal. Webly supervised learning meets zero-shot learning: A hybrid approach for ﬁne-grained classiﬁcation. In CVPR, pages 7171–7180,

work page 2018
[27]

Pham and R

[Pham and Pagh, 2013] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In KDD, pages 239–247,

work page 2013
[28]

[Reed et al., 2016] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep represen- tations of ﬁne-grained visual descriptions. In CVPR, pages 49–58,

work page 2016
[29]

[Song et al., 2017] J. Song, Q. Yu, Y .-Z. Song, T. Xiang, and T. M. Hospedales. Deep spatial-semantic attention for ﬁne-grained sketch-based image retrieval. In ICCV, pages 5551–5560,

work page 2017
[30]

[Suh et al., 2018] Y . Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Part-aligned bilin- ear representations for person re-identiﬁcation. In ECCV, pages 402–419,

work page 2018
[31]

[Sun et al., 2018] M. Sun, Y . Yuan, F. Zhou, and E. Ding. Multi-attention multi-class constraint for ﬁne-grained image recognition. In ECCV, pages 834–850,

work page 2018
[32]

[Sun et al., 2019] X. Sun, L. Chen, and J. Yang. Learning from web data using adver- sarial discriminative neural networks for ﬁne-grained classiﬁcation. In AAAI,

work page 2019
[33]

[Wah et al., 2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD birds-200-2011 dataset. Tech. Report CNS-TR-2011-001,

work page 2011
[34]

[Wang et al., 2018] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A survey on learning to hash. TPAMI, 40(4):769–790,

work page 2018
[35]

Wei, J.-H

[Wei et al., 2017] X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutional descriptor aggregation for ﬁne-grained image retrieval.TIP, 26(6):2868–2881,

work page 2017
[36]

[Wei et al., 2019a] X.-S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu. RPC: A large- scale retail product checkout dataset. arXiv preprint arXiv:1901.07249,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[37]

[Yang et al., 2018] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang. Learning to navigate for ﬁne-grained classiﬁcation. In ECCV, pages 438–454,

work page 2018
[38]

Zhang, J

[Zhang et al., 2014] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for ﬁne-grained category detection. In ECCV, pages 834–849,

work page 2014
[39]

Zhang, H

[Zhang et al., 2018] Y . Zhang, H. Tang, and K. Jia. Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In ECCV, pages 233–248,

work page 2018
[40]

[Zhao et al., 2017] B. Zhao, J. Feng, X. Wu, and S. Yan. A survey on deep learning- based ﬁne-grained object classiﬁcation and semantic segmentation. International Journal of Automation and Computing, 14(2):119–135,

work page 2017
[41]

Zheng, J

[Zheng et al., 2017] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for ﬁne-grained image recognition. In ICCV, pages 5209–5217,

work page 2017
[42]

Zheng, R

[Zheng et al., 2018] X. Zheng, R. Ji, X. Sun, Y . Wu, F. Huang, and Y . Yang. Cen- tralized ranking loss with weakly supervised localization for ﬁne-grained object re- trieval. In IJCAI, pages 1226–1233,

work page 2018
[43]

Zheng, R

[Zheng et al., 2019] X. Zheng, R. Ji, X. Sun, B. Zhang, Y . Wu, and F. Huang. Towards optimal ﬁne grained retrieval via decorrelated centralized loss with normalize-scale layer. In AAAI,

work page 2019
[44]

Zhuang, L

[Zhuang et al., 2017] B. Zhuang, L. Liu, Y . Li, C. Shen, and I. Reid. Attend in groups: a weakly-supervised deep learning framework for learning from web data. InCVPR, pages 1878–1887, 2017

work page 2017

[1] [1]

[Bao et al., 2017] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CV AE-GAN: Fine- grained image generation through asymmetric training. In ICCV, pages 2745–2754,

work page 2017

[2] [2]

[Berg et al., 2014] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale ﬁne-grained visual categorization of birds. In CVPR, pages 2019–2026,

work page 2014

[3] [3]

Charikar, K

[Charikar et al., 2002] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693–703,

work page 2002

[4] [4]

[Chen et al., 2018] T. Chen, L. Lin, R. Chen, Y . Wu, and X. Luo. Knowledge- embedded representation learning for ﬁne-grained image recognition. In IJCAI, pages 627–634,

work page 2018

[5] [5]

[Cui et al., 2016] Y . Cui, F. Zhou, Y . Lin, and S. Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In CVPR, pages 1153–1162,

work page 2016

[6] [6]

[Cui et al., 2017] Y . Cui, F. Zhou, J. Wang, X. Liu, Y . Lin, and S. Belongie. Kernel pooling for convolutional neural network. In CVPR, pages 2921–2930,

work page 2017

[7] [7]

[Deng et al., 2016] J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the wis- dom of the crowd for ﬁne-grained recognition. TPAMI, 38(4):666–676,

work page 2016

[8] [8]

Dubey, O

[Dubey et al., 2018] A. Dubey, O. Gupta, R. Raskar, and N. Naik. Maximum entropy ﬁne-grained classiﬁcation. In NeurIPS, pages 637–647,

work page 2018

[9] [9]

Neural Architecture Search: A Survey

[Elsken et al., 2018] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Feurer, A

[Feurer et al., 2015] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efﬁcient and robust automated machine learning. In NIPS, pages 2962–2970,

work page 2015

[11] [11]

[Fu et al., 2017] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for ﬁne-grained image recognition. InCVPR, pages 4438–4446,

work page 2017

[12] [12]

[Gao et al., 2016] Y . Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In CVPR, pages 317–326,

work page 2016

[13] [13]

Goodfellow, J

[Goodfellow et al., 2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In NIPS, pages 2672–2680,

work page 2014

[14] [14]

Van Horn, O

[Horn et al., 2017] G. Van Horn, O. M. Aodha, Y . Song, Y . Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie. The iNaturalist species classiﬁcation and detection dataset. In CVPR, pages 8769–8778,

work page 2017

[15] [15]

[Hou et al., 2017] S. Hou, Y . Feng, and Z. Wang. VegFru: A domain-speciﬁc dataset for ﬁne-grained visual categorization. In ICCV, pages 541–549,

work page 2017

[16] [16]

Jaderberg, K

[Jaderberg et al., 2015] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025,

work page 2015

[17] [17]

Khosla, N

[Khosla et al., 2011] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for ﬁne-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, pages 806–813,

work page 2011

[18] [18]

Kong and C

[Kong and Fowlkes, 2017] S. Kong and C. Fowlkes. Low-rank bilinear pooling for ﬁne-grained classiﬁcation. In CVPR, pages 365–374,

work page 2017

[19] [19]

Krause, M

[Krause et al., 2013] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object repre- sentations for ﬁne-grained categorization. In ICCV Workshop on 3D Representation and Recognition,

work page 2013

[20] [20]

LeCun, Y

[LeCun et al., 2015] Y . LeCun, Y . Bengion, and G. Hinton. Deep learning. Nature, 521:436–444,

work page 2015

[21] [21]

Lehmann, R

[Lehmann et al., 2015] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal, pages 167–195,

work page 2015

[22] [22]

[Li et al., 2016] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep su- pervised hashing with pairwise labels. In IJCAI, pages 1711–1717,

work page 2016

[23] [23]

[Liu et al., 2016] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. DeepFashion: Power- ing robust clothes recognition and retrieval with rich annotations. In CVPR, pages 1096–1104,

work page 2016

[24] [24]

[Maji et al., 2013] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine- grained visual classiﬁcation of aircraft. arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv 2013

[25] [25]

Nilsback and A

[Nilsback and Zisserman, 2008] M.-E. Nilsback and A. Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In Indian Conf. on Comput. Vision, Graph. and Image Process., pages 722–729,

work page 2008

[26] [26]

[Niu et al., 2018] L. Niu, A. Veeraraghavan, and A. Sabharwal. Webly supervised learning meets zero-shot learning: A hybrid approach for ﬁne-grained classiﬁcation. In CVPR, pages 7171–7180,

work page 2018

[27] [27]

Pham and R

[Pham and Pagh, 2013] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In KDD, pages 239–247,

work page 2013

[28] [28]

[Reed et al., 2016] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep represen- tations of ﬁne-grained visual descriptions. In CVPR, pages 49–58,

work page 2016

[29] [29]

[Song et al., 2017] J. Song, Q. Yu, Y .-Z. Song, T. Xiang, and T. M. Hospedales. Deep spatial-semantic attention for ﬁne-grained sketch-based image retrieval. In ICCV, pages 5551–5560,

work page 2017

[30] [30]

[Suh et al., 2018] Y . Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Part-aligned bilin- ear representations for person re-identiﬁcation. In ECCV, pages 402–419,

work page 2018

[31] [31]

[Sun et al., 2018] M. Sun, Y . Yuan, F. Zhou, and E. Ding. Multi-attention multi-class constraint for ﬁne-grained image recognition. In ECCV, pages 834–850,

work page 2018

[32] [32]

[Sun et al., 2019] X. Sun, L. Chen, and J. Yang. Learning from web data using adver- sarial discriminative neural networks for ﬁne-grained classiﬁcation. In AAAI,

work page 2019

[33] [33]

[Wah et al., 2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD birds-200-2011 dataset. Tech. Report CNS-TR-2011-001,

work page 2011

[34] [34]

[Wang et al., 2018] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A survey on learning to hash. TPAMI, 40(4):769–790,

work page 2018

[35] [35]

Wei, J.-H

[Wei et al., 2017] X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutional descriptor aggregation for ﬁne-grained image retrieval.TIP, 26(6):2868–2881,

work page 2017

[36] [36]

[Wei et al., 2019a] X.-S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu. RPC: A large- scale retail product checkout dataset. arXiv preprint arXiv:1901.07249,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[37] [37]

[Yang et al., 2018] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang. Learning to navigate for ﬁne-grained classiﬁcation. In ECCV, pages 438–454,

work page 2018

[38] [38]

Zhang, J

[Zhang et al., 2014] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for ﬁne-grained category detection. In ECCV, pages 834–849,

work page 2014

[39] [39]

Zhang, H

[Zhang et al., 2018] Y . Zhang, H. Tang, and K. Jia. Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In ECCV, pages 233–248,

work page 2018

[40] [40]

[Zhao et al., 2017] B. Zhao, J. Feng, X. Wu, and S. Yan. A survey on deep learning- based ﬁne-grained object classiﬁcation and semantic segmentation. International Journal of Automation and Computing, 14(2):119–135,

work page 2017

[41] [41]

Zheng, J

[Zheng et al., 2017] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for ﬁne-grained image recognition. In ICCV, pages 5209–5217,

work page 2017

[42] [42]

Zheng, R

[Zheng et al., 2018] X. Zheng, R. Ji, X. Sun, Y . Wu, F. Huang, and Y . Yang. Cen- tralized ranking loss with weakly supervised localization for ﬁne-grained object re- trieval. In IJCAI, pages 1226–1233,

work page 2018

[43] [43]

Zheng, R

[Zheng et al., 2019] X. Zheng, R. Ji, X. Sun, B. Zhang, Y . Wu, and F. Huang. Towards optimal ﬁne grained retrieval via decorrelated centralized loss with normalize-scale layer. In AAAI,

work page 2019

[44] [44]

Zhuang, L

[Zhuang et al., 2017] B. Zhuang, L. Liu, Y . Li, C. Shen, and I. Reid. Attend in groups: a weakly-supervised deep learning framework for learning from web data. InCVPR, pages 1878–1887, 2017

work page 2017