pith. sign in

arxiv: 1907.03069 · v1 · pith:RUHPD6KYnew · submitted 2019-07-06 · 💻 cs.CV

Deep Learning for Fine-Grained Image Analysis: A Survey

Pith reviewed 2026-05-25 01:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained image analysisdeep learningfine-grained recognitionfine-grained retrievalfine-grained generationcomputer vision surveybenchmark datasets
0
0 comments X

The pith

Recent deep learning progress in fine-grained image analysis is organized into recognition, retrieval, and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper provides a systematic survey of deep learning techniques for fine-grained image analysis. It divides the field into three categories based on the tasks of recognition, retrieval, and generation. The survey also discusses benchmark datasets, domain-specific applications, and future research directions. This organization helps in understanding how deep learning has advanced the analysis of objects with subtle differences between classes.

Core claim

During the development of deep learning, fine-grained image analysis has made remarkable progress. Existing studies are organized into three major categories: fine-grained image recognition, fine-grained image retrieval, and fine-grained image generation. The survey covers publicly available benchmark datasets and related applications, concluding with directions and open problems.

What carries the argument

The three-category taxonomy consisting of fine-grained image recognition, fine-grained image retrieval, and fine-grained image generation, which structures the review of deep learning methods for analyzing subordinate categories of visual objects.

Load-bearing premise

Existing studies of FGIA techniques can be partitioned into the three categories of recognition, retrieval, and generation without major omissions or overlaps.

What would settle it

A deep learning based FGIA method that cannot be classified under recognition, retrieval, or generation would challenge the survey's organizational framework.

Figures

Figures reproduced from arXiv: 1907.03069 by Jianxin Wu, Quan Cui, Xiu-Shen Wei.

Figure 1
Figure 1. Figure 1: Fine-grained image analysis vs. generic image analysis (taking the recognitiont task for an example). meta-category), e.g., different species of animals/plants, dif￾ferent models of cars, different kinds of retail products, etc (cf [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Key challenge of fine-grained image analysis, [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Main aspects of our hierarchical and structrual organization [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example fine-grained images belonging to different species of flowers/vegetable, different models of cars/aircrafts and different kinds [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An example image with its supervisions associated with [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example knowledge graph for modeling the category [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
read the original abstract

Computer vision (CV) is the process of using machines to understand and analyze imagery, which is an integral branch of artificial intelligence. Among various research areas of CV, fine-grained image analysis (FGIA) is a longstanding and fundamental problem, and has become ubiquitous in diverse real-world applications. The task of FGIA targets analyzing visual objects from subordinate categories, \eg, species of birds or models of cars. The small inter-class variations and the large intra-class variations caused by the fine-grained nature makes it a challenging problem. During the booming of deep learning, recent years have witnessed remarkable progress of FGIA using deep learning techniques. In this paper, we aim to give a survey on recent advances of deep learning based FGIA techniques in a systematic way. Specifically, we organize the existing studies of FGIA techniques into three major categories: fine-grained image recognition, fine-grained image retrieval and fine-grained image generation. In addition, we also cover some other important issues of FGIA, such as publicly available benchmark datasets and its related domain specific applications. Finally, we conclude this survey by highlighting several directions and open problems which need be further explored by the community in the future.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This survey reviews deep learning advances in fine-grained image analysis (FGIA), a subfield of computer vision focused on subordinate categories with small inter-class and large intra-class variations. It organizes the literature into three major categories—fine-grained image recognition, fine-grained image retrieval, and fine-grained image generation—while additionally covering benchmark datasets, domain-specific applications, and open problems for future work.

Significance. If the categorization and coverage prove comprehensive and balanced, the survey would serve as a useful reference point for the FGIA community, consolidating progress across recognition, retrieval, and generation tasks during the deep learning era and highlighting applications and datasets.

major comments (2)
  1. [Abstract] Abstract: The central organizational claim partitions existing FGIA studies into exactly three major categories (recognition, retrieval, generation). This is load-bearing for the survey's systematic character, yet the abstract provides no explicit justification or mapping for how tasks such as fine-grained detection, localization, or segmentation—which appear frequently in the broader FGIA literature—are assigned or excluded, raising the risk of omissions that would undermine completeness.
  2. [Abstract] Abstract: The partition treats recognition and retrieval as distinct despite substantial shared methodology (e.g., CNN backbones, part-based representations, and metric-learning losses). Without a dedicated discussion of how overlaps are handled or why they do not collapse the categories, the claimed systematic organization risks appearing artificial rather than reflective of the literature structure.
minor comments (1)
  1. [Abstract] Abstract, final sentence: 'need be further explored' is grammatically incomplete and should read 'need to be further explored'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We agree that the abstract should better justify the scope and categorization to strengthen the survey's systematic presentation. We will revise the abstract accordingly while preserving the paper's core organization, which is elaborated in the introduction and subsequent sections.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central organizational claim partitions existing FGIA studies into exactly three major categories (recognition, retrieval, generation). This is load-bearing for the survey's systematic character, yet the abstract provides no explicit justification or mapping for how tasks such as fine-grained detection, localization, or segmentation—which appear frequently in the broader FGIA literature—are assigned or excluded, raising the risk of omissions that would undermine completeness.

    Authors: We accept this point. The three categories were chosen because they constitute the dominant task formulations in the deep-learning FGIA literature (as evidenced by the volume of papers and benchmark protocols). Detection, localization, and segmentation are typically treated as enabling components within recognition pipelines rather than independent FGIA tasks with their own large-scale benchmarks; they are therefore discussed inside the recognition section when relevant. To address the concern, we will expand the abstract with one sentence that explicitly states the scope and notes that auxiliary tasks are subsumed under the three primary categories. revision: yes

  2. Referee: [Abstract] Abstract: The partition treats recognition and retrieval as distinct despite substantial shared methodology (e.g., CNN backbones, part-based representations, and metric-learning losses). Without a dedicated discussion of how overlaps are handled or why they do not collapse the categories, the claimed systematic organization risks appearing artificial rather than reflective of the literature structure.

    Authors: We agree that methodological overlap exists and is already noted in the body of the survey (e.g., shared backbone architectures and attention mechanisms appear in both recognition and retrieval sections). The separation is maintained because the problem definitions differ—recognition is a closed-set classification task while retrieval is an open-set ranking/similarity task—leading to distinct evaluation protocols and loss formulations. We will add a short clarifying clause in the revised abstract and ensure the introduction explicitly references the overlap discussion that appears later in the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature without derivations or self-referential reductions

full rationale

This is a literature survey paper with no equations, fitted parameters, predictions, or derivation chain. The central organization into recognition/retrieval/generation is an explicit taxonomic choice stated in the abstract, not derived from any prior result or self-citation within the paper. No load-bearing step reduces to a fit, ansatz, or self-citation; the partition is presented as the authors' systematic framing of existing external work. Self-citations, if present, are not required to justify the taxonomy. The paper is self-contained as a review and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Survey paper with no new derivations, parameters, or entities; the ledger is empty by nature of the contribution type.

pith-pipeline@v0.9.0 · 5736 in / 985 out tokens · 24646 ms · 2026-05-25T01:46:41.693955+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories

    cs.CV 2026-03 unverdicted novelty 7.0

    PF-MA is a new active learning rule that favors likely-positive uncertain samples to speed up discovery of rare categories in imbalanced visual retrieval.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    [Bao et al., 2017] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CV AE-GAN: Fine- grained image generation through asymmetric training. In ICCV, pages 2745–2754,

  2. [2]

    [Berg et al., 2014] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In CVPR, pages 2019–2026,

  3. [3]

    Charikar, K

    [Charikar et al., 2002] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693–703,

  4. [4]

    [Chen et al., 2018] T. Chen, L. Lin, R. Chen, Y . Wu, and X. Luo. Knowledge- embedded representation learning for fine-grained image recognition. In IJCAI, pages 627–634,

  5. [5]

    [Cui et al., 2016] Y . Cui, F. Zhou, Y . Lin, and S. Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In CVPR, pages 1153–1162,

  6. [6]

    [Cui et al., 2017] Y . Cui, F. Zhou, J. Wang, X. Liu, Y . Lin, and S. Belongie. Kernel pooling for convolutional neural network. In CVPR, pages 2921–2930,

  7. [7]

    [Deng et al., 2016] J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the wis- dom of the crowd for fine-grained recognition. TPAMI, 38(4):666–676,

  8. [8]

    Dubey, O

    [Dubey et al., 2018] A. Dubey, O. Gupta, R. Raskar, and N. Naik. Maximum entropy fine-grained classification. In NeurIPS, pages 637–647,

  9. [9]

    Neural Architecture Search: A Survey

    [Elsken et al., 2018] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377,

  10. [10]

    Feurer, A

    [Feurer et al., 2015] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In NIPS, pages 2962–2970,

  11. [11]

    [Fu et al., 2017] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. InCVPR, pages 4438–4446,

  12. [12]

    [Gao et al., 2016] Y . Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In CVPR, pages 317–326,

  13. [13]

    Goodfellow, J

    [Goodfellow et al., 2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In NIPS, pages 2672–2680,

  14. [14]

    Van Horn, O

    [Horn et al., 2017] G. Van Horn, O. M. Aodha, Y . Song, Y . Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie. The iNaturalist species classification and detection dataset. In CVPR, pages 8769–8778,

  15. [15]

    [Hou et al., 2017] S. Hou, Y . Feng, and Z. Wang. VegFru: A domain-specific dataset for fine-grained visual categorization. In ICCV, pages 541–549,

  16. [16]

    Jaderberg, K

    [Jaderberg et al., 2015] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025,

  17. [17]

    Khosla, N

    [Khosla et al., 2011] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, pages 806–813,

  18. [18]

    Kong and C

    [Kong and Fowlkes, 2017] S. Kong and C. Fowlkes. Low-rank bilinear pooling for fine-grained classification. In CVPR, pages 365–374,

  19. [19]

    Krause, M

    [Krause et al., 2013] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object repre- sentations for fine-grained categorization. In ICCV Workshop on 3D Representation and Recognition,

  20. [20]

    LeCun, Y

    [LeCun et al., 2015] Y . LeCun, Y . Bengion, and G. Hinton. Deep learning. Nature, 521:436–444,

  21. [21]

    Lehmann, R

    [Lehmann et al., 2015] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal, pages 167–195,

  22. [22]

    [Li et al., 2016] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep su- pervised hashing with pairwise labels. In IJCAI, pages 1711–1717,

  23. [23]

    [Liu et al., 2016] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. DeepFashion: Power- ing robust clothes recognition and retrieval with rich annotations. In CVPR, pages 1096–1104,

  24. [24]

    [Maji et al., 2013] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine- grained visual classification of aircraft. arXiv preprint arXiv:1306.5151,

  25. [25]

    Nilsback and A

    [Nilsback and Zisserman, 2008] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Indian Conf. on Comput. Vision, Graph. and Image Process., pages 722–729,

  26. [26]

    [Niu et al., 2018] L. Niu, A. Veeraraghavan, and A. Sabharwal. Webly supervised learning meets zero-shot learning: A hybrid approach for fine-grained classification. In CVPR, pages 7171–7180,

  27. [27]

    Pham and R

    [Pham and Pagh, 2013] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In KDD, pages 239–247,

  28. [28]

    [Reed et al., 2016] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep represen- tations of fine-grained visual descriptions. In CVPR, pages 49–58,

  29. [29]

    [Song et al., 2017] J. Song, Q. Yu, Y .-Z. Song, T. Xiang, and T. M. Hospedales. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In ICCV, pages 5551–5560,

  30. [30]

    [Suh et al., 2018] Y . Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Part-aligned bilin- ear representations for person re-identification. In ECCV, pages 402–419,

  31. [31]

    [Sun et al., 2018] M. Sun, Y . Yuan, F. Zhou, and E. Ding. Multi-attention multi-class constraint for fine-grained image recognition. In ECCV, pages 834–850,

  32. [32]

    [Sun et al., 2019] X. Sun, L. Chen, and J. Yang. Learning from web data using adver- sarial discriminative neural networks for fine-grained classification. In AAAI,

  33. [33]

    [Wah et al., 2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD birds-200-2011 dataset. Tech. Report CNS-TR-2011-001,

  34. [34]

    [Wang et al., 2018] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A survey on learning to hash. TPAMI, 40(4):769–790,

  35. [35]

    Wei, J.-H

    [Wei et al., 2017] X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutional descriptor aggregation for fine-grained image retrieval.TIP, 26(6):2868–2881,

  36. [36]

    [Wei et al., 2019a] X.-S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu. RPC: A large- scale retail product checkout dataset. arXiv preprint arXiv:1901.07249,

  37. [37]

    [Yang et al., 2018] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang. Learning to navigate for fine-grained classification. In ECCV, pages 438–454,

  38. [38]

    Zhang, J

    [Zhang et al., 2014] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for fine-grained category detection. In ECCV, pages 834–849,

  39. [39]

    Zhang, H

    [Zhang et al., 2018] Y . Zhang, H. Tang, and K. Jia. Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In ECCV, pages 233–248,

  40. [40]

    [Zhao et al., 2017] B. Zhao, J. Feng, X. Wu, and S. Yan. A survey on deep learning- based fine-grained object classification and semantic segmentation. International Journal of Automation and Computing, 14(2):119–135,

  41. [41]

    Zheng, J

    [Zheng et al., 2017] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In ICCV, pages 5209–5217,

  42. [42]

    Zheng, R

    [Zheng et al., 2018] X. Zheng, R. Ji, X. Sun, Y . Wu, F. Huang, and Y . Yang. Cen- tralized ranking loss with weakly supervised localization for fine-grained object re- trieval. In IJCAI, pages 1226–1233,

  43. [43]

    Zheng, R

    [Zheng et al., 2019] X. Zheng, R. Ji, X. Sun, B. Zhang, Y . Wu, and F. Huang. Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In AAAI,

  44. [44]

    Zhuang, L

    [Zhuang et al., 2017] B. Zhuang, L. Liu, Y . Li, C. Shen, and I. Reid. Attend in groups: a weakly-supervised deep learning framework for learning from web data. InCVPR, pages 1878–1887, 2017