Visualizing and Describing Fine-grained Categories as Textures

Chenyun Wu; Mikayla Timm; Subhransu Maji; Tsung-Yu Lin

arxiv: 1907.05288 · v1 · pith:I2LFAVRZnew · submitted 2019-07-02 · 💻 cs.CV

Visualizing and Describing Fine-grained Categories as Textures

Tsung-Yu Lin , Mikayla Timm , Chenyun Wu , Subhransu Maji This is my paper

Pith reviewed 2026-05-25 11:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained visual categorizationtexture visualizationmaximal imagestexture attributesbilinear CNNDTD datasetFGVC

0 comments

The pith

Fine-grained categories such as bird and butterfly species can be visualized and described through their distinctive textures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that categories from fine-grained visual categorization challenges can be understood by their textural content. It generates maximal images by optimizing network inputs to maximize the probability of each class under a texture-based model, then automatically describes those images with texture attributes. This matters because subtle species differences are often textural and because top-performing networks already rely on orderless second-order pooling. The resulting visualizations highlight the most discriminative texture elements while the descriptions supply verbal explanations of the same properties.

Core claim

For each category the authors obtain maximal images by finding inputs that maximize the class probability according to a texture-based deep network, then caption those images with texture attributes learned from an extended DTD dataset. These maximal images and their descriptions together indicate which textural aspects are most responsible for distinguishing the category.

What carries the argument

Maximal images produced by input optimization in texture-based networks, paired with automatic texture-attribute captioning.

If this is right

Subtle inter-category differences in FGVC datasets can be captured by textural properties alone.
Texture-based models such as bilinear CNNs become more interpretable through the generated maximal images and attribute lists.
Language-based texture descriptions can be produced automatically for any category that has a trained texture network.
The same pipeline applies directly to recent large-scale FGVC collections such as iNaturalist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested by measuring whether the generated descriptions improve human accuracy when identifying fine-grained categories from images.
Similar optimization-plus-description pipelines might be applied to other sensory domains that admit texture-like representations.
If the maximal images prove reliable, they could serve as synthetic training examples to augment small FGVC datasets.

Load-bearing premise

The maximal images obtained by optimization actually reflect the textural properties that humans would judge as discriminative for the category rather than artifacts of the network or optimizer.

What would settle it

A controlled comparison in which humans rate how well the texture attributes of maximal images match those of real category examples versus control images produced by unrelated optimizations.

Figures

Figures reproduced from arXiv: 1907.05288 by Chenyun Wu, Mikayla Timm, Subhransu Maji, Tsung-Yu Lin.

**Figure 2.** Figure 2: Visualization of fine-grained categories from Caltech-UCSD birds and Oxford flowers. Each example is shown as a column of [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of fine-grained categories from FGVC butterflies and moths, fungi, and flowers. Each example is shown as a [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

We analyze how categories from recent FGVC challenges can be described by their textural content. The motivation is that subtle differences between species of birds or butterflies can often be described in terms of the texture associated with them and that several top-performing networks are inspired by texture-based representations. These representations are characterized by orderless pooling of second-order filter activations such as in bilinear CNNs and the winner of the iNaturalist 2018 challenge. Concretely, for each category we (i) visualize the "maximal images" by obtaining inputs x that maximize the probability of the particular class according to a texture-based deep network, and (ii) automatically describe the maximal images using a set of texture attributes. The models for texture captioning were trained on our ongoing efforts on collecting a dataset of describable textures building on the DTD dataset. These visualizations indicate what aspects of the texture is most discriminative for each category while the descriptions provide a language-based explanation of the same.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper generates maximal images from texture networks for FGVC classes then captions them with DTD attributes, but supplies no checks that these reflect human-discriminative textures rather than optimization artifacts.

read the letter

The core move here is to take a bilinear CNN or similar orderless pooling network, run gradient ascent to produce a maximal image per FGVC category, and feed that image to a captioner trained on describable texture attributes. The output is a set of visualizations plus short texture descriptions for things like bird or butterfly species. That specific pipeline applied to recent FGVC benchmarks is not in the cited prior work, so the combination counts as new even if the pieces are extensions of bilinear pooling and DTD captioning. The paper does a clean job of laying out the motivation that texture representations already dominate top FGVC entries and that language descriptions could make them more interpretable. The examples are straightforward to follow and stay within the texture-modeling literature without overclaiming new theory or data. The soft spot is exactly the one flagged in the stress test. Nothing in the abstract or described procedure shows that the optimized images align with textures a human would pick out as category-discriminative. There are no real-image comparisons, no perceptual ratings, and no ablation against non-texture networks to rule out architecture-specific directions. Without those, the claim that the visualizations and captions explain what is most discriminative rests on an untested assumption. The soundness score of 3.0 is fair given the absence of any quantitative validation or error analysis. The citation pattern is appropriate and does not hide the reliance on external networks. This is for readers already working on interpretability of second-order pooling models or on texture datasets; someone outside that niche will not get much downstream use. A serious referee could usefully press for the missing human studies or controls, so the work is worth sending out rather than desk-rejecting. I would bring it to a reading group as a maybe to talk through the visualization method, but I would not cite it in my own work until the validation gap is closed.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-grained visual categories (e.g., bird and butterfly species from FGVC challenges) can be analyzed via their textural content. It generates 'maximal images' for each category by gradient ascent to maximize class probability under texture-based networks such as bilinear CNNs, then feeds these images to a captioning model trained on the DTD dataset (and ongoing extensions) to produce automatic texture-attribute descriptions. The visualizations and descriptions are presented as indicating the most discriminative textural aspects for each category.

Significance. If the maximal images are shown to align with human-perceived discriminative textures, the work could provide an interpretable bridge between orderless second-order pooling representations (known to perform well on FGVC) and linguistic explanations, potentially aiding analysis of why texture models succeed on subtle category distinctions. The approach builds directly on established texture networks and the DTD dataset.

major comments (2)

[Abstract] Abstract and method description: The central claim that the visualizations 'indicate what aspects of the texture is most discriminative for each category' rests on the untested assumption that inputs x* = argmax_x p(class | texture-network(x)) capture human-discriminative textural properties rather than optimization artifacts or network-specific directions. No section reports side-by-side comparisons of x* against real category exemplars, human perceptual similarity ratings, or ablations against non-texture networks to validate this alignment.
[Abstract] The manuscript supplies no quantitative validation, error analysis, or downstream-task evaluation (e.g., whether the generated descriptions improve retrieval or classification) showing that the texture captions match human judgments. This absence leaves the language-based explanations without empirical grounding for the claimed explanatory power.

minor comments (2)

[Abstract] Abstract contains a grammatical error: 'what aspects of the texture is most discriminative' should read 'are most discriminative'.
[Abstract] The abstract refers to 'our ongoing efforts on collecting a dataset of describable textures building on the DTD dataset' without providing details on the new data collection, size, or annotation protocol; this should be clarified or referenced to a specific section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned changes to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and method description: The central claim that the visualizations 'indicate what aspects of the texture is most discriminative for each category' rests on the untested assumption that inputs x* = argmax_x p(class | texture-network(x)) capture human-discriminative textural properties rather than optimization artifacts or network-specific directions. No section reports side-by-side comparisons of x* against real category exemplars, human perceptual similarity ratings, or ablations against non-texture networks to validate this alignment.

Authors: We agree the manuscript does not provide quantitative validation (human ratings or ablations) that the maximal images align with human perception rather than network artifacts. The visualizations follow standard gradient-ascent practice on texture networks known to succeed on FGVC, and are presented as qualitative indications. In revision we will add side-by-side comparisons of maximal images with real dataset exemplars and a limitations paragraph noting the absence of perceptual studies. A full human-subject validation remains outside the scope of this work. revision: partial
Referee: [Abstract] The manuscript supplies no quantitative validation, error analysis, or downstream-task evaluation (e.g., whether the generated descriptions improve retrieval or classification) showing that the texture captions match human judgments. This absence leaves the language-based explanations without empirical grounding for the claimed explanatory power.

Authors: The paper is methodological and demonstrates automatic texture captioning on maximal images using models trained on DTD extensions; it does not include quantitative agreement metrics or downstream-task results. In revision we will add a short error analysis of the captioning model on a held-out texture validation split and will explicitly state that the descriptions are exploratory rather than validated explanations. We will not claim downstream improvements. revision: partial

Circularity Check

0 steps flagged

No circularity in the paper's analysis pipeline

full rationale

The paper presents an empirical visualization and captioning pipeline that applies gradient ascent on pre-trained bilinear-CNN or similar orderless pooling networks to produce maximal images, then feeds those images to a captioner trained on the external DTD dataset. No equations, fitted parameters, or predictions are defined in terms of the target outputs; the central claims rest on the outputs of independently trained external models rather than any self-referential derivation, self-citation load-bearing premise, or renaming of known results. The approach is therefore self-contained against external benchmarks and contains no load-bearing steps that reduce to tautology by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no free parameters, axioms, or invented entities are identifiable.

pith-pipeline@v0.9.0 · 5704 in / 998 out tokens · 23574 ms · 2026-05-25T11:38:52.474819+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

google.com/view/fgvc6/competitions/ butterflies-moths-2019

FGVC Butterﬂies and Moths Dataset, https://sites. google.com/view/fgvc6/competitions/ butterflies-moths-2019. 1

work page 2019
[2]

com/view/fgvc5/competitions/fgvcx/ flowers

FGVC Flowers Dataset, https://sites.google. com/view/fgvc5/competitions/fgvcx/ flowers. 1

work page
[3]

FGVC Fungi Dataset https://sites.google.com/ view/fgvc5/competitions/fgvcx/fungi. 1

work page
[4]

The Fifth Fine-Grained Visual Categorization (FGVC) Workshop https://sites.google.com/view/ fgvc5. 1

work page
[5]

The Sixth Fine-Grained Visual Categorization (FGVC) Workshop https://sites.google.com/view/ fgvc6. 1

work page
[6]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 1

work page 2014
[7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 1

work page 2016
[8]

To- wards faster training of global covariance pooling networks by iterative matrix square root normalization

Peihua Li, Jiangtao Xie, Qilong Wang, and Zilin Gao. To- wards faster training of global covariance pooling networks by iterative matrix square root normalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 1

work page 2018
[9]

Visualizing and Under- standing Deep Texture Representations

Tsung-Yu Lin and Subhransu Maji. Visualizing and Under- standing Deep Texture Representations. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2791–2799, 2016. 1

work page 2016
[10]

Bilinear Convolutional Neural Networks for Fine-grained Visual Recognition

Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear Convolutional Neural Networks for Fine-grained Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), volume=40, number=6, pages=1309–1322, year=2018, publisher=IEEE. 1

work page 2018
[11]

Visualizing deep convolutional neural networks using natural pre-images

Avinash Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. In- ternational Journal of Computer Vision (IJCV) , 2016. 1

work page 2016
[12]

Automated ﬂower classiﬁcation over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In In- dian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), Dec 2008. 1

work page 2008
[13]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

The Caltech-UCSD Birds-200- 2011 Dataset

Catherine Wah, Steven Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The Caltech-UCSD Birds-200- 2011 Dataset. Technical report, 2011. 1

work page 2011

[1] [1]

google.com/view/fgvc6/competitions/ butterflies-moths-2019

FGVC Butterﬂies and Moths Dataset, https://sites. google.com/view/fgvc6/competitions/ butterflies-moths-2019. 1

work page 2019

[2] [2]

com/view/fgvc5/competitions/fgvcx/ flowers

FGVC Flowers Dataset, https://sites.google. com/view/fgvc5/competitions/fgvcx/ flowers. 1

work page

[3] [3]

FGVC Fungi Dataset https://sites.google.com/ view/fgvc5/competitions/fgvcx/fungi. 1

work page

[4] [4]

The Fifth Fine-Grained Visual Categorization (FGVC) Workshop https://sites.google.com/view/ fgvc5. 1

work page

[5] [5]

The Sixth Fine-Grained Visual Categorization (FGVC) Workshop https://sites.google.com/view/ fgvc6. 1

work page

[6] [6]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 1

work page 2014

[7] [7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 1

work page 2016

[8] [8]

To- wards faster training of global covariance pooling networks by iterative matrix square root normalization

Peihua Li, Jiangtao Xie, Qilong Wang, and Zilin Gao. To- wards faster training of global covariance pooling networks by iterative matrix square root normalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 1

work page 2018

[9] [9]

Visualizing and Under- standing Deep Texture Representations

Tsung-Yu Lin and Subhransu Maji. Visualizing and Under- standing Deep Texture Representations. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2791–2799, 2016. 1

work page 2016

[10] [10]

Bilinear Convolutional Neural Networks for Fine-grained Visual Recognition

Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear Convolutional Neural Networks for Fine-grained Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), volume=40, number=6, pages=1309–1322, year=2018, publisher=IEEE. 1

work page 2018

[11] [11]

Visualizing deep convolutional neural networks using natural pre-images

Avinash Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. In- ternational Journal of Computer Vision (IJCV) , 2016. 1

work page 2016

[12] [12]

Automated ﬂower classiﬁcation over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In In- dian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), Dec 2008. 1

work page 2008

[13] [13]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

The Caltech-UCSD Birds-200- 2011 Dataset

Catherine Wah, Steven Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The Caltech-UCSD Birds-200- 2011 Dataset. Technical report, 2011. 1

work page 2011