pith. sign in

arxiv: 1907.04983 · v2 · pith:XTXUHMEBnew · submitted 2019-07-11 · 💻 cs.CV

Aesthetic Attributes Assessment of Images

Pith reviewed 2026-05-24 23:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords image aesthetic assessmentaesthetic attributesimage captioningmulti-attribute predictionknowledge transferattention modelDPC-Captions datasetAMAN network
0
0 comments X

The pith

Aesthetic assessment of images now predicts both captions and numerical scores for five separate attributes using a single network trained on mixed labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces aesthetic attributes assessment as a new task that generates text captions describing up to five qualities of an image while also assigning a numerical score to each. It creates the large DPC-Captions dataset by transferring labels from the smaller fully annotated PCCD set, then trains the Aesthetic Multi-Attribute Network on the combined data. The network combines transfer learning with an attention mechanism to handle both fully and weakly labeled examples in one framework. Experiments show the model produces attribute captions and scores together and beats standard CNN-LSTM and SCA-CNN captioning baselines on standard image-caption metrics.

Core claim

Aesthetic Attributes Assessment predicts captions of five aesthetic attributes together with a numerical score for each attribute; the AMAN model, trained on a mixture of the small fully-annotated PCCD dataset and the large weakly-annotated DPC-Captions dataset obtained via knowledge transfer, jointly performs both tasks and outperforms traditional CNN-LSTM and modern SCA-CNN models on image caption evaluation criteria.

What carries the argument

Aesthetic Multi-Attribute Network (AMAN), a single framework that applies transfer learning and attention to predict multiple attribute captions and scores from mixed fully and weakly labeled image data.

If this is right

  • Images receive richer feedback than a single overall score, with separate text and numeric output for each of five attributes.
  • Large-scale weakly labeled data can be leveraged without requiring full annotation of every image.
  • Attention inside the network focuses on relevant image regions when generating each attribute's caption and score.
  • The same architecture handles both caption generation and regression in one forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tools that edit photos could use the per-attribute outputs to suggest targeted changes such as adjusting composition or color balance.
  • The approach may extend to other visual domains where both descriptive text and quantitative ratings are desired, such as product photography or architectural images.
  • If the transferred labels contain hidden biases, downstream applications risk amplifying those biases in generated captions.

Load-bearing premise

The knowledge transfer process that builds the large DPC-Captions dataset from the small PCCD dataset preserves the intended aesthetic attribute labels without adding substantial noise or bias.

What would settle it

Run the trained AMAN model on the held-out PCCD images and compare its generated attribute captions against human-written ones using BLEU, METEOR, or CIDEr; if scores fall below the CNN-LSTM baseline or if the predicted numerical scores deviate systematically from the original PCCD ground-truth ratings, the joint prediction claim does not hold.

Figures

Figures reproduced from arXiv: 1907.04983 by Bin Zhou, Dongqing Zou, Geng Zhao, Le Wu, Shiming Ge, Xiaodong Li, Xiaokun Zhang, Xinghui Zhou, Xin Jin.

Figure 1
Figure 1. Figure 1: Aesthetic Attributes Assessment of Images. We predict caption and score of each aesthetic attribute of an image. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The knowledge transfer method from PCCD to our DPC-Captions. The PCCD dataset includes 7 aesthetic attributes [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Aesthetic Multi-Attribute Network (AMAN) contains Multi-Attribute Feature Network (MAFN), Channel and Spatial [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The results of aesthetic multi-attribute network on DPC-Captions dataset. The predicted captions and score each [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Image aesthetic quality assessment has been a relatively hot topic during the last decade. Most recently, comments type assessment (aesthetic captions) has been proposed to describe the general aesthetic impression of an image using text. In this paper, we propose Aesthetic Attributes Assessment of Images, which means the aesthetic attributes captioning. This is a new formula of image aesthetic assessment, which predicts aesthetic attributes captions together with the aesthetic score of each attribute. We introduce a new dataset named \emph{DPC-Captions} which contains comments of up to 5 aesthetic attributes of one image through knowledge transfer from a full-annotated small-scale dataset. Then, we propose Aesthetic Multi-Attribute Network (AMAN), which is trained on a mixture of fully-annotated small-scale PCCD dataset and weakly-annotated large-scale DPC-Captions dataset. Our AMAN makes full use of transfer learning and attention model in a single framework. The experimental results on our DPC-Captions and PCCD dataset reveal that our method can predict captions of 5 aesthetic attributes together with numerical score assessment of each attribute. We use the evaluation criteria used in image captions to prove that our specially designed AMAN model outperforms traditional CNN-LSTM model and modern SCA-CNN model of image captions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces aesthetic attributes assessment as a new formulation of image aesthetic quality assessment: jointly predicting captions for up to five aesthetic attributes and a numerical score for each attribute. It constructs the large-scale DPC-Captions dataset via knowledge transfer from the small fully-annotated PCCD dataset, proposes the Aesthetic Multi-Attribute Network (AMAN) trained on the mixture of fully- and weakly-annotated data, and reports that AMAN outperforms CNN-LSTM and SCA-CNN baselines on standard image-caption metrics.

Significance. If the transferred labels prove reliable, the joint caption-plus-score formulation and the use of attention within a transfer-learning framework could support more fine-grained, multi-attribute aesthetic analysis at scale. The approach directly addresses the data scarcity problem common in aesthetic assessment by leveraging weak supervision.

major comments (3)
  1. [Dataset section (DPC-Captions construction)] The construction and use of DPC-Captions is load-bearing for all training and evaluation claims, yet the manuscript provides no quantitative validation (agreement metrics, noise analysis, or held-out human verification) that the knowledge-transfer process preserves the original five-attribute semantics without systematic bias or label noise.
  2. [Experiments section] The experimental claims of outperformance rest on standard caption metrics, but the manuscript supplies no statistical significance tests, error bars, cross-validation details, or ablation studies on the contribution of the attention or transfer components.
  3. [Evaluation criteria paragraph] Suitability of BLEU/CIDEr-style metrics for evaluating aesthetic-attribute captions (as opposed to general scene descriptions) is not justified or compared against attribute-specific alternatives.
minor comments (2)
  1. [Model section] Notation for the five attributes and the precise form of the joint loss (caption + score) should be defined explicitly with equations rather than prose.
  2. [Tables and figures] Figure captions and table headers should clarify whether reported scores are on the PCCD test set, DPC-Captions, or both.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The construction and use of DPC-Captions is load-bearing for all training and evaluation claims, yet the manuscript provides no quantitative validation (agreement metrics, noise analysis, or held-out human verification) that the knowledge-transfer process preserves the original five-attribute semantics without systematic bias or label noise.

    Authors: We agree that quantitative validation of the knowledge-transfer process used to construct DPC-Captions would strengthen the claims. In the revised manuscript we will add agreement metrics (e.g., Cohen's kappa on a held-out subset), a noise analysis comparing transferred labels to the original PCCD annotations, and a brief discussion of potential semantic drift across the five attributes. revision: yes

  2. Referee: The experimental claims of outperformance rest on standard caption metrics, but the manuscript supplies no statistical significance tests, error bars, cross-validation details, or ablation studies on the contribution of the attention or transfer components.

    Authors: We acknowledge the need for greater statistical rigor. The revised version will include paired statistical significance tests on the reported metrics, error bars computed over multiple random seeds, details of the train/validation splits, and ablation studies isolating the attention mechanism and the mixed fully/weakly supervised training regime. revision: yes

  3. Referee: Suitability of BLEU/CIDEr-style metrics for evaluating aesthetic-attribute captions (as opposed to general scene descriptions) is not justified or compared against attribute-specific alternatives.

    Authors: We will add an explicit justification paragraph noting that BLEU and CIDEr remain the de-facto standards for comparing against the CNN-LSTM and SCA-CNN baselines in the captioning literature; we will also discuss their limitations for attribute-specific text and note that attribute-specific alternatives (e.g., attribute-wise precision) could be explored in future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; conventional supervised training on external datasets

full rationale

The paper's core claims rest on training the AMAN model via standard supervised learning on the PCCD dataset and the derived DPC-Captions dataset, then evaluating with conventional image captioning metrics against baselines like CNN-LSTM and SCA-CNN. No equations, predictions, or uniqueness claims reduce reported performance to quantities defined by the fitted parameters themselves. The knowledge transfer step for creating DPC-Captions is a preprocessing choice whose fidelity is an external validity concern rather than a self-referential derivation. Any self-citations to prior dataset work are not load-bearing for the performance results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the correctness of the knowledge-transfer labeling process for DPC-Captions and on the assumption that standard captioning metrics are appropriate proxies for aesthetic attribute quality. No explicit free parameters or invented physical entities are introduced beyond the new dataset and model architecture.

invented entities (1)
  • DPC-Captions dataset no independent evidence
    purpose: Large-scale weakly annotated training data for the five aesthetic attributes
    Constructed by knowledge transfer from the smaller PCCD dataset; no independent verification of label quality is described.

pith-pipeline@v0.9.0 · 5768 in / 1182 out tokens · 19873 ms · 2026-05-24T23:31:53.817051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V (Lecture Notes in Computer Science) , Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.), V...

  2. [2]

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  3. [3]

    Jyoti Aneja, Aditya Deshpande, and Alexander G. Schwing. 2018. Convolu- tional Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  4. [4]

    Kuang-Yu Chang, Kung-Hung Lu, and Chu-Song Chen. 2017. Aesthetic Critiques Generation for Photos. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017 . IEEE Computer Society, 3534–3543. https://doi.org/10.1109/ICCV.2017.380

  5. [5]

    Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, and Jinsong Su. 2018. GroupCap: Group-Based Image Captioning With Structured Relevance and Di- versity Constraints. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  6. [6]

    Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and Channel-Wise Attention in Convo- lutional Networks for Image Captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 . 6298–6306. https://doi.org/10.1109/CVPR.2017.667

  7. [7]

    Xiaowu Chen, Xin Jin, Hongyu Wu, and Qinping Zhao. 2015. Learning Templates for Artistic Portrait Lighting Analysis. IEEE Trans. Image Processing 24, 2 (2015), 608–618

  8. [8]

    C. Cui, H. Liu, T. Lian, L. Nie, L. Zhu, and Y. Yin. 2018. Distribution-oriented Aesthetics Assessment with Semantic-Aware Hybrid Network. IEEE Transactions on Multimedia (2018), 1–1. https://doi.org/10.1109/TMM.2018.2875357

  9. [9]

    Yubin Deng, Chen Change Loy, and Xiaoou Tang. 2017. Image Aesthetic Assess- ment: An experimental survey. IEEE Signal Process. Mag. 34, 4 (2017), 80–106. https://doi.org/10.1109/MSP.2017.2696576

  10. [10]

    Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 677–691. https://doi.org/10.1109/TPAMI. 2016.2599174

  11. [11]

    Zhe Dong and Xinmei Tian. 2015. Multi-level photo quality assessment with multi-view features. Neurocomputing 168 (2015), 308–319

  12. [12]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770–778

  13. [13]

    Weinberger

    Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger

  14. [14]

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

    Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. 2261–2269. https://doi.org/10.1109/CVPR.2017.243

  15. [15]

    X. Jin, J. Chi, S. Peng, Y. Tian, C. Ye, and X. Li. 2016. Deep Image Aesthetics Classification using Inception Modules and Fine-tuning Connected Layer. In The 8th International Conference on Wireless Communications and Signal Processing (WCSP). 1–6

  16. [16]

    Xin Jin, Le Wu, Xiaodong Li, Siyu Chen, Siwei Peng, Jingying Chi, Shiming Ge, Chenggen Song, and Geng Zhao. 2018. Predicting Aesthetic Score Distribution Through Cumulative Jensen-Shannon Divergence. In Proceedings of the Thirty- Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018. https://www.aaai.org/ocs/in...

  17. [17]

    Xin Jin, Mingtian Zhao, Xiaowu Chen, Qinping Zhao, and Song Chun Zhu. 2010. Learning Artistic Lighting Template from Portrait Photographs. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV . 101–114

  18. [18]

    Yueying Kao, Ran He, and Kaiqi Huang. 2017. Deep Aesthetic Quality Assessment With Semantic Information. IEEE Trans. Image Processing 26, 3 (2017), 1482–1495. https://doi.org/10.1109/TIP.2017.2651399

  19. [19]

    Yueying Kao, Kaiqi Huang, and Steve J. Maybank. 2016. Hierarchical aesthetic quality assessment using deep convolutional neural networks. Sig. Proc.: Image Comm. 47 (2016), 500–510. https://doi.org/10.1016/j.image.2016.05.004

  20. [20]

    Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664–676. https://doi.org/10.1109/TPAMI.2016.2598339

  21. [21]

    Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. 2016. Photo Aesthetics Ranking Network with Attributes and Content Adaptation. In European Conference on Computer Vision (ECCV)

  22. [22]

    Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Zijun Wang. 2014. RAPID: Rating Pictorial Aesthetics using Deep Learning. In Proceedings of the ACM International Conference on Multimedia, MM’14, Orlando, FL, USA, November 03 - 07, 2014. 457–466

  23. [23]

    Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Dis- criminability Objective for Training Descriptive Captions. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  24. [24]

    Shuang Ma, Jing Liu, and Chang Wen Chen. 2017. A-Lamp: Adaptive Layout- Aware Multi-patch Deep Convolutional Neural Network for Photo Aesthetic Assessment. In CVPR. IEEE Computer Society, 722–731

  25. [25]

    Long Mai, Hailin Jin, and Feng Liu. 2016. Composition-Preserving Deep Photo Aesthetics Assessment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  26. [26]

    Yuille, and Kevin Murphy

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 . 11–20. https: //doi.org/10.1109/CVPR.2016.9

  27. [27]

    Alexander Mathews, Lexing Xie, and Xuming He. 2018. SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  28. [28]

    Naila Murray, Luca Marchesotti, and Florent Perronnin. 2012. AVA: A large-scale database for aesthetic visual analysis. In IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012 . 2408–2415

  29. [29]

    Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural Image Assessment. IEEE Trans. Image Processing 27, 8 (2018), 3998–4011. https://doi.org/10.1109/TIP. 2018.2831899

  30. [30]

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015 . 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

  31. [31]

    Wenshan Wang, Su Yang, Weishan Zhang, and Jiulong Zhang. 2018. Neural Aesthetic Image Reviewer. CoRR abs/1802.10240 (2018). arXiv:1802.10240 http: //arxiv.org/abs/1802.10240

  32. [32]

    Weining Wang, Mingquan Zhao, Li Wang, Jiexiong Huang, Chengjia Cai, and Xiangmin Xu. 2016. A multi-scene deep learning model for image aesthetic evaluation. Sig. Proc.: Image Comm. 47 (2016), 511–518

  33. [33]

    Ye Zhou, Xin Lu, Junping Zhang, and James Z. Wang. 2016. Joint Image and Text Representation for Aesthetics Analysis. InProceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, Alan Hanjalic, Cees Snoek, Marcel Worring, Dick C. A. Bulterman, Benoit Huet, Aisling Kelliher, Yiannis Kompatsiar...