Image Aesthetics Assessment using Multi Channel Convolutional Neural Networks

Gitam Shikhenawis; Nishi Doshi; Suman K Mitra

arxiv: 1911.09301 · v1 · submitted 2019-11-21 · 💻 cs.CV

Image Aesthetics Assessment using Multi Channel Convolutional Neural Networks

Nishi Doshi , Gitam Shikhenawis , Suman K Mitra This is my paper

Pith reviewed 2026-05-24 15:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords image aesthetics assessmentmulti-channel CNNsaliency mapsAVA databaseimage classificationdeep convolutional networksaesthetic qualitycrops

0 comments

The pith

A multi-channel CNN that feeds raw images, crops, and saliency maps together improves binary aesthetics classification accuracy on the AVA dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a convolutional neural network architecture for deciding whether a photograph has high or low aesthetic quality. Rather than passing only the original image, the network receives multiple parallel inputs consisting of the full image, several cropped versions, and a saliency map. These inputs are processed through separate channels that are later combined. Experiments on the widely used AVA collection of human-rated images report higher accuracy than earlier single-input methods. Accurate automatic assessment could support photo selection tools and editing applications that rely on visual appeal judgments.

Core claim

The authors state that a multi-channel CNN architecture, which receives the raw image together with multiple crops and the corresponding saliency map as inputs, produces higher accuracy when classifying images into high-quality and low-quality aesthetic categories than existing single-channel approaches, as measured on the AVA database.

What carries the argument

Multi-channel CNN that processes the original image, image crops, and saliency map through parallel input channels before combining features for binary classification.

If this is right

The multi-channel inputs allow the network to combine global composition with local detail and attention information for the classification decision.
Binary high/low quality labeling on AVA becomes more accurate than prior single-input CNN baselines.
The same input strategy could be applied to other image rating tasks that benefit from both full-frame and localized visual cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Controlled ablation studies that isolate each added channel would make the source of the gain clearer to future readers.
Applying the architecture to datasets with finer-grained aesthetic scores rather than binary labels could test whether the same inputs remain useful.

Load-bearing premise

The performance gain on AVA stems from the multi-channel design itself rather than from differences in training procedure, data splits, or hyper-parameters that are not fully specified.

What would settle it

Training and evaluating a single-channel CNN using exactly the same data splits, augmentation, and optimization settings as the multi-channel model and obtaining equal or higher accuracy on AVA would show that the channel structure is not responsible for the reported improvement.

Figures

Figures reproduced from arXiv: 1911.09301 by Gitam Shikhenawis, Nishi Doshi, Suman K Mitra.

**Figure 2.** Figure 2: High quality images Various search engines makes use of such a classification. For a given search word, it takes the aesthetic value of an image into consideration while showing any image on top or showing it last. Review applications, which allow the users to upload photographs and write review about the products/place also take aesthetics of the uploaded images into consideration. While displaying the r… view at source ↗

**Figure 3.** Figure 3: AlexNet Convolutional Neural Network architecture Block Number 1 2 3 4 5 Number of convolutional layers 2 2 4 4 4 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: In order from left : Original high quality image taken from PhotoQuality Dataset [5], corresponding spectral residual saliency map and fine grained saliency map. column networks. The configuration details of both the networks are discussed below: Double Column Network The double column network involves use of two pipelines and concatenation of those channels to generate output classifier. In one part of th… view at source ↗

**Figure 5.** Figure 5: Triple column network design 4 Experiments There are many datasets available for testing the validity of the model designed to solve the image aesthetic assessment problem. In this paper, the experimental results are reported on one of the most used database for IAA i.e. AVA dataset [3]. AVA Dataset AVA Dataset[3] consists of images which have votes of users for every rating 1 to 10. As the model developed… view at source ↗

read the original abstract

Image Aesthetics Assessment is one of the emerging domains in research. The domain deals with classification of images into categories depending on the basis of how pleasant they are for the users to watch. In this article, the focus is on categorizing the images in high quality and low quality image. Deep convolutional neural networks are used to classify the images. Instead of using just the raw image as input, different crops and saliency maps of the images are also used, as input to the proposed multi channel CNN architecture. The experiments reported on widely used AVA database show improvement in the aesthetic assessment performance over existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-channel CNN on AVA shows no isolated evidence that the architecture drives any gain over baselines.

read the letter

The paper's main move is to feed raw images plus crops plus saliency maps into separate channels of a CNN for binary high/low aesthetics classification on AVA. The abstract says this beats existing approaches, but supplies no accuracy figures, no listed baselines, and no ablation results. That is the entire reported contribution on the data side. Multi-input networks and saliency maps were already routine by 2019, and CNNs had already been run on AVA, so the work is an application of known pieces rather than a new method. The architecture description itself is straightforward and uses standard components without added complexity. The real limitation is the missing controls. Nothing in the abstract shows that training schedules, data splits, or hyperparameters were matched to the cited prior work, and there are no channel-ablation runs that hold everything else fixed. Without those, any reported lift cannot be credited to the multi-channel design. The stress-test note correctly flags this gap. A reader who already works on aesthetics assessment might still want to see the exact numbers and training details once they appear, but the paper as described does not advance the broader literature on CNN design or evaluation practice. It is too narrow and too lightly controlled to justify referee time.

Referee Report

3 major / 1 minor

Summary. The paper proposes a multi-channel CNN architecture for binary image aesthetics assessment (high vs. low quality) that feeds the raw image, multiple crops, and saliency maps into parallel channels. It reports experiments on the AVA dataset and claims improved performance over existing approaches.

Significance. If the performance gains can be shown to arise from the multi-channel design rather than uncontrolled variables, the work would offer a modest incremental contribution to multi-input CNN designs for subjective visual tasks in computer vision. The current lack of quantitative evidence and controls prevents any stronger assessment of significance.

major comments (3)

[Abstract] Abstract: the central claim that 'experiments reported on widely used AVA database show improvement in the aesthetic assessment performance over existing approaches' supplies no accuracy numbers, baseline values, metrics (e.g., accuracy, AUC), or comparison tables, so the claim cannot be evaluated.
[Experimental results] Experimental results: no ablation studies are described that disable individual input channels (raw image only, crops only, saliency only) while holding training schedule, data split, and hyperparameters fixed; without these, gains cannot be attributed to the multi-channel architecture.
[Method and Experiments] Method and Experiments: the manuscript provides no evidence that the cited baseline methods were re-trained or evaluated under identical data splits, optimization schedules, or hyper-parameter settings, leaving open the possibility that reported gains stem from procedural differences rather than the proposed architecture.

minor comments (1)

[Abstract] Abstract: the phrasing 'categorizing the images in high quality and low quality image' is grammatically awkward and should be revised for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and agree that revisions are needed to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'experiments reported on widely used AVA database show improvement in the aesthetic assessment performance over existing approaches' supplies no accuracy numbers, baseline values, metrics (e.g., accuracy, AUC), or comparison tables, so the claim cannot be evaluated.

Authors: We agree that the abstract should include specific quantitative results to support the claim. In the revised manuscript we will add the key performance metrics (accuracy and AUC) along with the main baseline values and a reference to the comparison table. revision: yes
Referee: [Experimental results] Experimental results: no ablation studies are described that disable individual input channels (raw image only, crops only, saliency only) while holding training schedule, data split, and hyperparameters fixed; without these, gains cannot be attributed to the multi-channel architecture.

Authors: The referee correctly identifies the absence of controlled ablations. We will add ablation experiments that isolate each input channel while keeping the training schedule, data split, and hyperparameters identical, and report the resulting accuracies. revision: yes
Referee: [Method and Experiments] Method and Experiments: the manuscript provides no evidence that the cited baseline methods were re-trained or evaluated under identical data splits, optimization schedules, or hyper-parameter settings, leaving open the possibility that reported gains stem from procedural differences rather than the proposed architecture.

Authors: We acknowledge that the manuscript does not demonstrate identical evaluation conditions for the baselines. In revision we will explicitly state which baselines were re-implemented by the authors, detail the data splits and optimization settings used, and note any unavoidable differences with the original publications. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ML evaluation with no derivation chain

full rationale

The paper reports experimental results from training a multi-channel CNN on the AVA dataset and comparing accuracy to prior methods. No equations, derivations, or theoretical claims are present in the provided text. Performance numbers are direct outputs of training and evaluation on held-out data; they do not reduce to any fitted parameter or self-citation by construction. Self-citations, if any, are not load-bearing for a mathematical result. This matches the default expectation for empirical ML papers.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Standard supervised CNN training on labeled images; no new entities or ad-hoc axioms beyond the usual assumption that deep networks can learn aesthetic features from the provided inputs.

free parameters (1)

network weights and biases
Learned from AVA training data; central to any performance claim.

axioms (1)

domain assumption Convolutional layers extract useful visual features from raw pixels, crops, and saliency maps
Invoked by the choice of multi-channel CNN architecture.

pith-pipeline@v0.9.0 · 5625 in / 1055 out tokens · 16673 ms · 2026-05-24T15:58:29.546159+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Y. Deng, C. C. Loy and X. Tang, ”Image Aesthetic Assessment : An experimental survey” in IEEE Signal Processing Magazine , 2017, pp. 80-106

work page 2017
[2]

Lihua and L

G. Lihua and L. Fudi, ”Image Aesthetic Evaluation using Parallelled Deep Convo- lution Neural Network” in 2016 International Conference on Digital Image Com- puting: Techniques and Applications (DICTA) , 2016

work page 2016
[3]

Murray, L

N. Murray, L. Marchesotti, and F. Perronnin, ”AVA: A large-scale database for aes- thetic visual analysis,” in Proc. IEEE Conf. Computer Vision and Pattern Recog- nition (CVPR) , 2012, pp. 24082415

work page 2012
[4]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ”ImageNet classiﬁcation with deep convolutional neural networks,” in Advances in Neural Information Processing Sys- tems 25 , Neural Information Processing Systems Foundation, 2012, pp. 10971105

work page 2012
[5]

W. Luo, X. Wang, and X. Tang, ”Content-based photo quality assessment,” inProc. IEEE Int. Conf. Computer Vision (ICCV) , 2011, pp. 22062213

work page 2011
[6]

X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang, ”RAPID: Rating pictorial aesthetics using deep learning,” in Proc. ACM Int. Conf. Multimedia , 2014, pp. 457466

work page 2014
[7]

Z. Wang, F. Dolcos, D. Beck, S. Chang, and T. S. Huang, ”Brain-inspired deep networks for image aesthetics assessment,” arXiv preprint arXiv:1601.04155, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa- thy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei ”ImageNet Large Scale Visual Recognition Challenge”. International Journal of Computer Vision (IJCV), 2015, pp. 115, 211252

work page 2015
[9]

Simonyan, A

K. Simonyan, A. Zisserman ”Very deep convolutional networks for large-scale image recognition”

work page
[10]

Wang and P

B. Wang and P. Dudek A Fast Self-tuning Background Subtraction Algorithm, in proc of IEEE Workshop on Change Detection, 2014

work page 2014

[1] [1]

Y. Deng, C. C. Loy and X. Tang, ”Image Aesthetic Assessment : An experimental survey” in IEEE Signal Processing Magazine , 2017, pp. 80-106

work page 2017

[2] [2]

Lihua and L

G. Lihua and L. Fudi, ”Image Aesthetic Evaluation using Parallelled Deep Convo- lution Neural Network” in 2016 International Conference on Digital Image Com- puting: Techniques and Applications (DICTA) , 2016

work page 2016

[3] [3]

Murray, L

N. Murray, L. Marchesotti, and F. Perronnin, ”AVA: A large-scale database for aes- thetic visual analysis,” in Proc. IEEE Conf. Computer Vision and Pattern Recog- nition (CVPR) , 2012, pp. 24082415

work page 2012

[4] [4]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ”ImageNet classiﬁcation with deep convolutional neural networks,” in Advances in Neural Information Processing Sys- tems 25 , Neural Information Processing Systems Foundation, 2012, pp. 10971105

work page 2012

[5] [5]

W. Luo, X. Wang, and X. Tang, ”Content-based photo quality assessment,” inProc. IEEE Int. Conf. Computer Vision (ICCV) , 2011, pp. 22062213

work page 2011

[6] [6]

X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang, ”RAPID: Rating pictorial aesthetics using deep learning,” in Proc. ACM Int. Conf. Multimedia , 2014, pp. 457466

work page 2014

[7] [7]

Z. Wang, F. Dolcos, D. Beck, S. Chang, and T. S. Huang, ”Brain-inspired deep networks for image aesthetics assessment,” arXiv preprint arXiv:1601.04155, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa- thy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei ”ImageNet Large Scale Visual Recognition Challenge”. International Journal of Computer Vision (IJCV), 2015, pp. 115, 211252

work page 2015

[9] [9]

Simonyan, A

K. Simonyan, A. Zisserman ”Very deep convolutional networks for large-scale image recognition”

work page

[10] [10]

Wang and P

B. Wang and P. Dudek A Fast Self-tuning Background Subtraction Algorithm, in proc of IEEE Workshop on Change Detection, 2014

work page 2014