pith. sign in

arxiv: 1907.02157 · v1 · pith:KGZQUPQ7new · submitted 2019-07-03 · 💻 cs.CV

Slim-CNN: A Light-Weight CNN for Face Attribute Prediction

Pith reviewed 2026-05-25 09:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords face attribute predictionlightweight CNNdepthwise separable convolutionpointwise convolutionCelebA datasetSlim ModuleSlim-Netmobile applications
0
0 comments X

The pith

Slim Modules built from depthwise separable and pointwise convolutions let Slim-Net reach 91.24 percent accuracy on face attribute prediction with at least 25 times fewer parameters than prior models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a micro-architecture called the Slim Module that combines depthwise separable convolutions with pointwise convolutions to form the building blocks of a compact network named Slim-Net. This design targets face attribute prediction, a task made difficult by variations in pose, background, illumination, and class imbalance. Stacking the modules produces a network that keeps accuracy high while shrinking the parameter count and memory footprint enough for mobile and embedded hardware. A sympathetic reader would care because the result shows how to run reliable face analysis on devices that cannot store or run full-scale CNNs. Experiments on CelebA confirm the efficiency claim by reporting the stated accuracy alongside the parameter and storage reductions.

Core claim

Slim Modules are constructed by assembling depthwise separable convolutions with pointwise convolution to produce a computationally efficient module, and stacking these modules yields Slim-Net which achieves an accuracy of 91.24 percent on the CelebA dataset with at least 25 times fewer parameters than comparably performing methods, reducing the memory storage requirement by at least 87 percent.

What carries the argument

Slim Module, a micro-architecture assembled from depthwise separable convolutions and pointwise convolutions that reduces computational cost while preserving the ability to extract features for face attributes.

If this is right

  • Slim-Net becomes suitable for mobile and embedded applications because of its low memory footprint.
  • The stacked modules maintain very high accuracy even when input images show large variations in pose, background, illumination, and dataset imbalance.
  • Memory storage drops by at least 87 percent relative to comparably accurate networks.
  • Parameter count falls by a factor of at least 25 while accuracy stays at 91.24 percent on CelebA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same module construction could be tested on related tasks such as facial landmark detection or expression recognition on edge hardware.
  • Combining Slim Modules with further quantization might push the same accuracy onto even smaller microcontrollers.
  • The efficiency margin suggests it is feasible to run several attribute predictors in parallel on a single mobile device without exceeding typical RAM limits.

Load-bearing premise

The design of Slim Modules assembled from depthwise separable and pointwise convolutions will maintain very high accuracy on face attribute prediction despite large variations in pose, background, illumination, and dataset imbalance.

What would settle it

Running Slim-Net on the CelebA test set or an equivalent balanced evaluation and measuring accuracy below 85 percent at the reported parameter count would falsify the claim that the modules preserve high accuracy under the stated efficiency constraints.

Figures

Figures reproduced from arXiv: 1907.02157 by Ankit Sharma, Hassan Foroosh.

Figure 1
Figure 1. Figure 1: Facial Attribute Prediction: Given an image, the task is to detect the presence or [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Slim Module: The micro-architecture shown on the left is the Separable Squeeze [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Full Architecture of our Slim-CNN. The network consists of 4 stacked Slim Modules. Each module is followed by a max-pooling layer. For each Slim Module, the value of filter dimensions for different layers are shown in brackets in the following order: Squeeze, Expand and last 3x3 depthwise separable layer. compared to fully-connected layers of size 512 and 1024. The Global Average Layer is preferred as it b… view at source ↗
read the original abstract

We introduce a computationally-efficient CNN micro-architecture Slim Module to design a lightweight deep neural network Slim-Net for face attribute prediction. Slim Modules are constructed by assembling depthwise separable convolutions with pointwise convolution to produce a computationally efficient module. The problem of facial attribute prediction is challenging because of the large variations in pose, background, illumination, and dataset imbalance. We stack these Slim Modules to devise a compact CNN which still maintains very high accuracy. Additionally, the neural network has a very low memory footprint which makes it suitable for mobile and embedded applications. Experiments on the CelebA dataset show that Slim-Net achieves an accuracy of 91.24% with at least 25 times fewer parameters than comparably performing methods, which reduces the memory storage requirement of Slim-net by at least 87%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a Slim Module micro-architecture assembled from depthwise separable convolutions and pointwise convolutions, which are stacked to form the lightweight Slim-Net CNN for face attribute prediction. The central empirical claim is that Slim-Net attains 91.24% accuracy on CelebA while using at least 25 times fewer parameters than comparably performing methods, thereby reducing memory storage by at least 87%.

Significance. If the reported accuracy and parameter counts are supported by reproducible experiments with standard baselines and proper controls, the work supplies a practical, low-footprint model suitable for mobile and embedded face-attribute applications. The architecture re-uses the established depthwise-separable pattern rather than introducing new theoretical machinery, so its value rests on the strength of the empirical comparison.

major comments (1)
  1. [Abstract] Abstract: the central performance numbers (91.24% accuracy, ≥25× parameter reduction) are stated without any accompanying experimental protocol, baseline list, dataset split, error bars, or ablation study. Because the load-bearing claim is purely empirical, this omission prevents verification that the result is not affected by post-hoc choices or non-standard evaluation.
minor comments (1)
  1. [Abstract] The description of how Slim Modules are assembled from depthwise separable and pointwise convolutions is repeated in nearly identical wording in the abstract and introduction; a single concise definition would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestion regarding the abstract. We address the comment point-by-point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance numbers (91.24% accuracy, ≥25× parameter reduction) are stated without any accompanying experimental protocol, baseline list, dataset split, error bars, or ablation study. Because the load-bearing claim is purely empirical, this omission prevents verification that the result is not affected by post-hoc choices or non-standard evaluation.

    Authors: We agree that the abstract, due to its length constraints, does not include the full experimental protocol. The manuscript body (Section 4) details the CelebA dataset, standard 80/20 train/test split used in prior work, the list of baselines (e.g., MobileNet, ResNet variants, and other face attribute models), mean accuracy across attributes, and direct parameter comparisons. No error bars or full ablation study appear in the current version. We will revise the abstract to briefly state the dataset, reference the experimental section for protocol and baselines, and note that results follow the standard CelebA evaluation protocol from prior literature. We will also add a short ablation paragraph in the experiments section if space permits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result only

full rationale

The paper introduces Slim Modules by assembling standard depthwise separable and pointwise convolutions, then reports experimental accuracy on CelebA. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear. The central claim (91.24% accuracy with 25x fewer parameters) is presented as a direct experimental outcome rather than a quantity derived from the architecture definition itself. The design re-uses an existing pattern without claiming uniqueness via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no equations, training details, or architectural hyperparameters; the Slim Module is the primary invented component whose independent evidence cannot be assessed.

invented entities (1)
  • Slim Module no independent evidence
    purpose: Computationally efficient CNN micro-architecture assembled from depthwise separable convolutions and pointwise convolution
    Presented in the abstract as the core new building block for Slim-Net.

pith-pipeline@v0.9.0 · 5658 in / 1322 out tokens · 43948 ms · 2026-05-25T09:52:17.021459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 7 internal anchors

  1. [1]

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016

  2. [2]

    Understanding the difficulty of training deep feed- forward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed- forward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010

  3. [3]

    Manuel Günther, Andras Rozsa, and Terrance E. Boult. AFFACT - alignment free facial attribute classification technique. CoRR, abs/1611.06158, 2016. URL http: //arxiv.org/abs/1611.06158

  4. [4]

    Heterogeneous face attribute estimation: A deep multi-task learning approach

    Hu Han, Anil K Jain, Fang Wang, Shiguang Shan, and Xilin Chen. Heterogeneous face attribute estimation: A deep multi-task learning approach. IEEE transactions on pattern analysis and machine intelligence, 40(11):2597–2609, 2018. 10 SLIM-CNN: LIGHT-WEIGHT CNN

  5. [5]

    Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification

    Emily M Hand and Rama Chellappa. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In AAAI, pages 4068–4074, 2017

  6. [6]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  7. [7]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Com- puter Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017

  8. [8]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolu- tional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017

  9. [9]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parame- ters and <1mb model size. CoRR, abs/1602.07360, 2016

  10. [10]

    Imagenet classification with deep convo- lutional neural

    Alex Krizhevsky, I Sutskever, and G Hinton. Imagenet classification with deep convo- lutional neural. In Neural Information Processing Systems, pages 1–9, 2014

  11. [11]

    Attribute and simile classifiers for face verification

    Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. Attribute and simile classifiers for face verification. InComputer Vision, 2009 IEEE 12th Inter- national Conference on, pages 365–372. IEEE, 2009

  12. [12]

    Network In Network

    Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013

  13. [13]

    Reed, Cheng-Yang Fu, and Alexander C

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. In Com- puter Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 21–37, 2016

  14. [14]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015

  15. [15]

    Moon: A mixed objective optimization network for the recognition of facial attributes

    Ethan M Rudd, Manuel Günther, and Terrance E Boult. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision, pages 19–35. Springer, 2016

  16. [16]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. CoRR, abs/1409.1556, 2014

  17. [17]

    Training very deep networks

    Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in Neural Information Processing Systems 28: Annual Confer- ence on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2377–2385, 2015. SLIM-CNN: LIGHT-WEIGHT CNN 11

  18. [18]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pat- tern recognition, pages 1–9, 2015

  19. [19]

    Aggregated residual transformations for deep neural networks

    Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017

  20. [20]

    Learning Face Representation from Scratch

    Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014

  21. [21]

    Shufflenet: An extremely efficient convolutional neural network for mobile devices

    Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6848–6856, 2018

  22. [22]

    Position-squeeze and excitation block for facial attribute analysis

    Yan Zhang, Wanxia Shen, Li Sun, and Qingli Li. Position-squeeze and excitation block for facial attribute analysis. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, page 279, 2018

  23. [23]

    Leveraging mid-level deep representa- tions for predicting face attributes in the wild

    Yang Zhong, Josephine Sullivan, and Haibo Li. Leveraging mid-level deep representa- tions for predicting face attributes in the wild. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 3239–3243. IEEE, 2016