Slim-CNN: A Light-Weight CNN for Face Attribute Prediction
Pith reviewed 2026-05-25 09:52 UTC · model grok-4.3
The pith
Slim Modules built from depthwise separable and pointwise convolutions let Slim-Net reach 91.24 percent accuracy on face attribute prediction with at least 25 times fewer parameters than prior models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Slim Modules are constructed by assembling depthwise separable convolutions with pointwise convolution to produce a computationally efficient module, and stacking these modules yields Slim-Net which achieves an accuracy of 91.24 percent on the CelebA dataset with at least 25 times fewer parameters than comparably performing methods, reducing the memory storage requirement by at least 87 percent.
What carries the argument
Slim Module, a micro-architecture assembled from depthwise separable convolutions and pointwise convolutions that reduces computational cost while preserving the ability to extract features for face attributes.
If this is right
- Slim-Net becomes suitable for mobile and embedded applications because of its low memory footprint.
- The stacked modules maintain very high accuracy even when input images show large variations in pose, background, illumination, and dataset imbalance.
- Memory storage drops by at least 87 percent relative to comparably accurate networks.
- Parameter count falls by a factor of at least 25 while accuracy stays at 91.24 percent on CelebA.
Where Pith is reading between the lines
- The same module construction could be tested on related tasks such as facial landmark detection or expression recognition on edge hardware.
- Combining Slim Modules with further quantization might push the same accuracy onto even smaller microcontrollers.
- The efficiency margin suggests it is feasible to run several attribute predictors in parallel on a single mobile device without exceeding typical RAM limits.
Load-bearing premise
The design of Slim Modules assembled from depthwise separable and pointwise convolutions will maintain very high accuracy on face attribute prediction despite large variations in pose, background, illumination, and dataset imbalance.
What would settle it
Running Slim-Net on the CelebA test set or an equivalent balanced evaluation and measuring accuracy below 85 percent at the reported parameter count would falsify the claim that the modules preserve high accuracy under the stated efficiency constraints.
Figures
read the original abstract
We introduce a computationally-efficient CNN micro-architecture Slim Module to design a lightweight deep neural network Slim-Net for face attribute prediction. Slim Modules are constructed by assembling depthwise separable convolutions with pointwise convolution to produce a computationally efficient module. The problem of facial attribute prediction is challenging because of the large variations in pose, background, illumination, and dataset imbalance. We stack these Slim Modules to devise a compact CNN which still maintains very high accuracy. Additionally, the neural network has a very low memory footprint which makes it suitable for mobile and embedded applications. Experiments on the CelebA dataset show that Slim-Net achieves an accuracy of 91.24% with at least 25 times fewer parameters than comparably performing methods, which reduces the memory storage requirement of Slim-net by at least 87%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a Slim Module micro-architecture assembled from depthwise separable convolutions and pointwise convolutions, which are stacked to form the lightweight Slim-Net CNN for face attribute prediction. The central empirical claim is that Slim-Net attains 91.24% accuracy on CelebA while using at least 25 times fewer parameters than comparably performing methods, thereby reducing memory storage by at least 87%.
Significance. If the reported accuracy and parameter counts are supported by reproducible experiments with standard baselines and proper controls, the work supplies a practical, low-footprint model suitable for mobile and embedded face-attribute applications. The architecture re-uses the established depthwise-separable pattern rather than introducing new theoretical machinery, so its value rests on the strength of the empirical comparison.
major comments (1)
- [Abstract] Abstract: the central performance numbers (91.24% accuracy, ≥25× parameter reduction) are stated without any accompanying experimental protocol, baseline list, dataset split, error bars, or ablation study. Because the load-bearing claim is purely empirical, this omission prevents verification that the result is not affected by post-hoc choices or non-standard evaluation.
minor comments (1)
- [Abstract] The description of how Slim Modules are assembled from depthwise separable and pointwise convolutions is repeated in nearly identical wording in the abstract and introduction; a single concise definition would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive suggestion regarding the abstract. We address the comment point-by-point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance numbers (91.24% accuracy, ≥25× parameter reduction) are stated without any accompanying experimental protocol, baseline list, dataset split, error bars, or ablation study. Because the load-bearing claim is purely empirical, this omission prevents verification that the result is not affected by post-hoc choices or non-standard evaluation.
Authors: We agree that the abstract, due to its length constraints, does not include the full experimental protocol. The manuscript body (Section 4) details the CelebA dataset, standard 80/20 train/test split used in prior work, the list of baselines (e.g., MobileNet, ResNet variants, and other face attribute models), mean accuracy across attributes, and direct parameter comparisons. No error bars or full ablation study appear in the current version. We will revise the abstract to briefly state the dataset, reference the experimental section for protocol and baselines, and note that results follow the standard CelebA evaluation protocol from prior literature. We will also add a short ablation paragraph in the experiments section if space permits. revision: yes
Circularity Check
No significant circularity; empirical result only
full rationale
The paper introduces Slim Modules by assembling standard depthwise separable and pointwise convolutions, then reports experimental accuracy on CelebA. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear. The central claim (91.24% accuracy with 25x fewer parameters) is presented as a direct experimental outcome rather than a quantity derived from the architecture definition itself. The design re-uses an existing pattern without claiming uniqueness via prior self-work.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Slim Module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Understanding the difficulty of training deep feed- forward neural networks
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed- forward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010
work page 2010
-
[3]
Manuel Günther, Andras Rozsa, and Terrance E. Boult. AFFACT - alignment free facial attribute classification technique. CoRR, abs/1611.06158, 2016. URL http: //arxiv.org/abs/1611.06158
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
Heterogeneous face attribute estimation: A deep multi-task learning approach
Hu Han, Anil K Jain, Fang Wang, Shiguang Shan, and Xilin Chen. Heterogeneous face attribute estimation: A deep multi-task learning approach. IEEE transactions on pattern analysis and machine intelligence, 40(11):2597–2609, 2018. 10 SLIM-CNN: LIGHT-WEIGHT CNN
work page 2018
-
[5]
Emily M Hand and Rama Chellappa. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In AAAI, pages 4068–4074, 2017
work page 2017
-
[6]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[7]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Com- puter Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017
work page 2017
-
[8]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolu- tional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parame- ters and <1mb model size. CoRR, abs/1602.07360, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
Imagenet classification with deep convo- lutional neural
Alex Krizhevsky, I Sutskever, and G Hinton. Imagenet classification with deep convo- lutional neural. In Neural Information Processing Systems, pages 1–9, 2014
work page 2014
-
[11]
Attribute and simile classifiers for face verification
Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. Attribute and simile classifiers for face verification. InComputer Vision, 2009 IEEE 12th Inter- national Conference on, pages 365–372. IEEE, 2009
work page 2009
-
[12]
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[13]
Reed, Cheng-Yang Fu, and Alexander C
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. In Com- puter Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 21–37, 2016
work page 2016
-
[14]
Deep learning face attributes in the wild
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015
work page 2015
-
[15]
Moon: A mixed objective optimization network for the recognition of facial attributes
Ethan M Rudd, Manuel Günther, and Terrance E Boult. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision, pages 19–35. Springer, 2016
work page 2016
-
[16]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. CoRR, abs/1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[17]
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in Neural Information Processing Systems 28: Annual Confer- ence on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2377–2385, 2015. SLIM-CNN: LIGHT-WEIGHT CNN 11
work page 2015
-
[18]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pat- tern recognition, pages 1–9, 2015
work page 2015
-
[19]
Aggregated residual transformations for deep neural networks
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017
work page 2017
-
[20]
Learning Face Representation from Scratch
Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[21]
Shufflenet: An extremely efficient convolutional neural network for mobile devices
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6848–6856, 2018
work page 2018
-
[22]
Position-squeeze and excitation block for facial attribute analysis
Yan Zhang, Wanxia Shen, Li Sun, and Qingli Li. Position-squeeze and excitation block for facial attribute analysis. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, page 279, 2018
work page 2018
-
[23]
Leveraging mid-level deep representa- tions for predicting face attributes in the wild
Yang Zhong, Josephine Sullivan, and Haibo Li. Leveraging mid-level deep representa- tions for predicting face attributes in the wild. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 3239–3243. IEEE, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.