A Deep Image Compression Framework for Face Recognition

Bo Lei; Feng Liang; Haisheng Fu; Nai Bian

arxiv: 1907.01714 · v1 · pith:6XQMBMS5new · submitted 2019-07-03 · 💻 cs.CV

A Deep Image Compression Framework for Face Recognition

Nai Bian , Feng Liang , Haisheng Fu , Bo Lei This is my paper

Pith reviewed 2026-05-25 10:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords face image compressiondeep convolutional autoencoderface recognitionjoint optimizationLFW datasetimage reconstructionverification accuracycompact representation

0 comments

The pith

Jointly trained deep autoencoder compression yields higher face verification accuracy on LFW than JPEG or JPEG2000.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a compression method for face images that supports accurate face recognition better than ordinary codecs. It uses a convolutional autoencoder to turn images into compact representations that existing codecs can store, then reconstructs them, with the autoencoder and a face recognition network trained together so the reconstruction keeps identity cues intact. A reader would care because large face recognition systems face huge data volumes and need compression that does not destroy performance. The evidence comes from tests on the LFW dataset showing the jointly trained system beats JPEG2000 and greatly exceeds JPEG in verification accuracy after compression.

Core claim

The paper claims that its deep convolutional autoencoder compression network, when jointly optimized with an existing face recognition network, produces reconstructed images whose face verification accuracy on the LFW dataset exceeds that of images compressed by JPEG2000 and is substantially higher than that of images compressed by JPEG.

What carries the argument

deep convolutional autoencoder compression network jointly optimized with a face recognition network, which extracts compact features for encoding and reconstructs images tuned to preserve recognition performance

If this is right

Face recognition pipelines can store and transmit more images with less accuracy loss than with standard codecs.
The compact representation produced by the autoencoder can be saved using ordinary codecs such as PNG.
Joint training makes the reconstructed images more suitable for recognition than images from separate compression steps.
The framework achieves measurable gains on a standard benchmark dataset after compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-training idea could be tested on other recognition tasks such as object or scene classification to see if task-specific compression generalizes.
Storage and bandwidth savings in large biometric databases would follow directly if the accuracy advantage holds at scale.
Extending the approach to video sequences of faces would require checking whether temporal consistency is preserved under the same optimization.
Different recognition network architectures could be substituted to test whether the compression benefit depends on the particular recognition model used.

Load-bearing premise

Joint optimization of the autoencoder and face recognition network will keep identity-discriminating features in the reconstructed images without adding artifacts that reduce recognition accuracy.

What would settle it

Running the LFW verification test on images compressed by the framework and finding accuracy no higher than JPEG2000 would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 1907.01714 by Bo Lei, Feng Liang, Haisheng Fu, Nai Bian.

**Figure 2.** Figure 2: Blocking artifacts of images compressed by JPEG at low bit rates. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Example JPEG compressed image with blocking artifacts and the restored image by AR-CNN [1]. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The overall structure of compression-reconstruction-recognition franmework. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: The structure of CompNet and RecNet. 2.2.2 Quantization and entropy coding The values of the original image are integers in [0, 255]. In order to train the network better, all image data will be normalized to floating-point number in [-1, 1] before input into the network. The values of the compact map are still floating-point numbers in [-1, 1]. In order to encode the compact representation whose values ra… view at source ↗

**Figure 6.** Figure 6: The default structure of residual block [9] and the structure of the proposed residual block. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: The sphere20 network of Cosface. 2.4 Joint training of the combined network During training process, our combined network omits the quantization and entropy coding for compact representation, due to the same reason mentioned already. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: The training network of the overall framework. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: The effect of images restored by JPEG, JPEG2000 and our proposed network. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Some of the images in LFW_112×96 dataset compressed by different methods. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

read the original abstract

Face recognition technology has advanced rapidly and has been widely used in various applications. Due to the extremely huge amount of data of face images and the large computing resources required correspondingly in large-scale face recognition tasks, there is a requirement for a face image compression approach that is highly suitable for face recognition tasks. In this paper, we propose a deep convolutional autoencoder compression network for face recognition tasks. In the compression process, deep features are extracted from the original image by the convolutional neural networks to produce a compact representation of the original image, which is then encoded and saved by existing codec such as PNG. This compact representation is utilized by the reconstruction network to generate a reconstructed image of the original one. In order to improve the face recognition accuracy when the compression framework is used in a face recognition system, we combine this compression framework with a existing face recognition network for joint optimization. We test the proposed scheme and find that after joint training, the Labeled Faces in the Wild (LFW) dataset compressed by our compression framework has higher face verification accuracy than that compressed by JPEG2000, and is much higher than that compressed by JPEG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The joint optimization idea makes sense for task-specific compression but the LFW accuracy claims are undercut by missing bitrate controls in the comparisons.

read the letter

The paper's main move is to jointly train a convolutional autoencoder with a face recognition network so that the reconstructed images keep identity-discriminating features. That is the actual novelty: a compression pipeline tuned end-to-end for the downstream verification task rather than generic image quality. The approach of extracting compact features, encoding them with PNG, and then reconstructing is straightforward and directly targets the storage problem in large-scale face systems. Credit is due for framing the problem that way and for running the joint training experiment on LFW at all. The results directionally suggest the method beats JPEG and JPEG2000 on verification accuracy after compression. That part is at least consistent with the goal. The evaluation is the clear weak point. The abstract and claim give no bitrates or bits-per-pixel numbers for any of the methods. JPEG and JPEG2000 quality can be dialed up or down, so an un-matched comparison leaves open the possibility that the reported accuracy edge simply reflects different average file sizes rather than better feature preservation. Without rate-distortion curves or explicit matching, the central result is hard to interpret. The paper is aimed at people working on compression for recognition or other vision tasks. A reader already thinking about task-aware codecs might pick up the joint-training trick, but the lack of controlled rate comparisons limits how far the numbers can be taken. I would not send this to peer review until the bitrate issue is fixed; the idea is worth pursuing but the evidence as written does not yet support the claims at the level needed for referee time.

Referee Report

3 major / 1 minor

Summary. The paper proposes a deep convolutional autoencoder compression network for face images that extracts compact deep features, encodes them with an existing codec such as PNG, and reconstructs the image. The framework is jointly optimized with an existing face recognition network to preserve identity-discriminating features. Experiments claim that LFW images compressed by this method after joint training achieve higher face verification accuracy than the same images compressed by JPEG2000 and much higher accuracy than those compressed by JPEG.

Significance. If the central empirical claim holds under bitrate-matched conditions and the joint optimization demonstrably avoids recognition-harming artifacts, the work could contribute to task-specific learned compression for recognition pipelines. The manuscript provides no indication of released code, parameter-free derivations, or machine-checked proofs.

major comments (3)

[Abstract] Abstract: the central claim that the jointly trained framework yields higher LFW verification accuracy than JPEG2000 (and much higher than JPEG) is presented without any quantitative accuracy values, standard deviations, number of pairs tested, or verification protocol details; this prevents assessment of effect size and statistical reliability.
[Abstract] Abstract (and results): the accuracy comparison with JPEG and JPEG2000 reports no bitrates, bits-per-pixel, or file-size statistics for any method; without explicit rate matching or rate-distortion curves, observed accuracy gaps could arise from unequal compression ratios rather than superior feature preservation by the autoencoder or joint training.
[Abstract] Abstract: the joint-optimization procedure is described at a high level but no loss function, weighting between reconstruction and recognition losses, or training details (e.g., which layers are frozen) are supplied, leaving the mechanism that supposedly preserves identity features unexamined.

minor comments (1)

[Abstract] The abstract states that the compact representation is 'encoded and saved by existing codec such as PNG' while comparing against lossy codecs JPEG and JPEG2000; this choice of lossless PNG for the learned representation should be clarified with respect to rate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the jointly trained framework yields higher LFW verification accuracy than JPEG2000 (and much higher than JPEG) is presented without any quantitative accuracy values, standard deviations, number of pairs tested, or verification protocol details; this prevents assessment of effect size and statistical reliability.

Authors: We agree that the abstract would benefit from including the quantitative results. In the revised manuscript we will update the abstract to report the specific LFW verification accuracies achieved by each method, along with the standard LFW protocol details (6000 pairs, 10-fold cross validation) and any reported standard deviations from our experiments. revision: yes
Referee: [Abstract] Abstract (and results): the accuracy comparison with JPEG and JPEG2000 reports no bitrates, bits-per-pixel, or file-size statistics for any method; without explicit rate matching or rate-distortion curves, observed accuracy gaps could arise from unequal compression ratios rather than superior feature preservation by the autoencoder or joint training.

Authors: This observation is correct and highlights an important point for fair evaluation. While the experiments compare the methods under their respective typical operating points, we will add explicit bitrate (bpp) and file-size statistics for all codecs in the revised abstract and results section, and include a rate-distortion analysis to demonstrate that the accuracy advantage holds under matched rates where possible. revision: yes
Referee: [Abstract] Abstract: the joint-optimization procedure is described at a high level but no loss function, weighting between reconstruction and recognition losses, or training details (e.g., which layers are frozen) are supplied, leaving the mechanism that supposedly preserves identity features unexamined.

Authors: We acknowledge that additional detail on the joint training would strengthen the abstract. In the revision we will expand the abstract description to include the form of the combined loss, the weighting coefficients between reconstruction and recognition terms, and the training protocol (e.g., which layers remain trainable). These details are already present in the body of the paper and will now be summarized at the abstract level as well. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical accuracy claim rests on external LFW benchmark testing.

full rationale

The paper trains a convolutional autoencoder jointly with a face recognition network and reports higher LFW verification accuracy versus JPEG/JPEG2000 baselines. This is a standard empirical procedure whose outcome is not forced by construction: the accuracy metric is computed on held-out external data, the joint loss does not redefine any quantity in terms of itself, and no fitted parameter is relabeled as a prediction. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain. The result is therefore self-contained against the stated benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5725 in / 944 out tokens · 23289 ms · 2026-05-25T10:52:30.682133+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Compression artifacts reduction by a deep convolutional network

Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision , pages 576–584, 2015

work page 2015
[2]

Soft-to-hard vector quantization for end-to-end learning compressible representations

Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc V Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, pages 1141–1151, 2017. 9 A PREPRINT - J ULY 4, 2019

work page 2017
[3]

Generative Adversarial Networks for Extreme Learned Image Compression

Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative adversarial networks for extreme learned image compression. arXiv preprint arXiv:1804.02958, 2018

work page arXiv 2018
[4]

Lossy Image Compression with Compressive Autoencoders

Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Full resolution image compression with recurrent neural networks

George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5306–5314, 2017

work page 2017
[6]

An end-to-end compression framework based on convolutional neural networks

Feng Jiang, Wen Tao, Shaohui Liu, Jie Ren, Xun Guo, and Debin Zhao. An end-to-end compression framework based on convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology , 28(10):3007–3018, 2017

work page 2017
[7]

Cosface: Large margin cosine loss for deep face recognition

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018

work page 2018
[8]

Deep Image Compression via End-to-End Learning

Haojie Liu, Chen Tong, Shen Qiu, Yue Tao, and Ma Zhan. Deep image compression via end-to-end learning. arXiv preprint arXiv:1806.01496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016
[10]

Normface: l 2 hypersphere embedding for face veriﬁcation

Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: l 2 hypersphere embedding for face veriﬁcation. In Proceedings of the 25th ACM international conference on Multimedia , pages 1041–1049. ACM, 2017

work page 2017
[11]

Sphereface: Deep hypersphere embedding for face recognition

Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017

work page 2017
[12]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4690–4699, 2019

work page 2019
[13]

Joint face detection and alignment using multitask cascaded convolutional networks

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016. 10

work page 2016

[1] [1]

Compression artifacts reduction by a deep convolutional network

Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision , pages 576–584, 2015

work page 2015

[2] [2]

Soft-to-hard vector quantization for end-to-end learning compressible representations

Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc V Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, pages 1141–1151, 2017. 9 A PREPRINT - J ULY 4, 2019

work page 2017

[3] [3]

Generative Adversarial Networks for Extreme Learned Image Compression

Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative adversarial networks for extreme learned image compression. arXiv preprint arXiv:1804.02958, 2018

work page arXiv 2018

[4] [4]

Lossy Image Compression with Compressive Autoencoders

Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Full resolution image compression with recurrent neural networks

George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5306–5314, 2017

work page 2017

[6] [6]

An end-to-end compression framework based on convolutional neural networks

Feng Jiang, Wen Tao, Shaohui Liu, Jie Ren, Xun Guo, and Debin Zhao. An end-to-end compression framework based on convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology , 28(10):3007–3018, 2017

work page 2017

[7] [7]

Cosface: Large margin cosine loss for deep face recognition

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018

work page 2018

[8] [8]

Deep Image Compression via End-to-End Learning

Haojie Liu, Chen Tong, Shen Qiu, Yue Tao, and Ma Zhan. Deep image compression via end-to-end learning. arXiv preprint arXiv:1806.01496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016

[10] [10]

Normface: l 2 hypersphere embedding for face veriﬁcation

Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: l 2 hypersphere embedding for face veriﬁcation. In Proceedings of the 25th ACM international conference on Multimedia , pages 1041–1049. ACM, 2017

work page 2017

[11] [11]

Sphereface: Deep hypersphere embedding for face recognition

Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017

work page 2017

[12] [12]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4690–4699, 2019

work page 2019

[13] [13]

Joint face detection and alignment using multitask cascaded convolutional networks

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016. 10

work page 2016