AugLabel: Exploiting Word Representations to Augment Labels for Face Attribute Classification

Binod Bhattarai; Rumeysa Bodur; Tae-Kyun Kim

arxiv: 1907.06757 · v1 · pith:M3VLOI6Cnew · submitted 2019-07-15 · 💻 cs.CV

AugLabel: Exploiting Word Representations to Augment Labels for Face Attribute Classification

Binod Bhattarai , Rumeysa Bodur , Tae-Kyun Kim This is my paper

Pith reviewed 2026-05-24 21:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords face attribute classificationlabel augmentationword2vecdeep neural networksCelebALFWAmulti-label learningsemantic embeddings

0 comments

The pith

Appending word2vec vectors of attribute names to categorical labels improves face attribute classification while cutting annotated data needs by up to 50%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to generate continuous-valued fixed-dimensional labels by taking the word2vec embeddings of the original attribute names and concatenating them to the categorical labels before feeding them to a deep network. This label-space augmentation is tested on the CelebA and LFWA face attribute datasets against a competitive deep-learning baseline. Results indicate higher accuracy than the unaugmented baseline and performance comparable to existing state-of-the-art methods, even when only half the real annotated images are used for training. A reader would care because the approach offers a way to extract extra supervisory signal from the linguistic structure already implicit in label names without altering images or network architecture. The central move is therefore to treat label names as carriers of semantic geometry that can be injected directly into the training targets.

Core claim

By exploiting the word2vec representations of existing categorical labels to produce fixed-dimensional continuous values and appending these representations to the original labels, the network receives richer supervision that raises classification accuracy on face attributes and reduces the quantity of annotated real images required by up to 50 percent while matching prior state-of-the-art results on CelebA and LFWA.

What carries the argument

Label-space augmentation that concatenates fixed word2vec vectors of attribute names with the original categorical labels, thereby injecting semantic structure into the training targets.

If this is right

Accuracy on CelebA and LFWA rises above the unaugmented deep-learning baseline.
The same training set size yields performance comparable to prior state-of-the-art methods.
Only half the annotated real images suffice to reach that performance level.
The augmentation requires no change to image preprocessing or network architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same concatenation trick could be tried on other multi-label vision tasks whose labels possess meaningful word embeddings.
Jointly fine-tuning the word vectors together with the network might yield additional gains beyond the fixed-embedding version tested.
If the method generalizes, label augmentation of this form could become a standard regularizer alongside image-space and dropout augmentations.
Domains with sparse annotations but rich label vocabularies would be natural next test beds.

Load-bearing premise

The fixed word2vec vectors of the attribute names already encode task-relevant semantic relations that a network can exploit when they are simply concatenated to the labels.

What would settle it

Retraining the same baseline architecture on CelebA with the augmented labels produces no accuracy lift and still requires the full original training set size to reach the reported performance level.

Figures

Figures reproduced from arXiv: 1907.06757 by Binod Bhattarai, Rumeysa Bodur, Tae-Kyun Kim.

**Figure 2.** Figure 2: Performance comparison on CelebA with different size of training examples. Using the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The listed attributes are a subset of the attributes predicted by the corresponding model on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Augmenting data in image space (eg. flipping, cropping etc) and activation space (eg. dropout) are being widely used to regularise deep neural networks and have been successfully applied on several computer vision tasks. Unlike previous works, which are mostly focused on doing augmentation in the aforementioned domains, we propose to do augmentation in label space. In this paper, we present a novel method to generate fixed dimensional labels with continuous values for images by exploiting the word2vec representations of the existing categorical labels. We then append these representations with existing categorical labels and train the model. We validated our idea on two challenging face attribute classification data sets viz. CelebA and LFWA. Our extensive experiments show that the augmented labels improve the performance of the competitive deep learning baseline and reduce the need of annotated real data up to 50%, while attaining a performance similar to the state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Label augmentation by appending fixed word2vec vectors to binary attributes is a straightforward idea, but the experiments do not test whether the semantic content of those vectors drives any gains.

read the letter

The paper appends pre-trained word2vec vectors of the 40 CelebA/LFWA attribute names to the existing binary label vector and trains a network to predict the combined target. This is presented as label-space augmentation distinct from image or activation methods. The approach is simple and requires no extra parameters at inference time, which is a practical plus for anyone already running standard supervised training on these datasets. It reports better accuracy than a competitive baseline and claims similar performance to prior SOTA while using up to 50% less labeled data. Those are the concrete positives. The central assumption is that the geometry in the fixed word2vec vectors supplies useful structure the image encoder can exploit. The stress-test note is on point: because the vectors are constant per attribute, any observed improvement could stem from the added output dimensions, altered loss weighting, or implicit regularization rather than semantic relations. No control that replaces the word2vec entries with random vectors of matching dimension is described, so the attribution remains untested. The 50% data-reduction figure is also stated without the exact protocol, variance across runs, or ablation on how the reduced subsets were chosen. These gaps make the quantitative claims hard to evaluate from the given material. The work is aimed at people doing multi-label face attribute classification who want cheap ways to stretch limited labels. It is coherent on its own terms and shows honest engagement with the standard benchmarks, but the missing controls are load-bearing for the main claim. I would send it for review only if the authors add the random-vector ablation and clarify the data-reduction protocol; otherwise it is not ready.

Referee Report

2 major / 0 minor

Summary. The paper proposes AugLabel, which augments the binary attribute labels for face images by concatenating fixed word2vec embeddings of the 40 attribute names (from CelebA/LFWA) to create continuous-valued targets. The central claim is that training a deep network on these augmented labels improves classification accuracy over a competitive baseline and achieves comparable performance to state-of-the-art methods while requiring up to 50% less annotated real data.

Significance. If the central claim holds after proper controls, the method offers a simple, annotation-free way to inject semantic structure from label names into multi-label classification losses, which could reduce data requirements in attribute prediction tasks. The approach relies on off-the-shelf word2vec and standard supervised training on public datasets, which supports reproducibility.

major comments (2)

[Abstract] Abstract: the claim that augmented labels 'reduce the need of annotated real data up to 50%' is presented without quantitative baselines, error bars, ablation details on how the 50% figure was obtained, or description of the reduced-data training protocol.
[Method and Experiments] Method and Experiments: no control experiment replaces the fixed word2vec vectors with random vectors of identical dimension (or the zero vector) while preserving label dimensionality and training protocol; without this, gains in low-data regimes cannot be attributed to semantic structure in word2vec rather than extra output dimensions, changed loss weighting, or implicit regularization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that augmented labels 'reduce the need of annotated real data up to 50%' is presented without quantitative baselines, error bars, ablation details on how the 50% figure was obtained, or description of the reduced-data training protocol.

Authors: We agree the abstract statement would be strengthened by additional detail. The 50% figure derives from controlled subsampling experiments (training on 50% of the labeled data while holding the test set fixed) reported in the experiments section; we will revise the abstract to briefly describe this protocol and reference the corresponding quantitative results. revision: yes
Referee: [Method and Experiments] Method and Experiments: no control experiment replaces the fixed word2vec vectors with random vectors of identical dimension (or the zero vector) while preserving label dimensionality and training protocol; without this, gains in low-data regimes cannot be attributed to semantic structure in word2vec rather than extra output dimensions, changed loss weighting, or implicit regularization.

Authors: This is a fair criticism. A random-vector control would help isolate the contribution of semantic content. We will add this ablation (random vectors of matching dimension, plus a zero-vector baseline) to the revised experiments, keeping all other training details identical. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is empirical and externally grounded.

full rationale

The paper introduces an empirical augmentation technique that concatenates fixed, off-the-shelf word2vec vectors (pre-trained on external text corpora) to existing binary attribute labels and trains a standard CNN classifier. No equations derive new quantities from fitted parameters, no self-citation chain supports a uniqueness claim, and no prediction is defined by construction from the authors' own choices. The central performance claims rest on held-out test accuracy on CelebA and LFWA rather than on any internal redefinition or renaming of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that word2vec vectors encode useful semantic relations among face-attribute names; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption word2vec embeddings of attribute names contain task-relevant semantic structure that can be usefully concatenated to categorical labels
The entire augmentation strategy depends on this premise; if the vectors add only noise or redundancy the performance claim collapses.

pith-pipeline@v0.9.0 · 5690 in / 1234 out tokens · 18037 ms · 2026-05-24T21:13:27.898934+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Evaluation of output embeddings for ﬁne-grained image classiﬁcation

Zeynep Akata et al. “Evaluation of output embeddings for ﬁne-grained image classiﬁcation”. In: CVPR. 2015

work page 2015
[2]

Label-embedding for attribute-based classiﬁcation

Zeynep Akata et al. “Label-embedding for attribute-based classiﬁcation”. In: CVPR. 2013

work page 2013
[3]

Augmented skeleton space transfer for depth-based hand pose estimation

Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. “Augmented skeleton space transfer for depth-based hand pose estimation”. In: CVPR. 2018

work page 2018
[4]

A neural probabilistic language model

Yoshua Bengio et al. “A neural probabilistic language model”. In: JMLR (2003)

work page 2003
[5]

Cp-mtml: Coupled projection multi-task metric learning for large scale face retrieval

Binod Bhattarai, Gaurav Sharma, and Frédéric Jurie. “Cp-mtml: Coupled projection multi-task metric learning for large scale face retrieval”. In: CVPR. 2016

work page 2016
[6]

Deep fusion of visual signatures for client-server facial analysis

Binod Bhattarai, Gaurav Sharma, and Frédéric Jurie. “Deep fusion of visual signatures for client-server facial analysis”. In: ICVGIP. ACM. 2016

work page 2016
[7]

Large-scale machine learning with stochastic gradient descent

Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: COMPSTAT. 2010

work page 2010
[8]

Multitask learning

Rich Caruana. “Multitask learning”. In: Machine learning (1997)

work page 1997
[9]

A multi-task deep network for person re-identiﬁcation

Weihua Chen et al. “A multi-task deep network for person re-identiﬁcation”. In: AAAI. 2017

work page 2017
[10]

Fran ccois Chollet et al. Keras. https://github.com/fchollet/keras. 2015

work page 2015
[11]

Histograms of oriented gradients for human detection

Navneet Dalal and Bill Triggs. “Histograms of oriented gradients for human detection”. In: CVPR. 2005

work page 2005
[12]

Semi-supervised adversarial learning to generate photorealistic face images of new identities from 3D morphable model

Baris Gecer et al. “Semi-supervised adversarial learning to generate photorealistic face images of new identities from 3D morphable model”. In: ECCV. 2018

work page 2018
[13]

AFFACT: Alignment-free facial attribute classiﬁcation technique

Manuel Günther, Andras Rozsa, and Terranee E Boult. “AFFACT: Alignment-free facial attribute classiﬁcation technique”. In: IJCB. 2017

work page 2017
[14]

Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classiﬁcation

Emily M Hand and Rama Chellappa. “Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classiﬁcation”. In: AAAI. 2017

work page 2017
[15]

Improving facial attribute prediction using semantic segmentation

Mahdi M Kalayeh, Boqing Gong, and Mubarak Shah. “Improving facial attribute prediction using semantic segmentation”. In: CVPR. 2017

work page 2017
[16]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classiﬁcation with deep convolutional neural networks”. In: NIPS. 2012. 8

work page 2012
[17]

Facetracer: A search engine for large collections of images with faces

Neeraj Kumar, Peter Belhumeur, and Shree Nayar. “Facetracer: A search engine for large collections of images with faces”. In: ECCV. 2008

work page 2008
[18]

Attribute and simile classiﬁers for face veriﬁcation

Neeraj Kumar et al. “Attribute and simile classiﬁers for face veriﬁcation”. In: ICCV. 2009

work page 2009
[19]

Describable visual attributes for face veriﬁcation and image search

Neeraj Kumar et al. “Describable visual attributes for face veriﬁcation and image search”. In: PAMI (2011)

work page 2011
[20]

Deep learning face attributes in the wild

Ziwei Liu et al. “Deep learning face attributes in the wild”. In: ICCV. 2015

work page 2015
[21]

Ask your neurons: A deep learning approach to visual question answering

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. “Ask your neurons: A deep learning approach to visual question answering”. In: IJCV (2017)

work page 2017
[22]

Distributed representations of words and phrases and their composition- ality

Tomas Mikolov et al. “Distributed representations of words and phrases and their composition- ality”. In: NIPS. 2013

work page 2013
[23]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation”. In: EMNLP. 2014

work page 2014
[24]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Gabriel Pereyra et al. “Regularizing neural networks by penalizing conﬁdent output distribu- tions”. In: arXiv preprint arXiv:1701.06548 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Generative adversarial text to image synthesis

Scott Reed et al. “Generative adversarial text to image synthesis”. In: ICML (2016)

work page 2016
[26]

Moon: A mixed objective optimization network for the recognition of facial attributes

Ethan M Rudd, Manuel Günther, and Terrance E Boult. “Moon: A mixed objective optimization network for the recognition of facial attributes”. In: ECCV. 2016

work page 2016
[27]

Very deep convolutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: ICLR (2015)

work page 2015
[28]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from overﬁtting”. In: JMLR (2014)

work page 2014
[29]

Deep Facial Attribute Detection in the Wild: From General to Speciﬁc

Yuechuan Sun and Jun Yu. “Deep Facial Attribute Detection in the Wild: From General to Speciﬁc”. In: BMVC. 2018

work page 2018
[30]

Walk and learn: Facial attribute represen- tation learning from egocentric video and contextual data

Jing Wang, Yu Cheng, and Rogerio Schmidt Feris. “Walk and learn: Facial attribute represen- tation learning from egocentric video and contextual data”. In: CVPR. 2016

work page 2016
[31]

Disturblabel: Regularizing cnn on the loss layer

Lingxi Xie et al. “Disturblabel: Regularizing cnn on the loss layer”. In: CVPR. 2016

work page 2016
[32]

Conditional convolutional neural network for modality-aware face recogni- tion

Chao Xiong et al. “Conditional convolutional neural network for modality-aware face recogni- tion”. In: ICCV. 2015

work page 2015
[33]

Panda: Pose aligned networks for deep attribute modeling

Ning Zhang et al. “Panda: Pose aligned networks for deep attribute modeling”. In: CVPR. 2014

work page 2014
[34]

Facial landmark detection by deep multi-task learning

Zhanpeng Zhang et al. “Facial landmark detection by deep multi-task learning”. In: ECCV. 2014. 9

work page 2014

[1] [1]

Evaluation of output embeddings for ﬁne-grained image classiﬁcation

Zeynep Akata et al. “Evaluation of output embeddings for ﬁne-grained image classiﬁcation”. In: CVPR. 2015

work page 2015

[2] [2]

Label-embedding for attribute-based classiﬁcation

Zeynep Akata et al. “Label-embedding for attribute-based classiﬁcation”. In: CVPR. 2013

work page 2013

[3] [3]

Augmented skeleton space transfer for depth-based hand pose estimation

Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. “Augmented skeleton space transfer for depth-based hand pose estimation”. In: CVPR. 2018

work page 2018

[4] [4]

A neural probabilistic language model

Yoshua Bengio et al. “A neural probabilistic language model”. In: JMLR (2003)

work page 2003

[5] [5]

Cp-mtml: Coupled projection multi-task metric learning for large scale face retrieval

Binod Bhattarai, Gaurav Sharma, and Frédéric Jurie. “Cp-mtml: Coupled projection multi-task metric learning for large scale face retrieval”. In: CVPR. 2016

work page 2016

[6] [6]

Deep fusion of visual signatures for client-server facial analysis

Binod Bhattarai, Gaurav Sharma, and Frédéric Jurie. “Deep fusion of visual signatures for client-server facial analysis”. In: ICVGIP. ACM. 2016

work page 2016

[7] [7]

Large-scale machine learning with stochastic gradient descent

Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: COMPSTAT. 2010

work page 2010

[8] [8]

Multitask learning

Rich Caruana. “Multitask learning”. In: Machine learning (1997)

work page 1997

[9] [9]

A multi-task deep network for person re-identiﬁcation

Weihua Chen et al. “A multi-task deep network for person re-identiﬁcation”. In: AAAI. 2017

work page 2017

[10] [10]

Fran ccois Chollet et al. Keras. https://github.com/fchollet/keras. 2015

work page 2015

[11] [11]

Histograms of oriented gradients for human detection

Navneet Dalal and Bill Triggs. “Histograms of oriented gradients for human detection”. In: CVPR. 2005

work page 2005

[12] [12]

Semi-supervised adversarial learning to generate photorealistic face images of new identities from 3D morphable model

Baris Gecer et al. “Semi-supervised adversarial learning to generate photorealistic face images of new identities from 3D morphable model”. In: ECCV. 2018

work page 2018

[13] [13]

AFFACT: Alignment-free facial attribute classiﬁcation technique

Manuel Günther, Andras Rozsa, and Terranee E Boult. “AFFACT: Alignment-free facial attribute classiﬁcation technique”. In: IJCB. 2017

work page 2017

[14] [14]

Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classiﬁcation

Emily M Hand and Rama Chellappa. “Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classiﬁcation”. In: AAAI. 2017

work page 2017

[15] [15]

Improving facial attribute prediction using semantic segmentation

Mahdi M Kalayeh, Boqing Gong, and Mubarak Shah. “Improving facial attribute prediction using semantic segmentation”. In: CVPR. 2017

work page 2017

[16] [16]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classiﬁcation with deep convolutional neural networks”. In: NIPS. 2012. 8

work page 2012

[17] [17]

Facetracer: A search engine for large collections of images with faces

Neeraj Kumar, Peter Belhumeur, and Shree Nayar. “Facetracer: A search engine for large collections of images with faces”. In: ECCV. 2008

work page 2008

[18] [18]

Attribute and simile classiﬁers for face veriﬁcation

Neeraj Kumar et al. “Attribute and simile classiﬁers for face veriﬁcation”. In: ICCV. 2009

work page 2009

[19] [19]

Describable visual attributes for face veriﬁcation and image search

Neeraj Kumar et al. “Describable visual attributes for face veriﬁcation and image search”. In: PAMI (2011)

work page 2011

[20] [20]

Deep learning face attributes in the wild

Ziwei Liu et al. “Deep learning face attributes in the wild”. In: ICCV. 2015

work page 2015

[21] [21]

Ask your neurons: A deep learning approach to visual question answering

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. “Ask your neurons: A deep learning approach to visual question answering”. In: IJCV (2017)

work page 2017

[22] [22]

Distributed representations of words and phrases and their composition- ality

Tomas Mikolov et al. “Distributed representations of words and phrases and their composition- ality”. In: NIPS. 2013

work page 2013

[23] [23]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation”. In: EMNLP. 2014

work page 2014

[24] [24]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Gabriel Pereyra et al. “Regularizing neural networks by penalizing conﬁdent output distribu- tions”. In: arXiv preprint arXiv:1701.06548 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Generative adversarial text to image synthesis

Scott Reed et al. “Generative adversarial text to image synthesis”. In: ICML (2016)

work page 2016

[26] [26]

Moon: A mixed objective optimization network for the recognition of facial attributes

Ethan M Rudd, Manuel Günther, and Terrance E Boult. “Moon: A mixed objective optimization network for the recognition of facial attributes”. In: ECCV. 2016

work page 2016

[27] [27]

Very deep convolutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: ICLR (2015)

work page 2015

[28] [28]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from overﬁtting”. In: JMLR (2014)

work page 2014

[29] [29]

Deep Facial Attribute Detection in the Wild: From General to Speciﬁc

Yuechuan Sun and Jun Yu. “Deep Facial Attribute Detection in the Wild: From General to Speciﬁc”. In: BMVC. 2018

work page 2018

[30] [30]

Walk and learn: Facial attribute represen- tation learning from egocentric video and contextual data

Jing Wang, Yu Cheng, and Rogerio Schmidt Feris. “Walk and learn: Facial attribute represen- tation learning from egocentric video and contextual data”. In: CVPR. 2016

work page 2016

[31] [31]

Disturblabel: Regularizing cnn on the loss layer

Lingxi Xie et al. “Disturblabel: Regularizing cnn on the loss layer”. In: CVPR. 2016

work page 2016

[32] [32]

Conditional convolutional neural network for modality-aware face recogni- tion

Chao Xiong et al. “Conditional convolutional neural network for modality-aware face recogni- tion”. In: ICCV. 2015

work page 2015

[33] [33]

Panda: Pose aligned networks for deep attribute modeling

Ning Zhang et al. “Panda: Pose aligned networks for deep attribute modeling”. In: CVPR. 2014

work page 2014

[34] [34]

Facial landmark detection by deep multi-task learning

Zhanpeng Zhang et al. “Facial landmark detection by deep multi-task learning”. In: ECCV. 2014. 9

work page 2014