Efficient data augmentation using graph imputation neural networks

Aurelio Uncini; Indro Spinelli; Michele Scarpiniti; Simone Scardapane

arxiv: 1906.08502 · v1 · pith:NRQK7HAPnew · submitted 2019-06-20 · 📊 stat.ML · cs.LG

Efficient data augmentation using graph imputation neural networks

Indro Spinelli , Simone Scardapane , Michele Scarpiniti , Aurelio Uncini This is my paper

Pith reviewed 2026-05-25 19:25 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords data augmentationsemi-supervised learninggraph imputationmissing datagraph neural networksbenchmark evaluation

0 comments

The pith

Graph imputation neural networks reconstruct severely damaged samples to expand semi-supervised datasets by up to 10 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an efficient data augmentation method for semi-supervised learning by leveraging both labeled and unlabeled data. A similarity graph is built across all points, after which selected nodes have up to 80 percent of their features removed and then reconstructed via a graph imputation neural network. The resulting synthetic samples are added to the training set, yielding measurable accuracy gains over models trained only on the original labeled data. A reader would care because the approach turns abundant unlabeled points into usable training material without requiring additional human labels.

Core claim

By constructing a similarity graph from the union of labeled and unlabeled data and then applying a graph imputation neural network to recover nodes that have had up to 80 percent of their features removed, the method generates new training examples that improve classification performance relative to a fully supervised baseline and permit dataset expansion by a factor of up to 10.

What carries the argument

Graph imputation neural network (GINN) operating on a similarity graph built from labeled plus unlabeled points, which performs the reconstruction of heavily damaged samples.

If this is right

Datasets can be expanded by a factor of 10 while still improving accuracy over the original fully supervised baseline.
The same reconstruction process works across multiple standard benchmark datasets.
Heavy feature damage (up to 80 percent) remains recoverable when the underlying similarity graph is available.
Graph-based imputation directly supplies new labeled examples without external labelers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may generalize to other reconstruction tasks if a reliable similarity graph can be obtained.
Performance would likely degrade on data domains where Euclidean or simple similarity measures fail to capture manifold structure.
Combining the generated samples with conventional augmentation techniques could produce still larger effective training sets.
Computational cost scales with graph construction; sparse or approximate graphs might be needed for very large unlabeled collections.

Load-bearing premise

The similarity graph formed from labeled and unlabeled points must correctly encode the affinities that allow the imputation network to produce useful reconstructions even after 80 percent feature removal.

What would settle it

Running the augmentation pipeline on the reported benchmarks and finding that the augmented models show no accuracy improvement or outright degradation on held-out test sets would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.08502 by Aurelio Uncini, Indro Spinelli, Michele Scarpiniti, Simone Scardapane.

**Figure 1.** Figure 1: Overall schema of the proposed framework for data augmentation. In green we show the GINN method which is optimized from the dataset. even when a few of its elements are missing, i.e., an algorithm performing missing data imputation. In order to train GINN to this end, we first augment the matrix information in X with a graph describing the structural proximity between points. In particular, we encode eac… view at source ↗

**Figure 2.** Figure 2: Number of times the default and the augmented datasets with GINN had a better classification performances over 5 different trials considering all datasets and classifiers in the benchmark. images and audio, where the challenge is to define a proper metric to build the similarity graph. As a final remark, we note that the experiments presented here open the way to a set of interesting additional questions.… view at source ↗

read the original abstract

Recently, data augmentation in the semi-supervised regime, where unlabeled data vastly outnumbers labeled data, has received a considerable attention. In this paper, we describe an efficient technique for this task, exploiting a recent framework we proposed for missing data imputation called graph imputation neural network (GINN). The key idea is to leverage both supervised and unsupervised data to build a graph of similarities between points in the dataset. Then, we augment the dataset by severely damaging a few of the nodes (up to 80\% of their features), and reconstructing them using a variation of GINN. On several benchmark datasets, we show that our method can obtain significant improvements compared to a fully-supervised model, and we are able to augment the datasets up to a factor of 10x. This points to the power of graph-based neural networks to represent structural affinities in the samples for tasks of data reconstruction and augmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper repurposes the authors' prior GINN work for data augmentation by building a joint similarity graph then damaging and reconstructing samples, but the abstract's claims of significant gains and 10x expansion lack any numbers or protocol to evaluate them.

read the letter

The actual new element is the specific workflow: construct a similarity graph over labeled plus unlabeled points, apply up to 80% feature damage to selected nodes, and feed them through a GINN variant to generate new training examples. This is a distinct downstream task from the original imputation paper and avoids circularity by relying on the earlier framework. The description of the pipeline is straightforward and the motivation for using graph structure to capture affinities makes sense on paper. The stress-test worry about the graph remaining informative after heavy damage is worth taking seriously; if edges are based on raw feature distances, high-dimensional or noisy data could produce unhelpful reconstructions that do not beat simpler baselines. The abstract asserts clear wins over fully-supervised models and 10x augmentation on benchmarks, yet supplies none of the quantitative results, baselines, or experimental details needed to check those claims. That gap makes it hard to know whether the method delivers in practice. This work is aimed at people already using graph-based imputation or looking for augmentation tricks in semi-supervised settings. A reader working on similar graph neural net applications might pick up the idea, but only the full experiments would show if it is worth adopting. It deserves peer review so referees can examine the actual results and test whether the graph construction survives the damage levels described.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an efficient data augmentation method for the semi-supervised regime by constructing a similarity graph from the combined labeled and unlabeled feature matrix, then using a variant of the previously published graph imputation neural network (GINN) to reconstruct nodes whose features have been damaged by up to 80 %. The resulting synthetic samples are claimed to augment the original dataset by a factor of up to 10× and to yield significant accuracy gains over a fully-supervised baseline on several benchmark datasets.

Significance. If the reported gains are reproducible and the graph-based imputations are shown to be meaningfully better than simpler baselines, the work would provide a concrete, graph-centric route to exploiting unlabeled data for augmentation without requiring new labels. It also supplies a downstream application of the earlier GINN framework, illustrating how structural affinities encoded in a similarity graph can be leveraged for reconstruction tasks.

major comments (2)

[Method (graph construction and GINN variant)] The central empirical claim (significant improvement over fully-supervised models and 10× augmentation) rests on the untested premise that the similarity graph built from raw features remains class-informative after 80 % feature damage. No ablation, edge-quality metric, or comparison against mean/random imputation is described to substantiate that the GINN reconstructions carry label-relevant signal rather than noise.
[Experiments] The abstract states that “significant improvements” and “10× augmentation” are obtained, yet the experimental protocol, exact baselines, number of runs, statistical tests, and quantitative tables are not referenced in the provided text. Without these details the magnitude and reliability of the gains cannot be assessed.

minor comments (1)

[Method] Clarify whether the similarity graph is recomputed after each damage/reconstruction step or remains fixed; the current description leaves this ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to improve clarity and provide additional supporting evidence.

read point-by-point responses

Referee: [Method (graph construction and GINN variant)] The central empirical claim (significant improvement over fully-supervised models and 10× augmentation) rests on the untested premise that the similarity graph built from raw features remains class-informative after 80 % feature damage. No ablation, edge-quality metric, or comparison against mean/random imputation is described to substantiate that the GINN reconstructions carry label-relevant signal rather than noise.

Authors: We agree that additional evidence would strengthen the central claim. The reported accuracy gains over fully-supervised baselines provide indirect support that the reconstructions carry label-relevant signal, but we will add direct ablations (GINN vs. mean and random imputation) and an analysis of post-damage graph properties (e.g., edge label consistency) in the revised manuscript. revision: yes
Referee: [Experiments] The abstract states that “significant improvements” and “10× augmentation” are obtained, yet the experimental protocol, exact baselines, number of runs, statistical tests, and quantitative tables are not referenced in the provided text. Without these details the magnitude and reliability of the gains cannot be assessed.

Authors: The full manuscript includes a dedicated Experiments section with the protocol, baselines, run counts, and quantitative tables. We will revise the abstract and introduction to add explicit cross-references to these sections and results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical application of prior GINN framework

full rationale

The paper's central contribution is an empirical technique that builds a similarity graph from combined labeled/unlabeled data, damages up to 80% of node features, and reconstructs via a variation of the authors' previously published GINN imputation method. Claims of 10x augmentation and gains over fully-supervised baselines are presented as experimental results on benchmarks, not as first-principles derivations or predictions. The self-citation to prior GINN work supplies the imputation engine but does not create a load-bearing loop within this paper; no equations reduce by construction to fitted inputs, no uniqueness theorems are invoked from self-citations, and no ansatz is smuggled. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a similarity graph built from mixed labeled and unlabeled data supports accurate imputation after heavy feature removal; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption A graph of similarities between data points captures structural affinities sufficient for neural imputation of up to 80% missing features.
Stated as the key idea enabling both reconstruction and augmentation.

pith-pipeline@v0.9.0 · 5686 in / 1156 out tokens · 24657 ms · 2026-05-25T19:25:29.658126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

[1]

In: Proc

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proc. 34th International Conference on Machine Learning (ICML). vol. 70, pp. 214–223 (2017)

work page 2017
[2]

Relational inductive biases, deep learning, and graph networks

Battaglia, P.W., Hamrick, J.B., Bapst, V ., Sanchez-Gonzalez, A., Zambaldi, V ., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Journal of Machine Learning Research 7(Nov), 2399–2434 (2006)

Belkin, M., Niyogi, P., Sindhwani, V .: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7(Nov), 2399–2434 (2006)

work page 2006
[4]

arXiv preprint arXiv:1905.02249 (2019)

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249 (2019)

work page arXiv 1905
[5]

The MIT Press, 1st edn

Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, 1st edn. (2010)

work page 2010
[6]

AutoAugment: Learning Augmentation Policies from Data

Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V ., Le, Q.V .: Autoaugment: Learning augmen- tation policies from data. arXiv preprint arXiv:1805.09501 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

IEEE/ACM Transactions on Audio, Speech and Language Processing 23(9), 1469–1477 (2015)

Cui, X., Goel, V ., Kingsbury, B.: Data augmentation for deep neural network acoustic model- ing. IEEE/ACM Transactions on Audio, Speech and Language Processing 23(9), 1469–1477 (2015)

work page 2015
[8]

In: Advances in Neural Information Processing Systems

Grandvalet, Y ., Bengio, Y .: Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems. pp. 529–536 (2005)

work page 2005
[9]

In: Advances in Neural Information Processing Systems, pp

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V ., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017) Data imputation using GINNs 9

work page 2017
[10]

In: Proc

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. 3rd International Conference for Learning Representations (ICLR) (2014)

work page 2014
[11]

In: Proc

Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutional networks. In: Proc. 2017 International Conference on Learning Representations (ICLR) (2017)

work page 2017
[12]

In: Workshop on Challenges in Representation Learning, ICML

Lee, D.H.: Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML. vol. 3, p. 2 (2013)

work page 2013
[13]

In: Advances in Neural Information Processing Systems, pp

Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546– 3554 (2015)

work page 2015
[14]

arXiv preprint arXiv:1905.01907 (2019)

Spinelli, I., Scardapane, S., Uncini, A.: Missing Data Imputation with Adversarially-trained Graph Convolutional Networks. arXiv preprint arXiv:1905.01907 (2019)

work page arXiv 1905
[15]

arXiv preprint arXiv:1904.12848 (Apr 2019)

Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V .: Unsupervised Data Augmentation. arXiv preprint arXiv:1904.12848 (Apr 2019)

work page arXiv 1904
[16]

In: Proc

Yoon, J., Jordon, J., van der Schaar, M.: Gain: Missing data imputation using generative adversarial nets. In: Proc. 35th International Conference of Machine Learning (ICML). pp. 1–10 (2018)

work page 2018
[17]

mixup: Beyond Empirical Risk Minimization

Zhang, H., Cisse, M., Dauphin, Y .N., Lopez-Paz, D.: mixup: Beyond empirical risk mini- mization. arXiv preprint arXiv:1710.09412 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

In: Proc

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proc. 34th International Conference on Machine Learning (ICML). vol. 70, pp. 214–223 (2017)

work page 2017

[2] [2]

Relational inductive biases, deep learning, and graph networks

Battaglia, P.W., Hamrick, J.B., Bapst, V ., Sanchez-Gonzalez, A., Zambaldi, V ., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Journal of Machine Learning Research 7(Nov), 2399–2434 (2006)

Belkin, M., Niyogi, P., Sindhwani, V .: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7(Nov), 2399–2434 (2006)

work page 2006

[4] [4]

arXiv preprint arXiv:1905.02249 (2019)

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249 (2019)

work page arXiv 1905

[5] [5]

The MIT Press, 1st edn

Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, 1st edn. (2010)

work page 2010

[6] [6]

AutoAugment: Learning Augmentation Policies from Data

Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V ., Le, Q.V .: Autoaugment: Learning augmen- tation policies from data. arXiv preprint arXiv:1805.09501 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

IEEE/ACM Transactions on Audio, Speech and Language Processing 23(9), 1469–1477 (2015)

Cui, X., Goel, V ., Kingsbury, B.: Data augmentation for deep neural network acoustic model- ing. IEEE/ACM Transactions on Audio, Speech and Language Processing 23(9), 1469–1477 (2015)

work page 2015

[8] [8]

In: Advances in Neural Information Processing Systems

Grandvalet, Y ., Bengio, Y .: Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems. pp. 529–536 (2005)

work page 2005

[9] [9]

In: Advances in Neural Information Processing Systems, pp

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V ., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017) Data imputation using GINNs 9

work page 2017

[10] [10]

In: Proc

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. 3rd International Conference for Learning Representations (ICLR) (2014)

work page 2014

[11] [11]

In: Proc

Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutional networks. In: Proc. 2017 International Conference on Learning Representations (ICLR) (2017)

work page 2017

[12] [12]

In: Workshop on Challenges in Representation Learning, ICML

Lee, D.H.: Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML. vol. 3, p. 2 (2013)

work page 2013

[13] [13]

In: Advances in Neural Information Processing Systems, pp

Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546– 3554 (2015)

work page 2015

[14] [14]

arXiv preprint arXiv:1905.01907 (2019)

Spinelli, I., Scardapane, S., Uncini, A.: Missing Data Imputation with Adversarially-trained Graph Convolutional Networks. arXiv preprint arXiv:1905.01907 (2019)

work page arXiv 1905

[15] [15]

arXiv preprint arXiv:1904.12848 (Apr 2019)

Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V .: Unsupervised Data Augmentation. arXiv preprint arXiv:1904.12848 (Apr 2019)

work page arXiv 1904

[16] [16]

In: Proc

Yoon, J., Jordon, J., van der Schaar, M.: Gain: Missing data imputation using generative adversarial nets. In: Proc. 35th International Conference of Machine Learning (ICML). pp. 1–10 (2018)

work page 2018

[17] [17]

mixup: Beyond Empirical Risk Minimization

Zhang, H., Cisse, M., Dauphin, Y .N., Lopez-Paz, D.: mixup: Beyond empirical risk mini- mization. arXiv preprint arXiv:1710.09412 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017