pith. sign in

arxiv: 1906.08502 · v1 · pith:NRQK7HAPnew · submitted 2019-06-20 · 📊 stat.ML · cs.LG

Efficient data augmentation using graph imputation neural networks

Pith reviewed 2026-05-25 19:25 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords data augmentationsemi-supervised learninggraph imputationmissing datagraph neural networksbenchmark evaluation
0
0 comments X

The pith

Graph imputation neural networks reconstruct severely damaged samples to expand semi-supervised datasets by up to 10 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an efficient data augmentation method for semi-supervised learning by leveraging both labeled and unlabeled data. A similarity graph is built across all points, after which selected nodes have up to 80 percent of their features removed and then reconstructed via a graph imputation neural network. The resulting synthetic samples are added to the training set, yielding measurable accuracy gains over models trained only on the original labeled data. A reader would care because the approach turns abundant unlabeled points into usable training material without requiring additional human labels.

Core claim

By constructing a similarity graph from the union of labeled and unlabeled data and then applying a graph imputation neural network to recover nodes that have had up to 80 percent of their features removed, the method generates new training examples that improve classification performance relative to a fully supervised baseline and permit dataset expansion by a factor of up to 10.

What carries the argument

Graph imputation neural network (GINN) operating on a similarity graph built from labeled plus unlabeled points, which performs the reconstruction of heavily damaged samples.

If this is right

  • Datasets can be expanded by a factor of 10 while still improving accuracy over the original fully supervised baseline.
  • The same reconstruction process works across multiple standard benchmark datasets.
  • Heavy feature damage (up to 80 percent) remains recoverable when the underlying similarity graph is available.
  • Graph-based imputation directly supplies new labeled examples without external labelers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may generalize to other reconstruction tasks if a reliable similarity graph can be obtained.
  • Performance would likely degrade on data domains where Euclidean or simple similarity measures fail to capture manifold structure.
  • Combining the generated samples with conventional augmentation techniques could produce still larger effective training sets.
  • Computational cost scales with graph construction; sparse or approximate graphs might be needed for very large unlabeled collections.

Load-bearing premise

The similarity graph formed from labeled and unlabeled points must correctly encode the affinities that allow the imputation network to produce useful reconstructions even after 80 percent feature removal.

What would settle it

Running the augmentation pipeline on the reported benchmarks and finding that the augmented models show no accuracy improvement or outright degradation on held-out test sets would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.08502 by Aurelio Uncini, Indro Spinelli, Michele Scarpiniti, Simone Scardapane.

Figure 1
Figure 1. Figure 1: Overall schema of the proposed framework for data augmentation. In green we show the GINN method which is optimized from the dataset. even when a few of its elements are missing, i.e., an algorithm performing missing data imputation. In order to train GINN to this end, we first augment the matrix information in X with a graph describing the structural proximity between points. In particular, we en￾code eac… view at source ↗
Figure 2
Figure 2. Figure 2: Number of times the default and the augmented datasets with GINN had a better classifica￾tion performances over 5 different trials considering all datasets and classifiers in the benchmark. images and audio, where the challenge is to define a proper metric to build the similarity graph. As a final remark, we note that the experiments presented here open the way to a set of interesting additional questions.… view at source ↗
read the original abstract

Recently, data augmentation in the semi-supervised regime, where unlabeled data vastly outnumbers labeled data, has received a considerable attention. In this paper, we describe an efficient technique for this task, exploiting a recent framework we proposed for missing data imputation called graph imputation neural network (GINN). The key idea is to leverage both supervised and unsupervised data to build a graph of similarities between points in the dataset. Then, we augment the dataset by severely damaging a few of the nodes (up to 80\% of their features), and reconstructing them using a variation of GINN. On several benchmark datasets, we show that our method can obtain significant improvements compared to a fully-supervised model, and we are able to augment the datasets up to a factor of 10x. This points to the power of graph-based neural networks to represent structural affinities in the samples for tasks of data reconstruction and augmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an efficient data augmentation method for the semi-supervised regime by constructing a similarity graph from the combined labeled and unlabeled feature matrix, then using a variant of the previously published graph imputation neural network (GINN) to reconstruct nodes whose features have been damaged by up to 80 %. The resulting synthetic samples are claimed to augment the original dataset by a factor of up to 10× and to yield significant accuracy gains over a fully-supervised baseline on several benchmark datasets.

Significance. If the reported gains are reproducible and the graph-based imputations are shown to be meaningfully better than simpler baselines, the work would provide a concrete, graph-centric route to exploiting unlabeled data for augmentation without requiring new labels. It also supplies a downstream application of the earlier GINN framework, illustrating how structural affinities encoded in a similarity graph can be leveraged for reconstruction tasks.

major comments (2)
  1. [Method (graph construction and GINN variant)] The central empirical claim (significant improvement over fully-supervised models and 10× augmentation) rests on the untested premise that the similarity graph built from raw features remains class-informative after 80 % feature damage. No ablation, edge-quality metric, or comparison against mean/random imputation is described to substantiate that the GINN reconstructions carry label-relevant signal rather than noise.
  2. [Experiments] The abstract states that “significant improvements” and “10× augmentation” are obtained, yet the experimental protocol, exact baselines, number of runs, statistical tests, and quantitative tables are not referenced in the provided text. Without these details the magnitude and reliability of the gains cannot be assessed.
minor comments (1)
  1. [Method] Clarify whether the similarity graph is recomputed after each damage/reconstruction step or remains fixed; the current description leaves this ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to improve clarity and provide additional supporting evidence.

read point-by-point responses
  1. Referee: [Method (graph construction and GINN variant)] The central empirical claim (significant improvement over fully-supervised models and 10× augmentation) rests on the untested premise that the similarity graph built from raw features remains class-informative after 80 % feature damage. No ablation, edge-quality metric, or comparison against mean/random imputation is described to substantiate that the GINN reconstructions carry label-relevant signal rather than noise.

    Authors: We agree that additional evidence would strengthen the central claim. The reported accuracy gains over fully-supervised baselines provide indirect support that the reconstructions carry label-relevant signal, but we will add direct ablations (GINN vs. mean and random imputation) and an analysis of post-damage graph properties (e.g., edge label consistency) in the revised manuscript. revision: yes

  2. Referee: [Experiments] The abstract states that “significant improvements” and “10× augmentation” are obtained, yet the experimental protocol, exact baselines, number of runs, statistical tests, and quantitative tables are not referenced in the provided text. Without these details the magnitude and reliability of the gains cannot be assessed.

    Authors: The full manuscript includes a dedicated Experiments section with the protocol, baselines, run counts, and quantitative tables. We will revise the abstract and introduction to add explicit cross-references to these sections and results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical application of prior GINN framework

full rationale

The paper's central contribution is an empirical technique that builds a similarity graph from combined labeled/unlabeled data, damages up to 80% of node features, and reconstructs via a variation of the authors' previously published GINN imputation method. Claims of 10x augmentation and gains over fully-supervised baselines are presented as experimental results on benchmarks, not as first-principles derivations or predictions. The self-citation to prior GINN work supplies the imputation engine but does not create a load-bearing loop within this paper; no equations reduce by construction to fitted inputs, no uniqueness theorems are invoked from self-citations, and no ansatz is smuggled. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a similarity graph built from mixed labeled and unlabeled data supports accurate imputation after heavy feature removal; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption A graph of similarities between data points captures structural affinities sufficient for neural imputation of up to 80% missing features.
    Stated as the key idea enabling both reconstruction and augmentation.

pith-pipeline@v0.9.0 · 5686 in / 1156 out tokens · 24657 ms · 2026-05-25T19:25:29.658126+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    In: Proc

    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proc. 34th International Conference on Machine Learning (ICML). vol. 70, pp. 214–223 (2017)

  2. [2]

    Relational inductive biases, deep learning, and graph networks

    Battaglia, P.W., Hamrick, J.B., Bapst, V ., Sanchez-Gonzalez, A., Zambaldi, V ., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)

  3. [3]

    Journal of Machine Learning Research 7(Nov), 2399–2434 (2006)

    Belkin, M., Niyogi, P., Sindhwani, V .: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7(Nov), 2399–2434 (2006)

  4. [4]

    arXiv preprint arXiv:1905.02249 (2019)

    Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249 (2019)

  5. [5]

    The MIT Press, 1st edn

    Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, 1st edn. (2010)

  6. [6]

    AutoAugment: Learning Augmentation Policies from Data

    Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V ., Le, Q.V .: Autoaugment: Learning augmen- tation policies from data. arXiv preprint arXiv:1805.09501 (2018)

  7. [7]

    IEEE/ACM Transactions on Audio, Speech and Language Processing 23(9), 1469–1477 (2015)

    Cui, X., Goel, V ., Kingsbury, B.: Data augmentation for deep neural network acoustic model- ing. IEEE/ACM Transactions on Audio, Speech and Language Processing 23(9), 1469–1477 (2015)

  8. [8]

    In: Advances in Neural Information Processing Systems

    Grandvalet, Y ., Bengio, Y .: Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems. pp. 529–536 (2005)

  9. [9]

    In: Advances in Neural Information Processing Systems, pp

    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V ., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017) Data imputation using GINNs 9

  10. [10]

    In: Proc

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. 3rd International Conference for Learning Representations (ICLR) (2014)

  11. [11]

    In: Proc

    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proc. 2017 International Conference on Learning Representations (ICLR) (2017)

  12. [12]

    In: Workshop on Challenges in Representation Learning, ICML

    Lee, D.H.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML. vol. 3, p. 2 (2013)

  13. [13]

    In: Advances in Neural Information Processing Systems, pp

    Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546– 3554 (2015)

  14. [14]

    arXiv preprint arXiv:1905.01907 (2019)

    Spinelli, I., Scardapane, S., Uncini, A.: Missing Data Imputation with Adversarially-trained Graph Convolutional Networks. arXiv preprint arXiv:1905.01907 (2019)

  15. [15]

    arXiv preprint arXiv:1904.12848 (Apr 2019)

    Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V .: Unsupervised Data Augmentation. arXiv preprint arXiv:1904.12848 (Apr 2019)

  16. [16]

    In: Proc

    Yoon, J., Jordon, J., van der Schaar, M.: Gain: Missing data imputation using generative adversarial nets. In: Proc. 35th International Conference of Machine Learning (ICML). pp. 1–10 (2018)

  17. [17]

    mixup: Beyond Empirical Risk Minimization

    Zhang, H., Cisse, M., Dauphin, Y .N., Lopez-Paz, D.: mixup: Beyond empirical risk mini- mization. arXiv preprint arXiv:1710.09412 (2017)