pith. sign in

arxiv: 1907.06800 · v1 · pith:WVUTI2QMnew · submitted 2019-07-16 · 💻 cs.LG · cs.NA· math.NA· stat.ML

Graph Interpolating Activation Improves Both Natural and Robust Accuracies in Data-Efficient Deep Learning

Pith reviewed 2026-05-24 21:10 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAstat.ML
keywords graph interpolating activationdata-efficient learningadversarial robustnesssemi-supervised learningLaplace-Beltrami equationmanifold learningdeep neural networks
0
0 comments X

The pith

Replacing softmax with a graph Laplacian interpolator raises both natural accuracy and adversarial robustness for DNNs trained on limited data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the usual softmax output layer in deep neural networks with a high-dimensional interpolating function built from the graph Laplacian. In the continuum this function solves a Laplace-Beltrami equation on the data manifold. The resulting networks are shown to train effectively with far fewer labeled examples than standard architectures. They also record higher accuracy on clean test images and higher accuracy against both white-box and black-box adversarial examples. The same change supplies a direct route to semi-supervised learning.

Core claim

The central claim is that a DNN whose final activation is the graph Laplacian interpolator, rather than softmax, integrates manifold geometry into the output layer and thereby improves both natural accuracy on clean images and robust accuracy on adversarially perturbed images, with the gains being largest when the training set is small.

What carries the argument

The graph Laplacian-based high-dimensional interpolating function that replaces softmax and converges to the solution of a Laplace-Beltrami equation on the data manifold.

If this is right

  • High-capacity networks become usable with training sets an order of magnitude smaller than current practice.
  • Robustness to both white-box and black-box attacks improves without extra adversarial training.
  • The architecture supplies a built-in mechanism for incorporating unlabeled data in semi-supervised regimes.
  • End-to-end training and inference algorithms remain essentially unchanged from standard DNN pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method effectively embeds a discrete manifold-learning step inside the final layer, offering a tighter coupling than typical manifold-regularization add-ons.
  • Because the interpolator is data-dependent, it may adapt automatically to distribution shift between training and test sets.
  • The same construction could be applied to intermediate layers to propagate geometric information deeper into the network.

Load-bearing premise

The graph Laplacian interpolator can be inserted as the output activation of a standard DNN and trained end-to-end without introducing instabilities or prohibitive extra cost.

What would settle it

Train identical DNNs on a small labeled subset of CIFAR-10 or SVHN, once with the new activation and once with softmax, then compare clean test accuracy and accuracy under FGSM or PGD attacks; if the graph version shows no consistent gain the claim fails.

Figures

Figures reproduced from arXiv: 1907.06800 by Bao Wang, Stanley J. Osher.

Figure 1
Figure 1. Figure 1: Illustration of training and testing procedures of the standard DNN with the [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of training and testing procedures of the DNN with the WNLL inter [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plots of test errors when 1K (a) and 10K (b) training data are used to train the [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of the generation accuracy over the training procedure. Charts (a) and [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Adversarial images (left panel) selected from the MNIST dataset and the corre [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Adversarial images (left panel) selected from the CIFAR10 dataset and the corre [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Epochs v.s. accuracy in training ResNet56 on the CIFAR10. (a): without the [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the features learned by ResNet56 with the softmax ((a), (b)) and [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the first two principal components of the adversarial images’ [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A randomly selected adversarial image and their top five nearest neighbors in [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
read the original abstract

Improving the accuracy and robustness of deep neural nets (DNNs) and adapting them to small training data are primary tasks in deep learning research. In this paper, we replace the output activation function of DNNs, typically the data-agnostic softmax function, with a graph Laplacian-based high dimensional interpolating function which, in the continuum limit, converges to the solution of a Laplace-Beltrami equation on a high dimensional manifold. Furthermore, we propose end-to-end training and testing algorithms for this new architecture. The proposed DNN with graph interpolating activation integrates the advantages of both deep learning and manifold learning. Compared to the conventional DNNs with the softmax function as output activation, the new framework demonstrates the following major advantages: First, it is better applicable to data-efficient learning in which we train high capacity DNNs without using a large number of training data. Second, it remarkably improves both natural accuracy on the clean images and robust accuracy on the adversarial images crafted by both white-box and black-box adversarial attacks. Third, it is a natural choice for semi-supervised learning. For reproducibility, the code is available at \url{https://github.com/BaoWangMath/DNN-DataDependentActivation}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes replacing the standard softmax output activation in DNNs with a graph Laplacian-based high-dimensional interpolating function that, in the continuum limit, solves a Laplace-Beltrami equation on the data manifold. It introduces end-to-end training and testing procedures for this architecture and claims three advantages over conventional DNNs: improved applicability to data-efficient regimes, higher natural accuracy on clean data and robust accuracy under white- and black-box adversarial attacks, and natural suitability for semi-supervised learning. Reproducible code is provided.

Significance. If the claimed accuracy and robustness gains are shown to be statistically significant, reproducible across architectures, and not artifacts of altered optimization dynamics, the work would provide a concrete mechanism for injecting manifold geometry into the output layer of deep networks. The explicit provision of code strengthens the contribution by enabling direct verification of the end-to-end differentiability claim.

major comments (3)
  1. [Section 3 (training algorithm)] The central claim that the graph interpolant can be stably integrated into the output layer and trained end-to-end with SGD-style optimizers rests on unverified assumptions about differentiability. The manuscript must supply the explicit back-propagation rule through the graph-Laplacian solve (or pseudoinverse) and demonstrate that the resulting gradients remain well-conditioned for standard mini-batch sizes; without this, reported gains could arise from an incidental change in the loss landscape rather than the manifold property itself.
  2. [Section 4 (experiments)] No quantitative results, error bars, or baseline comparisons appear in the abstract, and the full text must include tables that report natural and robust accuracy (with standard deviations over multiple runs) against at least ResNet- and VGG-style softmax baselines on CIFAR-10/100 and ImageNet subsets for the data-efficient regime. The absence of these numbers makes it impossible to assess whether the claimed improvements are load-bearing or marginal.
  3. [Section 2 (graph interpolating activation)] The construction of the graph Laplacian from high-dimensional features is described only at a high level; the paper must specify whether the Laplacian is recomputed every epoch from the current mini-batch embeddings or held fixed, and must quantify the additional per-iteration cost relative to standard softmax. If the cost scales with batch size squared, the data-efficiency advantage may be offset by computational overhead.
minor comments (2)
  1. Notation for the graph Laplacian matrix and its pseudoinverse should be introduced with an explicit equation number and kept consistent between the theoretical derivation and the algorithmic pseudocode.
  2. The abstract states that the method 'remarkably improves' both accuracies; the results section should replace this phrasing with precise percentage-point gains relative to the softmax baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and will revise the manuscript accordingly to improve clarity, rigor, and completeness.

read point-by-point responses
  1. Referee: [Section 3 (training algorithm)] The central claim that the graph interpolant can be stably integrated into the output layer and trained end-to-end with SGD-style optimizers rests on unverified assumptions about differentiability. The manuscript must supply the explicit back-propagation rule through the graph-Laplacian solve (or pseudoinverse) and demonstrate that the resulting gradients remain well-conditioned for standard mini-batch sizes; without this, reported gains could arise from an incidental change in the loss landscape rather than the manifold property itself.

    Authors: We agree that explicit details on differentiability are required. The accompanying code implements the graph-Laplacian solve (via pseudoinverse) and its backward pass using automatic differentiation. In the revised manuscript we will add an explicit derivation of the back-propagation rule through the linear solve and include numerical verification that gradient norms remain well-conditioned for the batch sizes employed in the experiments. This will confirm that the reported gains stem from the manifold geometry rather than incidental optimization effects. revision: yes

  2. Referee: [Section 4 (experiments)] No quantitative results, error bars, or baseline comparisons appear in the abstract, and the full text must include tables that report natural and robust accuracy (with standard deviations over multiple runs) against at least ResNet- and VGG-style softmax baselines on CIFAR-10/100 and ImageNet subsets for the data-efficient regime. The absence of these numbers makes it impossible to assess whether the claimed improvements are load-bearing or marginal.

    Authors: We will revise the abstract to include key quantitative highlights. In Section 4 we will add tables that report natural and robust accuracies together with standard deviations computed over multiple independent runs, and we will include direct comparisons against ResNet- and VGG-style softmax baselines on CIFAR-10/100 and ImageNet subsets in the data-efficient regime. These additions will enable a clear statistical assessment of the improvements. revision: yes

  3. Referee: [Section 2 (graph interpolating activation)] The construction of the graph Laplacian from high-dimensional features is described only at a high level; the paper must specify whether the Laplacian is recomputed every epoch from the current mini-batch embeddings or held fixed, and must quantify the additional per-iteration cost relative to standard softmax. If the cost scales with batch size squared, the data-efficiency advantage may be offset by computational overhead.

    Authors: We will expand Section 2 to state explicitly that the graph Laplacian is built from the current mini-batch embeddings and is recomputed at every training iteration. We will also add a complexity analysis together with empirical timing measurements that quantify the additional per-iteration cost relative to softmax; the dominant term is the linear solve whose size equals the batch size. These details will allow readers to evaluate the computational trade-off against the observed data-efficiency gains. revision: yes

Circularity Check

0 steps flagged

No circularity; architectural proposal with empirical validation

full rationale

The paper introduces a graph-Laplacian interpolating activation as a direct replacement for softmax, justified by its continuum limit to the Laplace-Beltrami equation and supported by proposed end-to-end algorithms. No derivation step equates a claimed prediction or result to its own fitted inputs or self-citations by construction. Advantages in accuracy and data efficiency are framed as empirical outcomes rather than tautological identities. The central claims rest on experimental comparisons, not on re-deriving inputs from outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5753 in / 1056 out tokens · 17907 ms · 2026-05-24T21:10:28.659469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 19 internal anchors

  1. [1]

    Learning Activation Functions to Improve Deep Neural Networks

    F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi. Learning Activation Functions to Improve Deep Neural Networks. arXiv preprint arXiv:1412.6830 ,

  2. [2]

    Adversarial Machine Learning against Tesla’s Autopilot

    Anonymous. Adversarial Machine Learning against Tesla’s Autopilot. https://www. schneier.com/blog/archives/2019/04/adversarial_mac.html,

  3. [3]

    Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models

    W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248 ,

  4. [4]

    X. Chen, C. Liu, B. Li, K. Liu, and D. Song. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv preprint arXiv:1712.05526 , 2017a. Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual Path Networks. In Advances in neural information processing systems, 2017b. J. Cohen, E. Rosenfeld, and J. Z. Kolter. Certified Adversarial ...

  5. [5]

    Z. Dou, S. J. Osher, and B. Wang. Mathematical Analysis of Adversarial Attacks. arXiv preprint arXiv:1811.06492,

  6. [6]

    Maxout Networks

    I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout Networks. arXiv preprint arXiv:1302.4389 ,

  7. [7]

    I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Exam- ples. arXiv preprint arXiv:1412.6275 ,

  8. [8]

    Improving neural networks by preventing co-adaptation of feature detectors

    G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improv- ing neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580,

  9. [9]

    Adam: A Method for Stochastic Optimization

    D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980,

  10. [10]

    Deep Residual Learning and PDEs on Manifold

    Z. Li and Z. Shi. Deep Residual Learning and PDEs on Manifold. arXiv preprint arXiv:1708.05115,

  11. [11]

    Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770 ,

  12. [12]

    S. J. Osher, B. Wang, P. Yin, X. Luo, M. Pham, and A. Lin. Laplacian Smoothing Gradient Descent. arXiv preprint arXiv:1806.06317 ,

  13. [13]

    Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples

    N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z.B. Celik, and A. Swami. The Limita- tions of Deep Learning in Adversarial Settings. IEEE European Symposium on Security and Privacy, pages 372–387, 2016a. N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks. IEEE Europe...

  14. [14]

    Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients

    32 A. Ross and F. Doshi-Velez. Improving the Adversarial Robustness and Interpretabil- ity of Deep Neural Networks by Regularizing Their Input Gradients. arXiv preprint arXiv:1711.09404,

  15. [15]

    URL https://openreview.net/forum?id=BkJ3ibb0-. Z. Shi, B. Wang, and S. Osher. Error Estimation of the Weighted Nonlocal Laplacian on Random Point Cloud. arXiv preprint arXiv:1809.08622 ,

  16. [16]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ,

  17. [17]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing Properties of Neural Networks. arXiv preprint arXiv:1312.6199 ,

  18. [18]

    Y. Tang. Deep Learning Using Linear Support Vector Machines. ArXiv:1306.0239,

  19. [19]

    URL https://openreview.net/forum?id= rkZvSe-RZ. V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitiagkas, A. Courville, D. Lopez-Paz, and Y. Bengio. Manifold Mixup: Better Representations by Interpolating Hidden States. arXiv preprint arXiv:1806.05236 ,

  20. [20]

    B. Wang, A. T. Lin, Z. Shi, W. Zhu, P. Yin, A. L. Bertozzi, and S. J. Osher. Adversar- ial Defense via Data Dependent Activation Function and Total Variation Minimization. arXiv preprint arXiv:1809.08516 , 2018a. B. Wang, X. Luo, Z. Li, W. Zhu, Z. Shi, and S. Osher. Deep Neural Nets with Interpolating Function as Output Activation. In Advances in Neural I...

  21. [21]

    Theoretically Principled Trade-off between Robustness and Accuracy

    H. Zhang, Y. Yu, J. Jiao, E. Xing, L. Ghaoui, and M. Jordan. Theoretically Principled Trade-off between Robustness and Accuracy. arXiv preprint arXiv:1901.08573 ,