pith. sign in

arxiv: 1906.10064 · v1 · pith:WECFS5OPnew · submitted 2019-06-24 · 💻 cs.LG · cs.AI· stat.ML

Variations on the Chebyshev-Lagrange Activation Function

Pith reviewed 2026-05-25 17:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords activation functionChebyshev nodesLagrange interpolationneural networkresidual architectureimage classificationDementiaBank
0
0 comments X

The pith

Replacing ReLU or tanh with linearly extrapolated Chebyshev-Lagrange activations yields competitive performance on MNIST, CIFAR-10, and DementiaBank tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new family of activation functions defined by the y-coordinates of Chebyshev nodes, interpolated via Lagrange polynomials on the interval from negative one to one. Linear extrapolation is applied to inputs outside this interval, which experiments on synthetic data show increases the model's expressive capacity. When these activations replace conventional ones in residual networks, they deliver competitive or leading results on standard image classification benchmarks and a task involving minimally correlated vectors, suggesting a path to greater data efficiency in neural networks.

Core claim

By parameterizing the y-coordinates at n+1 Chebyshev nodes per hidden unit and using Lagrangian interpolation to define the polynomial on [-1, 1], with linear extrapolation beyond that range, the activation functions exhibit improved interpolation accuracy on synthetic datasets. Substituting these for ReLU or tanh in deep residual architectures produces competitive or state-of-the-art classification performance on MNIST, CIFAR-10, and DementiaBank.

What carries the argument

Chebyshev-Lagrange activation, which parameterizes y-coordinates at Chebyshev nodes for Lagrangian interpolation on [-1,1] with linear extrapolation outside the interval.

Load-bearing premise

The assumption that the learned y-coordinates at Chebyshev nodes combined with linear extrapolation will produce stable training dynamics and genuine generalization gains rather than overfitting to the specific datasets or architectures tested.

What would settle it

Observing performance that falls below ReLU or tanh baselines on a new dataset or architecture outside the MNIST, CIFAR-10, and DementiaBank experiments.

Figures

Figures reproduced from arXiv: 1906.10064 by Frank Rudzicz, Jekaterina Novikova, Yuchen Li.

Figure 1
Figure 1. Figure 1: Chebyshev-Lagrange activations be￾fore (top row) and after (bottom row) recieving backpropagation for 100 epochs of training on CIFAR-10. We show the activations for the first (left column) and second (right column) elements of the last linear layer of a modified ResNet-32. This idea has been explored with piece-wise polynomial activations where the model learns the weights for the Lagrangian basis functio… view at source ↗
Figure 2
Figure 2. Figure 2: Sample plots of activations and histograms of their [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample plots of ReLU and histograms of their inputs for the first 5 hidden units at the [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
read the original abstract

We seek to improve the data efficiency of neural networks and present novel implementations of parameterized piece-wise polynomial activation functions. The parameters are the y-coordinates of n+1 Chebyshev nodes per hidden unit and Lagrangian interpolation between the nodes produces the polynomial on [-1, 1]. We show results for different methods of handling inputs outside [-1, 1] on synthetic datasets, finding significant improvements in capacity of expression and accuracy of interpolation in models that compute some form of linear extrapolation from either ends. We demonstrate competitive or state-of-the-art performance on the classification of images (MNIST and CIFAR-10) and minimally-correlated vectors (DementiaBank) when we replace ReLU or tanh with linearly extrapolated Chebyshev-Lagrange activations in deep residual architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces parameterized piecewise polynomial activation functions based on Chebyshev nodes, where the y-coordinates are learnable parameters per hidden unit, using Lagrangian interpolation on [-1,1] and linear extrapolation outside this interval. It reports significant improvements in interpolation accuracy on synthetic data and competitive or state-of-the-art classification performance on MNIST, CIFAR-10, and DementiaBank datasets when used in deep residual networks in place of ReLU or tanh activations.

Significance. If the empirical results hold after controlling for parameter count, the work could be significant in demonstrating that learnable polynomial activations can enhance neural network performance and data efficiency. The approach of using Chebyshev-Lagrange with extrapolation provides a structured way to increase expressivity. However, the current presentation leaves open whether the gains are attributable to the activation design or simply to the added degrees of freedom.

major comments (2)
  1. [Abstract] The central performance claim on MNIST, CIFAR-10, and DementiaBank requires that the comparison isolates the effect of the Chebyshev-Lagrange activation rather than the added capacity from n+1 learnable y-coordinates per hidden unit. The manuscript does not indicate whether ReLU/tanh baselines received equivalent parameter budgets or width adjustments.
  2. [Abstract] The abstract reports positive empirical outcomes across three datasets without error bars, ablation details on the extrapolation methods, or statistical tests. This is load-bearing for the robustness of the claimed accuracy improvements.
minor comments (1)
  1. The description of handling inputs outside [-1,1] could benefit from more explicit equations for the extrapolation rule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional controls and reporting would strengthen the empirical claims. We address each major comment below and commit to revisions that directly respond to the concerns about parameter matching and statistical robustness.

read point-by-point responses
  1. Referee: [Abstract] The central performance claim on MNIST, CIFAR-10, and DementiaBank requires that the comparison isolates the effect of the Chebyshev-Lagrange activation rather than the added capacity from n+1 learnable y-coordinates per hidden unit. The manuscript does not indicate whether ReLU/tanh baselines received equivalent parameter budgets or width adjustments.

    Authors: We agree that the original experiments compare networks of identical width and depth, so the Chebyshev-Lagrange activations add n+1 parameters per hidden unit relative to ReLU or tanh. This leaves open whether gains arise from the activation shape or from extra capacity. In the revised manuscript we will add matched-parameter experiments in which baseline ReLU and tanh networks are widened until total parameter count equals that of the Chebyshev-Lagrange models. These new results will be reported alongside the original architecture-matched comparisons. revision: yes

  2. Referee: [Abstract] The abstract reports positive empirical outcomes across three datasets without error bars, ablation details on the extrapolation methods, or statistical tests. This is load-bearing for the robustness of the claimed accuracy improvements.

    Authors: The abstract is intentionally concise, yet we accept that it should convey variability and supporting analyses. The body already contains ablation results on linear versus other extrapolation schemes using synthetic interpolation tasks. For the revision we will (i) append error bars obtained from at least five independent random seeds to the reported accuracies, (ii) reference the extrapolation ablations in the abstract, and (iii) include statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) in the experimental sections. The abstract text will be updated accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on direct experiments, not self-referential derivations

full rationale

The paper proposes a parameterized activation function (y-coordinates at Chebyshev nodes with Lagrange interpolation and linear extrapolation) and reports empirical accuracy on MNIST, CIFAR-10, and DementiaBank in residual networks. No derivation chain, uniqueness theorem, or first-principles prediction is claimed; results are obtained by training and measuring test accuracy. No step reduces a claimed output to a fitted input by construction, and any self-citations (if present) are not load-bearing for the central experimental claims. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the standard uniqueness property of Lagrange interpolation through n+1 distinct points and on the empirical choice of linear extrapolation; the y-coordinates themselves are learned parameters rather than fixed constants.

free parameters (1)
  • y-coordinates of n+1 Chebyshev nodes per hidden unit
    These heights are the trainable parameters that determine the shape of each unit's activation polynomial.
axioms (1)
  • standard math Lagrange interpolation through n+1 distinct points yields a unique polynomial of degree at most n
    This classical result from numerical analysis is invoked to guarantee that the interpolated function is well-defined on [-1,1].

pith-pipeline@v0.9.0 · 5662 in / 1343 out tokens · 30348 ms · 2026-05-25T17:26:34.755335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    Balagopalan, J

    A. Balagopalan, J. Novikova, F. Rudzicz, and M. Ghassemi. The effect of heterogeneous data for Alzheimer’s disease detection from speech. NeurIPS workshop on Machine Learning for Health, 2018

  2. [2]

    Boller and J

    F. Boller and J. Becker. Dementiabank database guide. University of Pittsburgh, 2005

  3. [3]

    Chang, J

    J. Chang, J. Gu, L. Wang, G. Meng, S. Xiang, and C. Pan. Structure-aware convolutional neural networks. NeurIPS, 2018

  4. [4]

    Polynomial Regression As an Alternative to Neural Nets

    X. Cheng, B. Khomtchouk, N. Matloff, and P. Mohanty. Polynomial regression as an alternative to neural nets. arXiv preprint arXiv:1806.06850, 2019

  5. [5]

    Defferrard, X

    M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. NeurIPS, 2016

  6. [6]

    K. C. Fraser, J. A. Meltzer, and F. Rudzicz. Linguistic features identify Alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease, 49(3), 2018

  7. [7]

    Shake-Shake regularization

    X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017

  8. [8]

    I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y . Bengio. Maxout networks. ICML, 2013

  9. [9]

    K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. ICCV, 2015

  10. [10]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016

  11. [11]

    Karlekar, T

    S. Karlekar, T. Niu, and M. Bansal. Detecting linguistic characteristics of Alzheimer’s dementia by interpreting neural models. Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics , 2015

  12. [12]

    Knyazev, G

    B. Knyazev, G. W. Taylor, X. Lin, and M. R. Amer. Spectral multigraph networks for discovering and fusing relationships in molecules. NeurIPS, 2018

  13. [13]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images. Tech report, 2009

  14. [14]

    M. B. Kursa and W. R. Rudnicki. Feature selection with the Boruta package. Journal of Statistical Software, 2010

  15. [15]

    LeCun, L

    Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition. IEEE, 1998

  16. [16]

    T. T. Lee and J. T. Jeng. The Chebyshev-polynomials-based unified model neural networks for function approximation. IEEE, 1998

  17. [17]

    D. Levy. Introduction to numerical analysis. University of Maryland, 2010

  18. [18]

    Y . Li, S. Hossain, K. Jamali, and F. Rudzicz. DeepConsensus: using the consensus of features from multiple layers to attain robust image classification. arXiv preprint arXiv:1811.07266, 2018

  19. [19]

    Loshchilov and F

    I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. ICLR, 2017

  20. [20]

    Discontinuous Piecewise Polynomial Neural Networks

    J. Loverich. Discontinuous piecewise polynomial neural networks. arXiv preprint arXiv:1505.04211, 2016

  21. [21]

    Noorian, C

    Z. Noorian, C. Pou-Prom, and F. Rudzicz. On the importance of normative data in speech-based assessment. NeurIPS, 2017

  22. [22]

    Pou-Prom and F

    C. Pou-Prom and F. Rudzicz. Learning multiview embeddings for assessing dementia. EMNLP, 2018

  23. [23]

    M. Vleck. Chebyshev polynomial approximation for activation sigmoid function. Neural Network World, 2012

  24. [24]

    Z. Zhu, J. Novikova, and F. Rudzicz. Semi-supervised classification by reaching consensus among modalities. NeurIPS workshop on Machine Learning for Health , 2018. 9 6 Appendix 6.1 Parameters We first consider the case where the input to the activation function is a vector v of length d, with elements vi, i = 1,...,d . For the hyperparameter n, we wish to l...