Variations on the Chebyshev-Lagrange Activation Function

Frank Rudzicz; Jekaterina Novikova; Yuchen Li

arxiv: 1906.10064 · v1 · pith:WECFS5OPnew · submitted 2019-06-24 · 💻 cs.LG · cs.AI· stat.ML

Variations on the Chebyshev-Lagrange Activation Function

Yuchen Li , Frank Rudzicz , Jekaterina Novikova This is my paper

Pith reviewed 2026-05-25 17:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords activation functionChebyshev nodesLagrange interpolationneural networkresidual architectureimage classificationDementiaBank

0 comments

The pith

Replacing ReLU or tanh with linearly extrapolated Chebyshev-Lagrange activations yields competitive performance on MNIST, CIFAR-10, and DementiaBank tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new family of activation functions defined by the y-coordinates of Chebyshev nodes, interpolated via Lagrange polynomials on the interval from negative one to one. Linear extrapolation is applied to inputs outside this interval, which experiments on synthetic data show increases the model's expressive capacity. When these activations replace conventional ones in residual networks, they deliver competitive or leading results on standard image classification benchmarks and a task involving minimally correlated vectors, suggesting a path to greater data efficiency in neural networks.

Core claim

By parameterizing the y-coordinates at n+1 Chebyshev nodes per hidden unit and using Lagrangian interpolation to define the polynomial on [-1, 1], with linear extrapolation beyond that range, the activation functions exhibit improved interpolation accuracy on synthetic datasets. Substituting these for ReLU or tanh in deep residual architectures produces competitive or state-of-the-art classification performance on MNIST, CIFAR-10, and DementiaBank.

What carries the argument

Chebyshev-Lagrange activation, which parameterizes y-coordinates at Chebyshev nodes for Lagrangian interpolation on [-1,1] with linear extrapolation outside the interval.

Load-bearing premise

The assumption that the learned y-coordinates at Chebyshev nodes combined with linear extrapolation will produce stable training dynamics and genuine generalization gains rather than overfitting to the specific datasets or architectures tested.

What would settle it

Observing performance that falls below ReLU or tanh baselines on a new dataset or architecture outside the MNIST, CIFAR-10, and DementiaBank experiments.

Figures

Figures reproduced from arXiv: 1906.10064 by Frank Rudzicz, Jekaterina Novikova, Yuchen Li.

**Figure 1.** Figure 1: Chebyshev-Lagrange activations before (top row) and after (bottom row) recieving backpropagation for 100 epochs of training on CIFAR-10. We show the activations for the first (left column) and second (right column) elements of the last linear layer of a modified ResNet-32. This idea has been explored with piece-wise polynomial activations where the model learns the weights for the Lagrangian basis functio… view at source ↗

**Figure 2.** Figure 2: Sample plots of activations and histograms of their [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Sample plots of ReLU and histograms of their inputs for the first 5 hidden units at the [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

read the original abstract

We seek to improve the data efficiency of neural networks and present novel implementations of parameterized piece-wise polynomial activation functions. The parameters are the y-coordinates of n+1 Chebyshev nodes per hidden unit and Lagrangian interpolation between the nodes produces the polynomial on [-1, 1]. We show results for different methods of handling inputs outside [-1, 1] on synthetic datasets, finding significant improvements in capacity of expression and accuracy of interpolation in models that compute some form of linear extrapolation from either ends. We demonstrate competitive or state-of-the-art performance on the classification of images (MNIST and CIFAR-10) and minimally-correlated vectors (DementiaBank) when we replace ReLU or tanh with linearly extrapolated Chebyshev-Lagrange activations in deep residual architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Chebyshev-Lagrange activation adds n+1 parameters per hidden unit, so any edge over ReLU or tanh is likely just extra capacity unless the baselines were matched.

read the letter

The one thing to know is that this activation function carries n+1 extra learnable y-coordinates per hidden unit on top of the usual weights. That makes the performance numbers on MNIST, CIFAR-10, and DementiaBank hard to read without knowing whether the ReLU and tanh baselines got the same total parameter budget or width adjustments. The stress-test note is right on this point; the abstract gives no sign they controlled for it. The central claim therefore rests on an uncontrolled comparison. What the paper actually introduces is a concrete implementation: Chebyshev nodes inside [-1,1], Lagrange interpolation between them, and linear extrapolation outside the interval. They test several boundary rules on synthetic data and report that linear extrapolation improves both expressivity and interpolation accuracy. That combination and the extrapolation handling do not appear in the cited prior work, so the implementation detail is new. The synthetic results are the cleanest part of the paper. The real-data section is weaker. No error bars, no ablation on n, and no statistical tests are mentioned in the abstract, and the full text does not appear to supply them either. The work is aimed at people already tinkering with custom activations for modest data-efficiency gains in residual nets. A reader in that niche can extract the synthetic interpolation findings and the extrapolation rule without much trouble. It is coherent enough and has enough of a novel implementation to go out for peer review, though any referee will need to press on the capacity controls and the missing error analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces parameterized piecewise polynomial activation functions based on Chebyshev nodes, where the y-coordinates are learnable parameters per hidden unit, using Lagrangian interpolation on [-1,1] and linear extrapolation outside this interval. It reports significant improvements in interpolation accuracy on synthetic data and competitive or state-of-the-art classification performance on MNIST, CIFAR-10, and DementiaBank datasets when used in deep residual networks in place of ReLU or tanh activations.

Significance. If the empirical results hold after controlling for parameter count, the work could be significant in demonstrating that learnable polynomial activations can enhance neural network performance and data efficiency. The approach of using Chebyshev-Lagrange with extrapolation provides a structured way to increase expressivity. However, the current presentation leaves open whether the gains are attributable to the activation design or simply to the added degrees of freedom.

major comments (2)

[Abstract] The central performance claim on MNIST, CIFAR-10, and DementiaBank requires that the comparison isolates the effect of the Chebyshev-Lagrange activation rather than the added capacity from n+1 learnable y-coordinates per hidden unit. The manuscript does not indicate whether ReLU/tanh baselines received equivalent parameter budgets or width adjustments.
[Abstract] The abstract reports positive empirical outcomes across three datasets without error bars, ablation details on the extrapolation methods, or statistical tests. This is load-bearing for the robustness of the claimed accuracy improvements.

minor comments (1)

The description of handling inputs outside [-1,1] could benefit from more explicit equations for the extrapolation rule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional controls and reporting would strengthen the empirical claims. We address each major comment below and commit to revisions that directly respond to the concerns about parameter matching and statistical robustness.

read point-by-point responses

Referee: [Abstract] The central performance claim on MNIST, CIFAR-10, and DementiaBank requires that the comparison isolates the effect of the Chebyshev-Lagrange activation rather than the added capacity from n+1 learnable y-coordinates per hidden unit. The manuscript does not indicate whether ReLU/tanh baselines received equivalent parameter budgets or width adjustments.

Authors: We agree that the original experiments compare networks of identical width and depth, so the Chebyshev-Lagrange activations add n+1 parameters per hidden unit relative to ReLU or tanh. This leaves open whether gains arise from the activation shape or from extra capacity. In the revised manuscript we will add matched-parameter experiments in which baseline ReLU and tanh networks are widened until total parameter count equals that of the Chebyshev-Lagrange models. These new results will be reported alongside the original architecture-matched comparisons. revision: yes
Referee: [Abstract] The abstract reports positive empirical outcomes across three datasets without error bars, ablation details on the extrapolation methods, or statistical tests. This is load-bearing for the robustness of the claimed accuracy improvements.

Authors: The abstract is intentionally concise, yet we accept that it should convey variability and supporting analyses. The body already contains ablation results on linear versus other extrapolation schemes using synthetic interpolation tasks. For the revision we will (i) append error bars obtained from at least five independent random seeds to the reported accuracies, (ii) reference the extrapolation ablations in the abstract, and (iii) include statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) in the experimental sections. The abstract text will be updated accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on direct experiments, not self-referential derivations

full rationale

The paper proposes a parameterized activation function (y-coordinates at Chebyshev nodes with Lagrange interpolation and linear extrapolation) and reports empirical accuracy on MNIST, CIFAR-10, and DementiaBank in residual networks. No derivation chain, uniqueness theorem, or first-principles prediction is claimed; results are obtained by training and measuring test accuracy. No step reduces a claimed output to a fitted input by construction, and any self-citations (if present) are not load-bearing for the central experimental claims. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the standard uniqueness property of Lagrange interpolation through n+1 distinct points and on the empirical choice of linear extrapolation; the y-coordinates themselves are learned parameters rather than fixed constants.

free parameters (1)

y-coordinates of n+1 Chebyshev nodes per hidden unit
These heights are the trainable parameters that determine the shape of each unit's activation polynomial.

axioms (1)

standard math Lagrange interpolation through n+1 distinct points yields a unique polynomial of degree at most n
This classical result from numerical analysis is invoked to guarantee that the interpolated function is well-defined on [-1,1].

pith-pipeline@v0.9.0 · 5662 in / 1343 out tokens · 30348 ms · 2026-05-25T17:26:34.755335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Balagopalan, J

A. Balagopalan, J. Novikova, F. Rudzicz, and M. Ghassemi. The effect of heterogeneous data for Alzheimer’s disease detection from speech. NeurIPS workshop on Machine Learning for Health, 2018

work page 2018
[2]

Boller and J

F. Boller and J. Becker. Dementiabank database guide. University of Pittsburgh, 2005

work page 2005
[3]

Chang, J

J. Chang, J. Gu, L. Wang, G. Meng, S. Xiang, and C. Pan. Structure-aware convolutional neural networks. NeurIPS, 2018

work page 2018
[4]

Polynomial Regression As an Alternative to Neural Nets

X. Cheng, B. Khomtchouk, N. Matloff, and P. Mohanty. Polynomial regression as an alternative to neural nets. arXiv preprint arXiv:1806.06850, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Defferrard, X

M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. NeurIPS, 2016

work page 2016
[6]

K. C. Fraser, J. A. Meltzer, and F. Rudzicz. Linguistic features identify Alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease, 49(3), 2018

work page 2018
[7]

Shake-Shake regularization

X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y . Bengio. Maxout networks. ICML, 2013

work page 2013
[9]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: surpassing human-level performance on ImageNet classiﬁcation. ICCV, 2015

work page 2015
[10]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016

work page 2016
[11]

Karlekar, T

S. Karlekar, T. Niu, and M. Bansal. Detecting linguistic characteristics of Alzheimer’s dementia by interpreting neural models. Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics , 2015

work page 2018
[12]

Knyazev, G

B. Knyazev, G. W. Taylor, X. Lin, and M. R. Amer. Spectral multigraph networks for discovering and fusing relationships in molecules. NeurIPS, 2018

work page 2018
[13]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images. Tech report, 2009

work page 2009
[14]

M. B. Kursa and W. R. Rudnicki. Feature selection with the Boruta package. Journal of Statistical Software, 2010

work page 2010
[15]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition. IEEE, 1998

work page 1998
[16]

T. T. Lee and J. T. Jeng. The Chebyshev-polynomials-based uniﬁed model neural networks for function approximation. IEEE, 1998

work page 1998
[17]

D. Levy. Introduction to numerical analysis. University of Maryland, 2010

work page 2010
[18]

Y . Li, S. Hossain, K. Jamali, and F. Rudzicz. DeepConsensus: using the consensus of features from multiple layers to attain robust image classiﬁcation. arXiv preprint arXiv:1811.07266, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. ICLR, 2017

work page 2017
[20]

Discontinuous Piecewise Polynomial Neural Networks

J. Loverich. Discontinuous piecewise polynomial neural networks. arXiv preprint arXiv:1505.04211, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Noorian, C

Z. Noorian, C. Pou-Prom, and F. Rudzicz. On the importance of normative data in speech-based assessment. NeurIPS, 2017

work page 2017
[22]

Pou-Prom and F

C. Pou-Prom and F. Rudzicz. Learning multiview embeddings for assessing dementia. EMNLP, 2018

work page 2018
[23]

M. Vleck. Chebyshev polynomial approximation for activation sigmoid function. Neural Network World, 2012

work page 2012
[24]

Z. Zhu, J. Novikova, and F. Rudzicz. Semi-supervised classiﬁcation by reaching consensus among modalities. NeurIPS workshop on Machine Learning for Health , 2018. 9 6 Appendix 6.1 Parameters We ﬁrst consider the case where the input to the activation function is a vector v of length d, with elements vi, i = 1,...,d . For the hyperparameter n, we wish to l...

work page 2018

[1] [1]

Balagopalan, J

A. Balagopalan, J. Novikova, F. Rudzicz, and M. Ghassemi. The effect of heterogeneous data for Alzheimer’s disease detection from speech. NeurIPS workshop on Machine Learning for Health, 2018

work page 2018

[2] [2]

Boller and J

F. Boller and J. Becker. Dementiabank database guide. University of Pittsburgh, 2005

work page 2005

[3] [3]

Chang, J

J. Chang, J. Gu, L. Wang, G. Meng, S. Xiang, and C. Pan. Structure-aware convolutional neural networks. NeurIPS, 2018

work page 2018

[4] [4]

Polynomial Regression As an Alternative to Neural Nets

X. Cheng, B. Khomtchouk, N. Matloff, and P. Mohanty. Polynomial regression as an alternative to neural nets. arXiv preprint arXiv:1806.06850, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[5] [5]

Defferrard, X

M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. NeurIPS, 2016

work page 2016

[6] [6]

K. C. Fraser, J. A. Meltzer, and F. Rudzicz. Linguistic features identify Alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease, 49(3), 2018

work page 2018

[7] [7]

Shake-Shake regularization

X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y . Bengio. Maxout networks. ICML, 2013

work page 2013

[9] [9]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: surpassing human-level performance on ImageNet classiﬁcation. ICCV, 2015

work page 2015

[10] [10]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016

work page 2016

[11] [11]

Karlekar, T

S. Karlekar, T. Niu, and M. Bansal. Detecting linguistic characteristics of Alzheimer’s dementia by interpreting neural models. Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics , 2015

work page 2018

[12] [12]

Knyazev, G

B. Knyazev, G. W. Taylor, X. Lin, and M. R. Amer. Spectral multigraph networks for discovering and fusing relationships in molecules. NeurIPS, 2018

work page 2018

[13] [13]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images. Tech report, 2009

work page 2009

[14] [14]

M. B. Kursa and W. R. Rudnicki. Feature selection with the Boruta package. Journal of Statistical Software, 2010

work page 2010

[15] [15]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition. IEEE, 1998

work page 1998

[16] [16]

T. T. Lee and J. T. Jeng. The Chebyshev-polynomials-based uniﬁed model neural networks for function approximation. IEEE, 1998

work page 1998

[17] [17]

D. Levy. Introduction to numerical analysis. University of Maryland, 2010

work page 2010

[18] [18]

Y . Li, S. Hossain, K. Jamali, and F. Rudzicz. DeepConsensus: using the consensus of features from multiple layers to attain robust image classiﬁcation. arXiv preprint arXiv:1811.07266, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. ICLR, 2017

work page 2017

[20] [20]

Discontinuous Piecewise Polynomial Neural Networks

J. Loverich. Discontinuous piecewise polynomial neural networks. arXiv preprint arXiv:1505.04211, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Noorian, C

Z. Noorian, C. Pou-Prom, and F. Rudzicz. On the importance of normative data in speech-based assessment. NeurIPS, 2017

work page 2017

[22] [22]

Pou-Prom and F

C. Pou-Prom and F. Rudzicz. Learning multiview embeddings for assessing dementia. EMNLP, 2018

work page 2018

[23] [23]

M. Vleck. Chebyshev polynomial approximation for activation sigmoid function. Neural Network World, 2012

work page 2012

[24] [24]

Z. Zhu, J. Novikova, and F. Rudzicz. Semi-supervised classiﬁcation by reaching consensus among modalities. NeurIPS workshop on Machine Learning for Health , 2018. 9 6 Appendix 6.1 Parameters We ﬁrst consider the case where the input to the activation function is a vector v of length d, with elements vi, i = 1,...,d . For the hyperparameter n, we wish to l...

work page 2018