arxiv: 2605.02591 · v1 · submitted 2026-05-04 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions

Wentao Zhang , Yutong Zhang , Yifan Zhu , Wentao Mo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords activation functionsBernstein polynomialssmoothnessdeep neural networksgradient stabilityBerLUpiecewise linearconstructive approximation

0 comments

The pith

BerLU uses Bernstein polynomials to create a smooth quadratic transition in activation functions that guarantees continuous differentiability and a Lipschitz constant of one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a smoothing framework based on constructive approximation to replace non-differentiable points in activation functions. It introduces the Bernstein Linear Unit (BerLU), which builds a quadratic transition region using Bernstein polynomials while keeping the rest piecewise linear. This produces functions that are strictly continuously differentiable with a non-expansive Lipschitz constant of one. The design targets stable gradient flow in deep networks without the high cost of transcendental smooth activations. Experiments on Vision Transformers and CNNs show consistent gains on image classification benchmarks alongside better speed and memory use.

Core claim

Bernstein polynomials can construct a differentiable quadratic transition region for activation functions. The resulting BerLU is strictly continuously differentiable with a Lipschitz constant of one, which supports stable gradient propagation and avoids explosion in deep architectures while retaining the efficiency of piecewise linear forms.

What carries the argument

The Bernstein Linear Unit (BerLU), which applies Bernstein polynomial approximation to build a quadratic transition segment that removes singularities at the origin in otherwise linear activations.

If this is right

Deep architectures can train stably without gradient explosion issues common in non-smooth activations.
Inference remains as fast as piecewise linear functions while avoiding their optimization instability.
The same Bernstein smoothing can be applied to other base activations beyond linear ones.
Memory and compute overhead stays lower than activations relying on exponentials or other transcendental operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might support even deeper networks than ReLU allows by removing a key source of training instability.
Higher-degree Bernstein polynomials could be swapped in to achieve higher-order differentiability if needed for specific models.
The framework could transfer to activation design in non-vision domains such as language models or reinforcement learning.

Load-bearing premise

The transition region's width and shape can be chosen so the smoothed function stays computationally cheap and at least as expressive as standard activations on typical tasks.

What would settle it

A deep network trained with BerLU that exhibits exploding gradients or underperforms ReLU on standard image classification benchmarks would disprove the stability and performance claims.

Figures

Figures reproduced from arXiv: 2605.02591 by Wentao Mo, Wentao Zhang, Yifan Zhu, Yutong Zhang.

**Figure 1.** Figure 1: Impact of the Smoothing Parameter ϵ on ViT Classification Performance across CIFAR Datasets trained for 100 Epochs. The accuracy exhibits a rise-then-fall trend, peaking at ϵ = 10−2 with top-1 accuracies of 78.5% on CIFAR-10 and 45.5% on CIFAR-100. Performance remains highly stable for small ϵ ∈ [10−4 , 10−1 ], where the accuracy fluctuation is negligible (within 1.5%), demonstrating the method’s robustnes… view at source ↗

read the original abstract

The efficacy of deep neural networks is heavily reliant on the design of non-linear activation functions, yet existing approaches often struggle to balance optimization stability with computational efficiency. While piecewise linear functions offer inference speed, they suffer from optimization instability due to non-differentiability at the origin, whereas smooth counterparts typically incur significant computational overhead through their reliance on transcendental operations. To address these limitations, this paper proposes a general smoothing framework based on constructive approximation theory and introduces the Bernstein Linear Unit (BerLU). This novel activation function utilizes Bernstein polynomials to construct a differentiable quadratic transition region that effectively eliminates singularities while maintaining a piecewise linear structure. Theoretical analysis demonstrates that the proposed method guarantees strictly continuous differentiability and a non-expansive Lipschitz constant of one, which ensures stable gradient propagation and prevents the gradient explosion problems common in deep architectures. Comprehensive empirical evaluations across representative Vision Transformer and Convolutional Neural Network architectures confirm that this approach consistently outperforms state-of-the-art baselines on standard image classification benchmarks while delivering superior computational and memory efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BerLU uses Bernstein polynomials to smooth the ReLU kink into a C1 function whose derivative stays between 0 and 1 regardless of transition width, and the vision experiments show modest accuracy gains with lower FLOPs than GELU.

read the letter

The central contribution is a constructive way to replace the non-differentiable point in a piecewise-linear activation with a quadratic transition built from Bernstein polynomials. The construction matches both the function value and the first derivative at the boundaries, so the result is C1 everywhere. Because the derivative in the transition is a convex combination of the two endpoint slopes (0 and 1), the maximum slope is exactly 1 no matter how wide the transition interval is chosen. That removes the usual worry that smoothing will either explode gradients or require per-model retuning. The default version is parameter-free, which is a practical plus for deployment.

Referee Report

1 major / 3 minor

Summary. The manuscript proposes the Bernstein Linear Unit (BerLU), a new activation function that applies Bernstein polynomials to construct a quadratic transition region smoothing a piecewise-linear target function. It claims this yields a C^1 continuous activation with a Lipschitz constant of exactly 1 (independent of transition width) via explicit matching of value and first-derivative boundary conditions, where the derivative is a convex combination of endpoint slopes 0 and 1. The paper presents a parameter-free construction, theoretical analysis of gradient stability, and empirical results showing consistent outperformance over GELU and Swish on ViT and CNN image-classification benchmarks with lower FLOPs.

Significance. If the central claims hold, the work provides a constructive, reproducible method for producing smooth, non-expansive activations without transcendental operations or per-dataset tuning. The Bernstein-polynomial approach directly enforces C^1 continuity and unit Lipschitz constant, addressing both optimization instability and computational overhead in deep networks. The parameter-free default and reported efficiency gains on standard architectures represent a practical contribution to activation design.

major comments (1)

§3 (Theoretical Analysis): the derivation that max |f'| = 1 holds independently of transition width is load-bearing for the stability claim; the manuscript should include the explicit step showing that the quadratic Bernstein basis coefficients keep the derivative within [0,1] for arbitrary width parameters, rather than asserting it from the convex-combination property alone.

minor comments (3)

§4 (Empirical Evaluation): the reported performance tables lack error bars, number of runs, or statistical tests; adding these would strengthen the claim of consistent outperformance.
The transition-width hyperparameter is stated to be fixed in the default construction, but its concrete value and sensitivity analysis should appear in the main text rather than only in the appendix.
Notation: the Bernstein polynomial degree and the explicit form of the quadratic transition (e.g., the three basis functions and their coefficients) should be written out in §2 before the boundary-matching argument.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive comment on the theoretical section. We will incorporate the requested clarification to strengthen the presentation.

read point-by-point responses

Referee: §3 (Theoretical Analysis): the derivation that max |f'| = 1 holds independently of transition width is load-bearing for the stability claim; the manuscript should include the explicit step showing that the quadratic Bernstein basis coefficients keep the derivative within [0,1] for arbitrary width parameters, rather than asserting it from the convex-combination property alone.

Authors: We agree with the referee that an explicit derivation of the derivative bound would improve clarity and rigor. In the revised manuscript we will expand §3 to include the following steps: the transition region is realized by the quadratic Bernstein polynomial whose coefficients are set to match value and first-derivative continuity at the endpoints (yielding coefficients 0, ½, 1). Because the Bernstein basis functions are non-negative and form a partition of unity, the derivative is necessarily a convex combination of the endpoint slopes 0 and 1; consequently 0 ≤ f′(x) ≤ 1 holds for any positive transition width. This explicit verification will be inserted immediately after the convex-combination statement, leaving all claims and results unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines BerLU via an explicit Bernstein-polynomial construction that matches value and first-derivative boundary conditions of the target piecewise-linear function, yielding a C^1 transition whose derivative is a convex combination of the endpoint slopes 0 and 1. The claimed Lipschitz constant of 1 and continuous differentiability therefore follow directly from the boundary-matching equations and Bernstein basis properties, without any parameter fitting to data, renaming of known results, or load-bearing self-citations. The theoretical guarantees are proven from the construction itself rather than asserted via external uniqueness theorems or prior author work; empirical benchmarks are reported separately and do not retroactively define the smoothness properties.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, background axioms, or additional invented entities beyond the BerLU function itself.

invented entities (1)

Bernstein Linear Unit (BerLU) no independent evidence
purpose: Activation function with smooth quadratic transition via Bernstein polynomials
The central new object introduced to solve the stability-efficiency trade-off.

pith-pipeline@v0.9.0 · 5476 in / 1235 out tokens · 76018 ms · 2026-05-08T18:01:12.646181+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (Jcost = ½(x+x⁻¹)−1) Jcost_unit0 / washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BerLU(x) = αx for x<-ε; (1-α)/(4ε) x² + (1+α)/2 x + (1-α)ε/4 for -ε≤x≤ε; x for x>ε
IndisputableMonolith.Foundation.AlphaCoordinateFixation alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we set the smoothing parameter ε ... to a default value of 10⁻²; α ... is initialized to 0.01 and optimized jointly with the model weights
IndisputableMonolith.Cost Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BerLU is engineered to strictly enforce L = 1.000, classifying it as a strictly non-expansive operator

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Activation functions in deep learning: A comprehensive survey and benchmark,

S. R. Dubey, S. K. Singh, and B. B. Chaudhuri, “Activation functions in deep learning: A comprehensive survey and benchmark,”Neuro- computing, vol. 503, pp. 92–108, 2022

2022
[2]

Approximation capabilities of multilayer feedforward networks,

K. Hornik, “Approximation capabilities of multilayer feedforward networks,”Neural networks, vol. 4, no. 2, pp. 251–257, 1991

1991
[3]

Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,

R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung, “Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,”nature, vol. 405, no. 6789, pp. 947– 951, 2000

2000
[4]

What is the best multi-stage architecture for object recognition?

K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y . LeCun, “What is the best multi-stage architecture for object recognition?” in2009 IEEE 12th international conference on computer vision. IEEE, 2009, pp. 2146–2153

2009
[5]

Rectified linear units improve restricted boltzmann machines,

V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” inProceedings of the 27th international con- ference on machine learning (ICML-10), 2010, pp. 807–814

2010
[6]

Rectifier nonlinearities improve neural network acoustic models,

A. L. Maas, A. Y . Hannun, A. Y . Nget al., “Rectifier nonlinearities improve neural network acoustic models,” inProc. icml, vol. 30, no. 1. Atlanta, GA, 2013, p. 3

2013
[7]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034

2015
[8]

On the impact of the activation function on deep neural networks training,

S. Hayou, A. Doucet, and J. Rousseau, “On the impact of the activation function on deep neural networks training,” inInternational conference on machine learning. PMLR, 2019, pp. 2672–2680

2019
[9]

Smooth maximum unit: Smooth activation function for deep networks using smoothing maximum technique,

K. Biswas, S. Kumar, S. Banerjee, and A. K. Pandey, “Smooth maximum unit: Smooth activation function for deep networks using smoothing maximum technique,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 794–803

2022
[10]

A general framework for activation function optimization based on mollification theory,

W. Zhang, Y . Zhang, Y . Zheng, and W. Mo, “A general framework for activation function optimization based on mollification theory,” Mathematics, vol. 14, no. 1, p. 72, 2025

2025
[11]

Visualizing the loss landscape of neural nets,

H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,”Advances in neural information processing systems, vol. 31, 2018

2018
[12]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016

work page Pith review arXiv 2016
[13]

Searching for Activation Functions

P. Ramachandran, B. Zoph, and Q. V . Le, “Searching for activation functions,”arXiv preprint arXiv:1710.05941, 2017

work page internal anchor Pith review arXiv 2017
[14]

arXiv preprint arXiv:1908.08681

D. Misra, “Mish: A self regularized non-monotonic activation func- tion,”arXiv preprint arXiv:1908.08681, 2019

work page arXiv 1908
[15]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019
[16]

Adaptive online convex optimization: A survey of algorithms, theory, and modern applications,

Y . Zhang, W. Zhang, L. Zhang, H. Li, and W. Mo, “Adaptive online convex optimization: A survey of algorithms, theory, and modern applications,”Applied Sciences, vol. 16, no. 4, p. 1739, 2026

2026
[17]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),”arXiv preprint arXiv:1511.07289, vol. 4, no. 5, p. 11, 2015

work page Pith review arXiv 2015
[18]

Continuously differentiable exponential linear units,

J. T. Barron, “Continuously differentiable exponential linear units,” arXiv preprint arXiv:1704.07483, 2017

work page arXiv 2017
[19]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2020

2020
[20]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

2019
[21]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[22]

Palm: Scaling language modeling with pathways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,”Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023

2023
[23]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review arXiv 2023
[24]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guoet al., “Deepseek-v2: A strong, economi- cal, and efficient mixture-of-experts language model,”arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review arXiv 2024
[25]

From knowing to doing precisely: A general self-correction and termination framework for vla models,

W. Zhang, A. Sun, W. Mo, X. Qu, Y . Zheng, and J. Wang, “From knowing to doing precisely: A general self-correction and termination framework for vla models,”arXiv preprint arXiv:2602.01811, 2026

work page arXiv 2026
[26]

Lipschitz recurrent neural networks,

N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney, “Lipschitz recurrent neural networks,” inInternational Con- ference on Learning Representations, 2021

2021
[27]

Entropy-based activation function optimization: a method on searching better activation functions,

H. Sun, Z. Wu, B. Xia, P. Chang, Z. Dong, Y . Yuan, Y . Chang, and X. Wang, “Entropy-based activation function optimization: a method on searching better activation functions,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[28]

Lipschitz constant estimation of neural networks via sparse polynomial optimization,

F. Latorre, P. Rolland, and V . Cevher, “Lipschitz constant estimation of neural networks via sparse polynomial optimization,” inInternational Conference on Learning Representations, 2020

2020
[29]

Learning multiple layers of features from tiny images,

A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009

2009
[30]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

2009
[31]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inInternational conference on machine learning. PMLR, 2021, pp. 10 347–10 357

2021
[32]

Transformer in transformer,

K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y . Wang, “Transformer in transformer,”Advances in neural information processing systems, vol. 34, pp. 15 908–15 919, 2021

2021
[33]

A convnet for the 2020s,

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986

2022