pith. sign in

arxiv: 2605.25704 · v1 · pith:C7AAK6BYnew · submitted 2026-05-25 · 💻 cs.CL · cs.LG

PowLU: An Activation Function for Stable Pre-Training of LLMs

Pith reviewed 2026-06-29 21:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords PowLUactivation functionLLM trainingnumerical stabilitySwiGLUscaling lawspre-training
0
0 comments X

The pith

PowLU uses a rational power function to deliver stable nonlinearity in LLM activations, matching SwiGLU performance at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PowLU as an activation function that uses a rational power function for adaptive nonlinearity in large language models. This addresses the numerical instability in SwiGLU, which approximates a quadratic for large inputs and causes outliers in low-precision training. Scaling experiments show consistent performance across different model sizes. Tests on the Ling architecture with up to 124 billion parameters demonstrate that PowLU achieves results competitive with SwiGLU and its clipped variant. If true, this would support more reliable training of very large models.

Core claim

PowLU employs a rational power function to achieve adaptive nonlinearity, thereby improving representation ability and enabling stable training in spike regions. Theoretical justification is provided for several key properties. Scaling law experiments confirm that the performance is consistent across model sizes, and experimental results with the Ling architecture at 7.9B and 124B total parameters show that PowLU achieves competitive results against SwiGLU and SwiGLU-Clip.

What carries the argument

The rational power function, which supplies adaptive nonlinearity without quadratic amplification of large inputs.

If this is right

  • Performance stays consistent as models grow larger.
  • Training remains stable even with low precision and large inputs.
  • Scalability of LLM pre-training improves without additional clipping mechanisms.
  • Representation capacity matches or approaches that of SwiGLU in large architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • PowLU might allow training without per-layer hyperparameter tuning for activations.
  • Similar power-based functions could be tested in other neural network domains prone to outlier issues.
  • The approach could extend to multimodal models where activation stability is critical.

Load-bearing premise

A rational power function can supply enough nonlinearity and representation capacity for LLMs without creating new instability or requiring retuning.

What would settle it

A training run at 100B parameters where PowLU produces higher validation loss or more divergence than SwiGLU-Clip under identical conditions.

read the original abstract

In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function $x^2$, providing strong nonlinearity and expressive capacity. However, this property also causes numerical instability as the input or model scale increases, particularly in low-precision LLM training. The main reason is its approximate quadratic amplification, which enlarges the output range and exacerbates outliers. To address this issue, we propose a stable activation function, Power Linear Unit (PowLU), for large-scale LLM pre-training. Specifically, PowLU employs a rational power function to achieve adaptive nonlinearity, thereby improving representation ability and enabling stable training in spike regions. Moreover, we provide theoretical justification for several key properties of PowLU. Scaling law experiments confirm that the performance is consistent across model sizes, and further experimental results with the Ling architecture (7.9B and 124B total parameters) demonstrate that PowLU achieves competitive results against SwiGLU and SwiGLU-Clip in large-scale training of LLMs. In addition, the experimental results also show that PowLU effectively improves the scalability of the large-scale training of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Power Linear Unit (PowLU) activation function based on a rational power function to provide adaptive nonlinearity and address numerical instability in SwiGLU (which approximates x² for large positive inputs) during large-scale, low-precision LLM pre-training. It supplies theoretical justification for key properties of PowLU, reports scaling-law experiments showing performance consistency across model sizes, and presents results on the Ling architecture at 7.9B and 124B total parameters claiming competitive performance versus SwiGLU and SwiGLU-Clip along with improved scalability.

Significance. If the empirical claims hold with adequate controls, PowLU could meaningfully improve training stability for LLMs at and beyond current scales by mitigating outlier amplification without sacrificing representation capacity. The scaling-law consistency and large-model runs would be valuable if they include proper ablations; the rational-power approach is a plausible direction if its theoretical properties are shown to be robust rather than fitted.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim that PowLU 'achieves competitive results against SwiGLU and SwiGLU-Clip' and 'effectively improves the scalability' on 7.9B/124B Ling models is asserted without any quantitative metrics, error bars, baseline details, data-exclusion rules, or per-layer retuning information, rendering the claim unevaluable.
  2. [Theoretical justification] Theoretical justification and experimental sections: the paper states that theoretical properties justify stability, but does not demonstrate that these properties bound behavior in spike regions or extreme regimes at scales >124B; without such a bound the scalability conclusion does not follow from the reported Ling-architecture runs.
minor comments (1)
  1. The definition and exact form of the rational power function should be given as an explicit equation early in the method section rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that PowLU 'achieves competitive results against SwiGLU and SwiGLU-Clip' and 'effectively improves the scalability' on 7.9B/124B Ling models is asserted without any quantitative metrics, error bars, baseline details, data-exclusion rules, or per-layer retuning information, rendering the claim unevaluable.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will expand the abstract to report the key quantitative metrics (e.g., relative perplexity or downstream scores) from the 7.9B and 124B Ling runs, identify the exact SwiGLU and SwiGLU-Clip baselines, and note the training configuration details that were used. revision: yes

  2. Referee: [Theoretical justification] Theoretical justification and experimental sections: the paper states that theoretical properties justify stability, but does not demonstrate that these properties bound behavior in spike regions or extreme regimes at scales >124B; without such a bound the scalability conclusion does not follow from the reported Ling-architecture runs.

    Authors: We acknowledge that the current theoretical analysis supplies properties that reduce outlier amplification but does not derive explicit bounds that guarantee behavior for scales strictly larger than 124B. The scaling-law consistency and the 124B results provide supporting empirical evidence; we will add a limitations paragraph that clarifies the scope of the theoretical claims and the degree to which the observed trends can be extrapolated. revision: partial

standing simulated objections not resolved
  • Deriving explicit mathematical bounds on PowLU behavior in spike regions for model scales exceeding 124B parameters.

Circularity Check

0 steps flagged

No circularity in PowLU derivation or claims

full rationale

The paper defines PowLU via a rational power function, states theoretical justification for its properties, and validates via scaling-law and Ling-architecture experiments at multiple scales. No equations or steps reduce by construction to fitted inputs, self-citations, or renamed known results. The derivation chain is self-contained against external benchmarks (empirical comparisons to SwiGLU variants) with no load-bearing self-referential reductions visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate concrete free parameters, axioms, or invented entities; the rational power itself appears to be the central modeling choice whose exponent or form is not specified here.

pith-pipeline@v0.9.1-grok · 5778 in / 1131 out tokens · 23445 ms · 2026-06-29T21:45:40.320160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 22 canonical work pages · 14 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  2. [2]

    Faster and lighter LLMs: A survey on current challenges and way forward.arXiv preprint arXiv:2402.01799,

    Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, and Deepak Gupta. Faster and lighter LLMs: A survey on current challenges and way forward.arXiv preprint arXiv:2402.01799,

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  6. [6]

    Xeron Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, 13 Yizhi LI, Yunwen Li, dehua ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu ...

  7. [7]

    Low-precision training of large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2505.01043,

    Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Guoxia Wang, Dianhai Yu, Yonggang Wen, and Dacheng Tao. Low-precision training of large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2505.01043,

  8. [8]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415,

  9. [9]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021a. https: //openreview.net/forum?id=d7KBjmI3GmQ. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song...

  10. [10]

    Three decades of activations: A comprehensive survey of 400 activation functions for neural networks.arXiv preprint arXiv:2402.09092,

    Vladimír Kunc and Jiˇ rí Kléma. Three decades of activations: A comprehensive survey of 400 activation functions for neural networks.arXiv preprint arXiv:2402.09092,

  11. [11]

    To FP8 and back again: Quantifying the effects of reducing precision on LLM training stability.arXiv preprint arXiv:2405.18710,

    Joonhyung Lee, Jeongin Bae, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. To FP8 and back again: Quantifying the effects of reducing precision on LLM training stability.arXiv preprint arXiv:2405.18710,

  12. [12]

    CMMLU: Measuring massive multitask language understanding in Chinese

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11260–11285,

  13. [13]

    Large Language Models: A Survey

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196,

  14. [14]

    ReLU strikes back: Exploiting activation sparsity in large language models.arXiv preprint arXiv:2310.04564,

    Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. ReLU strikes back: Exploiting activation sparsity in large language models.arXiv preprint arXiv:2310.04564,

  15. [15]

    A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871,

    Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, et al. A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871,

  16. [16]

    Searching for Activation Functions

    Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941,

  17. [17]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,

  18. [18]

    OpenAI GPT-5 System Card

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought rea- soners. InInternational Conference on Learning Representations, 2023.https://openreview.net/forum?id=fR3wGCk-IXp. Aaditya Singh, Ada...

  19. [19]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdh- ery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051,

  20. [20]

    Gemini: A Family of Highly Capable Multimodal Models

    Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. Spike no more: Stabilizing the pre-training of large language models. InSecond Conference on Language Modeling, 2025.https://openreview.net/forum?id=52YBEzcI0l. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, K...

  21. [21]

    Attention Residuals

    Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031,

  22. [22]

    Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv:2510.22115,

    Ling Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, et al. Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv:2510.22115,

  23. [23]

    CMATH: Can your language model pass chinese elementary school math test?arXiv preprint arXiv:2306.16636,

    Tianwen Wei, Jian Luan, Wei Liu, Shuang Dong, and Bin Wang. CMATH: Can your language model pass chinese elementary school math test?arXiv preprint arXiv:2306.16636,

  24. [24]

    Qwen3 Technical Report

    Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie E Everett, Alexander A Alemi, Ben Adlam, John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Small-scale proxies for large-scale transformer training instabilities. InInternational Conference on ...

  25. [25]

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan

    https://openreview.net/forum? id=5HCnKDeTws. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314,

  26. [26]

    The sigmoid function is non-linear (Ramapuram et al., 2025)

    In terms of the x> 0 part, the PowLU activation function is a product of x1+m/( √x+1) and the sigmoid function. The sigmoid function is non-linear (Ramapuram et al., 2025). Furthermore, the exponent 1 +m/( √x+