More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
Pith reviewed 2026-06-29 19:34 UTC · model grok-4.3
The pith
Mixture of Activations strictly increases the expressive power of feedforward layers by making nonlinearity selection depend on the input token.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mixture of Activations (MoA) mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. This yields strict finite-width expressive separations where fixed-activation FFNs are contained in learnable activations (LA), which are contained in MoA. The added expressivity comes from input-dependent nonlinear hybridization. Pretraining experiments confirm lower terminal loss and better scaling.
What carries the argument
Mixture of Activations (MoA) with input-dependent gates that select and mix from multiple activation functions per token
Load-bearing premise
The lightweight input-dependent gates realize genuine input-dependent nonlinear hybridization without optimization difficulties or capacity limits erasing the theoretical separation in practice.
What would settle it
A finite-width counterexample showing that some MoA network can be exactly reproduced by a fixed-activation FFN of comparable width, or a set of pretraining runs in which MoA fails to reach lower terminal loss than well-tuned fixed or LA baselines.
read the original abstract
Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mixture of Activations (MoA) for FFN layers, which mixes a dictionary of activation functions via lightweight input-dependent gates while sharing linear projections; it also introduces Learnable Activations (LA) as the input-independent linear-combination counterpart. It claims strict finite-width expressive separations (fixed-activation FFNs ⊂ LA ⊂ MoA) arising from input-dependent nonlinear hybridization, and reports that MoA yields lower terminal loss and better scaling than tuned baselines in pre-training runs on 0.12B–2B dense and MoE models across token budgets, optimizers, and schedules, with negligible overhead.
Significance. If the finite-width separations are realized by non-degenerate gates in trained models and the observed loss improvements are causally linked to the extra expressivity (rather than optimization artifacts or capacity differences), the approach offers a low-overhead route to increasing FFN expressivity. The empirical scope across scales and setups is a strength, but significance hinges on confirming that the claimed hybridization occurs in practice.
major comments (2)
- [theoretical analysis] Theoretical separation claims (abstract and theory section): the strict containment MoA ⊃ LA is asserted to arise from input-dependent nonlinear hybridization. However, this separation is realized only if the learned gates vary meaningfully across tokens and the resulting hybrids lie outside the LA function class. The manuscript provides no post-training analysis of gate statistics, variance, or effective rank, leaving open the possibility that gradient dynamics cause gates to converge to near-constant values and collapse MoA to LA behavior. This directly undermines attribution of empirical gains to the theoretical separation.
- [experiments] Empirical evaluation (experiments section): the claim of consistent gains 'across scales, optimizers, and token budgets' is load-bearing for the practical contribution. Without ablations that isolate gate-induced hybridization (e.g., freezing gates to constants, measuring per-token activation diversity, or comparing against an LA baseline with matched parameter count), it is impossible to rule out that observed improvements stem from other factors such as implicit regularization or optimization landscape changes rather than the asserted extra expressivity.
minor comments (2)
- [methods] Notation for the gate functions and the dictionary of activations should be introduced with explicit equations early in the methods section to avoid ambiguity when comparing LA and MoA.
- The abstract states 'Part I' but the manuscript contains no forward reference to planned follow-up work or limitations of the current scope.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and agree that targeted additions will strengthen the attribution of results to the claimed mechanism.
read point-by-point responses
-
Referee: Theoretical separation claims (abstract and theory section): the strict containment MoA ⊃ LA is asserted to arise from input-dependent nonlinear hybridization. However, this separation is realized only if the learned gates vary meaningfully across tokens and the resulting hybrids lie outside the LA function class. The manuscript provides no post-training analysis of gate statistics, variance, or effective rank, leaving open the possibility that gradient dynamics cause gates to converge to near-constant values and collapse MoA to LA behavior. This directly undermines attribution of empirical gains to the theoretical separation.
Authors: The theory section proves strict finite-width containment (fixed ⊂ LA ⊂ MoA) via explicit constructions showing input-dependent hybridization can realize functions outside the LA class. We agree that confirming non-degenerate gate behavior in trained models would strengthen the link to empirical gains. In revision we will add post-training gate statistics (variance, per-token diversity, effective rank) on the reported runs. revision: yes
-
Referee: Empirical evaluation (experiments section): the claim of consistent gains 'across scales, optimizers, and token budgets' is load-bearing for the practical contribution. Without ablations that isolate gate-induced hybridization (e.g., freezing gates to constants, measuring per-token activation diversity, or comparing against an LA baseline with matched parameter count), it is impossible to rule out that observed improvements stem from other factors such as implicit regularization or optimization landscape changes rather than the asserted extra expressivity.
Authors: The experiments show MoA outperforming tuned fixed-activation baselines across the stated range. LA is introduced primarily as the input-independent theoretical counterpart; direct matched-parameter LA comparisons and gate-freezing ablations are absent. We will add both in revision (LA baselines and controlled gate-freezing runs) to better isolate the contribution of input-dependent mixing. revision: yes
Circularity Check
No significant circularity in theoretical separations or empirical results
full rationale
The paper's central claims consist of a theoretical proof of strict finite-width expressive separations (LA contains fixed FFNs; MoA contains LA via input-dependent hybridization) and independent empirical observations of lower terminal loss and better scaling in pre-training runs. No equation, definition, or self-citation in the abstract or described content reduces any claimed separation or performance gain to a fitted quantity defined by the same experiment. The theoretical hierarchy is presented as an external mathematical result, and the empirical gains are reported as observations under varied training conditions, making the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning Activation Functions to Improve Deep Neural Networks
Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks.arXiv preprint arXiv:1412.6830, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[2]
A survey on modern trainable activation functions.Neural Networks, 138:14–32, 2021
Andrea Apicella, Francesco Donnarumma, Francesco Isgrò, and Roberto Prevete. A survey on modern trainable activation functions.Neural Networks, 138:14–32, 2021
2021
-
[3]
Breaking the curse of dimensionality with convex neural networks.The Journal of MachineLearning Research, 18(1):629–681, 2017
Francis Bach. Breaking the curse of dimensionality with convex neural networks.The Journal of MachineLearning Research, 18(1):629–681, 2017
2017
-
[4]
Neural net approximation
Andrew R Barron. Neural net approximation. InProc. 7th YaleWorkshopon Adaptive and Learning Systems, volume 1, pages 69–72, 1992
1992
-
[5]
Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 1993
1993
-
[6]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[7]
Knowledge neurons in pretrained transformers
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 8493–8502, 2022
2022
-
[8]
Dauphin, Angela Fan, Michael Auli, and David Grangier
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InProceedings of the 34th International Conference on Machine Learning, pages 933–941, 2017
2017
-
[9]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT, pages 4171–4186, 2019
2019
-
[10]
Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural Networks, 107:3–11, 2018
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural Networks, 107:3–11, 2018
2018
-
[11]
Deep sparse rectifier neural networks
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics, pages 315–323, 2011
2011
-
[12]
Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio
Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, pages 1319–1327, 2013
2013
-
[13]
Mohit Goyal, Rajan Goyal, and Brejesh Lall. Learning activation functions: A new paradigm for understanding neural networks.arXiv preprint arXiv:1906.09529, 2019
-
[14]
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015
2015
-
[15]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
2022
-
[16]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[17]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Adaptive mixtures of local experts
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991
1991
-
[20]
Hierarchical mixtures of experts and the em algorithm.Neuralcomputation, 6(2):181–214, 1994
Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm.Neuralcomputation, 6(2):181–214, 1994
1994
-
[21]
Andrej Karpathy. Nanogpt. https://github.com/karpathy/nanoGPT, 2022. 13
2022
-
[22]
Muon optimizer
Jordan Keller et al. Muon optimizer. https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024
2024
-
[23]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[24]
Self-normalizing neural networks
Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advancesin Neural Information Processing Systems, 2017
2017
-
[25]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
KAN: Kolmogorov-Arnold Networks
Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks.arXiv preprint arXiv:2404.19756, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Learning Combinations of Activation Functions
Franco Manessi and Alessandro Rozza. Learning combinations of activation functions. arXiv preprint arXiv:1801.09403, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019
Diganta Misra. Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019
-
[31]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
2019
-
[32]
Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. In International Conference on Learning Representations Workshop, 2018
2018
-
[33]
Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
2021
-
[34]
GLU Variants Improve Transformer
Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[35]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017
2017
-
[36]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[37]
Adaptive Blending Units: Trainable Activation Functions for Deep Neural Networks
Leon René Sütfeld, Flemming Brieger, Holger Finger, Sonja Füllhase, and Gordon Pipa. Adaptive blending units: Trainable activation functions for deep neural networks.arXiv preprint arXiv:1806.10064, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[38]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Attention is all you need.Advancesin neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017
2017
-
[40]
Understanding the expressive power and mechanisms of transformer for sequence modeling
Mingze Wang and Weinan E. Understanding the expressive power and mechanisms of transformer for sequence modeling. Advancesin Neural Information Processing Systems, 2024
2024
-
[41]
Vikas Yadav, Steven Bethard, and Mihai Surdeanu. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering.arXiv preprint arXiv:1911.07176, 2019
-
[42]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[44]
Scaling vision transformers
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 14
2022
-
[45]
Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. Relu2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024
-
[46]
Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025
-
[47]
Value residual learning
Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, and Zhenzhong Lan. Value residual learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28341–28356, 2025
2025
-
[48]
Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, and Jinwen Ma. Polynomial composition activations: Unleashing the dynamics of large language models.arXiv preprint arXiv:2411.03884, 2024. 15 Appendix A Experimental Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 A.1 Experimental Details for Section 5.2...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.