LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
Pith reviewed 2026-05-25 04:28 UTC · model grok-4.3
The pith
LLM training follows a Shannon capacity limit where insufficient signal-to-noise ratio turns scaling from improvement to U-shaped degradation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a Shannon Scaling Law that treats LLM training as noisy-channel transmission. By mapping parameters to bandwidth and tokens to signal power, it derives an explicit capacity bound. Scaling either quantity without preserving sufficient signal-to-noise ratio amplifies noise and produces a transition from monotonic gains to U-shaped degradation. The law is validated by superior fits on perturbed Pythia and OLMo2 runs and by accurate extrapolation from models up to 6.9B parameters and 180B tokens to a 12B model at 307B tokens.
What carries the argument
The Shannon Scaling Law obtained by applying the Shannon-Hartley theorem after equating model parameters with channel bandwidth and training tokens with signal power.
If this is right
- Performance curves become U-shaped rather than monotonic power laws once signal-to-noise ratio falls below the capacity threshold.
- Quantization and added noise produce degradation exactly where the derived capacity predicts a loss basin.
- The law extrapolates loss on unseen larger models and longer training runs where prior monotonic laws diverge.
- Supervised fine-tuning on math, QA, and code tasks exhibits the same SNR-dependent non-monotonicity captured by the formulation.
Where Pith is reading between the lines
- Designers may need to couple model growth to data growth so that effective bandwidth and signal power remain balanced throughout training.
- The same capacity view could be tested on non-text modalities by re-expressing parameter count and data volume in equivalent signal and bandwidth terms.
- Hardware or optimizer choices that reduce intrinsic noise would raise the effective capacity and delay the onset of degradation.
- An explicit SNR schedule could be derived and tested by varying data quality or regularization strength while scaling.
Load-bearing premise
The Shannon-Hartley theorem can be applied directly by mapping model parameters to channel bandwidth and training tokens to signal power.
What would settle it
A controlled scaling experiment that increases both model size and token count while holding the signal-to-noise ratio fixed and shows continued monotonic loss improvement without any U-shaped downturn.
read the original abstract
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Shannon Scaling Law, a framework that models LLM training as information transmission over a noisy channel via the Shannon-Hartley theorem. Model parameters are mapped to channel bandwidth and training tokens to signal power; the resulting capacity is claimed to govern cross-entropy loss and to induce a transition from monotonic improvement to U-shaped degradation when the signal-to-noise ratio is insufficient. The law is fitted to Pythia and OLMo2 models under Gaussian noise, quantization, and fine-tuning perturbations, reported to achieve higher R² than classical and perturbation-aware baselines, and shown to extrapolate from models ≤6.9B (≤180B tokens) to an unseen 12B model (up to 307B tokens) at pooled R²=0.847.
Significance. If the mapping and functional form can be placed on a firmer footing, the approach would supply a unified information-theoretic account of both monotonic scaling and recently observed non-monotonic regimes (catastrophic overtraining, quantization degradation). The extrapolation experiment and the consistent outperformance on perturbed data are concrete strengths that distinguish the work from purely empirical fits.
major comments (3)
- [Theory section] Theory section (immediately after the statement of the Shannon-Hartley mapping): the identification of model parameters with bandwidth B and training tokens with signal power S is asserted without deriving an effective channel model. No section specifies the transmitted message, the additive or multiplicative noise process during gradient updates, or the receiver that maps capacity to next-token cross-entropy loss; consequently the functional form used for fitting is an ansatz rather than a theorem-derived expression.
- [§4] §4 (extrapolation experiment): the law is fitted on Pythia models up to 6.9B and then used to predict the 12B model; without the explicit functional form, the fitting procedure, or any sensitivity analysis to the choice of capacity constants, it is impossible to determine whether the reported R²=0.847 reflects genuine predictive power or an overfit ansatz tuned to the observed U-shape.
- [Experimental results] Experimental results (Tables 2–4 and associated figures): no error bars, no cross-validation of the free capacity constants, and no ablation of the SNR threshold are reported. These omissions make it impossible to assess whether the claimed superiority over monotonic baselines is robust or an artifact of the particular fitting choices.
minor comments (2)
- [Notation] Notation: the symbol for the effective noise variance is introduced without a clear link to the earlier channel model; a single consistent definition would improve readability.
- [Figure 3] Figure 3: axis labels and legend entries are too small for print; the U-shaped curves are difficult to read at the reported scale.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Theory section] Theory section (immediately after the statement of the Shannon-Hartley mapping): the identification of model parameters with bandwidth B and training tokens with signal power S is asserted without deriving an effective channel model. No section specifies the transmitted message, the additive or multiplicative noise process during gradient updates, or the receiver that maps capacity to next-token cross-entropy loss; consequently the functional form used for fitting is an ansatz rather than a theorem-derived expression.
Authors: We acknowledge that the parameter-to-bandwidth and tokens-to-power mapping is introduced as an effective analogy grounded in the Shannon-Hartley theorem rather than a microscopic derivation from gradient dynamics. The transmitted message is the information content of the training data, noise arises from stochastic optimization and finite-precision effects, and the receiver is instantiated via next-token cross-entropy loss. A complete first-principles channel model remains intractable given current understanding of training dynamics. In revision we will add an explicit subsection stating these modeling assumptions, the rationale for the ansatz, and its relation to capacity, while clarifying that the functional form is empirically motivated. revision: partial
-
Referee: [§4] §4 (extrapolation experiment): the law is fitted on Pythia models up to 6.9B and then used to predict the 12B model; without the explicit functional form, the fitting procedure, or any sensitivity analysis to the choice of capacity constants, it is impossible to determine whether the reported R²=0.847 reflects genuine predictive power or an overfit ansatz tuned to the observed U-shape.
Authors: We will insert the explicit closed-form expression of the Shannon Scaling Law, the precise optimization procedure used to fit the capacity constants, and a sensitivity analysis that perturbs those constants over plausible ranges while recomputing the extrapolated R² on the 12B model. These additions will be placed in §4 and the appendix together with reproducibility code. revision: yes
-
Referee: [Experimental results] Experimental results (Tables 2–4 and associated figures): no error bars, no cross-validation of the free capacity constants, and no ablation of the SNR threshold are reported. These omissions make it impossible to assess whether the claimed superiority over monotonic baselines is robust or an artifact of the particular fitting choices.
Authors: We will add error bars derived from multiple fitting initializations where feasible, report k-fold cross-validation results for the capacity constants on the Pythia and OLMo2 suites, and include an ablation that varies the SNR threshold while tracking both R² and the location of the performance minimum. These analyses will be added to the experimental section and supplementary material. revision: yes
Circularity Check
No significant circularity; derivation applies external theorem via modeling choice and tests via extrapolation
full rationale
The paper grounds its Shannon Scaling Law in the Shannon-Hartley theorem through an explicit mapping (parameters to bandwidth, tokens to signal power) and validates via fitting on Pythia/OLMo2 data up to 6.9B/180B tokens followed by extrapolation to the unseen 12B model (R²=0.847). This extrapolation constitutes an out-of-sample test rather than a fitted-input-called-prediction, as the target data is not used in parameter estimation. No equations reduce the claimed capacity-to-loss relation to a self-definition, no self-citations are load-bearing for the central premise, and the functional form is presented as following from the theorem rather than chosen to match the reported R². The derivation chain remains self-contained against the external theorem and the held-out larger model.
Axiom & Free-Parameter Ledger
free parameters (1)
- model-specific capacity constants
axioms (1)
- domain assumption Shannon-Hartley theorem governs the information rate achievable during LLM training
invented entities (1)
-
Shannon capacity for LLMs
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304.01373
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Eric Brill and Robert C. Moore. An improved error model for noisy channel spelling correction. InProceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 286–293, Hong Kong, October
-
[3]
Association for Computational Linguistics. doi: 10.3115/1075218.1075255. URLhttps://aclanthology. org/P00-1037/
-
[4]
Brown, Stephen A
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation.Computational Linguistics, 19(2):263–311, 1993. URL https://aclanthology.org/J93-2003/
1993
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026. URLhttps: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/main/DeepSeek_V4.pdf. Technical Re- port
2026
-
[7]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023. URLhttps://arxiv.org/abs/2210.17323
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/abs/2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[10]
Rafael C. Gonzalez and Richard E. Woods. Digital image processing. Prentice Hall, Upper Saddle River, N.J., 2008. ISBN 9780131687288 013168728X 9780135052679 013505267X. URLhttp://www.amazon.com/ Digital-Image-Processing-3rd-Edition/dp/013168728X
-
[11]
Communication Systems
Simon Haykin. Communication Systems. John Wiley & Sons, 4th edition, 2001
2001
-
[12]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Naive bayes, text classification, and sentiment.Speech and language processing, 3(4): 60–94, 2025
D Jurafsky and JH Martin. Naive bayes, text classification, and sentiment.Speech and language processing, 3(4): 60–94, 2025
2025
-
[14]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[15]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan. Scaling laws for precision, 2024. URL https: //arxiv.org/abs/2411.04330
-
[17]
StarCoder: may the source be with you!
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration, 2024. URLhttps://arxiv.org/abs/2306.00978
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Oppenheim and Ronald W
Alan V. Oppenheim and Ronald W. Schafer.Discrete-Time Signal Processing. Prentice Hall Press, USA, 3rd edition, 2009. ISBN 0131988425
2009
-
[22]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advancesin Neural Information Processing Systems, volume 35, pages 27730–27744, 2022
2022
-
[23]
Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100t training tokens, 2024. URLhttps: //arxiv.org/abs/2411.17691
-
[24]
R. Priemer. Introductory Signal Processing. Advanced Series In Electrical And Computer Engineering. World Scientific Publishing Company, 1990. ISBN 9789813103757. URLhttps://books.google.com.tr/books? id=5AM8DQAAQBAJ
1990
-
[25]
SocialIQA: Commonsense Reasoning about Social Interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions, 2019. URLhttps://arxiv.org/abs/1904.09728
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[26]
C. E. Shannon. A mathematical theory of communication.Bell Labs TechnicalJournal, 27(3):379–423, July 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x
-
[27]
Opening the Black Box of Deep Neural Networks via Information
Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025
Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025
-
[29]
Deep learning and the information bottleneck principle
Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. Ieee, 2015
2015
-
[30]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008. 15 Appendix A Appendix A.1 Implementation Details Model Model SizesN Pretrain TokensD Pythia Suitededuped-160m, ded...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.