pith. sign in

arxiv: 2605.23901 · v1 · pith:IYW62XVBnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI· cs.IT· math.IT

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Pith reviewed 2026-05-25 04:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT
keywords scaling lawsshannon capacitynoisy channelslarge language modelsmodel capacityu-shaped degradationinformation theorysignal-to-noise ratio
0
0 comments X

The pith

LLM training follows a Shannon capacity limit where insufficient signal-to-noise ratio turns scaling from improvement to U-shaped degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models LLM training as information transmission over a noisy channel using the Shannon-Hartley theorem. Model parameters map to channel bandwidth and training tokens map to signal power, so the resulting capacity governs observed loss. This framing shows why adding more parameters or tokens without enough signal relative to noise produces a performance peak followed by decline. Experiments on Pythia and OLMo2 models under noise, quantization, and fine-tuning confirm the law fits data better than monotonic alternatives and extrapolates to larger scales. The core result is that classical power-law scaling holds only when the signal-to-noise ratio stays above a critical threshold.

Core claim

The paper establishes a Shannon Scaling Law that treats LLM training as noisy-channel transmission. By mapping parameters to bandwidth and tokens to signal power, it derives an explicit capacity bound. Scaling either quantity without preserving sufficient signal-to-noise ratio amplifies noise and produces a transition from monotonic gains to U-shaped degradation. The law is validated by superior fits on perturbed Pythia and OLMo2 runs and by accurate extrapolation from models up to 6.9B parameters and 180B tokens to a 12B model at 307B tokens.

What carries the argument

The Shannon Scaling Law obtained by applying the Shannon-Hartley theorem after equating model parameters with channel bandwidth and training tokens with signal power.

If this is right

  • Performance curves become U-shaped rather than monotonic power laws once signal-to-noise ratio falls below the capacity threshold.
  • Quantization and added noise produce degradation exactly where the derived capacity predicts a loss basin.
  • The law extrapolates loss on unseen larger models and longer training runs where prior monotonic laws diverge.
  • Supervised fine-tuning on math, QA, and code tasks exhibits the same SNR-dependent non-monotonicity captured by the formulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers may need to couple model growth to data growth so that effective bandwidth and signal power remain balanced throughout training.
  • The same capacity view could be tested on non-text modalities by re-expressing parameter count and data volume in equivalent signal and bandwidth terms.
  • Hardware or optimizer choices that reduce intrinsic noise would raise the effective capacity and delay the onset of degradation.
  • An explicit SNR schedule could be derived and tested by varying data quality or regularization strength while scaling.

Load-bearing premise

The Shannon-Hartley theorem can be applied directly by mapping model parameters to channel bandwidth and training tokens to signal power.

What would settle it

A controlled scaling experiment that increases both model size and token count while holding the signal-to-noise ratio fixed and shows continued monotonic loss improvement without any U-shaped downturn.

read the original abstract

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the Shannon Scaling Law, a framework that models LLM training as information transmission over a noisy channel via the Shannon-Hartley theorem. Model parameters are mapped to channel bandwidth and training tokens to signal power; the resulting capacity is claimed to govern cross-entropy loss and to induce a transition from monotonic improvement to U-shaped degradation when the signal-to-noise ratio is insufficient. The law is fitted to Pythia and OLMo2 models under Gaussian noise, quantization, and fine-tuning perturbations, reported to achieve higher R² than classical and perturbation-aware baselines, and shown to extrapolate from models ≤6.9B (≤180B tokens) to an unseen 12B model (up to 307B tokens) at pooled R²=0.847.

Significance. If the mapping and functional form can be placed on a firmer footing, the approach would supply a unified information-theoretic account of both monotonic scaling and recently observed non-monotonic regimes (catastrophic overtraining, quantization degradation). The extrapolation experiment and the consistent outperformance on perturbed data are concrete strengths that distinguish the work from purely empirical fits.

major comments (3)
  1. [Theory section] Theory section (immediately after the statement of the Shannon-Hartley mapping): the identification of model parameters with bandwidth B and training tokens with signal power S is asserted without deriving an effective channel model. No section specifies the transmitted message, the additive or multiplicative noise process during gradient updates, or the receiver that maps capacity to next-token cross-entropy loss; consequently the functional form used for fitting is an ansatz rather than a theorem-derived expression.
  2. [§4] §4 (extrapolation experiment): the law is fitted on Pythia models up to 6.9B and then used to predict the 12B model; without the explicit functional form, the fitting procedure, or any sensitivity analysis to the choice of capacity constants, it is impossible to determine whether the reported R²=0.847 reflects genuine predictive power or an overfit ansatz tuned to the observed U-shape.
  3. [Experimental results] Experimental results (Tables 2–4 and associated figures): no error bars, no cross-validation of the free capacity constants, and no ablation of the SNR threshold are reported. These omissions make it impossible to assess whether the claimed superiority over monotonic baselines is robust or an artifact of the particular fitting choices.
minor comments (2)
  1. [Notation] Notation: the symbol for the effective noise variance is introduced without a clear link to the earlier channel model; a single consistent definition would improve readability.
  2. [Figure 3] Figure 3: axis labels and legend entries are too small for print; the U-shaped curves are difficult to read at the reported scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Theory section] Theory section (immediately after the statement of the Shannon-Hartley mapping): the identification of model parameters with bandwidth B and training tokens with signal power S is asserted without deriving an effective channel model. No section specifies the transmitted message, the additive or multiplicative noise process during gradient updates, or the receiver that maps capacity to next-token cross-entropy loss; consequently the functional form used for fitting is an ansatz rather than a theorem-derived expression.

    Authors: We acknowledge that the parameter-to-bandwidth and tokens-to-power mapping is introduced as an effective analogy grounded in the Shannon-Hartley theorem rather than a microscopic derivation from gradient dynamics. The transmitted message is the information content of the training data, noise arises from stochastic optimization and finite-precision effects, and the receiver is instantiated via next-token cross-entropy loss. A complete first-principles channel model remains intractable given current understanding of training dynamics. In revision we will add an explicit subsection stating these modeling assumptions, the rationale for the ansatz, and its relation to capacity, while clarifying that the functional form is empirically motivated. revision: partial

  2. Referee: [§4] §4 (extrapolation experiment): the law is fitted on Pythia models up to 6.9B and then used to predict the 12B model; without the explicit functional form, the fitting procedure, or any sensitivity analysis to the choice of capacity constants, it is impossible to determine whether the reported R²=0.847 reflects genuine predictive power or an overfit ansatz tuned to the observed U-shape.

    Authors: We will insert the explicit closed-form expression of the Shannon Scaling Law, the precise optimization procedure used to fit the capacity constants, and a sensitivity analysis that perturbs those constants over plausible ranges while recomputing the extrapolated R² on the 12B model. These additions will be placed in §4 and the appendix together with reproducibility code. revision: yes

  3. Referee: [Experimental results] Experimental results (Tables 2–4 and associated figures): no error bars, no cross-validation of the free capacity constants, and no ablation of the SNR threshold are reported. These omissions make it impossible to assess whether the claimed superiority over monotonic baselines is robust or an artifact of the particular fitting choices.

    Authors: We will add error bars derived from multiple fitting initializations where feasible, report k-fold cross-validation results for the capacity constants on the Pythia and OLMo2 suites, and include an ablation that varies the SNR threshold while tracking both R² and the location of the performance minimum. These analyses will be added to the experimental section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external theorem via modeling choice and tests via extrapolation

full rationale

The paper grounds its Shannon Scaling Law in the Shannon-Hartley theorem through an explicit mapping (parameters to bandwidth, tokens to signal power) and validates via fitting on Pythia/OLMo2 data up to 6.9B/180B tokens followed by extrapolation to the unseen 12B model (R²=0.847). This extrapolation constitutes an out-of-sample test rather than a fitted-input-called-prediction, as the target data is not used in parameter estimation. No equations reduce the claimed capacity-to-loss relation to a self-definition, no self-citations are load-bearing for the central premise, and the functional form is presented as following from the theorem rather than chosen to match the reported R². The derivation chain remains self-contained against the external theorem and the held-out larger model.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on an unproven domain mapping from training dynamics to channel capacity plus several fitted constants whose values are not derived from first principles.

free parameters (1)
  • model-specific capacity constants
    Constants that convert parameter count and token count into effective bandwidth and power must be fitted to observed loss curves.
axioms (1)
  • domain assumption Shannon-Hartley theorem governs the information rate achievable during LLM training
    The entire derivation begins from this theorem applied to the training process.
invented entities (1)
  • Shannon capacity for LLMs no independent evidence
    purpose: Fundamental performance limit arising from the noisy-channel model
    Introduced as the quantity that cannot be exceeded without sufficient SNR; no independent falsifiable prediction is supplied.

pith-pipeline@v0.9.0 · 5803 in / 1362 out tokens · 43835 ms · 2026-05-25T04:28:30.735128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 21 canonical work pages · 14 internal anchors

  1. [1]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304.01373

  2. [2]

    Eric Brill and Robert C. Moore. An improved error model for noisy channel spelling correction. InProceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 286–293, Hong Kong, October

  3. [3]

    doi: 10.3115/1075218.1075255

    Association for Computational Linguistics. doi: 10.3115/1075218.1075255. URLhttps://aclanthology. org/P00-1037/

  4. [4]

    Brown, Stephen A

    Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation.Computational Linguistics, 19(2):263–311, 1993. URL https://aclanthology.org/J93-2003/

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

  6. [6]

    DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026. URLhttps: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/main/DeepSeek_V4.pdf. Technical Re- port

  7. [7]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314, 2023

  8. [8]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023. URLhttps://arxiv.org/abs/2210.17323

  9. [9]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/abs/2101.00027

  10. [10]

    Gonzalez and Richard E

    Rafael C. Gonzalez and Richard E. Woods. Digital image processing. Prentice Hall, Upper Saddle River, N.J., 2008. ISBN 9780131687288 013168728X 9780135052679 013505267X. URLhttp://www.amazon.com/ Digital-Image-Processing-3rd-Edition/dp/013168728X

  11. [11]

    Communication Systems

    Simon Haykin. Communication Systems. John Wiley & Sons, 4th edition, 2001

  12. [12]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  13. [13]

    Naive bayes, text classification, and sentiment.Speech and language processing, 3(4): 60–94, 2025

    D Jurafsky and JH Martin. Naive bayes, text classification, and sentiment.Speech and language processing, 3(4): 60–94, 2025

  14. [14]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

  15. [15]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  16. [16]

    Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan

    Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan. Scaling laws for precision, 2024. URL https: //arxiv.org/abs/2411.04330

  17. [17]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

  18. [18]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration, 2024. URLhttps://arxiv.org/abs/2306.00978

  19. [19]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

  20. [20]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

  21. [21]

    Oppenheim and Ronald W

    Alan V. Oppenheim and Ronald W. Schafer.Discrete-Time Signal Processing. Prentice Hall Press, USA, 3rd edition, 2009. ISBN 0131988425

  22. [22]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advancesin Neural Information Processing Systems, volume 35, pages 27730–27744, 2022

  23. [23]

    Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100t training tokens, 2024

    Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100t training tokens, 2024. URLhttps: //arxiv.org/abs/2411.17691

  24. [24]

    R. Priemer. Introductory Signal Processing. Advanced Series In Electrical And Computer Engineering. World Scientific Publishing Company, 1990. ISBN 9789813103757. URLhttps://books.google.com.tr/books? id=5AM8DQAAQBAJ

  25. [25]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions, 2019. URLhttps://arxiv.org/abs/1904.09728

  26. [26]

    C. E. Shannon. A mathematical theory of communication.Bell Labs TechnicalJournal, 27(3):379–423, July 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

  27. [27]

    Opening the Black Box of Deep Neural Networks via Information

    Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810, 2017

  28. [28]

    Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025

    Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025

  29. [29]

    Deep learning and the information bottleneck principle

    Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. Ieee, 2015

  30. [30]

    adamw_torch

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008. 15 Appendix A Appendix A.1 Implementation Details Model Model SizesN Pretrain TokensD Pythia Suitededuped-160m, ded...