LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Chen Zheng; Deyi Liu; Jing Liu; Thomas Hartvigsen; Xu Ouyang; Yiyuan Ma; Yuan Yang; Yuhang Cai

arxiv: 2605.23901 · v1 · pith:IYW62XVBnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI· cs.IT· math.IT

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Xu Ouyang , Deyi Liu , Yuhang Cai , Jing Liu , Yuan Yang , Chen Zheng , Thomas Hartvigsen , Yiyuan Ma This is my paper

Pith reviewed 2026-05-25 04:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT

keywords scaling lawsshannon capacitynoisy channelslarge language modelsmodel capacityu-shaped degradationinformation theorysignal-to-noise ratio

0 comments

The pith

LLM training follows a Shannon capacity limit where insufficient signal-to-noise ratio turns scaling from improvement to U-shaped degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models LLM training as information transmission over a noisy channel using the Shannon-Hartley theorem. Model parameters map to channel bandwidth and training tokens map to signal power, so the resulting capacity governs observed loss. This framing shows why adding more parameters or tokens without enough signal relative to noise produces a performance peak followed by decline. Experiments on Pythia and OLMo2 models under noise, quantization, and fine-tuning confirm the law fits data better than monotonic alternatives and extrapolates to larger scales. The core result is that classical power-law scaling holds only when the signal-to-noise ratio stays above a critical threshold.

Core claim

The paper establishes a Shannon Scaling Law that treats LLM training as noisy-channel transmission. By mapping parameters to bandwidth and tokens to signal power, it derives an explicit capacity bound. Scaling either quantity without preserving sufficient signal-to-noise ratio amplifies noise and produces a transition from monotonic gains to U-shaped degradation. The law is validated by superior fits on perturbed Pythia and OLMo2 runs and by accurate extrapolation from models up to 6.9B parameters and 180B tokens to a 12B model at 307B tokens.

What carries the argument

The Shannon Scaling Law obtained by applying the Shannon-Hartley theorem after equating model parameters with channel bandwidth and training tokens with signal power.

If this is right

Performance curves become U-shaped rather than monotonic power laws once signal-to-noise ratio falls below the capacity threshold.
Quantization and added noise produce degradation exactly where the derived capacity predicts a loss basin.
The law extrapolates loss on unseen larger models and longer training runs where prior monotonic laws diverge.
Supervised fine-tuning on math, QA, and code tasks exhibits the same SNR-dependent non-monotonicity captured by the formulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers may need to couple model growth to data growth so that effective bandwidth and signal power remain balanced throughout training.
The same capacity view could be tested on non-text modalities by re-expressing parameter count and data volume in equivalent signal and bandwidth terms.
Hardware or optimizer choices that reduce intrinsic noise would raise the effective capacity and delay the onset of degradation.
An explicit SNR schedule could be derived and tested by varying data quality or regularization strength while scaling.

Load-bearing premise

The Shannon-Hartley theorem can be applied directly by mapping model parameters to channel bandwidth and training tokens to signal power.

What would settle it

A controlled scaling experiment that increases both model size and token count while holding the signal-to-noise ratio fixed and shows continued monotonic loss improvement without any U-shaped downturn.

read the original abstract

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Shannon mapping gives a fresh way to think about U-shaped scaling but never derives how channel capacity produces the actual loss curves.

read the letter

The paper's main move is to treat LLM training as transmission over a noisy channel, with parameters as bandwidth and tokens as signal power, so that dropping SNR eventually caps capacity and flips the loss curve into a U. That framing is new relative to the usual power-law scaling work and directly targets the overtraining and quantization cases that monotonic laws miss. On the experiments, they fit Pythia and OLMo2 runs under noise, quantization, and fine-tuning, report higher R² than the baselines, and show the law extrapolates from models up to 6.9B to an unseen 12B model at pooled R² 0.847 while the power laws collapse. That is concrete and better than just claiming another curve fit. The soft spot is the missing link: the abstract and stress-test note give no derivation that turns the Shannon-Hartley formula into a loss function or shows why parameters map to bandwidth and tokens to power in the first place. The functional form therefore looks like an ansatz chosen to produce the U rather than a consequence of modeling the actual gradient or data process. Without the explicit equation or error-bar details it is hard to tell how much the SNR term is doing versus added flexibility. The extrapolation is only one size step within the same model family, so it does not stress the idea very hard. This is for readers who track scaling-law papers and want alternatives that can handle observed non-monotonic behavior. The empirical side is sharp enough to merit referee time even if the theory needs more work; I would send it to review.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the Shannon Scaling Law, a framework that models LLM training as information transmission over a noisy channel via the Shannon-Hartley theorem. Model parameters are mapped to channel bandwidth and training tokens to signal power; the resulting capacity is claimed to govern cross-entropy loss and to induce a transition from monotonic improvement to U-shaped degradation when the signal-to-noise ratio is insufficient. The law is fitted to Pythia and OLMo2 models under Gaussian noise, quantization, and fine-tuning perturbations, reported to achieve higher R² than classical and perturbation-aware baselines, and shown to extrapolate from models ≤6.9B (≤180B tokens) to an unseen 12B model (up to 307B tokens) at pooled R²=0.847.

Significance. If the mapping and functional form can be placed on a firmer footing, the approach would supply a unified information-theoretic account of both monotonic scaling and recently observed non-monotonic regimes (catastrophic overtraining, quantization degradation). The extrapolation experiment and the consistent outperformance on perturbed data are concrete strengths that distinguish the work from purely empirical fits.

major comments (3)

[Theory section] Theory section (immediately after the statement of the Shannon-Hartley mapping): the identification of model parameters with bandwidth B and training tokens with signal power S is asserted without deriving an effective channel model. No section specifies the transmitted message, the additive or multiplicative noise process during gradient updates, or the receiver that maps capacity to next-token cross-entropy loss; consequently the functional form used for fitting is an ansatz rather than a theorem-derived expression.
[§4] §4 (extrapolation experiment): the law is fitted on Pythia models up to 6.9B and then used to predict the 12B model; without the explicit functional form, the fitting procedure, or any sensitivity analysis to the choice of capacity constants, it is impossible to determine whether the reported R²=0.847 reflects genuine predictive power or an overfit ansatz tuned to the observed U-shape.
[Experimental results] Experimental results (Tables 2–4 and associated figures): no error bars, no cross-validation of the free capacity constants, and no ablation of the SNR threshold are reported. These omissions make it impossible to assess whether the claimed superiority over monotonic baselines is robust or an artifact of the particular fitting choices.

minor comments (2)

[Notation] Notation: the symbol for the effective noise variance is introduced without a clear link to the earlier channel model; a single consistent definition would improve readability.
[Figure 3] Figure 3: axis labels and legend entries are too small for print; the U-shaped curves are difficult to read at the reported scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Theory section] Theory section (immediately after the statement of the Shannon-Hartley mapping): the identification of model parameters with bandwidth B and training tokens with signal power S is asserted without deriving an effective channel model. No section specifies the transmitted message, the additive or multiplicative noise process during gradient updates, or the receiver that maps capacity to next-token cross-entropy loss; consequently the functional form used for fitting is an ansatz rather than a theorem-derived expression.

Authors: We acknowledge that the parameter-to-bandwidth and tokens-to-power mapping is introduced as an effective analogy grounded in the Shannon-Hartley theorem rather than a microscopic derivation from gradient dynamics. The transmitted message is the information content of the training data, noise arises from stochastic optimization and finite-precision effects, and the receiver is instantiated via next-token cross-entropy loss. A complete first-principles channel model remains intractable given current understanding of training dynamics. In revision we will add an explicit subsection stating these modeling assumptions, the rationale for the ansatz, and its relation to capacity, while clarifying that the functional form is empirically motivated. revision: partial
Referee: [§4] §4 (extrapolation experiment): the law is fitted on Pythia models up to 6.9B and then used to predict the 12B model; without the explicit functional form, the fitting procedure, or any sensitivity analysis to the choice of capacity constants, it is impossible to determine whether the reported R²=0.847 reflects genuine predictive power or an overfit ansatz tuned to the observed U-shape.

Authors: We will insert the explicit closed-form expression of the Shannon Scaling Law, the precise optimization procedure used to fit the capacity constants, and a sensitivity analysis that perturbs those constants over plausible ranges while recomputing the extrapolated R² on the 12B model. These additions will be placed in §4 and the appendix together with reproducibility code. revision: yes
Referee: [Experimental results] Experimental results (Tables 2–4 and associated figures): no error bars, no cross-validation of the free capacity constants, and no ablation of the SNR threshold are reported. These omissions make it impossible to assess whether the claimed superiority over monotonic baselines is robust or an artifact of the particular fitting choices.

Authors: We will add error bars derived from multiple fitting initializations where feasible, report k-fold cross-validation results for the capacity constants on the Pythia and OLMo2 suites, and include an ablation that varies the SNR threshold while tracking both R² and the location of the performance minimum. These analyses will be added to the experimental section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external theorem via modeling choice and tests via extrapolation

full rationale

The paper grounds its Shannon Scaling Law in the Shannon-Hartley theorem through an explicit mapping (parameters to bandwidth, tokens to signal power) and validates via fitting on Pythia/OLMo2 data up to 6.9B/180B tokens followed by extrapolation to the unseen 12B model (R²=0.847). This extrapolation constitutes an out-of-sample test rather than a fitted-input-called-prediction, as the target data is not used in parameter estimation. No equations reduce the claimed capacity-to-loss relation to a self-definition, no self-citations are load-bearing for the central premise, and the functional form is presented as following from the theorem rather than chosen to match the reported R². The derivation chain remains self-contained against the external theorem and the held-out larger model.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on an unproven domain mapping from training dynamics to channel capacity plus several fitted constants whose values are not derived from first principles.

free parameters (1)

model-specific capacity constants
Constants that convert parameter count and token count into effective bandwidth and power must be fitted to observed loss curves.

axioms (1)

domain assumption Shannon-Hartley theorem governs the information rate achievable during LLM training
The entire derivation begins from this theorem applied to the training process.

invented entities (1)

Shannon capacity for LLMs no independent evidence
purpose: Fundamental performance limit arising from the noisy-channel model
Introduced as the quantity that cannot be exceeded without sufficient SNR; no independent falsifiable prediction is supplied.

pith-pipeline@v0.9.0 · 5803 in / 1362 out tokens · 43835 ms · 2026-05-25T04:28:30.735128+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 21 canonical work pages · 14 internal anchors

[1]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304.01373

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Eric Brill and Robert C. Moore. An improved error model for noisy channel spelling correction. InProceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 286–293, Hong Kong, October
[3]

doi: 10.3115/1075218.1075255

Association for Computational Linguistics. doi: 10.3115/1075218.1075255. URLhttps://aclanthology. org/P00-1037/

work page doi:10.3115/1075218.1075255
[4]

Brown, Stephen A

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation.Computational Linguistics, 19(2):263–311, 1993. URL https://aclanthology.org/J93-2003/

1993
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026. URLhttps: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/main/DeepSeek_V4.pdf. Technical Re- port

2026
[7]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023. URLhttps://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Gonzalez and Richard E

Rafael C. Gonzalez and Richard E. Woods. Digital image processing. Prentice Hall, Upper Saddle River, N.J., 2008. ISBN 9780131687288 013168728X 9780135052679 013505267X. URLhttp://www.amazon.com/ Digital-Image-Processing-3rd-Edition/dp/013168728X

work page arXiv 2008
[11]

Communication Systems

Simon Haykin. Communication Systems. John Wiley & Sons, 4th edition, 2001

2001
[12]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Naive bayes, text classification, and sentiment.Speech and language processing, 3(4): 60–94, 2025

D Jurafsky and JH Martin. Naive bayes, text classification, and sentiment.Speech and language processing, 3(4): 60–94, 2025

2025
[14]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan

Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan. Scaling laws for precision, 2024. URL https: //arxiv.org/abs/2411.04330

work page arXiv 2024
[17]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration, 2024. URLhttps://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Oppenheim and Ronald W

Alan V. Oppenheim and Ronald W. Schafer.Discrete-Time Signal Processing. Prentice Hall Press, USA, 3rd edition, 2009. ISBN 0131988425

2009
[22]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advancesin Neural Information Processing Systems, volume 35, pages 27730–27744, 2022

2022
[23]

Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100t training tokens, 2024

Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100t training tokens, 2024. URLhttps: //arxiv.org/abs/2411.17691

work page arXiv 2024
[24]

R. Priemer. Introductory Signal Processing. Advanced Series In Electrical And Computer Engineering. World Scientific Publishing Company, 1990. ISBN 9789813103757. URLhttps://books.google.com.tr/books? id=5AM8DQAAQBAJ

1990
[25]

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions, 2019. URLhttps://arxiv.org/abs/1904.09728

work page internal anchor Pith review Pith/arXiv arXiv 2019
[26]

C. E. Shannon. A mathematical theory of communication.Bell Labs TechnicalJournal, 27(3):379–423, July 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948
[27]

Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025

work page arXiv 2025
[29]

Deep learning and the information bottleneck principle

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. Ieee, 2015

2015
[30]

adamw_torch

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008. 15 Appendix A Appendix A.1 Implementation Details Model Model SizesN Pretrain TokensD Pythia Suitededuped-160m, ded...

work page arXiv 2008

[1] [1]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304.01373

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Eric Brill and Robert C. Moore. An improved error model for noisy channel spelling correction. InProceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 286–293, Hong Kong, October

[3] [3]

doi: 10.3115/1075218.1075255

Association for Computational Linguistics. doi: 10.3115/1075218.1075255. URLhttps://aclanthology. org/P00-1037/

work page doi:10.3115/1075218.1075255

[4] [4]

Brown, Stephen A

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation.Computational Linguistics, 19(2):263–311, 1993. URL https://aclanthology.org/J93-2003/

1993

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence, 2026. URLhttps: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/main/DeepSeek_V4.pdf. Technical Re- port

2026

[7] [7]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2023. URLhttps://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://arxiv.org/abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

Gonzalez and Richard E

Rafael C. Gonzalez and Richard E. Woods. Digital image processing. Prentice Hall, Upper Saddle River, N.J., 2008. ISBN 9780131687288 013168728X 9780135052679 013505267X. URLhttp://www.amazon.com/ Digital-Image-Processing-3rd-Edition/dp/013168728X

work page arXiv 2008

[11] [11]

Communication Systems

Simon Haykin. Communication Systems. John Wiley & Sons, 4th edition, 2001

2001

[12] [12]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Naive bayes, text classification, and sentiment.Speech and language processing, 3(4): 60–94, 2025

D Jurafsky and JH Martin. Naive bayes, text classification, and sentiment.Speech and language processing, 3(4): 60–94, 2025

2025

[14] [14]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[15] [15]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan

Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan. Scaling laws for precision, 2024. URL https: //arxiv.org/abs/2411.04330

work page arXiv 2024

[17] [17]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration, 2024. URLhttps://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Oppenheim and Ronald W

Alan V. Oppenheim and Ronald W. Schafer.Discrete-Time Signal Processing. Prentice Hall Press, USA, 3rd edition, 2009. ISBN 0131988425

2009

[22] [22]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advancesin Neural Information Processing Systems, volume 35, pages 27730–27744, 2022

2022

[23] [23]

Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100t training tokens, 2024

Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained LLMs: Scaling laws for quantized LLMs with 100t training tokens, 2024. URLhttps: //arxiv.org/abs/2411.17691

work page arXiv 2024

[24] [24]

R. Priemer. Introductory Signal Processing. Advanced Series In Electrical And Computer Engineering. World Scientific Publishing Company, 1990. ISBN 9789813103757. URLhttps://books.google.com.tr/books? id=5AM8DQAAQBAJ

1990

[25] [25]

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions, 2019. URLhttps://arxiv.org/abs/1904.09728

work page internal anchor Pith review Pith/arXiv arXiv 2019

[26] [26]

C. E. Shannon. A mathematical theory of communication.Bell Labs TechnicalJournal, 27(3):379–423, July 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948

[27] [27]

Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025

work page arXiv 2025

[29] [29]

Deep learning and the information bottleneck principle

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. Ieee, 2015

2015

[30] [30]

adamw_torch

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008. 15 Appendix A Appendix A.1 Implementation Details Model Model SizesN Pretrain TokensD Pythia Suitededuped-160m, ded...

work page arXiv 2008