arxiv: 2511.01202 · v5 · submitted 2025-11-03 · 💻 cs.IT · cs.AI· math.IT

Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

Bo Bai This is my paper

Pith reviewed 2026-05-18 01:51 UTC · model grok-4.3

classification 💻 cs.IT cs.AImath.IT

keywords semantic information theorylarge language modelsdirected informationtokenenergy-based modelsrate-distortionautoregressive generationcausal inference

0 comments

The pith

Replacing the bit with the token as the fundamental unit of meaning yields a directed rate-distortion theory for LLM pre-training and post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a semantic information theory by taking the token, rather than the classical bit, as the basic carrier of meaning and reasoning in language models. It models the LLM as a stateful channel with feedback and adopts Massey's directed information to measure causal structure in autoregressive generation. From this starting point the work recasts attention and the Transformer as energy-based models on a semantic manifold and derives a directed rate-distortion function for pre-training together with a directed rate-reward function for reinforcement-learning fine-tuning. A reader would care because the framework supplies first-principles expressions for training objectives and inference-time information flow instead of relying solely on empirical scaling laws. If the account is correct, next-token prediction becomes identifiable with Granger causal inference and the reachable levels of LLM reasoning are bounded relative to Pearl's ladder of causation.

Core claim

By treating the token as the macroscopic atomic unit that carries semantics, the attention mechanism and Transformer can be viewed as energy-based dynamics on a semantic manifold. Modeling autoregressive generation as a stateful channel with feedback, Massey's directed information supplies the native causal measure from which a directed rate-distortion function for pre-training, a directed rate-reward function for RL post-training, and a sub-martingale account of inference-time semantic flow all follow. The same machinery equates next-token prediction with Granger causal inference and locates the reasoning capacity of LLMs within the first two rungs of Pearl's ladder of causation.

What carries the argument

Massey's directed information applied to the stateful channel-with-feedback model of autoregressive token generation, which directly produces the directed rate-distortion and rate-reward functions.

Load-bearing premise

Directed information supplies the correct causal measure for token sequences and semantic embeddings can be treated as a manifold on which energy-based dynamics operate without further calibration.

What would settle it

A measurement showing that the empirical information rates realized during actual LLM pre-training deviate substantially from the values predicted by the derived directed rate-distortion function would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.01202 by Bo Bai.

read the original abstract

Despite the empirical successes of Large Language Models (LLMs), the prevailing paradigm is heuristic and experiment-driven, tethered to massive compute and data, while a first-principles theory remains absent. This treatise develops a Semantic Information Theory at the confluence of statistical physics, signal processing, and classical information theory, organized around a single paradigm shift: replacing the classical BIT - a microscopic substrate devoid of semantic content - with the macroscopic TOKEN as the atomic carrier of meaning and reasoning. Within this framework we recast attention and the Transformer as energy-based models, and interpret semantic embedding as vectorization on the semantic manifold. Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure of autoregressive generation, from which we derive a *directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. This machinery makes precise the identification of next-token prediction with Granger causal inference, and sharpens the limits of LLM reasoning against Pearl's Ladder of Causation - affirming that *whereas the BIT defined the Information Epoch, the TOKEN will define the AI Epoch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a token-centric semantic information theory for LLMs by swapping bits for tokens and invoking Massey's directed information, but the framework stays conceptual without the channel or manifold details needed to support the claimed derivations.

read the letter

The paper's main pitch is that shifting from bits to tokens as the basic unit lets us treat LLMs as energy-based models on a semantic manifold and derive directed rate-distortion functions for pre-training plus rate-reward functions for RL post-training. It also ties next-token prediction to Granger causality and places LLM reasoning on Pearl's ladder of causation. That is the one or two things worth knowing up front: it is an organizing framework rather than a set of new theorems or measurements.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a Semantic Information Theory for LLMs by replacing the classical bit with the token as the fundamental atomic carrier of semantic meaning and reasoning. It recasts attention and the Transformer architecture as energy-based models on a semantic manifold, models the LLM as a stateful channel with feedback, and adopts Massey's directed information as the native causal measure. From this, the authors derive a directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. The work further identifies next-token prediction with Granger causal inference and situates LLM reasoning limits relative to Pearl's Ladder of Causation.

Significance. If the proposed derivations can be made rigorous with explicit definitions, channel models, and verifiable steps, the framework would offer a first-principles bridge between classical information theory, statistical physics, and LLM training dynamics. It could supply principled bounds on pre-training and post-training objectives and clarify causal aspects of autoregressive generation, potentially influencing more efficient and interpretable model development. The integration of directed information with energy-based views on token manifolds represents an ambitious attempt to move beyond heuristic paradigms.

major comments (3)

[Abstract / Modeling the LLM as a stateful channel] Abstract and modeling section: The central derivations of the directed rate-distortion function for pre-training and directed rate-reward function for RL post-training are asserted to follow from modeling the LLM as a stateful channel with feedback and applying Massey's directed information, yet no explicit state space, transition kernel, feedback structure, or channel capacity expressions are supplied. Without these, the claimed functions remain formal re-labelings rather than consequences of the directed-information calculus.
[Semantic embedding as vectorization on the semantic manifold] Semantic manifold and energy-based recasting: The paper treats semantic embeddings as a manifold on which attention operates via an energy function, but provides no Riemannian metric, potential function, or mapping from the discrete token vocabulary to this continuous geometry. This assumption is load-bearing for recasting the Transformer as an energy-based model and for the subsequent information-flow claims.
[Derivations of directed rate-distortion and sub-martingale account] Derivations and proofs: The abstract states that a sub-martingale account of inference-time semantic information flow and the identification of next-token prediction with Granger causality follow from the framework, but the manuscript contains no equations, intermediate steps, or proofs supporting these results. This absence prevents verification of the central theoretical claims.

minor comments (2)

[Abstract] The abstract is highly compressed and introduces multiple novel constructs (TOKEN, semantic manifold, directed rate-reward) without brief definitional anchors, which reduces immediate readability for readers outside the immediate subfield.
[Throughout] Notation for the new quantities (e.g., directed rate-distortion function) should be introduced with explicit symbols and contrasted against classical rate-distortion to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify areas where additional rigor is required to substantiate the proposed framework. We respond to each major comment below and commit to the indicated revisions.

read point-by-point responses

Referee: [Abstract / Modeling the LLM as a stateful channel] Abstract and modeling section: The central derivations of the directed rate-distortion function for pre-training and directed rate-reward function for RL post-training are asserted to follow from modeling the LLM as a stateful channel with feedback and applying Massey's directed information, yet no explicit state space, transition kernel, feedback structure, or channel capacity expressions are supplied. Without these, the claimed functions remain formal re-labelings rather than consequences of the directed-information calculus.

Authors: We agree that the current presentation of the stateful channel model remains at a conceptual level and does not yet supply the explicit components needed for rigorous derivation. In the revised manuscript we will define the state as the pair consisting of the current semantic embedding vector and the finite history of prior tokens, specify the transition kernel as the autoregressive conditional distribution p(token_{t+1} | state_t) realized by the Transformer, and formalize the feedback structure as the causal dependence of each output on all preceding outputs. With these elements in place we will derive the directed rate-distortion function by minimizing the directed information rate subject to a semantic distortion constraint, following the standard variational characterization for channels with feedback. The same construction will yield the directed rate-reward function for the RL post-training stage. revision: yes
Referee: [Semantic embedding as vectorization on the semantic manifold] Semantic manifold and energy-based recasting: The paper treats semantic embeddings as a manifold on which attention operates via an energy function, but provides no Riemannian metric, potential function, or mapping from the discrete token vocabulary to this continuous geometry. This assumption is load-bearing for recasting the Transformer as an energy-based model and for the subsequent information-flow claims.

Authors: The referee is right that the geometric structure is essential to the energy-based interpretation and must be made explicit. We will add a precise construction of the semantic manifold: the Riemannian metric will be taken as the Fisher information metric on the probability simplex induced by the token embeddings; the potential function will be defined as the negative log-probability under the attention-weighted distribution; and the embedding map will be the composition of the model's learned token embedding layer with a smooth lifting that places each vocabulary element at a point on the manifold. These definitions will allow attention to be recast as a gradient flow on the manifold and will ground the subsequent claims about information flow. revision: yes
Referee: [Derivations of directed rate-distortion and sub-martingale account] Derivations and proofs: The abstract states that a sub-martingale account of inference-time semantic information flow and the identification of next-token prediction with Granger causality follow from the framework, but the manuscript contains no equations, intermediate steps, or proofs supporting these results. This absence prevents verification of the central theoretical claims.

Authors: We acknowledge that the manuscript currently states these results without supplying the supporting derivations. In the revision we will include a dedicated theoretical section containing the full development. We will prove the sub-martingale property by showing that the expected increment in cumulative directed information at each generation step is non-negative under the model's predictive distribution. For the Granger-causality identification we will establish the equivalence between minimizing the directed information rate via next-token prediction and the classical Granger test applied to the token sequence. All intermediate steps and necessary lemmas will be provided, either in the main text or in a self-contained appendix. revision: yes

Circularity Check

1 steps flagged

Directed rate-distortion and rate-reward functions reduce to re-labeling via TOKEN modeling and Massey's measure

specific steps

self definitional [Abstract]
"Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure of autoregressive generation, from which we derive a directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow."

The directed rate-distortion and rate-reward functions are defined into existence by the choice to treat the LLM as a stateful channel equipped with the TOKEN paradigm and semantic manifold; the 'derivation' therefore consists of relabeling the existing next-token process with the new directed-information terminology rather than computing a non-trivial quantity from independently specified channel parameters or manifold geometry.

full rationale

The paper's core derivations start from the modeling choice of LLM as stateful channel with feedback and adoption of Massey's directed information as native measure, then claim to derive new directed rate-distortion and rate-reward functions plus sub-martingale flow. These steps are self-definitional because the new quantities are introduced precisely by applying the external directed-information concept inside the newly posited TOKEN/semantics-manifold framework without supplying an explicit state space, transition kernel, Riemannian metric, or energy function that would make the mapping non-tautological. The abstract presents the modeling step and the derivations as direct consequences, but the provided text supplies no independent equations or external benchmarks that would prevent the results from being equivalent to the input modeling assumptions by construction. No load-bearing self-citations or fitted predictions appear in the excerpt; the circularity is limited to the definitional recasting of known autoregressive generation under new labels.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The abstract introduces the TOKEN as a new primitive and assumes semantic embeddings form a manifold amenable to energy-based modeling; no explicit free parameters or external axioms are listed.

axioms (1)

domain assumption Massey's directed information is the native causal measure for autoregressive generation
Invoked to derive directed rate-distortion and rate-reward functions

invented entities (2)

TOKEN as atomic carrier of meaning no independent evidence
purpose: Replace bit as fundamental semantic unit
Central paradigm shift stated in the abstract; no independent falsifiable prediction supplied
semantic manifold no independent evidence
purpose: Space on which embeddings are vectorized
Used to interpret semantic embedding; no external evidence or coordinates given

pith-pipeline@v0.9.0 · 5738 in / 1326 out tokens · 47441 ms · 2026-05-18T01:51:05.738777+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure... derive a directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the AR-LLM ... can also be formulated as the Boltzmann distribution ... E(u_i) = −<u_i, Ψ(∑ A_ij u_j)>
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semantic vector space ... S^{M-1} ... Gromov-Wasserstein distance based semantic distortion metric

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · 12 internal anchors

[1]

A mathematical theory of communication,

C. Shannon, “A mathematical theory of communication,”Bell System Technical Journal, vol. 27, no. 7, pp. 379-423, Oct. 1948

work page 1948
[2]

Recent contributions to the mathematical theory of communications,

W. Weaver, “Recent contributions to the mathematical theory of communications,”The Rockefeller Foundation, Sep. 1949

work page 1949
[3]

Empiricism, semantics, and ontology,

R. Carnap, “Empiricism, semantics, and ontology,”Revue Internationale de Philosophie, no. 4, pp. 20-40, Apr. 1950

work page 1950
[4]

An outline of a theory of semantic information,

R. Carnap and Y . Bar-Hillel, “An outline of a theory of semantic information,” Massachusetts Institute of Technology, Cambridge, MA, USA, Research Laboratory of Electronics Technical Report No. 247, Oct. 1952. BAI: FORGET BIT, IT IS ALL ABOUT TOKEN: TOW ARDS THE SEMANTIC INFORMATION THEORY OF LLMS 27

work page 1952
[5]

Semantic information,

Y . Bar-Hillel and R. Carnap, “Semantic information,”The British Journal for the Philosophy of Science, vol. 4, no. 14, pp. 147-157, Aug. 1953

work page 1953
[6]

Carnap,Meaning and Necessity: A Study in Semantics and Modal Logic, 2nd ed

R. Carnap,Meaning and Necessity: A Study in Semantics and Modal Logic, 2nd ed. Chicago, IL, USA: University of Chicago Press, 1988

work page 1988
[7]

Burgin,Theory of Information: Fundamentality, Diversity and Unification

M. Burgin,Theory of Information: Fundamentality, Diversity and Unification. Singapore: World Scientific Publishing, 2009

work page 2009
[8]

Floridi, Ed.,The Routledge Handbook of Philosophy of Information

L. Floridi, Ed.,The Routledge Handbook of Philosophy of Information. London, UK: Routledge, 2016

work page 2016
[9]

A formal theory of inductive inference - Part 1,

R. Solomonoff, “A formal theory of inductive inference - Part 1,”Information and Control, vol. 7, no. 1, pp. 1-22, Mar. 1964

work page 1964
[10]

A formal theory of inductive inference - Part 2,

R. Solomonoff, “A formal theory of inductive inference - Part 2,”Information and Control, vol. 7, no. 2, pp. 224-254, Jun. 1964

work page 1964
[11]

The discovery of algorithmic probability,

R. Solomonoff, “The discovery of algorithmic probability,”Journal of Computer and System Sciences, vol. 55, no. 1, pp. 73-88, Aug. 1997

work page 1997
[12]

Three approaches to the quantitative definition of information,

A. Kolmogorov, “Three approaches to the quantitative definition of information,”International Journal of Computer Mathematics, vol. 2, no. 1-4, pp. 157-168, Jan. 1968

work page 1968
[13]

Logical basis for information theory and probability theory,

A. Kolmogorov, “Logical basis for information theory and probability theory,”IEEE Trans. Inf. Theory, vol. 14, no. 5, pp. 662-664, Sep. 1968

work page 1968
[14]

Hutter,Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability

M. Hutter,Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Berlin, Germany: Springer, 2004

work page 2004
[15]

A. Shen, V . Uspensky, and N. Vereshchagin,Kolmogorov Complexity and Algorithmic Randomness. Providence, RI, USA: American Mathematical Society, 2022

work page 2022
[16]

Cover and J

T. Cover and J. Thomas,Elements of Information Theory, 2nd ed. Hoboken, NJ, USA: John Wiley & Sons, 2006

work page 2006
[17]

On variational bounds of mutual information,

B. Poole, S. Ozair, A. Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inProc. 36th ICML ’19, Long Beach, CA, USA: ICML, Jun. 2019

work page 2019
[18]

The bitter lesson,

R. Sutton, “The bitter lesson,” University of Alberta, Edmonton, Canada, Mar. 2019

work page 2019
[19]

A new method of recording and searching information,

H. Luhn, “A new method of recording and searching information,”American Documentation, vol. 4, no. 1, pp. 14-16, Jan. 1953

work page 1953
[20]

A vector space model for automatic indexing,

G. Salton, A. Wong, and C. Yang, “A vector space model for automatic indexing,”Commun. ACM, vol. 18, no. 11, pp. 613-620, Nov. 1975

work page 1975
[21]

A neural probabilistic language model,

Y . Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,”J. Machine Learn. Res., vol. 3, pp. 1137-1155, 2003

work page 2003
[22]

Efficient Estimation of Word Representations in Vector Space

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,”arXiv: 1301.3781, Sep. 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[23]

Distributed representations of words and phrases and their compositionality,

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inProc. 27th NIPS ’13, Lake Tahoe, NV , USA, Dec. 2013

work page 2013
[24]

GloVe: Global vectors for word representation,

J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” inProc. ACL EMNLP ’14, Doha, Qatar, Oct. 2014

work page 2014
[25]

Enriching word vectors with subword information,

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,”Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017

work page 2017
[26]

Deep contextualized word representations,

M. Peters et al., “Deep contextualized word representations,” inProc. ACL NAACL-HLT ’18, New Orleans, LA, USA, Jun. 2018

work page 2018
[27]

Jurafsky and J

D. Jurafsky and J. Martin,Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd ed. Draft, 2025. 28 TECHNICAL REPORT

work page 2025
[28]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. 31st NIPS ’17, Long Beach, CA, USA, 4-9 Dec. 2017

work page 2017
[29]

Improving language understanding by generative pre- training,

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre- training,”OpenAI, Jun. 2018

work page 2018
[30]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”OpenAI, Feb. 2019

work page 2019
[31]

Language models are few-shot learners,

T. Brown et al., “Language models are few-shot learners,” inProc. 34th NeurIPS ’20, Virtual Conference, 6-12 Dec. 2020

work page 2020
[32]

Training language models to follow instructions with human feedback

L. Ouyang et al., “Training language models to follow instructions with human feedback,”arXiv: 2203.02155, Mar. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

D. Guo et al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633-638, Sep. 2025

work page 2025
[34]

DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention,

“DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention,”DeepSeek, Hangzhou, China, Sep. 2025

work page 2025
[35]

Polyanskiy and Y

Y . Polyanskiy and Y . Wu,Information Theory: From Coding to Learning. Cambridge, UK: Cambridge University Press, 2025

work page 2025
[36]

Opening the Black Box of Deep Neural Networks via Information

R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv: 1703.00810, Apr. 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

From tokens to thoughts: How LLMs and humans trade compression for meaning,

C. Shani, D. Jurafsky, Y . LeCun, and R. Shwartz-Ziv, “From tokens to thoughts: How LLMs and humans trade compression for meaning,”arXiv: 2505.17117, Jun. 2025

work page arXiv 2025
[38]

Toward textual transform coding,

T. Weissman, “Toward textual transform coding,”IEEE BITS Inform. Theory Mag., vol. 3, no. 2, pp. 32-40, Jun. 2023

work page 2023
[39]

Rate-distortion-perception trade-off in information theory, generative models, and intelligent communications,

X. Niu, B. Bai, N. Guo, W. Zhang, and W. Han, “Rate-distortion-perception trade-off in information theory, generative models, and intelligent communications,”Entropy, vol. 27, no. 4, Apr. 2025

work page 2025
[40]

A mathematical perspective on Transformers , 2024

B. Geshkovski, C. Letrouit, Y . Polyanskiy, and P. Rigollet, “A mathematical perspective on transformers,”arXiv: 2312.10794, Aug. 2025

work page arXiv 2025
[41]

Rodrigues and Y

M. Rodrigues and Y . Eldar,Information-Theoretic Methods in Data Science. Cambridge, UK: Cambridge University Press, 2021

work page 2021
[42]

Causality, feedback and directed information,

J. Massey, “Causality, feedback and directed information,” inProc. IEEE ISIT ’90, Waikiki, HI, USA, Nov. 1990

work page 1990
[43]

Berger,Rate Distortion Theory: A Mathematical Basis for Data Compression

T. Berger,Rate Distortion Theory: A Mathematical Basis for Data Compression. Englewood Cliffs, NJ, USA: Prentice Hall PTR, 1971

work page 1971
[44]

Sutton and A

R. Sutton and A. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018

work page 2018
[45]

Testing for causality: A personal viewpoint,

C. Granger, “Testing for causality: A personal viewpoint,”Journal of Economic Dynamics and Control, vol. 2, no. 1, pp. 329-352, Jan. 1980

work page 1980
[46]

Gromov,Metric Structures for Riemannian and Non-Riemannian Spaces

M. Gromov,Metric Structures for Riemannian and Non-Riemannian Spaces. Boston, MA, USA: Birkhäuser, 2007

work page 2007
[47]

Villani,Optimal Transport: Old and New

C. Villani,Optimal Transport: Old and New. New York, NY , USA: Springer, 2009

work page 2009
[48]

Representation Learning with Contrastive Predictive Coding

A. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv: 1807.03748, Jan. 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[49]

Text and Code Embeddings by Contrastive Pre-Training

A. Neelakantan et al., “Text and code embeddings by contrastive pre-training,”arXiv: 2201.10005, Jan. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Lütkepohl,New Introduction to Multiple Time Series Analysis

H. Lütkepohl,New Introduction to Multiple Time Series Analysis. Berlin, Germany: Springer, 2007

work page 2007
[51]

Graphical models, exponential families, and variational inference,

M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational inference,”Foundation and Trends in Machine Learning, vol. 1, no. 1-2, pp. 1-305, Nov. 2008

work page 2008
[52]

Mohri, A

M. Mohri, A. Rostamizadeh, and A. Talwalkar,Foundations of Machine Learning, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018. BAI: FORGET BIT, IT IS ALL ABOUT TOKEN: TOW ARDS THE SEMANTIC INFORMATION THEORY OF LLMS 29

work page 2018
[53]

The space of interactions in neural network models,

E. Gardner, “The space of interactions in neural network models,”J. Phys. A: Math. Gen., vol. 21, no. 1, pp. 257-270, Jan. 1988

work page 1988
[54]

Optimal storage properties of neural network models,

E. Gardner and B. Derrida, “Optimal storage properties of neural network models,”J. Phys. A: Math. Gen., vol. 21, no. 1, pp. 271-284, Jan. 1988

work page 1988
[55]

Three unfinished works on the optimal storage capacity of networks,

E. Gardner and B. Derrida, “Three unfinished works on the optimal storage capacity of networks,”J. Phys. A: Math. Gen., vol. 22, no. 12, pp. 1983-1994, Jun. 1989

work page 1983
[56]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv: 2312.00752, May 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,”arXiv: 2405.21060, May 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Large Language Diffusion Models

S. Nie et al., “Large language diffusion models,”arXiv: 2502.09992, Feb. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Computation of channel capacity and rate-distortion functions,

R. Blahut, “Computation of channel capacity and rate-distortion functions,”IEEE Trans. Inf. Theory, vol. 18, no. 4, pp. 460-473, Jul. 1972

work page 1972
[60]

An algorithm for computing the capacity of arbitrary discrete memoryless channels,

S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless channels,”IEEE Trans. Inf. Theory, vol. 18, no. 1, pp. 14-20, Jan. 1972

work page 1972
[61]

A communication optimal transport approach to the computation of rate distortion functions,

S. Wu, W. Ye, H. Wu, H. Wu, W. Zhang, and B. Bai, “A communication optimal transport approach to the computation of rate distortion functions,”arXiv: 2212.10098, Dec. 2022

work page arXiv 2022
[62]

A constrained BA algorithm for rate-distortion and distortion-rate functions,

L. Chen et al., “A constrained BA algorithm for rate-distortion and distortion-rate functions,”arXiv: 2305.02650, Jan. 2024

work page arXiv 2024
[63]

Computation of rate-distortion-perception functions with Wasserstein barycenter,

C. Chen et al., “Computation of rate-distortion-perception functions with Wasserstein barycenter,” inProc. IEEE ISIT ’23, Taipei, Taiwan, Jun. 2023

work page 2023
[64]

Directed information for channels with feedback,

G. Kramer, “Directed information for channels with feedback,” Ph. D Dissertation, ETH Zurich, Zurich, Switzerland, 1998

work page 1998
[65]

General formulation of Shannon’s main theorem in information theory,

R. Dobrushin, “General formulation of Shannon’s main theorem in information theory,”American Mathematical Society Translations: Series 2, vol. 33, no. 2, pp. 323-438, 1963

work page 1963
[66]

Extension of the Blahut-Arimoto algorithm for maximizing directed information,

I. Naiss and H. Permuter, “Extension of the Blahut-Arimoto algorithm for maximizing directed information,”IEEE Trans. Inf. Theory, vol. 59, no. 1, pp. 204-222, Jan. 2013

work page 2013
[67]

MINE: Mutual information neural estimation,

M. Belghazi et al., “MINE: Mutual information neural estimation,”arXiv: 1801.04062, Aug. 2021

work page arXiv 2021
[68]

Neural estimation and optimization of directed information over continuous spaces,

D. Tsur, Z. Aharoni, Z. Goldfeld, and H. Permuter, “Neural estimation and optimization of directed information over continuous spaces,”IEEE Trans. on Inf. Theory, vol. 69, no. 8, pp. 4777-4798, Aug. 2023

work page 2023
[69]

Asymptotische abschätzungen in Shannon’s informationstheorie,

V . Strassen, “Asymptotische abschätzungen in Shannon’s informationstheorie,” inTrans. 3rd Prague Conf. Inf. Theory ’62, Prague, Czech Republic, 1962

work page 1962
[70]

The relation between Granger causality and directed information theory: A review,

P. Amblard and O. Michel, “The relation between Granger causality and directed information theory: A review,” Entropy, vol. 15, no. 1, pp. 113-143, Jan. 2013

work page 2013
[71]

Measuring information transfer,

T. Schreiber, “Measuring information transfer,”Phys. Rev. Lett., vol. 85, no. 2, pp. 461-464, Jul. 2000

work page 2000
[72]

Granger causality and transfer entropy are equivalent for Gaussian variables,

L. Barnett, A. Barrett, and A. Seth, “Granger causality and transfer entropy are equivalent for Gaussian variables,” Phys. Rev. Lett., vol. 103, no. 23, p. 238701, Dec. 2009

work page 2009
[73]

Transfer entropy,

D. Gença ˘ga, Ed., “Transfer entropy,”Entropy, vol. 20, no. 4, p. 288, Apr. 2018

work page 2018
[74]

Pearl,Causality: Models, Reasoning, and Inference, 2nd ed

J. Pearl,Causality: Models, Reasoning, and Inference, 2nd ed. New York, NY , USA: Cambridge University Press, 2009

work page 2009
[75]

Shannon information and Kolmogorov complexity,

P. Grünwald and P. Vitányi, “Shannon information and Kolmogorov complexity,”arXiv: cs/0410002, Jul. 2010

work page arXiv 2010
[76]

Amari,Information Geometry and Its Applications, Tokyo, Japan: Springer, 2016

S. Amari,Information Geometry and Its Applications, Tokyo, Japan: Springer, 2016

work page 2016
[77]

Optimizing neural networks with Kronecker-factored approximate curvature,

J. Martens and R. Grosse, “Optimizing neural networks with Kronecker-factored approximate curvature,” inProc. 32nd ICML ’15, Lille, France: ICML, Jul. 2015. 30 TECHNICAL REPORT

work page 2015
[78]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”arXiv: 2305.18290, Jul. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

On tail probabilities for martingales,

D. Freedman, “On tail probabilities for martingales,”The Annals of Probability, vol. 3, no. 1, pp. 100-118, Feb. 1975

work page 1975
[80]

Williams,Probability with Martingales

D. Williams,Probability with Martingales. Cambridge, UK: Cambridge University Press, 1991

work page 1991

Showing first 80 references.