pith. machine review for the scientific record.
sign in

arxiv: 2511.01202 · v5 · submitted 2025-11-03 · 💻 cs.IT · cs.AI· math.IT

Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

Pith reviewed 2026-05-18 01:51 UTC · model grok-4.3

classification 💻 cs.IT cs.AImath.IT
keywords semantic information theorylarge language modelsdirected informationtokenenergy-based modelsrate-distortionautoregressive generationcausal inference
0
0 comments X

The pith

Replacing the bit with the token as the fundamental unit of meaning yields a directed rate-distortion theory for LLM pre-training and post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a semantic information theory by taking the token, rather than the classical bit, as the basic carrier of meaning and reasoning in language models. It models the LLM as a stateful channel with feedback and adopts Massey's directed information to measure causal structure in autoregressive generation. From this starting point the work recasts attention and the Transformer as energy-based models on a semantic manifold and derives a directed rate-distortion function for pre-training together with a directed rate-reward function for reinforcement-learning fine-tuning. A reader would care because the framework supplies first-principles expressions for training objectives and inference-time information flow instead of relying solely on empirical scaling laws. If the account is correct, next-token prediction becomes identifiable with Granger causal inference and the reachable levels of LLM reasoning are bounded relative to Pearl's ladder of causation.

Core claim

By treating the token as the macroscopic atomic unit that carries semantics, the attention mechanism and Transformer can be viewed as energy-based dynamics on a semantic manifold. Modeling autoregressive generation as a stateful channel with feedback, Massey's directed information supplies the native causal measure from which a directed rate-distortion function for pre-training, a directed rate-reward function for RL post-training, and a sub-martingale account of inference-time semantic flow all follow. The same machinery equates next-token prediction with Granger causal inference and locates the reasoning capacity of LLMs within the first two rungs of Pearl's ladder of causation.

What carries the argument

Massey's directed information applied to the stateful channel-with-feedback model of autoregressive token generation, which directly produces the directed rate-distortion and rate-reward functions.

Load-bearing premise

Directed information supplies the correct causal measure for token sequences and semantic embeddings can be treated as a manifold on which energy-based dynamics operate without further calibration.

What would settle it

A measurement showing that the empirical information rates realized during actual LLM pre-training deviate substantially from the values predicted by the derived directed rate-distortion function would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.01202 by Bo Bai.

Figure 1
Figure 1. Figure 1: The probabilistic model of an LLM at time [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

Despite the empirical successes of Large Language Models (LLMs), the prevailing paradigm is heuristic and experiment-driven, tethered to massive compute and data, while a first-principles theory remains absent. This treatise develops a Semantic Information Theory at the confluence of statistical physics, signal processing, and classical information theory, organized around a single paradigm shift: replacing the classical BIT - a microscopic substrate devoid of semantic content - with the macroscopic TOKEN as the atomic carrier of meaning and reasoning. Within this framework we recast attention and the Transformer as energy-based models, and interpret semantic embedding as vectorization on the semantic manifold. Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure of autoregressive generation, from which we derive a *directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. This machinery makes precise the identification of next-token prediction with Granger causal inference, and sharpens the limits of LLM reasoning against Pearl's Ladder of Causation - affirming that *whereas the BIT defined the Information Epoch, the TOKEN will define the AI Epoch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a Semantic Information Theory for LLMs by replacing the classical bit with the token as the fundamental atomic carrier of semantic meaning and reasoning. It recasts attention and the Transformer architecture as energy-based models on a semantic manifold, models the LLM as a stateful channel with feedback, and adopts Massey's directed information as the native causal measure. From this, the authors derive a directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. The work further identifies next-token prediction with Granger causal inference and situates LLM reasoning limits relative to Pearl's Ladder of Causation.

Significance. If the proposed derivations can be made rigorous with explicit definitions, channel models, and verifiable steps, the framework would offer a first-principles bridge between classical information theory, statistical physics, and LLM training dynamics. It could supply principled bounds on pre-training and post-training objectives and clarify causal aspects of autoregressive generation, potentially influencing more efficient and interpretable model development. The integration of directed information with energy-based views on token manifolds represents an ambitious attempt to move beyond heuristic paradigms.

major comments (3)
  1. [Abstract / Modeling the LLM as a stateful channel] Abstract and modeling section: The central derivations of the directed rate-distortion function for pre-training and directed rate-reward function for RL post-training are asserted to follow from modeling the LLM as a stateful channel with feedback and applying Massey's directed information, yet no explicit state space, transition kernel, feedback structure, or channel capacity expressions are supplied. Without these, the claimed functions remain formal re-labelings rather than consequences of the directed-information calculus.
  2. [Semantic embedding as vectorization on the semantic manifold] Semantic manifold and energy-based recasting: The paper treats semantic embeddings as a manifold on which attention operates via an energy function, but provides no Riemannian metric, potential function, or mapping from the discrete token vocabulary to this continuous geometry. This assumption is load-bearing for recasting the Transformer as an energy-based model and for the subsequent information-flow claims.
  3. [Derivations of directed rate-distortion and sub-martingale account] Derivations and proofs: The abstract states that a sub-martingale account of inference-time semantic information flow and the identification of next-token prediction with Granger causality follow from the framework, but the manuscript contains no equations, intermediate steps, or proofs supporting these results. This absence prevents verification of the central theoretical claims.
minor comments (2)
  1. [Abstract] The abstract is highly compressed and introduces multiple novel constructs (TOKEN, semantic manifold, directed rate-reward) without brief definitional anchors, which reduces immediate readability for readers outside the immediate subfield.
  2. [Throughout] Notation for the new quantities (e.g., directed rate-distortion function) should be introduced with explicit symbols and contrasted against classical rate-distortion to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify areas where additional rigor is required to substantiate the proposed framework. We respond to each major comment below and commit to the indicated revisions.

read point-by-point responses
  1. Referee: [Abstract / Modeling the LLM as a stateful channel] Abstract and modeling section: The central derivations of the directed rate-distortion function for pre-training and directed rate-reward function for RL post-training are asserted to follow from modeling the LLM as a stateful channel with feedback and applying Massey's directed information, yet no explicit state space, transition kernel, feedback structure, or channel capacity expressions are supplied. Without these, the claimed functions remain formal re-labelings rather than consequences of the directed-information calculus.

    Authors: We agree that the current presentation of the stateful channel model remains at a conceptual level and does not yet supply the explicit components needed for rigorous derivation. In the revised manuscript we will define the state as the pair consisting of the current semantic embedding vector and the finite history of prior tokens, specify the transition kernel as the autoregressive conditional distribution p(token_{t+1} | state_t) realized by the Transformer, and formalize the feedback structure as the causal dependence of each output on all preceding outputs. With these elements in place we will derive the directed rate-distortion function by minimizing the directed information rate subject to a semantic distortion constraint, following the standard variational characterization for channels with feedback. The same construction will yield the directed rate-reward function for the RL post-training stage. revision: yes

  2. Referee: [Semantic embedding as vectorization on the semantic manifold] Semantic manifold and energy-based recasting: The paper treats semantic embeddings as a manifold on which attention operates via an energy function, but provides no Riemannian metric, potential function, or mapping from the discrete token vocabulary to this continuous geometry. This assumption is load-bearing for recasting the Transformer as an energy-based model and for the subsequent information-flow claims.

    Authors: The referee is right that the geometric structure is essential to the energy-based interpretation and must be made explicit. We will add a precise construction of the semantic manifold: the Riemannian metric will be taken as the Fisher information metric on the probability simplex induced by the token embeddings; the potential function will be defined as the negative log-probability under the attention-weighted distribution; and the embedding map will be the composition of the model's learned token embedding layer with a smooth lifting that places each vocabulary element at a point on the manifold. These definitions will allow attention to be recast as a gradient flow on the manifold and will ground the subsequent claims about information flow. revision: yes

  3. Referee: [Derivations of directed rate-distortion and sub-martingale account] Derivations and proofs: The abstract states that a sub-martingale account of inference-time semantic information flow and the identification of next-token prediction with Granger causality follow from the framework, but the manuscript contains no equations, intermediate steps, or proofs supporting these results. This absence prevents verification of the central theoretical claims.

    Authors: We acknowledge that the manuscript currently states these results without supplying the supporting derivations. In the revision we will include a dedicated theoretical section containing the full development. We will prove the sub-martingale property by showing that the expected increment in cumulative directed information at each generation step is non-negative under the model's predictive distribution. For the Granger-causality identification we will establish the equivalence between minimizing the directed information rate via next-token prediction and the classical Granger test applied to the token sequence. All intermediate steps and necessary lemmas will be provided, either in the main text or in a self-contained appendix. revision: yes

Circularity Check

1 steps flagged

Directed rate-distortion and rate-reward functions reduce to re-labeling via TOKEN modeling and Massey's measure

specific steps
  1. self definitional [Abstract]
    "Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure of autoregressive generation, from which we derive a directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow."

    The directed rate-distortion and rate-reward functions are defined into existence by the choice to treat the LLM as a stateful channel equipped with the TOKEN paradigm and semantic manifold; the 'derivation' therefore consists of relabeling the existing next-token process with the new directed-information terminology rather than computing a non-trivial quantity from independently specified channel parameters or manifold geometry.

full rationale

The paper's core derivations start from the modeling choice of LLM as stateful channel with feedback and adoption of Massey's directed information as native measure, then claim to derive new directed rate-distortion and rate-reward functions plus sub-martingale flow. These steps are self-definitional because the new quantities are introduced precisely by applying the external directed-information concept inside the newly posited TOKEN/semantics-manifold framework without supplying an explicit state space, transition kernel, Riemannian metric, or energy function that would make the mapping non-tautological. The abstract presents the modeling step and the derivations as direct consequences, but the provided text supplies no independent equations or external benchmarks that would prevent the results from being equivalent to the input modeling assumptions by construction. No load-bearing self-citations or fitted predictions appear in the excerpt; the circularity is limited to the definitional recasting of known autoregressive generation under new labels.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The abstract introduces the TOKEN as a new primitive and assumes semantic embeddings form a manifold amenable to energy-based modeling; no explicit free parameters or external axioms are listed.

axioms (1)
  • domain assumption Massey's directed information is the native causal measure for autoregressive generation
    Invoked to derive directed rate-distortion and rate-reward functions
invented entities (2)
  • TOKEN as atomic carrier of meaning no independent evidence
    purpose: Replace bit as fundamental semantic unit
    Central paradigm shift stated in the abstract; no independent falsifiable prediction supplied
  • semantic manifold no independent evidence
    purpose: Space on which embeddings are vectorized
    Used to interpret semantic embedding; no external evidence or coordinates given

pith-pipeline@v0.9.0 · 5738 in / 1326 out tokens · 47441 ms · 2026-05-18T01:51:05.738777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · 12 internal anchors

  1. [1]

    A mathematical theory of communication,

    C. Shannon, “A mathematical theory of communication,”Bell System Technical Journal, vol. 27, no. 7, pp. 379-423, Oct. 1948

  2. [2]

    Recent contributions to the mathematical theory of communications,

    W. Weaver, “Recent contributions to the mathematical theory of communications,”The Rockefeller Foundation, Sep. 1949

  3. [3]

    Empiricism, semantics, and ontology,

    R. Carnap, “Empiricism, semantics, and ontology,”Revue Internationale de Philosophie, no. 4, pp. 20-40, Apr. 1950

  4. [4]

    An outline of a theory of semantic information,

    R. Carnap and Y . Bar-Hillel, “An outline of a theory of semantic information,” Massachusetts Institute of Technology, Cambridge, MA, USA, Research Laboratory of Electronics Technical Report No. 247, Oct. 1952. BAI: FORGET BIT, IT IS ALL ABOUT TOKEN: TOW ARDS THE SEMANTIC INFORMATION THEORY OF LLMS 27

  5. [5]

    Semantic information,

    Y . Bar-Hillel and R. Carnap, “Semantic information,”The British Journal for the Philosophy of Science, vol. 4, no. 14, pp. 147-157, Aug. 1953

  6. [6]

    Carnap,Meaning and Necessity: A Study in Semantics and Modal Logic, 2nd ed

    R. Carnap,Meaning and Necessity: A Study in Semantics and Modal Logic, 2nd ed. Chicago, IL, USA: University of Chicago Press, 1988

  7. [7]

    Burgin,Theory of Information: Fundamentality, Diversity and Unification

    M. Burgin,Theory of Information: Fundamentality, Diversity and Unification. Singapore: World Scientific Publishing, 2009

  8. [8]

    Floridi, Ed.,The Routledge Handbook of Philosophy of Information

    L. Floridi, Ed.,The Routledge Handbook of Philosophy of Information. London, UK: Routledge, 2016

  9. [9]

    A formal theory of inductive inference - Part 1,

    R. Solomonoff, “A formal theory of inductive inference - Part 1,”Information and Control, vol. 7, no. 1, pp. 1-22, Mar. 1964

  10. [10]

    A formal theory of inductive inference - Part 2,

    R. Solomonoff, “A formal theory of inductive inference - Part 2,”Information and Control, vol. 7, no. 2, pp. 224-254, Jun. 1964

  11. [11]

    The discovery of algorithmic probability,

    R. Solomonoff, “The discovery of algorithmic probability,”Journal of Computer and System Sciences, vol. 55, no. 1, pp. 73-88, Aug. 1997

  12. [12]

    Three approaches to the quantitative definition of information,

    A. Kolmogorov, “Three approaches to the quantitative definition of information,”International Journal of Computer Mathematics, vol. 2, no. 1-4, pp. 157-168, Jan. 1968

  13. [13]

    Logical basis for information theory and probability theory,

    A. Kolmogorov, “Logical basis for information theory and probability theory,”IEEE Trans. Inf. Theory, vol. 14, no. 5, pp. 662-664, Sep. 1968

  14. [14]

    Hutter,Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability

    M. Hutter,Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Berlin, Germany: Springer, 2004

  15. [15]

    A. Shen, V . Uspensky, and N. Vereshchagin,Kolmogorov Complexity and Algorithmic Randomness. Providence, RI, USA: American Mathematical Society, 2022

  16. [16]

    Cover and J

    T. Cover and J. Thomas,Elements of Information Theory, 2nd ed. Hoboken, NJ, USA: John Wiley & Sons, 2006

  17. [17]

    On variational bounds of mutual information,

    B. Poole, S. Ozair, A. Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inProc. 36th ICML ’19, Long Beach, CA, USA: ICML, Jun. 2019

  18. [18]

    The bitter lesson,

    R. Sutton, “The bitter lesson,” University of Alberta, Edmonton, Canada, Mar. 2019

  19. [19]

    A new method of recording and searching information,

    H. Luhn, “A new method of recording and searching information,”American Documentation, vol. 4, no. 1, pp. 14-16, Jan. 1953

  20. [20]

    A vector space model for automatic indexing,

    G. Salton, A. Wong, and C. Yang, “A vector space model for automatic indexing,”Commun. ACM, vol. 18, no. 11, pp. 613-620, Nov. 1975

  21. [21]

    A neural probabilistic language model,

    Y . Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,”J. Machine Learn. Res., vol. 3, pp. 1137-1155, 2003

  22. [22]

    Efficient Estimation of Word Representations in Vector Space

    T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,”arXiv: 1301.3781, Sep. 2013

  23. [23]

    Distributed representations of words and phrases and their compositionality,

    T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inProc. 27th NIPS ’13, Lake Tahoe, NV , USA, Dec. 2013

  24. [24]

    GloVe: Global vectors for word representation,

    J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” inProc. ACL EMNLP ’14, Doha, Qatar, Oct. 2014

  25. [25]

    Enriching word vectors with subword information,

    P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,”Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017

  26. [26]

    Deep contextualized word representations,

    M. Peters et al., “Deep contextualized word representations,” inProc. ACL NAACL-HLT ’18, New Orleans, LA, USA, Jun. 2018

  27. [27]

    Jurafsky and J

    D. Jurafsky and J. Martin,Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd ed. Draft, 2025. 28 TECHNICAL REPORT

  28. [28]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. 31st NIPS ’17, Long Beach, CA, USA, 4-9 Dec. 2017

  29. [29]

    Improving language understanding by generative pre- training,

    A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre- training,”OpenAI, Jun. 2018

  30. [30]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”OpenAI, Feb. 2019

  31. [31]

    Language models are few-shot learners,

    T. Brown et al., “Language models are few-shot learners,” inProc. 34th NeurIPS ’20, Virtual Conference, 6-12 Dec. 2020

  32. [32]

    Training language models to follow instructions with human feedback

    L. Ouyang et al., “Training language models to follow instructions with human feedback,”arXiv: 2203.02155, Mar. 2022

  33. [33]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

    D. Guo et al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633-638, Sep. 2025

  34. [34]

    DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention,

    “DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention,”DeepSeek, Hangzhou, China, Sep. 2025

  35. [35]

    Polyanskiy and Y

    Y . Polyanskiy and Y . Wu,Information Theory: From Coding to Learning. Cambridge, UK: Cambridge University Press, 2025

  36. [36]

    Opening the Black Box of Deep Neural Networks via Information

    R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv: 1703.00810, Apr. 2017

  37. [37]

    From tokens to thoughts: How LLMs and humans trade compression for meaning,

    C. Shani, D. Jurafsky, Y . LeCun, and R. Shwartz-Ziv, “From tokens to thoughts: How LLMs and humans trade compression for meaning,”arXiv: 2505.17117, Jun. 2025

  38. [38]

    Toward textual transform coding,

    T. Weissman, “Toward textual transform coding,”IEEE BITS Inform. Theory Mag., vol. 3, no. 2, pp. 32-40, Jun. 2023

  39. [39]

    Rate-distortion-perception trade-off in information theory, generative models, and intelligent communications,

    X. Niu, B. Bai, N. Guo, W. Zhang, and W. Han, “Rate-distortion-perception trade-off in information theory, generative models, and intelligent communications,”Entropy, vol. 27, no. 4, Apr. 2025

  40. [40]

    A mathematical perspective on Transformers , 2024

    B. Geshkovski, C. Letrouit, Y . Polyanskiy, and P. Rigollet, “A mathematical perspective on transformers,”arXiv: 2312.10794, Aug. 2025

  41. [41]

    Rodrigues and Y

    M. Rodrigues and Y . Eldar,Information-Theoretic Methods in Data Science. Cambridge, UK: Cambridge University Press, 2021

  42. [42]

    Causality, feedback and directed information,

    J. Massey, “Causality, feedback and directed information,” inProc. IEEE ISIT ’90, Waikiki, HI, USA, Nov. 1990

  43. [43]

    Berger,Rate Distortion Theory: A Mathematical Basis for Data Compression

    T. Berger,Rate Distortion Theory: A Mathematical Basis for Data Compression. Englewood Cliffs, NJ, USA: Prentice Hall PTR, 1971

  44. [44]

    Sutton and A

    R. Sutton and A. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018

  45. [45]

    Testing for causality: A personal viewpoint,

    C. Granger, “Testing for causality: A personal viewpoint,”Journal of Economic Dynamics and Control, vol. 2, no. 1, pp. 329-352, Jan. 1980

  46. [46]

    Gromov,Metric Structures for Riemannian and Non-Riemannian Spaces

    M. Gromov,Metric Structures for Riemannian and Non-Riemannian Spaces. Boston, MA, USA: Birkhäuser, 2007

  47. [47]

    Villani,Optimal Transport: Old and New

    C. Villani,Optimal Transport: Old and New. New York, NY , USA: Springer, 2009

  48. [48]

    Representation Learning with Contrastive Predictive Coding

    A. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv: 1807.03748, Jan. 2019

  49. [49]

    Text and Code Embeddings by Contrastive Pre-Training

    A. Neelakantan et al., “Text and code embeddings by contrastive pre-training,”arXiv: 2201.10005, Jan. 2022

  50. [50]

    Lütkepohl,New Introduction to Multiple Time Series Analysis

    H. Lütkepohl,New Introduction to Multiple Time Series Analysis. Berlin, Germany: Springer, 2007

  51. [51]

    Graphical models, exponential families, and variational inference,

    M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational inference,”Foundation and Trends in Machine Learning, vol. 1, no. 1-2, pp. 1-305, Nov. 2008

  52. [52]

    Mohri, A

    M. Mohri, A. Rostamizadeh, and A. Talwalkar,Foundations of Machine Learning, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018. BAI: FORGET BIT, IT IS ALL ABOUT TOKEN: TOW ARDS THE SEMANTIC INFORMATION THEORY OF LLMS 29

  53. [53]

    The space of interactions in neural network models,

    E. Gardner, “The space of interactions in neural network models,”J. Phys. A: Math. Gen., vol. 21, no. 1, pp. 257-270, Jan. 1988

  54. [54]

    Optimal storage properties of neural network models,

    E. Gardner and B. Derrida, “Optimal storage properties of neural network models,”J. Phys. A: Math. Gen., vol. 21, no. 1, pp. 271-284, Jan. 1988

  55. [55]

    Three unfinished works on the optimal storage capacity of networks,

    E. Gardner and B. Derrida, “Three unfinished works on the optimal storage capacity of networks,”J. Phys. A: Math. Gen., vol. 22, no. 12, pp. 1983-1994, Jun. 1989

  56. [56]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv: 2312.00752, May 2024

  57. [57]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,”arXiv: 2405.21060, May 2024

  58. [58]

    Large Language Diffusion Models

    S. Nie et al., “Large language diffusion models,”arXiv: 2502.09992, Feb. 2025

  59. [59]

    Computation of channel capacity and rate-distortion functions,

    R. Blahut, “Computation of channel capacity and rate-distortion functions,”IEEE Trans. Inf. Theory, vol. 18, no. 4, pp. 460-473, Jul. 1972

  60. [60]

    An algorithm for computing the capacity of arbitrary discrete memoryless channels,

    S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless channels,”IEEE Trans. Inf. Theory, vol. 18, no. 1, pp. 14-20, Jan. 1972

  61. [61]

    A communication optimal transport approach to the computation of rate distortion functions,

    S. Wu, W. Ye, H. Wu, H. Wu, W. Zhang, and B. Bai, “A communication optimal transport approach to the computation of rate distortion functions,”arXiv: 2212.10098, Dec. 2022

  62. [62]

    A constrained BA algorithm for rate-distortion and distortion-rate functions,

    L. Chen et al., “A constrained BA algorithm for rate-distortion and distortion-rate functions,”arXiv: 2305.02650, Jan. 2024

  63. [63]

    Computation of rate-distortion-perception functions with Wasserstein barycenter,

    C. Chen et al., “Computation of rate-distortion-perception functions with Wasserstein barycenter,” inProc. IEEE ISIT ’23, Taipei, Taiwan, Jun. 2023

  64. [64]

    Directed information for channels with feedback,

    G. Kramer, “Directed information for channels with feedback,” Ph. D Dissertation, ETH Zurich, Zurich, Switzerland, 1998

  65. [65]

    General formulation of Shannon’s main theorem in information theory,

    R. Dobrushin, “General formulation of Shannon’s main theorem in information theory,”American Mathematical Society Translations: Series 2, vol. 33, no. 2, pp. 323-438, 1963

  66. [66]

    Extension of the Blahut-Arimoto algorithm for maximizing directed information,

    I. Naiss and H. Permuter, “Extension of the Blahut-Arimoto algorithm for maximizing directed information,”IEEE Trans. Inf. Theory, vol. 59, no. 1, pp. 204-222, Jan. 2013

  67. [67]

    MINE: Mutual information neural estimation,

    M. Belghazi et al., “MINE: Mutual information neural estimation,”arXiv: 1801.04062, Aug. 2021

  68. [68]

    Neural estimation and optimization of directed information over continuous spaces,

    D. Tsur, Z. Aharoni, Z. Goldfeld, and H. Permuter, “Neural estimation and optimization of directed information over continuous spaces,”IEEE Trans. on Inf. Theory, vol. 69, no. 8, pp. 4777-4798, Aug. 2023

  69. [69]

    Asymptotische abschätzungen in Shannon’s informationstheorie,

    V . Strassen, “Asymptotische abschätzungen in Shannon’s informationstheorie,” inTrans. 3rd Prague Conf. Inf. Theory ’62, Prague, Czech Republic, 1962

  70. [70]

    The relation between Granger causality and directed information theory: A review,

    P. Amblard and O. Michel, “The relation between Granger causality and directed information theory: A review,” Entropy, vol. 15, no. 1, pp. 113-143, Jan. 2013

  71. [71]

    Measuring information transfer,

    T. Schreiber, “Measuring information transfer,”Phys. Rev. Lett., vol. 85, no. 2, pp. 461-464, Jul. 2000

  72. [72]

    Granger causality and transfer entropy are equivalent for Gaussian variables,

    L. Barnett, A. Barrett, and A. Seth, “Granger causality and transfer entropy are equivalent for Gaussian variables,” Phys. Rev. Lett., vol. 103, no. 23, p. 238701, Dec. 2009

  73. [73]

    Transfer entropy,

    D. Gença ˘ga, Ed., “Transfer entropy,”Entropy, vol. 20, no. 4, p. 288, Apr. 2018

  74. [74]

    Pearl,Causality: Models, Reasoning, and Inference, 2nd ed

    J. Pearl,Causality: Models, Reasoning, and Inference, 2nd ed. New York, NY , USA: Cambridge University Press, 2009

  75. [75]

    Shannon information and Kolmogorov complexity,

    P. Grünwald and P. Vitányi, “Shannon information and Kolmogorov complexity,”arXiv: cs/0410002, Jul. 2010

  76. [76]

    Amari,Information Geometry and Its Applications, Tokyo, Japan: Springer, 2016

    S. Amari,Information Geometry and Its Applications, Tokyo, Japan: Springer, 2016

  77. [77]

    Optimizing neural networks with Kronecker-factored approximate curvature,

    J. Martens and R. Grosse, “Optimizing neural networks with Kronecker-factored approximate curvature,” inProc. 32nd ICML ’15, Lille, France: ICML, Jul. 2015. 30 TECHNICAL REPORT

  78. [78]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”arXiv: 2305.18290, Jul. 2024

  79. [79]

    On tail probabilities for martingales,

    D. Freedman, “On tail probabilities for martingales,”The Annals of Probability, vol. 3, no. 1, pp. 100-118, Feb. 1975

  80. [80]

    Williams,Probability with Martingales

    D. Williams,Probability with Martingales. Cambridge, UK: Cambridge University Press, 1991

Showing first 80 references.