Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs
Pith reviewed 2026-05-18 01:51 UTC · model grok-4.3
The pith
Replacing the bit with the token as the fundamental unit of meaning yields a directed rate-distortion theory for LLM pre-training and post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating the token as the macroscopic atomic unit that carries semantics, the attention mechanism and Transformer can be viewed as energy-based dynamics on a semantic manifold. Modeling autoregressive generation as a stateful channel with feedback, Massey's directed information supplies the native causal measure from which a directed rate-distortion function for pre-training, a directed rate-reward function for RL post-training, and a sub-martingale account of inference-time semantic flow all follow. The same machinery equates next-token prediction with Granger causal inference and locates the reasoning capacity of LLMs within the first two rungs of Pearl's ladder of causation.
What carries the argument
Massey's directed information applied to the stateful channel-with-feedback model of autoregressive token generation, which directly produces the directed rate-distortion and rate-reward functions.
Load-bearing premise
Directed information supplies the correct causal measure for token sequences and semantic embeddings can be treated as a manifold on which energy-based dynamics operate without further calibration.
What would settle it
A measurement showing that the empirical information rates realized during actual LLM pre-training deviate substantially from the values predicted by the derived directed rate-distortion function would falsify the central claim.
Figures
read the original abstract
Despite the empirical successes of Large Language Models (LLMs), the prevailing paradigm is heuristic and experiment-driven, tethered to massive compute and data, while a first-principles theory remains absent. This treatise develops a Semantic Information Theory at the confluence of statistical physics, signal processing, and classical information theory, organized around a single paradigm shift: replacing the classical BIT - a microscopic substrate devoid of semantic content - with the macroscopic TOKEN as the atomic carrier of meaning and reasoning. Within this framework we recast attention and the Transformer as energy-based models, and interpret semantic embedding as vectorization on the semantic manifold. Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure of autoregressive generation, from which we derive a *directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. This machinery makes precise the identification of next-token prediction with Granger causal inference, and sharpens the limits of LLM reasoning against Pearl's Ladder of Causation - affirming that *whereas the BIT defined the Information Epoch, the TOKEN will define the AI Epoch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a Semantic Information Theory for LLMs by replacing the classical bit with the token as the fundamental atomic carrier of semantic meaning and reasoning. It recasts attention and the Transformer architecture as energy-based models on a semantic manifold, models the LLM as a stateful channel with feedback, and adopts Massey's directed information as the native causal measure. From this, the authors derive a directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. The work further identifies next-token prediction with Granger causal inference and situates LLM reasoning limits relative to Pearl's Ladder of Causation.
Significance. If the proposed derivations can be made rigorous with explicit definitions, channel models, and verifiable steps, the framework would offer a first-principles bridge between classical information theory, statistical physics, and LLM training dynamics. It could supply principled bounds on pre-training and post-training objectives and clarify causal aspects of autoregressive generation, potentially influencing more efficient and interpretable model development. The integration of directed information with energy-based views on token manifolds represents an ambitious attempt to move beyond heuristic paradigms.
major comments (3)
- [Abstract / Modeling the LLM as a stateful channel] Abstract and modeling section: The central derivations of the directed rate-distortion function for pre-training and directed rate-reward function for RL post-training are asserted to follow from modeling the LLM as a stateful channel with feedback and applying Massey's directed information, yet no explicit state space, transition kernel, feedback structure, or channel capacity expressions are supplied. Without these, the claimed functions remain formal re-labelings rather than consequences of the directed-information calculus.
- [Semantic embedding as vectorization on the semantic manifold] Semantic manifold and energy-based recasting: The paper treats semantic embeddings as a manifold on which attention operates via an energy function, but provides no Riemannian metric, potential function, or mapping from the discrete token vocabulary to this continuous geometry. This assumption is load-bearing for recasting the Transformer as an energy-based model and for the subsequent information-flow claims.
- [Derivations of directed rate-distortion and sub-martingale account] Derivations and proofs: The abstract states that a sub-martingale account of inference-time semantic information flow and the identification of next-token prediction with Granger causality follow from the framework, but the manuscript contains no equations, intermediate steps, or proofs supporting these results. This absence prevents verification of the central theoretical claims.
minor comments (2)
- [Abstract] The abstract is highly compressed and introduces multiple novel constructs (TOKEN, semantic manifold, directed rate-reward) without brief definitional anchors, which reduces immediate readability for readers outside the immediate subfield.
- [Throughout] Notation for the new quantities (e.g., directed rate-distortion function) should be introduced with explicit symbols and contrasted against classical rate-distortion to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments correctly identify areas where additional rigor is required to substantiate the proposed framework. We respond to each major comment below and commit to the indicated revisions.
read point-by-point responses
-
Referee: [Abstract / Modeling the LLM as a stateful channel] Abstract and modeling section: The central derivations of the directed rate-distortion function for pre-training and directed rate-reward function for RL post-training are asserted to follow from modeling the LLM as a stateful channel with feedback and applying Massey's directed information, yet no explicit state space, transition kernel, feedback structure, or channel capacity expressions are supplied. Without these, the claimed functions remain formal re-labelings rather than consequences of the directed-information calculus.
Authors: We agree that the current presentation of the stateful channel model remains at a conceptual level and does not yet supply the explicit components needed for rigorous derivation. In the revised manuscript we will define the state as the pair consisting of the current semantic embedding vector and the finite history of prior tokens, specify the transition kernel as the autoregressive conditional distribution p(token_{t+1} | state_t) realized by the Transformer, and formalize the feedback structure as the causal dependence of each output on all preceding outputs. With these elements in place we will derive the directed rate-distortion function by minimizing the directed information rate subject to a semantic distortion constraint, following the standard variational characterization for channels with feedback. The same construction will yield the directed rate-reward function for the RL post-training stage. revision: yes
-
Referee: [Semantic embedding as vectorization on the semantic manifold] Semantic manifold and energy-based recasting: The paper treats semantic embeddings as a manifold on which attention operates via an energy function, but provides no Riemannian metric, potential function, or mapping from the discrete token vocabulary to this continuous geometry. This assumption is load-bearing for recasting the Transformer as an energy-based model and for the subsequent information-flow claims.
Authors: The referee is right that the geometric structure is essential to the energy-based interpretation and must be made explicit. We will add a precise construction of the semantic manifold: the Riemannian metric will be taken as the Fisher information metric on the probability simplex induced by the token embeddings; the potential function will be defined as the negative log-probability under the attention-weighted distribution; and the embedding map will be the composition of the model's learned token embedding layer with a smooth lifting that places each vocabulary element at a point on the manifold. These definitions will allow attention to be recast as a gradient flow on the manifold and will ground the subsequent claims about information flow. revision: yes
-
Referee: [Derivations of directed rate-distortion and sub-martingale account] Derivations and proofs: The abstract states that a sub-martingale account of inference-time semantic information flow and the identification of next-token prediction with Granger causality follow from the framework, but the manuscript contains no equations, intermediate steps, or proofs supporting these results. This absence prevents verification of the central theoretical claims.
Authors: We acknowledge that the manuscript currently states these results without supplying the supporting derivations. In the revision we will include a dedicated theoretical section containing the full development. We will prove the sub-martingale property by showing that the expected increment in cumulative directed information at each generation step is non-negative under the model's predictive distribution. For the Granger-causality identification we will establish the equivalence between minimizing the directed information rate via next-token prediction and the classical Granger test applied to the token sequence. All intermediate steps and necessary lemmas will be provided, either in the main text or in a self-contained appendix. revision: yes
Circularity Check
Directed rate-distortion and rate-reward functions reduce to re-labeling via TOKEN modeling and Massey's measure
specific steps
-
self definitional
[Abstract]
"Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure of autoregressive generation, from which we derive a directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow."
The directed rate-distortion and rate-reward functions are defined into existence by the choice to treat the LLM as a stateful channel equipped with the TOKEN paradigm and semantic manifold; the 'derivation' therefore consists of relabeling the existing next-token process with the new directed-information terminology rather than computing a non-trivial quantity from independently specified channel parameters or manifold geometry.
full rationale
The paper's core derivations start from the modeling choice of LLM as stateful channel with feedback and adoption of Massey's directed information as native measure, then claim to derive new directed rate-distortion and rate-reward functions plus sub-martingale flow. These steps are self-definitional because the new quantities are introduced precisely by applying the external directed-information concept inside the newly posited TOKEN/semantics-manifold framework without supplying an explicit state space, transition kernel, Riemannian metric, or energy function that would make the mapping non-tautological. The abstract presents the modeling step and the derivations as direct consequences, but the provided text supplies no independent equations or external benchmarks that would prevent the results from being equivalent to the input modeling assumptions by construction. No load-bearing self-citations or fitted predictions appear in the excerpt; the circularity is limited to the definitional recasting of known autoregressive generation under new labels.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Massey's directed information is the native causal measure for autoregressive generation
invented entities (2)
-
TOKEN as atomic carrier of meaning
no independent evidence
-
semantic manifold
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure... derive a directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the AR-LLM ... can also be formulated as the Boltzmann distribution ... E(u_i) = −<u_i, Ψ(∑ A_ij u_j)>
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
semantic vector space ... S^{M-1} ... Gromov-Wasserstein distance based semantic distortion metric
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A mathematical theory of communication,
C. Shannon, “A mathematical theory of communication,”Bell System Technical Journal, vol. 27, no. 7, pp. 379-423, Oct. 1948
work page 1948
-
[2]
Recent contributions to the mathematical theory of communications,
W. Weaver, “Recent contributions to the mathematical theory of communications,”The Rockefeller Foundation, Sep. 1949
work page 1949
-
[3]
Empiricism, semantics, and ontology,
R. Carnap, “Empiricism, semantics, and ontology,”Revue Internationale de Philosophie, no. 4, pp. 20-40, Apr. 1950
work page 1950
-
[4]
An outline of a theory of semantic information,
R. Carnap and Y . Bar-Hillel, “An outline of a theory of semantic information,” Massachusetts Institute of Technology, Cambridge, MA, USA, Research Laboratory of Electronics Technical Report No. 247, Oct. 1952. BAI: FORGET BIT, IT IS ALL ABOUT TOKEN: TOW ARDS THE SEMANTIC INFORMATION THEORY OF LLMS 27
work page 1952
-
[5]
Y . Bar-Hillel and R. Carnap, “Semantic information,”The British Journal for the Philosophy of Science, vol. 4, no. 14, pp. 147-157, Aug. 1953
work page 1953
-
[6]
Carnap,Meaning and Necessity: A Study in Semantics and Modal Logic, 2nd ed
R. Carnap,Meaning and Necessity: A Study in Semantics and Modal Logic, 2nd ed. Chicago, IL, USA: University of Chicago Press, 1988
work page 1988
-
[7]
Burgin,Theory of Information: Fundamentality, Diversity and Unification
M. Burgin,Theory of Information: Fundamentality, Diversity and Unification. Singapore: World Scientific Publishing, 2009
work page 2009
-
[8]
Floridi, Ed.,The Routledge Handbook of Philosophy of Information
L. Floridi, Ed.,The Routledge Handbook of Philosophy of Information. London, UK: Routledge, 2016
work page 2016
-
[9]
A formal theory of inductive inference - Part 1,
R. Solomonoff, “A formal theory of inductive inference - Part 1,”Information and Control, vol. 7, no. 1, pp. 1-22, Mar. 1964
work page 1964
-
[10]
A formal theory of inductive inference - Part 2,
R. Solomonoff, “A formal theory of inductive inference - Part 2,”Information and Control, vol. 7, no. 2, pp. 224-254, Jun. 1964
work page 1964
-
[11]
The discovery of algorithmic probability,
R. Solomonoff, “The discovery of algorithmic probability,”Journal of Computer and System Sciences, vol. 55, no. 1, pp. 73-88, Aug. 1997
work page 1997
-
[12]
Three approaches to the quantitative definition of information,
A. Kolmogorov, “Three approaches to the quantitative definition of information,”International Journal of Computer Mathematics, vol. 2, no. 1-4, pp. 157-168, Jan. 1968
work page 1968
-
[13]
Logical basis for information theory and probability theory,
A. Kolmogorov, “Logical basis for information theory and probability theory,”IEEE Trans. Inf. Theory, vol. 14, no. 5, pp. 662-664, Sep. 1968
work page 1968
-
[14]
Hutter,Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability
M. Hutter,Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Berlin, Germany: Springer, 2004
work page 2004
-
[15]
A. Shen, V . Uspensky, and N. Vereshchagin,Kolmogorov Complexity and Algorithmic Randomness. Providence, RI, USA: American Mathematical Society, 2022
work page 2022
-
[16]
T. Cover and J. Thomas,Elements of Information Theory, 2nd ed. Hoboken, NJ, USA: John Wiley & Sons, 2006
work page 2006
-
[17]
On variational bounds of mutual information,
B. Poole, S. Ozair, A. Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inProc. 36th ICML ’19, Long Beach, CA, USA: ICML, Jun. 2019
work page 2019
-
[18]
R. Sutton, “The bitter lesson,” University of Alberta, Edmonton, Canada, Mar. 2019
work page 2019
-
[19]
A new method of recording and searching information,
H. Luhn, “A new method of recording and searching information,”American Documentation, vol. 4, no. 1, pp. 14-16, Jan. 1953
work page 1953
-
[20]
A vector space model for automatic indexing,
G. Salton, A. Wong, and C. Yang, “A vector space model for automatic indexing,”Commun. ACM, vol. 18, no. 11, pp. 613-620, Nov. 1975
work page 1975
-
[21]
A neural probabilistic language model,
Y . Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,”J. Machine Learn. Res., vol. 3, pp. 1137-1155, 2003
work page 2003
-
[22]
Efficient Estimation of Word Representations in Vector Space
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,”arXiv: 1301.3781, Sep. 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[23]
Distributed representations of words and phrases and their compositionality,
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inProc. 27th NIPS ’13, Lake Tahoe, NV , USA, Dec. 2013
work page 2013
-
[24]
GloVe: Global vectors for word representation,
J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” inProc. ACL EMNLP ’14, Doha, Qatar, Oct. 2014
work page 2014
-
[25]
Enriching word vectors with subword information,
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,”Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017
work page 2017
-
[26]
Deep contextualized word representations,
M. Peters et al., “Deep contextualized word representations,” inProc. ACL NAACL-HLT ’18, New Orleans, LA, USA, Jun. 2018
work page 2018
-
[27]
D. Jurafsky and J. Martin,Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd ed. Draft, 2025. 28 TECHNICAL REPORT
work page 2025
-
[28]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. 31st NIPS ’17, Long Beach, CA, USA, 4-9 Dec. 2017
work page 2017
-
[29]
Improving language understanding by generative pre- training,
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre- training,”OpenAI, Jun. 2018
work page 2018
-
[30]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”OpenAI, Feb. 2019
work page 2019
-
[31]
Language models are few-shot learners,
T. Brown et al., “Language models are few-shot learners,” inProc. 34th NeurIPS ’20, Virtual Conference, 6-12 Dec. 2020
work page 2020
-
[32]
Training language models to follow instructions with human feedback
L. Ouyang et al., “Training language models to follow instructions with human feedback,”arXiv: 2203.02155, Mar. 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,
D. Guo et al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633-638, Sep. 2025
work page 2025
-
[34]
DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention,
“DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention,”DeepSeek, Hangzhou, China, Sep. 2025
work page 2025
-
[35]
Y . Polyanskiy and Y . Wu,Information Theory: From Coding to Learning. Cambridge, UK: Cambridge University Press, 2025
work page 2025
-
[36]
Opening the Black Box of Deep Neural Networks via Information
R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv: 1703.00810, Apr. 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
From tokens to thoughts: How LLMs and humans trade compression for meaning,
C. Shani, D. Jurafsky, Y . LeCun, and R. Shwartz-Ziv, “From tokens to thoughts: How LLMs and humans trade compression for meaning,”arXiv: 2505.17117, Jun. 2025
-
[38]
Toward textual transform coding,
T. Weissman, “Toward textual transform coding,”IEEE BITS Inform. Theory Mag., vol. 3, no. 2, pp. 32-40, Jun. 2023
work page 2023
-
[39]
X. Niu, B. Bai, N. Guo, W. Zhang, and W. Han, “Rate-distortion-perception trade-off in information theory, generative models, and intelligent communications,”Entropy, vol. 27, no. 4, Apr. 2025
work page 2025
-
[40]
A mathematical perspective on Transformers , 2024
B. Geshkovski, C. Letrouit, Y . Polyanskiy, and P. Rigollet, “A mathematical perspective on transformers,”arXiv: 2312.10794, Aug. 2025
-
[41]
M. Rodrigues and Y . Eldar,Information-Theoretic Methods in Data Science. Cambridge, UK: Cambridge University Press, 2021
work page 2021
-
[42]
Causality, feedback and directed information,
J. Massey, “Causality, feedback and directed information,” inProc. IEEE ISIT ’90, Waikiki, HI, USA, Nov. 1990
work page 1990
-
[43]
Berger,Rate Distortion Theory: A Mathematical Basis for Data Compression
T. Berger,Rate Distortion Theory: A Mathematical Basis for Data Compression. Englewood Cliffs, NJ, USA: Prentice Hall PTR, 1971
work page 1971
-
[44]
R. Sutton and A. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018
work page 2018
-
[45]
Testing for causality: A personal viewpoint,
C. Granger, “Testing for causality: A personal viewpoint,”Journal of Economic Dynamics and Control, vol. 2, no. 1, pp. 329-352, Jan. 1980
work page 1980
-
[46]
Gromov,Metric Structures for Riemannian and Non-Riemannian Spaces
M. Gromov,Metric Structures for Riemannian and Non-Riemannian Spaces. Boston, MA, USA: Birkhäuser, 2007
work page 2007
-
[47]
Villani,Optimal Transport: Old and New
C. Villani,Optimal Transport: Old and New. New York, NY , USA: Springer, 2009
work page 2009
-
[48]
Representation Learning with Contrastive Predictive Coding
A. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv: 1807.03748, Jan. 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[49]
Text and Code Embeddings by Contrastive Pre-Training
A. Neelakantan et al., “Text and code embeddings by contrastive pre-training,”arXiv: 2201.10005, Jan. 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Lütkepohl,New Introduction to Multiple Time Series Analysis
H. Lütkepohl,New Introduction to Multiple Time Series Analysis. Berlin, Germany: Springer, 2007
work page 2007
-
[51]
Graphical models, exponential families, and variational inference,
M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational inference,”Foundation and Trends in Machine Learning, vol. 1, no. 1-2, pp. 1-305, Nov. 2008
work page 2008
- [52]
-
[53]
The space of interactions in neural network models,
E. Gardner, “The space of interactions in neural network models,”J. Phys. A: Math. Gen., vol. 21, no. 1, pp. 257-270, Jan. 1988
work page 1988
-
[54]
Optimal storage properties of neural network models,
E. Gardner and B. Derrida, “Optimal storage properties of neural network models,”J. Phys. A: Math. Gen., vol. 21, no. 1, pp. 271-284, Jan. 1988
work page 1988
-
[55]
Three unfinished works on the optimal storage capacity of networks,
E. Gardner and B. Derrida, “Three unfinished works on the optimal storage capacity of networks,”J. Phys. A: Math. Gen., vol. 22, no. 12, pp. 1983-1994, Jun. 1989
work page 1983
-
[56]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv: 2312.00752, May 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,”arXiv: 2405.21060, May 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Large Language Diffusion Models
S. Nie et al., “Large language diffusion models,”arXiv: 2502.09992, Feb. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Computation of channel capacity and rate-distortion functions,
R. Blahut, “Computation of channel capacity and rate-distortion functions,”IEEE Trans. Inf. Theory, vol. 18, no. 4, pp. 460-473, Jul. 1972
work page 1972
-
[60]
An algorithm for computing the capacity of arbitrary discrete memoryless channels,
S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless channels,”IEEE Trans. Inf. Theory, vol. 18, no. 1, pp. 14-20, Jan. 1972
work page 1972
-
[61]
A communication optimal transport approach to the computation of rate distortion functions,
S. Wu, W. Ye, H. Wu, H. Wu, W. Zhang, and B. Bai, “A communication optimal transport approach to the computation of rate distortion functions,”arXiv: 2212.10098, Dec. 2022
-
[62]
A constrained BA algorithm for rate-distortion and distortion-rate functions,
L. Chen et al., “A constrained BA algorithm for rate-distortion and distortion-rate functions,”arXiv: 2305.02650, Jan. 2024
-
[63]
Computation of rate-distortion-perception functions with Wasserstein barycenter,
C. Chen et al., “Computation of rate-distortion-perception functions with Wasserstein barycenter,” inProc. IEEE ISIT ’23, Taipei, Taiwan, Jun. 2023
work page 2023
-
[64]
Directed information for channels with feedback,
G. Kramer, “Directed information for channels with feedback,” Ph. D Dissertation, ETH Zurich, Zurich, Switzerland, 1998
work page 1998
-
[65]
General formulation of Shannon’s main theorem in information theory,
R. Dobrushin, “General formulation of Shannon’s main theorem in information theory,”American Mathematical Society Translations: Series 2, vol. 33, no. 2, pp. 323-438, 1963
work page 1963
-
[66]
Extension of the Blahut-Arimoto algorithm for maximizing directed information,
I. Naiss and H. Permuter, “Extension of the Blahut-Arimoto algorithm for maximizing directed information,”IEEE Trans. Inf. Theory, vol. 59, no. 1, pp. 204-222, Jan. 2013
work page 2013
-
[67]
MINE: Mutual information neural estimation,
M. Belghazi et al., “MINE: Mutual information neural estimation,”arXiv: 1801.04062, Aug. 2021
-
[68]
Neural estimation and optimization of directed information over continuous spaces,
D. Tsur, Z. Aharoni, Z. Goldfeld, and H. Permuter, “Neural estimation and optimization of directed information over continuous spaces,”IEEE Trans. on Inf. Theory, vol. 69, no. 8, pp. 4777-4798, Aug. 2023
work page 2023
-
[69]
Asymptotische abschätzungen in Shannon’s informationstheorie,
V . Strassen, “Asymptotische abschätzungen in Shannon’s informationstheorie,” inTrans. 3rd Prague Conf. Inf. Theory ’62, Prague, Czech Republic, 1962
work page 1962
-
[70]
The relation between Granger causality and directed information theory: A review,
P. Amblard and O. Michel, “The relation between Granger causality and directed information theory: A review,” Entropy, vol. 15, no. 1, pp. 113-143, Jan. 2013
work page 2013
-
[71]
Measuring information transfer,
T. Schreiber, “Measuring information transfer,”Phys. Rev. Lett., vol. 85, no. 2, pp. 461-464, Jul. 2000
work page 2000
-
[72]
Granger causality and transfer entropy are equivalent for Gaussian variables,
L. Barnett, A. Barrett, and A. Seth, “Granger causality and transfer entropy are equivalent for Gaussian variables,” Phys. Rev. Lett., vol. 103, no. 23, p. 238701, Dec. 2009
work page 2009
-
[73]
D. Gença ˘ga, Ed., “Transfer entropy,”Entropy, vol. 20, no. 4, p. 288, Apr. 2018
work page 2018
-
[74]
Pearl,Causality: Models, Reasoning, and Inference, 2nd ed
J. Pearl,Causality: Models, Reasoning, and Inference, 2nd ed. New York, NY , USA: Cambridge University Press, 2009
work page 2009
-
[75]
Shannon information and Kolmogorov complexity,
P. Grünwald and P. Vitányi, “Shannon information and Kolmogorov complexity,”arXiv: cs/0410002, Jul. 2010
-
[76]
Amari,Information Geometry and Its Applications, Tokyo, Japan: Springer, 2016
S. Amari,Information Geometry and Its Applications, Tokyo, Japan: Springer, 2016
work page 2016
-
[77]
Optimizing neural networks with Kronecker-factored approximate curvature,
J. Martens and R. Grosse, “Optimizing neural networks with Kronecker-factored approximate curvature,” inProc. 32nd ICML ’15, Lille, France: ICML, Jul. 2015. 30 TECHNICAL REPORT
work page 2015
-
[78]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”arXiv: 2305.18290, Jul. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
On tail probabilities for martingales,
D. Freedman, “On tail probabilities for martingales,”The Annals of Probability, vol. 3, no. 1, pp. 100-118, Feb. 1975
work page 1975
-
[80]
Williams,Probability with Martingales
D. Williams,Probability with Martingales. Cambridge, UK: Cambridge University Press, 1991
work page 1991
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.