A Markov Categorical Framework for Language Modeling
Pith reviewed 2026-05-19 02:26 UTC · model grok-4.3
pith:ZK4WREY7 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{ZK4WREY7}
Prints a linked pith:ZK4WREY7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
For linear-softmax heads, a quadratic surrogate to negative log-likelihood produces a generalized eigenproblem that aligns representations with predictive prototypes after whitening.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem aligning representation directions with predictive prototypes.
What carries the argument
Markov category composition of the single-step generation process, which quantifies information surplus about future tokens and reduces the quadratic surrogate plus whitening step to a generalized eigenproblem.
If this is right
- The framework quantifies the information surplus a hidden state carries about tokens beyond the immediate next one, supplying an information-theoretic rationale for speculative decoding and other parallel drafting methods.
- Negative log-likelihood training is shown to learn the data's intrinsic conditional uncertainty in addition to the mode, formalized through categorical entropy.
- Likelihood training shapes internal geometry by aligning representation directions with predictive prototypes under the stated conditions on the output head and loss surrogate.
Where Pith is reading between the lines
- Models trained with the standard objective should exhibit measurable covariance alignment between whitened hidden states and next-token prototypes when the output head is approximately linear.
- The same compositional analysis could be applied to other autoregressive domains such as time-series forecasting to test whether analogous spectral alignments appear.
- Relaxing the bounded-feature or linear-head assumption would identify the regimes in which the eigenproblem reduction fails to hold.
Load-bearing premise
The derivation assumes a linear-softmax output head with bounded output features and relies on a calibrated quadratic upper-bound surrogate to the negative log-likelihood loss.
What would settle it
Train a linear-softmax language model, whiten its hidden representations, solve the induced generalized eigenproblem, and check whether the resulting principal directions match the predictive prototypes obtained from the quadratic surrogate.
read the original abstract
Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes representations, and why these representations support complex behavior remains incomplete. We introduce an analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective connects three aspects of language modeling that are often studied separately: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework gives an information-theoretic rationale for parallel drafting methods such as speculative decoding by quantifying the information surplus a hidden state contains about future tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective learns not only a most likely next token, but also the data's intrinsic conditional uncertainty, formalized through categorical entropy. Our main spectral result is conditional: for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem aligning representation directions with predictive prototypes. This gives a compositional lens for understanding how information flows through a model and how likelihood training can shape its internal geometry.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a Markov categorical framework modeling autoregressive language model generation as compositions of information-processing stages. It connects the NLL training objective, representation geometry, and capabilities, offering an information-theoretic rationale for speculative decoding via information surplus in hidden states and formalizing how NLL learns conditional uncertainty through categorical entropy. The central result is conditional: for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces (after whitening or variance normalization) a generalized CCA/eigenproblem aligning representation directions with predictive prototypes.
Significance. If the result holds under the stated conditions, the framework supplies a compositional, information-theoretic account of how training shapes internal geometry and supports capabilities such as parallel drafting. The explicit use of Markov categories to link objective, geometry, and behavior is a constructive contribution. Credit is due for stating the spectral claim as conditional on the surrogate and normalization steps rather than claiming direct equivalence to standard NLL.
major comments (2)
- [Abstract] Abstract: the main spectral result is presented without derivation steps, error bounds, or verification that the quadratic surrogate calibration constant is independent of the target eigenproblem; this leaves open whether the alignment is a genuine consequence of the modeling choices or partly by construction of the surrogate.
- [Main spectral result] Main spectral result (conditional claim on linear-softmax head): the substitution of the calibrated quadratic upper-bound surrogate for true NLL is load-bearing for the broader claim that likelihood training shapes representation geometry; without a tightness argument or demonstration that critical points coincide with those of standard NLL near the data manifold, the generalized CCA/eigenproblem does not directly explain geometry under conventional negative log-likelihood training.
minor comments (2)
- [Notation] Notation section: explicitly define the calibration constant for the quadratic surrogate and state whether it is fixed independently of the learned representation or allowed to depend on it.
- [Speculative decoding rationale] Discussion of speculative decoding: quantify the information surplus more explicitly with a concrete bound or example to strengthen the rationale for parallel drafting methods.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments help clarify the presentation of our conditional spectral result and its relationship to standard NLL training. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the main spectral result is presented without derivation steps, error bounds, or verification that the quadratic surrogate calibration constant is independent of the target eigenproblem; this leaves open whether the alignment is a genuine consequence of the modeling choices or partly by construction of the surrogate.
Authors: We agree that the abstract is highly condensed. In the revision we will add one sentence outlining the main derivation steps and explicitly state that the calibration constant is obtained by matching the surrogate value to NLL at points drawn from the empirical data distribution; this choice is made prior to and independently of solving the subsequent eigenproblem. Error bounds and the full derivation remain in the body of the paper (Section 4), but the abstract will now reference them. revision: yes
-
Referee: [Main spectral result] Main spectral result (conditional claim on linear-softmax head): the substitution of the calibrated quadratic upper-bound surrogate for true NLL is load-bearing for the broader claim that likelihood training shapes representation geometry; without a tightness argument or demonstration that critical points coincide with those of standard NLL near the data manifold, the generalized CCA/eigenproblem does not directly explain geometry under conventional negative log-likelihood training.
Authors: The manuscript already qualifies the result as conditional on the calibrated quadratic surrogate. We will insert a new paragraph in Section 4.3 that supplies a first-order tightness argument: when model predictions are close to the observed conditional distributions (the regime after convergence), the surrogate and NLL differ by a term whose gradient vanishes at the data manifold. We do not claim that every critical point of the surrogate coincides with that of NLL under arbitrary model classes; such a statement would require additional landscape assumptions that lie outside the present scope. The revision will therefore emphasize the surrogate’s role while clarifying the approximation regime in which the geometric alignment remains informative for standard likelihood training. revision: partial
Circularity Check
No significant circularity; result is explicitly conditional on surrogate and whitening
full rationale
The paper's main spectral claim is presented as conditional on a linear-softmax head with bounded features, a calibrated quadratic upper-bound surrogate to NLL, and subsequent whitening or variance normalization, which induces a generalized CCA/eigenproblem. This is not a reduction of the true NLL to a fitted quantity by construction, nor does it rely on self-citation for a uniqueness theorem or smuggle an ansatz. The derivation chain uses the Markov categorical framework to connect training objective, representation geometry, and capabilities, but the spectral alignment is derived under the stated modeling choices without tautological redefinition or renaming of known results as new predictions. The framework remains self-contained against external benchmarks for the conditional case.
Axiom & Free-Parameter Ledger
free parameters (1)
- calibration constant for quadratic upper-bound surrogate
axioms (2)
- standard math Markov categories compose information-processing stages with well-defined conditional distributions
- domain assumption Categorical entropy formalizes conditional uncertainty in the data distribution
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NLL as Implicit Spectral Contrastive Learning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Information geometry in optimization, machine learning and statistical inference
Shun-ichi Amari. Information geometry in optimization, machine learning and statistical inference. Frontiers of Electrical and Electronic Engineering in China, 5 0 (3): 0 241--260, 2010
work page 2010
-
[2]
Methods of information geometry, volume 191
Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000
work page 2000
-
[3]
A compositional framework for markov processes
John C Baez, Brendan Fong, and Blake S Pollard. A compositional framework for markov processes. Journal of Mathematical Physics, 57 0 (3): 0 033301, 2016
work page 2016
-
[4]
A comprehensive survey on spectral clustering with graph structure learning
Kamal Berahmand, Farid Saberi-Movahed, Razieh Sheikhpour, Yuefeng Li, and Mahdi Jalili. A comprehensive survey on spectral clustering with graph structure learning. arXiv preprint arXiv:2501.13597, 2025
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pages 1877--1901, 2020
work page 1901
-
[6]
Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss
Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on learning theory, pages 1305--1338. PMLR, 2020
work page 2020
-
[7]
Disintegration and bayesian inversion via string diagrams
Kenta Cho and Bart Jacobs. Disintegration and bayesian inversion via string diagrams. Mathematical Structures in Computer Science, 29 0 (7): 0 938--971, 2019
work page 2019
-
[8]
A mathematical framework for transformer circuits, 2021
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits, 2021
work page 2021
-
[9]
Seven Sketches in Compositionality: An Invitation to Applied Category Theory
Brendan Fong and David I Spivak. Seven sketches in compositionality: An invitation to applied category theory. arXiv preprint arXiv:1803.05316, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370: 0 107239, 2020 a
work page 2020
-
[11]
Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370: 0 107239, 2020 b
work page 2020
-
[12]
Infinite products and zero-one laws in categorical probability
Tobias Fritz and Eigil Fjeldgren Rischel. Infinite products and zero-one laws in categorical probability. Compositionality, 2: 0 3, 2020. doi:10.32408/compositionality-2-3
-
[13]
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \`e re, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Characterizing implicit bias in terms of optimization geometry
Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832--1841. PMLR, 2018
work page 2018
-
[15]
HaoChen, Hao Chen, Chen Wei, Adrien Gaidon, and Tengyu Ma
Jeff Z. HaoChen, Hao Chen, Chen Wei, Adrien Gaidon, and Tengyu Ma. Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss . In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 14239--14250, 2021
work page 2021
-
[16]
A structural probe for finding syntax in word representations
John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129--4138, 2019
work page 2019
-
[17]
Foundations of modern probability, volume 2
Olav Kallenberg and Olav Kallenberg. Foundations of modern probability, volume 2. Springer, 1997
work page 1997
-
[18]
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Emergent linguistic structure in artificial neural networks trained by self-supervision
Christopher D Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117 0 (48): 0 30046--30054, 2020
work page 2020
-
[20]
f-gan: Training generative neural samplers using variational divergence minimization
Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, volume 29, 2016
work page 2016
-
[21]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Neural networks and markov categories
Sebastian Pardo-Guerra, Johnny Jingze Li, Kalyan Basu, and Gabriel A Silva. Neural networks and markov categories. AppliedMath, 5 0 (3): 0 93, 2025
work page 2025
-
[23]
Iclr: In-context learning of representations
Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka. Iclr: In-context learning of representations. arXiv preprint arXiv:2501.00070, 2024
-
[24]
Paolo Perrone. Markov categories and entropy. IEEE Transactions on Information Theory, 70 0 (3): 0 1671--1692, 2023 a
work page 2023
-
[25]
Categorical information geometry
Paolo Perrone. Categorical information geometry. In International Conference on Geometric Science of Information, pages 268--277. Springer, 2023 b
work page 2023
-
[26]
Spectral editing of activations for large language model alignment
Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay Cohen. Spectral editing of activations for large language model alignment. Advances in Neural Information Processing Systems, 37: 0 56958--56987, 2024
work page 2024
-
[27]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[28]
Opening the Black Box of Deep Neural Networks via Information
Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
The implicit bias of gradient descent on separable data
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19 0 (70): 0 1--57, 2018
work page 2018
-
[30]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024
work page 2024
-
[31]
Contrastive Learning Is Spectral Clustering On Similarity Graph
Zhiquan Tan, Yifan Zhang, Jingqin Yang, and Yang Yuan. Contrastive Learning Is Spectral Clustering On Similarity Graph . In The Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[33]
The information bottleneck method
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000 b
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[34]
On the implicit bias in deep-learning algorithms
Gal Vardi. On the implicit bias in deep-learning algorithms. Communications of the ACM, 66 0 (6): 0 86--93, 2023
work page 2023
-
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, volume 30, 2017
work page 2017
-
[36]
On the power of foundation models
Yang Yuan. On the power of foundation models. In International Conference on Machine Learning, pages 40519--40530. PMLR, 2023
work page 2023
-
[37]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.