pith. sign in

arxiv: 2507.19247 · v5 · pith:ZK4WREY7new · submitted 2025-07-25 · 💻 cs.LG · cs.AI· cs.CL

A Markov Categorical Framework for Language Modeling

Pith reviewed 2026-05-19 02:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Markov categorieslanguage modelingnegative log-likelihoodrepresentation geometryspeculative decodingcategorical entropygeneralized eigenproblem
0
0 comments X p. Extension
pith:ZK4WREY7 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{ZK4WREY7}

Prints a linked pith:ZK4WREY7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

For linear-softmax heads, a quadratic surrogate to negative log-likelihood produces a generalized eigenproblem that aligns representations with predictive prototypes after whitening.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Markov categorical framework that treats autoregressive generation as a sequence of information-processing stages. This view links the negative log-likelihood training objective to the geometry of learned representations and to capabilities such as parallel token drafting. It shows that the standard loss captures not only the most probable next token but also the data's conditional uncertainty through categorical entropy. The central result states that, for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to the loss, followed by whitening, reduces to a generalized CCA or eigenproblem whose solution aligns representation directions with class prototypes. The framework therefore supplies a compositional account of how likelihood training shapes internal model geometry.

Core claim

For a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem aligning representation directions with predictive prototypes.

What carries the argument

Markov category composition of the single-step generation process, which quantifies information surplus about future tokens and reduces the quadratic surrogate plus whitening step to a generalized eigenproblem.

If this is right

  • The framework quantifies the information surplus a hidden state carries about tokens beyond the immediate next one, supplying an information-theoretic rationale for speculative decoding and other parallel drafting methods.
  • Negative log-likelihood training is shown to learn the data's intrinsic conditional uncertainty in addition to the mode, formalized through categorical entropy.
  • Likelihood training shapes internal geometry by aligning representation directions with predictive prototypes under the stated conditions on the output head and loss surrogate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained with the standard objective should exhibit measurable covariance alignment between whitened hidden states and next-token prototypes when the output head is approximately linear.
  • The same compositional analysis could be applied to other autoregressive domains such as time-series forecasting to test whether analogous spectral alignments appear.
  • Relaxing the bounded-feature or linear-head assumption would identify the regimes in which the eigenproblem reduction fails to hold.

Load-bearing premise

The derivation assumes a linear-softmax output head with bounded output features and relies on a calibrated quadratic upper-bound surrogate to the negative log-likelihood loss.

What would settle it

Train a linear-softmax language model, whiten its hidden representations, solve the induced generalized eigenproblem, and check whether the resulting principal directions match the predictive prototypes obtained from the quadratic surrogate.

read the original abstract

Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes representations, and why these representations support complex behavior remains incomplete. We introduce an analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective connects three aspects of language modeling that are often studied separately: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework gives an information-theoretic rationale for parallel drafting methods such as speculative decoding by quantifying the information surplus a hidden state contains about future tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective learns not only a most likely next token, but also the data's intrinsic conditional uncertainty, formalized through categorical entropy. Our main spectral result is conditional: for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem aligning representation directions with predictive prototypes. This gives a compositional lens for understanding how information flows through a model and how likelihood training can shape its internal geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Markov categorical framework modeling autoregressive language model generation as compositions of information-processing stages. It connects the NLL training objective, representation geometry, and capabilities, offering an information-theoretic rationale for speculative decoding via information surplus in hidden states and formalizing how NLL learns conditional uncertainty through categorical entropy. The central result is conditional: for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces (after whitening or variance normalization) a generalized CCA/eigenproblem aligning representation directions with predictive prototypes.

Significance. If the result holds under the stated conditions, the framework supplies a compositional, information-theoretic account of how training shapes internal geometry and supports capabilities such as parallel drafting. The explicit use of Markov categories to link objective, geometry, and behavior is a constructive contribution. Credit is due for stating the spectral claim as conditional on the surrogate and normalization steps rather than claiming direct equivalence to standard NLL.

major comments (2)
  1. [Abstract] Abstract: the main spectral result is presented without derivation steps, error bounds, or verification that the quadratic surrogate calibration constant is independent of the target eigenproblem; this leaves open whether the alignment is a genuine consequence of the modeling choices or partly by construction of the surrogate.
  2. [Main spectral result] Main spectral result (conditional claim on linear-softmax head): the substitution of the calibrated quadratic upper-bound surrogate for true NLL is load-bearing for the broader claim that likelihood training shapes representation geometry; without a tightness argument or demonstration that critical points coincide with those of standard NLL near the data manifold, the generalized CCA/eigenproblem does not directly explain geometry under conventional negative log-likelihood training.
minor comments (2)
  1. [Notation] Notation section: explicitly define the calibration constant for the quadratic surrogate and state whether it is fixed independently of the learned representation or allowed to depend on it.
  2. [Speculative decoding rationale] Discussion of speculative decoding: quantify the information surplus more explicitly with a concrete bound or example to strengthen the rationale for parallel drafting methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments help clarify the presentation of our conditional spectral result and its relationship to standard NLL training. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the main spectral result is presented without derivation steps, error bounds, or verification that the quadratic surrogate calibration constant is independent of the target eigenproblem; this leaves open whether the alignment is a genuine consequence of the modeling choices or partly by construction of the surrogate.

    Authors: We agree that the abstract is highly condensed. In the revision we will add one sentence outlining the main derivation steps and explicitly state that the calibration constant is obtained by matching the surrogate value to NLL at points drawn from the empirical data distribution; this choice is made prior to and independently of solving the subsequent eigenproblem. Error bounds and the full derivation remain in the body of the paper (Section 4), but the abstract will now reference them. revision: yes

  2. Referee: [Main spectral result] Main spectral result (conditional claim on linear-softmax head): the substitution of the calibrated quadratic upper-bound surrogate for true NLL is load-bearing for the broader claim that likelihood training shapes representation geometry; without a tightness argument or demonstration that critical points coincide with those of standard NLL near the data manifold, the generalized CCA/eigenproblem does not directly explain geometry under conventional negative log-likelihood training.

    Authors: The manuscript already qualifies the result as conditional on the calibrated quadratic surrogate. We will insert a new paragraph in Section 4.3 that supplies a first-order tightness argument: when model predictions are close to the observed conditional distributions (the regime after convergence), the surrogate and NLL differ by a term whose gradient vanishes at the data manifold. We do not claim that every critical point of the surrogate coincides with that of NLL under arbitrary model classes; such a statement would require additional landscape assumptions that lie outside the present scope. The revision will therefore emphasize the surrogate’s role while clarifying the approximation regime in which the geometric alignment remains informative for standard likelihood training. revision: partial

Circularity Check

0 steps flagged

No significant circularity; result is explicitly conditional on surrogate and whitening

full rationale

The paper's main spectral claim is presented as conditional on a linear-softmax head with bounded features, a calibrated quadratic upper-bound surrogate to NLL, and subsequent whitening or variance normalization, which induces a generalized CCA/eigenproblem. This is not a reduction of the true NLL to a fitted quantity by construction, nor does it rely on self-citation for a uniqueness theorem or smuggle an ansatz. The derivation chain uses the Markov categorical framework to connect training objective, representation geometry, and capabilities, but the spectral alignment is derived under the stated modeling choices without tautological redefinition or renaming of known results as new predictions. The framework remains self-contained against external benchmarks for the conditional case.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard properties of Markov categories for composing information channels, the definition of categorical entropy, and the modeling choice of a linear-softmax head; the quadratic surrogate introduces a calibration step whose independence from the final eigenproblem is not shown in the abstract.

free parameters (1)
  • calibration constant for quadratic upper-bound surrogate
    The surrogate is described as calibrated; this parameter is chosen or fitted to make the bound useful and directly affects the induced eigenproblem.
axioms (2)
  • standard math Markov categories compose information-processing stages with well-defined conditional distributions
    Invoked to model single-step generation as a composition of stages.
  • domain assumption Categorical entropy formalizes conditional uncertainty in the data distribution
    Used to interpret what the NLL objective learns beyond the mode.

pith-pipeline@v0.9.0 · 5726 in / 1465 out tokens · 58312 ms · 2026-05-19T02:26:42.148767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

  1. [1]

    Information geometry in optimization, machine learning and statistical inference

    Shun-ichi Amari. Information geometry in optimization, machine learning and statistical inference. Frontiers of Electrical and Electronic Engineering in China, 5 0 (3): 0 241--260, 2010

  2. [2]

    Methods of information geometry, volume 191

    Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000

  3. [3]

    A compositional framework for markov processes

    John C Baez, Brendan Fong, and Blake S Pollard. A compositional framework for markov processes. Journal of Mathematical Physics, 57 0 (3): 0 033301, 2016

  4. [4]

    A comprehensive survey on spectral clustering with graph structure learning

    Kamal Berahmand, Farid Saberi-Movahed, Razieh Sheikhpour, Yuefeng Li, and Mahdi Jalili. A comprehensive survey on spectral clustering with graph structure learning. arXiv preprint arXiv:2501.13597, 2025

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pages 1877--1901, 2020

  6. [6]

    Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

    Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on learning theory, pages 1305--1338. PMLR, 2020

  7. [7]

    Disintegration and bayesian inversion via string diagrams

    Kenta Cho and Bart Jacobs. Disintegration and bayesian inversion via string diagrams. Mathematical Structures in Computer Science, 29 0 (7): 0 938--971, 2019

  8. [8]

    A mathematical framework for transformer circuits, 2021

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits, 2021

  9. [9]

    Seven Sketches in Compositionality: An Invitation to Applied Category Theory

    Brendan Fong and David I Spivak. Seven sketches in compositionality: An invitation to applied category theory. arXiv preprint arXiv:1803.05316, 2018

  10. [10]

    A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics

    Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370: 0 107239, 2020 a

  11. [11]

    A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics

    Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370: 0 107239, 2020 b

  12. [12]

    Infinite products and zero-one laws in categorical probability

    Tobias Fritz and Eigil Fjeldgren Rischel. Infinite products and zero-one laws in categorical probability. Compositionality, 2: 0 3, 2020. doi:10.32408/compositionality-2-3

  13. [13]

    Better & Faster Large Language Models via Multi-token Prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \`e re, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

  14. [14]

    Characterizing implicit bias in terms of optimization geometry

    Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832--1841. PMLR, 2018

  15. [15]

    HaoChen, Hao Chen, Chen Wei, Adrien Gaidon, and Tengyu Ma

    Jeff Z. HaoChen, Hao Chen, Chen Wei, Adrien Gaidon, and Tengyu Ma. Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss . In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 14239--14250, 2021

  16. [16]

    A structural probe for finding syntax in word representations

    John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129--4138, 2019

  17. [17]

    Foundations of modern probability, volume 2

    Olav Kallenberg and Olav Kallenberg. Foundations of modern probability, volume 2. Springer, 1997

  18. [18]

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024

  19. [19]

    Emergent linguistic structure in artificial neural networks trained by self-supervision

    Christopher D Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117 0 (48): 0 30046--30054, 2020

  20. [20]

    f-gan: Training generative neural samplers using variational divergence minimization

    Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, volume 29, 2016

  21. [21]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

  22. [22]

    Neural networks and markov categories

    Sebastian Pardo-Guerra, Johnny Jingze Li, Kalyan Basu, and Gabriel A Silva. Neural networks and markov categories. AppliedMath, 5 0 (3): 0 93, 2025

  23. [23]

    Iclr: In-context learning of representations

    Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka. Iclr: In-context learning of representations. arXiv preprint arXiv:2501.00070, 2024

  24. [24]

    Markov categories and entropy

    Paolo Perrone. Markov categories and entropy. IEEE Transactions on Information Theory, 70 0 (3): 0 1671--1692, 2023 a

  25. [25]

    Categorical information geometry

    Paolo Perrone. Categorical information geometry. In International Conference on Geometric Science of Information, pages 268--277. Springer, 2023 b

  26. [26]

    Spectral editing of activations for large language model alignment

    Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay Cohen. Spectral editing of activations for large language model alignment. Advances in Neural Information Processing Systems, 37: 0 56958--56987, 2024

  27. [27]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  28. [28]

    Opening the Black Box of Deep Neural Networks via Information

    Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017

  29. [29]

    The implicit bias of gradient descent on separable data

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19 0 (70): 0 1--57, 2018

  30. [30]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

  31. [31]

    Contrastive Learning Is Spectral Clustering On Similarity Graph

    Zhiquan Tan, Yifan Zhang, Jingqin Yang, and Yang Yuan. Contrastive Learning Is Spectral Clustering On Similarity Graph . In The Twelfth International Conference on Learning Representations (ICLR), 2024

  32. [33]

    The information bottleneck method

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000 b

  33. [34]

    On the implicit bias in deep-learning algorithms

    Gal Vardi. On the implicit bias in deep-learning algorithms. Communications of the ACM, 66 0 (6): 0 86--93, 2023

  34. [35]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, volume 30, 2017

  35. [36]

    On the power of foundation models

    Yang Yuan. On the power of foundation models. In International Conference on Machine Learning, pages 40519--40530. PMLR, 2023

  36. [37]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...