A Markov Categorical Framework for Language Modeling

arxiv: 2507.19247 · v5 · pith:ZK4WREY7new · submitted 2025-07-25 · 💻 cs.LG · cs.AI· cs.CL

A Markov Categorical Framework for Language Modeling

Yifan Zhang This is my paper

Pith reviewed 2026-05-19 02:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords Markov categorieslanguage modelingnegative log-likelihoodrepresentation geometryspeculative decodingcategorical entropygeneralized eigenproblem

0 comments p. Extension

pith:ZK4WREY7 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{ZK4WREY7}

Prints a linked pith:ZK4WREY7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

For linear-softmax heads, a quadratic surrogate to negative log-likelihood produces a generalized eigenproblem that aligns representations with predictive prototypes after whitening.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Markov categorical framework that treats autoregressive generation as a sequence of information-processing stages. This view links the negative log-likelihood training objective to the geometry of learned representations and to capabilities such as parallel token drafting. It shows that the standard loss captures not only the most probable next token but also the data's conditional uncertainty through categorical entropy. The central result states that, for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to the loss, followed by whitening, reduces to a generalized CCA or eigenproblem whose solution aligns representation directions with class prototypes. The framework therefore supplies a compositional account of how likelihood training shapes internal model geometry.

Core claim

For a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem aligning representation directions with predictive prototypes.

What carries the argument

Markov category composition of the single-step generation process, which quantifies information surplus about future tokens and reduces the quadratic surrogate plus whitening step to a generalized eigenproblem.

If this is right

The framework quantifies the information surplus a hidden state carries about tokens beyond the immediate next one, supplying an information-theoretic rationale for speculative decoding and other parallel drafting methods.
Negative log-likelihood training is shown to learn the data's intrinsic conditional uncertainty in addition to the mode, formalized through categorical entropy.
Likelihood training shapes internal geometry by aligning representation directions with predictive prototypes under the stated conditions on the output head and loss surrogate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained with the standard objective should exhibit measurable covariance alignment between whitened hidden states and next-token prototypes when the output head is approximately linear.
The same compositional analysis could be applied to other autoregressive domains such as time-series forecasting to test whether analogous spectral alignments appear.
Relaxing the bounded-feature or linear-head assumption would identify the regimes in which the eigenproblem reduction fails to hold.

Load-bearing premise

The derivation assumes a linear-softmax output head with bounded output features and relies on a calibrated quadratic upper-bound surrogate to the negative log-likelihood loss.

What would settle it

Train a linear-softmax language model, whiten its hidden representations, solve the induced generalized eigenproblem, and check whether the resulting principal directions match the predictive prototypes obtained from the quadratic surrogate.

read the original abstract

Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes representations, and why these representations support complex behavior remains incomplete. We introduce an analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective connects three aspects of language modeling that are often studied separately: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework gives an information-theoretic rationale for parallel drafting methods such as speculative decoding by quantifying the information surplus a hidden state contains about future tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective learns not only a most likely next token, but also the data's intrinsic conditional uncertainty, formalized through categorical entropy. Our main spectral result is conditional: for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem aligning representation directions with predictive prototypes. This gives a compositional lens for understanding how information flows through a model and how likelihood training can shape its internal geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a Markov category framing for single-step LM generation that ties an information surplus argument to speculative decoding and derives a spectral link from a quadratic NLL surrogate to generalized CCA after normalization.

read the letter

The core new piece is the use of Markov categories to break single-step generation into compositional stages. This yields a clean information-theoretic account of why a hidden state can carry surplus information about tokens beyond the immediate next one, which directly motivates parallel drafting like speculative decoding. The spectral claim then shows that, for a linear-softmax head with bounded features, a calibrated quadratic upper-bound surrogate to NLL, after whitening, produces a generalized CCA eigenproblem whose directions align with predictive prototypes. That connection is not in the prior literature they cite and gives a compositional route from loss to geometry.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Markov categorical framework modeling autoregressive language model generation as compositions of information-processing stages. It connects the NLL training objective, representation geometry, and capabilities, offering an information-theoretic rationale for speculative decoding via information surplus in hidden states and formalizing how NLL learns conditional uncertainty through categorical entropy. The central result is conditional: for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces (after whitening or variance normalization) a generalized CCA/eigenproblem aligning representation directions with predictive prototypes.

Significance. If the result holds under the stated conditions, the framework supplies a compositional, information-theoretic account of how training shapes internal geometry and supports capabilities such as parallel drafting. The explicit use of Markov categories to link objective, geometry, and behavior is a constructive contribution. Credit is due for stating the spectral claim as conditional on the surrogate and normalization steps rather than claiming direct equivalence to standard NLL.

major comments (2)

[Abstract] Abstract: the main spectral result is presented without derivation steps, error bounds, or verification that the quadratic surrogate calibration constant is independent of the target eigenproblem; this leaves open whether the alignment is a genuine consequence of the modeling choices or partly by construction of the surrogate.
[Main spectral result] Main spectral result (conditional claim on linear-softmax head): the substitution of the calibrated quadratic upper-bound surrogate for true NLL is load-bearing for the broader claim that likelihood training shapes representation geometry; without a tightness argument or demonstration that critical points coincide with those of standard NLL near the data manifold, the generalized CCA/eigenproblem does not directly explain geometry under conventional negative log-likelihood training.

minor comments (2)

[Notation] Notation section: explicitly define the calibration constant for the quadratic surrogate and state whether it is fixed independently of the learned representation or allowed to depend on it.
[Speculative decoding rationale] Discussion of speculative decoding: quantify the information surplus more explicitly with a concrete bound or example to strengthen the rationale for parallel drafting methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments help clarify the presentation of our conditional spectral result and its relationship to standard NLL training. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the main spectral result is presented without derivation steps, error bounds, or verification that the quadratic surrogate calibration constant is independent of the target eigenproblem; this leaves open whether the alignment is a genuine consequence of the modeling choices or partly by construction of the surrogate.

Authors: We agree that the abstract is highly condensed. In the revision we will add one sentence outlining the main derivation steps and explicitly state that the calibration constant is obtained by matching the surrogate value to NLL at points drawn from the empirical data distribution; this choice is made prior to and independently of solving the subsequent eigenproblem. Error bounds and the full derivation remain in the body of the paper (Section 4), but the abstract will now reference them. revision: yes
Referee: [Main spectral result] Main spectral result (conditional claim on linear-softmax head): the substitution of the calibrated quadratic upper-bound surrogate for true NLL is load-bearing for the broader claim that likelihood training shapes representation geometry; without a tightness argument or demonstration that critical points coincide with those of standard NLL near the data manifold, the generalized CCA/eigenproblem does not directly explain geometry under conventional negative log-likelihood training.

Authors: The manuscript already qualifies the result as conditional on the calibrated quadratic surrogate. We will insert a new paragraph in Section 4.3 that supplies a first-order tightness argument: when model predictions are close to the observed conditional distributions (the regime after convergence), the surrogate and NLL differ by a term whose gradient vanishes at the data manifold. We do not claim that every critical point of the surrogate coincides with that of NLL under arbitrary model classes; such a statement would require additional landscape assumptions that lie outside the present scope. The revision will therefore emphasize the surrogate’s role while clarifying the approximation regime in which the geometric alignment remains informative for standard likelihood training. revision: partial

Circularity Check

0 steps flagged

No significant circularity; result is explicitly conditional on surrogate and whitening

full rationale

The paper's main spectral claim is presented as conditional on a linear-softmax head with bounded features, a calibrated quadratic upper-bound surrogate to NLL, and subsequent whitening or variance normalization, which induces a generalized CCA/eigenproblem. This is not a reduction of the true NLL to a fitted quantity by construction, nor does it rely on self-citation for a uniqueness theorem or smuggle an ansatz. The derivation chain uses the Markov categorical framework to connect training objective, representation geometry, and capabilities, but the spectral alignment is derived under the stated modeling choices without tautological redefinition or renaming of known results as new predictions. The framework remains self-contained against external benchmarks for the conditional case.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard properties of Markov categories for composing information channels, the definition of categorical entropy, and the modeling choice of a linear-softmax head; the quadratic surrogate introduces a calibration step whose independence from the final eigenproblem is not shown in the abstract.

free parameters (1)

calibration constant for quadratic upper-bound surrogate
The surrogate is described as calibrated; this parameter is chosen or fitted to make the bound useful and directly affects the induced eigenproblem.

axioms (2)

standard math Markov categories compose information-processing stages with well-defined conditional distributions
Invoked to model single-step generation as a composition of stages.
domain assumption Categorical entropy formalizes conditional uncertainty in the data distribution
Used to interpret what the NLL objective learns beyond the mode.

pith-pipeline@v0.9.0 · 5726 in / 1465 out tokens · 58312 ms · 2026-05-19T02:26:42.148767+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NLL as Implicit Spectral Contrastive Learning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

[1]

Information geometry in optimization, machine learning and statistical inference

Shun-ichi Amari. Information geometry in optimization, machine learning and statistical inference. Frontiers of Electrical and Electronic Engineering in China, 5 0 (3): 0 241--260, 2010

work page 2010
[2]

Methods of information geometry, volume 191

Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000

work page 2000
[3]

A compositional framework for markov processes

John C Baez, Brendan Fong, and Blake S Pollard. A compositional framework for markov processes. Journal of Mathematical Physics, 57 0 (3): 0 033301, 2016

work page 2016
[4]

A comprehensive survey on spectral clustering with graph structure learning

Kamal Berahmand, Farid Saberi-Movahed, Razieh Sheikhpour, Yuefeng Li, and Mahdi Jalili. A comprehensive survey on spectral clustering with graph structure learning. arXiv preprint arXiv:2501.13597, 2025

work page arXiv 2025
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pages 1877--1901, 2020

work page 1901
[6]

Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on learning theory, pages 1305--1338. PMLR, 2020

work page 2020
[7]

Disintegration and bayesian inversion via string diagrams

Kenta Cho and Bart Jacobs. Disintegration and bayesian inversion via string diagrams. Mathematical Structures in Computer Science, 29 0 (7): 0 938--971, 2019

work page 2019
[8]

A mathematical framework for transformer circuits, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits, 2021

work page 2021
[9]

Seven Sketches in Compositionality: An Invitation to Applied Category Theory

Brendan Fong and David I Spivak. Seven sketches in compositionality: An invitation to applied category theory. arXiv preprint arXiv:1803.05316, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics

Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370: 0 107239, 2020 a

work page 2020
[11]

A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics

Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370: 0 107239, 2020 b

work page 2020
[12]

Infinite products and zero-one laws in categorical probability

Tobias Fritz and Eigil Fjeldgren Rischel. Infinite products and zero-one laws in categorical probability. Compositionality, 2: 0 3, 2020. doi:10.32408/compositionality-2-3

work page doi:10.32408/compositionality-2-3 2020
[13]

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \`e re, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Characterizing implicit bias in terms of optimization geometry

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832--1841. PMLR, 2018

work page 2018
[15]

HaoChen, Hao Chen, Chen Wei, Adrien Gaidon, and Tengyu Ma

Jeff Z. HaoChen, Hao Chen, Chen Wei, Adrien Gaidon, and Tengyu Ma. Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss . In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 14239--14250, 2021

work page 2021
[16]

A structural probe for finding syntax in word representations

John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129--4138, 2019

work page 2019
[17]

Foundations of modern probability, volume 2

Olav Kallenberg and Olav Kallenberg. Foundations of modern probability, volume 2. Springer, 1997

work page 1997
[18]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Emergent linguistic structure in artificial neural networks trained by self-supervision

Christopher D Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117 0 (48): 0 30046--30054, 2020

work page 2020
[20]

f-gan: Training generative neural samplers using variational divergence minimization

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, volume 29, 2016

work page 2016
[21]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Neural networks and markov categories

Sebastian Pardo-Guerra, Johnny Jingze Li, Kalyan Basu, and Gabriel A Silva. Neural networks and markov categories. AppliedMath, 5 0 (3): 0 93, 2025

work page 2025
[23]

Iclr: In-context learning of representations

Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka. Iclr: In-context learning of representations. arXiv preprint arXiv:2501.00070, 2024

work page arXiv 2024
[24]

Markov categories and entropy

Paolo Perrone. Markov categories and entropy. IEEE Transactions on Information Theory, 70 0 (3): 0 1671--1692, 2023 a

work page 2023
[25]

Categorical information geometry

Paolo Perrone. Categorical information geometry. In International Conference on Geometric Science of Information, pages 268--277. Springer, 2023 b

work page 2023
[26]

Spectral editing of activations for large language model alignment

Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay Cohen. Spectral editing of activations for large language model alignment. Advances in Neural Information Processing Systems, 37: 0 56958--56987, 2024

work page 2024
[27]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[28]

Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19 0 (70): 0 1--57, 2018

work page 2018
[30]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

work page 2024
[31]

Contrastive Learning Is Spectral Clustering On Similarity Graph

Zhiquan Tan, Yifan Zhang, Jingqin Yang, and Yang Yuan. Contrastive Learning Is Spectral Clustering On Similarity Graph . In The Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[33]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000 b

work page internal anchor Pith review Pith/arXiv arXiv 2000
[34]

On the implicit bias in deep-learning algorithms

Gal Vardi. On the implicit bias in deep-learning algorithms. Communications of the ACM, 66 0 (6): 0 86--93, 2023

work page 2023
[35]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, volume 30, 2017

work page 2017
[36]

On the power of foundation models

Yang Yuan. On the power of foundation models. In International Conference on Machine Learning, pages 40519--40530. PMLR, 2023

work page 2023
[37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

Information geometry in optimization, machine learning and statistical inference

Shun-ichi Amari. Information geometry in optimization, machine learning and statistical inference. Frontiers of Electrical and Electronic Engineering in China, 5 0 (3): 0 241--260, 2010

work page 2010

[2] [2]

Methods of information geometry, volume 191

Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000

work page 2000

[3] [3]

A compositional framework for markov processes

John C Baez, Brendan Fong, and Blake S Pollard. A compositional framework for markov processes. Journal of Mathematical Physics, 57 0 (3): 0 033301, 2016

work page 2016

[4] [4]

A comprehensive survey on spectral clustering with graph structure learning

Kamal Berahmand, Farid Saberi-Movahed, Razieh Sheikhpour, Yuefeng Li, and Mahdi Jalili. A comprehensive survey on spectral clustering with graph structure learning. arXiv preprint arXiv:2501.13597, 2025

work page arXiv 2025

[5] [5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pages 1877--1901, 2020

work page 1901

[6] [6]

Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on learning theory, pages 1305--1338. PMLR, 2020

work page 2020

[7] [7]

Disintegration and bayesian inversion via string diagrams

Kenta Cho and Bart Jacobs. Disintegration and bayesian inversion via string diagrams. Mathematical Structures in Computer Science, 29 0 (7): 0 938--971, 2019

work page 2019

[8] [8]

A mathematical framework for transformer circuits, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits, 2021

work page 2021

[9] [9]

Seven Sketches in Compositionality: An Invitation to Applied Category Theory

Brendan Fong and David I Spivak. Seven sketches in compositionality: An invitation to applied category theory. arXiv preprint arXiv:1803.05316, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics

Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370: 0 107239, 2020 a

work page 2020

[11] [11]

A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics

Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370: 0 107239, 2020 b

work page 2020

[12] [12]

Infinite products and zero-one laws in categorical probability

Tobias Fritz and Eigil Fjeldgren Rischel. Infinite products and zero-one laws in categorical probability. Compositionality, 2: 0 3, 2020. doi:10.32408/compositionality-2-3

work page doi:10.32408/compositionality-2-3 2020

[13] [13]

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \`e re, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Characterizing implicit bias in terms of optimization geometry

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832--1841. PMLR, 2018

work page 2018

[15] [15]

HaoChen, Hao Chen, Chen Wei, Adrien Gaidon, and Tengyu Ma

Jeff Z. HaoChen, Hao Chen, Chen Wei, Adrien Gaidon, and Tengyu Ma. Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss . In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 14239--14250, 2021

work page 2021

[16] [16]

A structural probe for finding syntax in word representations

John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129--4138, 2019

work page 2019

[17] [17]

Foundations of modern probability, volume 2

Olav Kallenberg and Olav Kallenberg. Foundations of modern probability, volume 2. Springer, 1997

work page 1997

[18] [18]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Emergent linguistic structure in artificial neural networks trained by self-supervision

Christopher D Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117 0 (48): 0 30046--30054, 2020

work page 2020

[20] [20]

f-gan: Training generative neural samplers using variational divergence minimization

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, volume 29, 2016

work page 2016

[21] [21]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Neural networks and markov categories

Sebastian Pardo-Guerra, Johnny Jingze Li, Kalyan Basu, and Gabriel A Silva. Neural networks and markov categories. AppliedMath, 5 0 (3): 0 93, 2025

work page 2025

[23] [23]

Iclr: In-context learning of representations

Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka. Iclr: In-context learning of representations. arXiv preprint arXiv:2501.00070, 2024

work page arXiv 2024

[24] [24]

Markov categories and entropy

Paolo Perrone. Markov categories and entropy. IEEE Transactions on Information Theory, 70 0 (3): 0 1671--1692, 2023 a

work page 2023

[25] [25]

Categorical information geometry

Paolo Perrone. Categorical information geometry. In International Conference on Geometric Science of Information, pages 268--277. Springer, 2023 b

work page 2023

[26] [26]

Spectral editing of activations for large language model alignment

Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay Cohen. Spectral editing of activations for large language model alignment. Advances in Neural Information Processing Systems, 37: 0 56958--56987, 2024

work page 2024

[27] [27]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019

[28] [28]

Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19 0 (70): 0 1--57, 2018

work page 2018

[30] [30]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

work page 2024

[31] [31]

Contrastive Learning Is Spectral Clustering On Similarity Graph

Zhiquan Tan, Yifan Zhang, Jingqin Yang, and Yang Yuan. Contrastive Learning Is Spectral Clustering On Similarity Graph . In The Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024

[32] [33]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000 b

work page internal anchor Pith review Pith/arXiv arXiv 2000

[33] [34]

On the implicit bias in deep-learning algorithms

Gal Vardi. On the implicit bias in deep-learning algorithms. Communications of the ACM, 66 0 (6): 0 86--93, 2023

work page 2023

[34] [35]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, volume 30, 2017

work page 2017

[35] [36]

On the power of foundation models

Yang Yuan. On the power of foundation models. In International Conference on Machine Learning, pages 40519--40530. PMLR, 2023

work page 2023

[36] [37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page