Probability Distributions Computed by Autoregressive Transformers

Andy Yang; Anej Svete; Anthony Widjaja Lin; David Chiang; Jiaoda Li; Jonathan Rawski; Ryan Cotterell

Probability Distributions Computed by Autoregressive Transformers

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2510.27118 v4 pith:NNEIJL4P submitted 2025-10-31 cs.CL

Probability Distributions Computed by Autoregressive Transformers

Andy Yang , Anej Svete , Jiaoda Li , Anthony Widjaja Lin , Jonathan Rawski , Ryan Cotterell , David Chiang This is my paper

classification cs.CL

keywords languagemodelstransformersautoregressivecasedistributionsexpressivitymaking

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Most expressivity results for transformers treat them as language recognizers -- devices that accept or reject strings -- rather than as they are used in practice: as language models that generate strings autoregressively and probabilistically. We characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing in their most common use case as language models.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Surprisal Theory is Tautological (without Rational Grounding)
cs.CL 2026-07 conditional novelty 6.0

Unconstrained surprisal theory is a tautology: for any non-negative difficulty measure, a language model exists whose surprisal matches it affinely.