pith. sign in

arxiv: 2604.28147 · v1 · submitted 2026-04-30 · 💻 cs.CL

On the Proper Treatment of Units in Surprisal Theory

Pith reviewed 2026-05-07 04:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords surprisal theorytokenizationlinguistic unitslanguage modelspsycholinguisticscognitive modelingpredictabilityinformation theory
0
0 comments X

The pith

Surprisal analyses must explicitly separate linguistic unit definition from model token choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Surprisal theory links human processing effort to the predictability of upcoming linguistic units such as words. Yet models assign probabilities to tokens that usually do not match these units, so analyses rely on ad hoc alignments that blend separate decisions about what the unit is and which parts of the prediction to use. This paper supplies a unified framework to handle surprisal for any chosen units without those alignments dictating the outcome. Making tokenization a technical detail instead of a foundational assumption allows clearer tests of the theory. The result is that studies of predictability and effort can rest on linguistically motivated units rather than whatever token scheme a model happens to use.

Core claim

The central discovery is that surprisal-based predictors depend implicitly on ad hoc procedures that conflate the definition of the unit of analysis and the choice of regions of interest, owing to the mismatch between linguistically motivated units and model tokens. The authors disentangle these choices and introduce a unified framework for computing surprisal over arbitrary unit inventories. They conclude that tokenization should be treated as an implementation detail rather than a scientific primitive in surprisal theory.

What carries the argument

The unified framework for reasoning about surprisal over arbitrary unit inventories, which separates the definition of the unit of analysis from the selection of regions of interest for evaluation.

If this is right

  • Surprisal can be computed directly for any linguistic unit chosen by the researcher, without depending on a particular tokenization.
  • Comparisons between different language models become possible on equal footing for the same units.
  • The validity of surprisal as a predictor of processing effort is no longer tied to arbitrary alignment procedures.
  • Re-analyses of prior experiments can use consistent units across tokenizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could be used to standardize surprisal calculations in psycholinguistic experiments.
  • It opens the door to models that directly predict over linguistic units instead of subwords.
  • Analogous disentangling of measurement choices may apply to other information-theoretic measures in language science.

Load-bearing premise

The load-bearing premise is that ad hoc procedures for aligning model tokens with linguistic units currently mix distinct modeling choices and thereby affect the validity of surprisal-based predictors.

What would settle it

Recompute surprisal values for a publicly available corpus of human reading times or eye-tracking data using both the proposed unified framework and standard ad hoc token-to-word alignments, then test whether the framework version shows a meaningfully stronger or weaker correlation with the human measures.

Figures

Figures reproduced from arXiv: 2604.28147 by Ryan Cotterell, Samuel Kiegeland, Tim Vieira, V\'esteinn Sn{\ae}bjarnarson.

Figure 1
Figure 1. Figure 1: The string Tokens don’t equal words. at three lev￾els: two alphabets of symbols Σ, two unit inventories U, and regions of interest (ROIs) derived from the sentence’s constituency parse: NP, VP (with a nested inner VP shown dashed), and punctuation. The contraction don’t is split three ways: GPT-2 yields don | ’t, Penn Treebank (PTB) yields DO | N’T, and the acontextual inventory keeps it as one unit DON’T.… view at source ↗
Figure 2
Figure 2. Figure 2: The unit parser ρ: Σ ∗ → U ∗ passes through ∆∗ : the transducer f maps symbol strings to SEP-annotated strings, and h −1 splits on SEP and maps each segment to the unit it spells, recovering the unit string. Define f def = h ◦ ρ: Σ ∗ → ∆∗ . Note that f maps between two finite alphabets, Σ and ∆, even when the unit inventory U is infinite12 because U ⊆ Ξ ∗ . If, in addition, f is rational,13 Snæbjarnarson e… view at source ↗
Figure 4
Figure 4. Figure 4: A rule from fptb showing contextual segmentation: a comma or colon is split off as its own unit (surrounded by SEPs) only when the following symbol is not a digit, e.g. end, he is split into three units, while 1,000 remains one. Adapted from Snæbjarnarson et al. (2026). as a finite transducer and then compose them left to right to obtain fptb; see §C.3 for additional details. In contrast to the acontextual… view at source ↗
Figure 5
Figure 5. Figure 5: Per-observation ∆llh (×10−3 nats) for each unit inventory across reading-time measures (FF: first fixation, GD: gaze duration, TRT: total reading time). Points and whiskers show the mean and 95% trial-level bootstrap CI from leave-one-out cross-validation by trial. Significance is assessed via a paired permutation test (∗ p < 0.05; ∗∗ p < 0.01). Filled markers denote significant effects. Note that y-axis s… view at source ↗
Figure 6
Figure 6. Figure 6: A finite transducer for mapping a token-level LM to characters, illustrated with paths for view at source ↗
Figure 7
Figure 7. Figure 7: Units and fixations for model tokens (BPE, leading delimiter). Reader 3, Text 1 (MECO English) view at source ↗
Figure 8
Figure 8. Figure 8: Units and fixations for acontextual (leading) words. Reader 3, Text 1 (MECO English). Looking at the whitespace-stripped differences: the 604 GPT-2 occurrences absent from the acontextual inventory are mainly punctuation marks that BPE splits off (126 commas, 96 periods) and BPE subword fragments (e.g., Jan, us from Janus; conversely, the 325 acontextual occurrences absent from GPT-2 are words that BPE spl… view at source ↗
Figure 9
Figure 9. Figure 9: Units and fixations for acontextual (trailing) words. Reader 3, Text 1 (MECO English) view at source ↗
Figure 10
Figure 10. Figure 10: Units and fixations for contextual words. Reader 3, Text 1 (MECO English) view at source ↗
Figure 11
Figure 11. Figure 11: Units and fixations for character-level units (leading delimiter). Reader 3, Text 1 (MECO English) view at source ↗
Figure 12
Figure 12. Figure 12: Approximate F-statistics for fixed-effect smooth terms from the full-data GAMM fit, grouped by predictor type. view at source ↗
read the original abstract

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper identifies an underspecification in empirical applications of surprisal theory: experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained LMs assign probabilities over a fixed token vocabulary that does not align with those units. The authors argue that current ad-hoc alignment procedures conflate two distinct choices—the definition of the unit of analysis and the selection of regions of interest over which surprisal is evaluated—and propose a unified framework that separates these choices, allowing surprisal to be defined over arbitrary unit inventories while treating tokenization strictly as an implementation detail rather than a scientific primitive.

Significance. If the framework is adopted, it would provide a clearer methodological foundation for surprisal-based predictors of human processing difficulty, improving transparency, reproducibility, and comparability across studies that use different tokenizers or unit definitions. The contribution is primarily conceptual: it formalizes distinctions without introducing free parameters, circular reductions, or ungrounded assumptions about probability distributions, and it supplies a coherent vocabulary for reasoning about units that does not depend on new empirical claims.

minor comments (3)
  1. [§2.2] §2.2: The running example that contrasts word-level versus subword-level surprisal would be strengthened by an explicit numerical illustration showing how the same sequence yields different predictor values under the two regimes before and after the framework is applied.
  2. [§4] §4: The discussion of implications for existing corpora and datasets is brief; adding a short table that maps common tokenizers (BPE, WordPiece, etc.) onto the framework’s parameters would help readers apply the proposal immediately.
  3. Notation: The symbols U (unit inventory) and R (region of interest) are introduced clearly, but a single consolidated table defining all symbols and their domains would reduce the need to cross-reference definitions across sections.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our manuscript and for recommending minor revision. The referee's description accurately reflects the paper's focus on disentangling the definition of linguistic units from tokenization choices in surprisal calculations. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper advances a methodological clarification for surprisal analyses by disentangling the definition of linguistic units from the choice of evaluation regions, treating tokenization as an implementation detail. No derivation chain is presented that reduces predictions or first-principles results to fitted inputs by construction, nor does the argument rely on self-citations for load-bearing uniqueness claims or ansatzes. The unified framework is introduced conceptually without equations that equate outputs to inputs tautologically, and the central recommendation for explicitness stands independently of any empirical fits or prior author work invoked as external authority. This is a self-contained proposal for better practice rather than a predictive model whose validity collapses into its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper builds on established concepts in surprisal theory and NLP without introducing new free parameters or invented entities in the abstract.

axioms (2)
  • domain assumption Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit
    Core premise of the field referenced in the abstract.
  • domain assumption Experimental stimuli are segmented into linguistically motivated units while pretrained language models use a fixed token alphabet that does not align
    Stated as the practical mismatch causing ad hoc procedures.

pith-pipeline@v0.9.0 · 11300 in / 1178 out tokens · 161546 ms · 2026-05-07T04:54:12.927610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 1 canonical work pages

  1. [1]

    Kenneth R

    The interaction of contextual constraints and parafoveal visual information in reading.Cognitive Psychology, 17(3). Kenneth R. Beesley and Lauri Karttunen. 2003.Finite State Morphology. Lisa Beinborn and Yuval Pinter. 2023. Analyzing cog- nitive plausibility of subword tokenization. InPro- ceedings of the Conference on Empirical Methods in Natural Languag...

  2. [2]

    Philip Gage

    Lossy-context surprisal: An information- theoretic model of memory effects in sentence pro- cessing.Cognitive Science, 44(3). Philip Gage. 1994. A new algorithm for data compres- sion.C Users J., 12(2). Mario Giulianelli, Luca Malagutti, Juan Luis Gastaldi, Brian DuSell, Tim Vieira, and Ryan Cotterell. 2024. On the proper treatment of tokenization in psyc...

  3. [3]

    Roger Levy

    Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge.Cogni- tive Science, 41(5). Roger Levy. 2008. Expectation-based syntactic compre- hension.Cognition, 106(3). Alvin M. Liberman, Franklin S. Cooper, Donald P. Shankweiler, and Michael Studdert-Kennedy. 1967. Perception of the speech code.Psychological Review, 74(6)...

  4. [4]

    InProceedings of the An- nual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    Surprisal estimators for human reading times need character models. InProceedings of the An- nual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Byung-Doh Oh and William Schuler. 2023. Why does surprisal from larger transformer-based language models pr...

  5. [5]

    Keith Rayner and Gary E

    The effect of clause wrap-up on eye move- ments during reading.The Quarterly Journal of Ex- perimental Psychology Section A, 53(4). Keith Rayner and Gary E. Raney. 1996. Eye movement control in reading and visual search: Effects of word frequency.Psychonomic Bulletin & Review, 3(2). Keith Rayner, Arnold D. Well, Alexander Pollatsek, and James H. Bertera. ...

  6. [6]

    OpenFst: An open-source, weighted finite- state transducer library and its applications to speech and language. InProceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computa- tional Linguistics, Companion Volume: Tutorial Ab- stracts. Emmanuel Roche and Yves Schabes. 1997.Finite-State Lang...

  7. [7]

    InProceedings of the Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers)

    Neural machine translation of rare words with subword units. InProceedings of the Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). Cory Shain. 2019. A large-scale study of the effects of word frequency and predictability in naturalistic read- ing. InProceedings of the Conference of the North American Chapter of th...

  8. [8]

    Robyn Speer

    Convergent neural signatures of speech pre- diction error are a biological marker for spoken word recognition.Nature Communications, 15(1). Robyn Speer. 2022. rspeer/wordfreq: v3.0. Filiz Tezcan, Hugo Weissbart, and Andrea E Martin

  9. [9]

    eLife, 12

    A tradeoff between acoustic and linguistic feature encoding in spoken language comprehension. eLife, 12. Eleftheria Tsipidi, Samuel Kiegeland, Franz Nowak, Tianyang Xu, Ethan Wilcox, Alex Warstadt, Ryan Cotterell, and Mario Giulianelli. 2025. The harmonic structure of information contours. InProceedings of the Annual Meeting of the Association for Computa...

  10. [10]

    InFindings of the Asso- ciation for Computational Linguistics: EMNLP

    The linearity of the effect of surprisal on read- ing times across languages. InFindings of the Asso- ciation for Computational Linguistics: EMNLP. Appendix Contents A Notation Glossary 16 B Prefix-Freeness ofh17 C Transducers 18 C.1 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C.2 Acontextual Words . ....

  11. [11]

    Under the onset convention, the transducer produces target strings SEPa and SEPab , both of which have hl(ua) =SEPa as a byte prefix; scoring with the transduced LM therefore gives − →p∆(hl(ua)) = 1̸= 1

  12. [12]

    Under the completion convention, aSEP is not a byte prefix ofabSEP, and − →p∆(aSEP) = 1 2 matches − →pU(ua)exactly. Practical consequence.The trailing SEP inside each h(uk) pins the unit’s right boundary in the target- byte prefix to a block boundary of the parse, ruling out parses in which uk has been silently extended into a longer unit sharing its byte...

  13. [13]

    Symbols” is the total number of ∆-symbols scored across all trials; “Syms/s

    with the respective transducers described in §3. To convert token-level models to character-level, we use .21 To quickly compute next-token/byte distributions, we use (Kwon et al., 2023). E.1 Computing Surprisal Both contextual surprisal and unigram surprisal are computed under the transduced language model p∆ =p Σ ◦f , from the next-unit conditional dist...