Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis
Pith reviewed 2026-05-19 14:45 UTC · model grok-4.3
The pith
Reducing the number of simultaneous parses in language models increases predicted garden path effects but not enough to match human reading difficulties.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The parse multiplicity mismatch hypothesis posits that language models are less surprised than humans by garden path sentences because they maintain a larger number of simultaneous active parses. Using word-synchronous beam search in RNNGs, the authors vary the beam size to control the number of parses and compute surprisals. They show that smaller beams increase the magnitude of predicted garden path effects, but these increases are insufficient to account for the full size of the effects observed in human reading time data.
What carries the argument
Word-synchronous beam search in Recurrent Neural Network Grammars (RNNGs), which limits the number of simultaneous parses used to compute next-word surprisal.
If this is right
- Smaller numbers of active parses lead to larger predicted processing difficulty at points of syntactic disambiguation.
- Current LM surprisal measures, even with reduced parse multiplicity, still underpredict human garden path effects.
- Other differences between human and model parsing mechanisms must be responsible for the remaining mismatch in surprise magnitudes.
- Surprisal from models with constrained parse sets remains a partial but incomplete account of human sentence processing difficulty.
Where Pith is reading between the lines
- Future work could test whether incorporating human-like memory limitations or reanalysis costs into models would better match human data.
- Similar experiments with other types of ambiguity or in different languages might reveal if parse multiplicity plays a larger role elsewhere.
- If models cannot be made to match humans by limiting parses, researchers may need to focus on how humans integrate information across parses rather than how many they keep active.
Load-bearing premise
That controlling the beam size in word-synchronous search with RNNGs accurately represents the number of distinct interpretations a human sentence parser can maintain in parallel.
What would settle it
An experiment that directly measures or estimates how many sentence interpretations humans maintain during parsing of garden path sentences and compares that number to the beam size required to match human effect magnitudes.
Figures
read the original abstract
Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surprisals successfully predict reading times in naturalistic text, they systematically underpredict the magnitude of difficulty observed in controlled studies of syntactic ambiguity, particularly in garden path sentences. This mismatch might arise from differences in the computational constraints between humans and LMs. Here we test one such hypothesis, specifically, that LMs may be able to simultaneously consider a greater number of distinct sentence interpretations at once, compared to humans. Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal, and then use these surprisals to predict human reading times. Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough to capture the full magnitude of the effects in humans. This suggests that differences in the number of simultaneous parses available to LMs and humans cannot reconcile LM-based surprisal with human sentence processing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript tests whether the underprediction of garden-path effects by language-model surprisals relative to human reading times can be explained by a difference in parse multiplicity: LMs may maintain more simultaneous syntactic analyses than humans. The authors use Recurrent Neural Network Grammars (RNNGs) together with word-synchronous beam search, systematically vary beam width to control the number of active parses, recompute word surprisals, and regress these surprisals against human reading-time data from controlled garden-path experiments. They report that smaller beam sizes increase the magnitude of predicted garden-path effects, yet the increase remains substantially smaller than the effects measured in humans.
Significance. If the beam-width manipulation validly isolates the number of simultaneous parses, the result indicates that parse-multiplicity differences alone cannot reconcile LM surprisal with human processing difficulty and therefore directs attention to other computational distinctions (e.g., integration mechanisms or resource allocation). The work supplies a direct, controllable test of a specific hypothesis about parsing constraints and demonstrates that RNNG surprisals can be modulated in the predicted direction, which is a methodological strength.
major comments (1)
- [Methods] Methods section on word-synchronous beam search: the central claim that varying beam width cleanly manipulates the number of distinct active parses (and thereby tests the multiplicity hypothesis) rests on an unverified assumption. Beam pruning is probability-driven and can discard low-probability but structurally distinct continuations before they affect surprisal; simultaneously, narrower beams degrade next-word prediction quality on unambiguous material. No beam-diversity metrics or surprisal calibration checks on unambiguous control sentences at matched beam sizes are reported, leaving open the possibility that the observed increase in garden-path magnitude is partly an artifact of poorer global model calibration rather than a pure multiplicity effect.
minor comments (2)
- [Results] Results: specify exactly how the magnitude comparison between model-predicted and human garden-path effects is quantified (e.g., ratio of regression coefficients, Cohen’s d, or raw millisecond differences).
- [Figures] Figure captions: ensure that error bars or confidence intervals are described and that the number of items per condition is stated.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment below and indicate planned revisions to strengthen the methodological claims.
read point-by-point responses
-
Referee: Methods section on word-synchronous beam search: the central claim that varying beam width cleanly manipulates the number of distinct active parses (and thereby tests the multiplicity hypothesis) rests on an unverified assumption. Beam pruning is probability-driven and can discard low-probability but structurally distinct continuations before they affect surprisal; simultaneously, narrower beams degrade next-word prediction quality on unambiguous material. No beam-diversity metrics or surprisal calibration checks on unambiguous control sentences at matched beam sizes are reported, leaving open the possibility that the observed increase in garden-path magnitude is partly an artifact of poorer global model calibration rather than a pure multiplicity effect.
Authors: We agree that explicit verification of beam composition would strengthen the interpretation. Word-synchronous beam search in RNNGs maintains the k highest-probability partial derivations at each word boundary, so beam width directly limits the number of active syntactic analyses used for marginalizing next-word probability. While probability-driven pruning can in principle drop low-probability but distinct structures, this is precisely the mechanism that reduces parse multiplicity—the quantity our hypothesis targets. To address calibration concerns, we will add (i) beam-diversity statistics (unique parse trees and their structural entropy) across beam widths and (ii) surprisal calibration plots and perplexity on unambiguous control items at the same beam sizes used in the garden-path regressions. These supplementary analyses will be included in the revised manuscript to demonstrate that the increase in garden-path magnitude is not solely an artifact of degraded global prediction quality. revision: partial
Circularity Check
No circularity: empirical manipulation and external comparison
full rationale
The paper performs an empirical test by using word-synchronous beam search in RNNGs to vary the number of active parses, deriving surprisal values from the resulting distributions, and then comparing those values against independent human reading-time measurements in garden-path sentences. No derivation step equates a model output to its input by construction, renames a known result, or relies on a self-citation chain for a uniqueness claim; the central finding (reduced multiplicity increases effect size but remains insufficient) is obtained by direct measurement against external data and is therefore falsifiable without internal reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- beam width / number of simultaneous parses
axioms (1)
- domain assumption Surprisal theory: processing difficulty is determined by word predictability in context
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal... Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Marr, David , month = may, year =. Vision:
-
[2]
Lowerre, B. T. , month = apr, year =. The
-
[3]
Lewis, Richard L. , editor =. Specifying. Architectures and. 1999 , pages =. doi:10.1017/CBO9780511527210.004 , abstract =
-
[4]
Trueswell, John C. and Tanenhaus, Michael K. , year =. Toward a lexicalist framework of constraint-based syntactic ambiguity resolution , isbn =. Perspectives on sentence processing , publisher =
-
[5]
Constraint-based models of sentence processing , isbn =
McRae, Ken and Matsuki, Kazunaga , year =. Constraint-based models of sentence processing , isbn =. Sentence processing , publisher =
-
[6]
Frazier, Lyn , year =. Sentence processing:. Attention and performance 12:
-
[7]
Goldstein, Ariel and Zada, Zaid and Buchnik, Eliav and Schain, Mariano and Price, Amy and Aubrey, Bobbi and Nastase, Samuel A. and Feder, Amir and Emanuel, Dotan and Cohen, Alon and Jansen, Aren and Gazula, Harshvardhan and Choe, Gina and Rao, Aditi and Kim, Catherine and Casto, Colton and Fanda, Lora and Doyle, Werner and Friedman, Daniel and Dugan, Patr...
-
[8]
Caucheteux, Charlotte and King, Jean-Rémi , month = feb, year =. Brains and algorithms partially converge in natural language processing , volume =. Communications Biology , publisher =. doi:10.1038/s42003-022-03036-1 , abstract =
-
[9]
Dunagan, Donald and Low, Dylan Scott and Yue, Shisen and Meyer, Lars and Hale, John T. , month = apr, year =. Temporal. doi:10.64898/2026.04.20.719609 , abstract =
-
[10]
Journal of Memory and Language , author =
Hierarchical relations guide memory retrieval in sentence comprehension:. Journal of Memory and Language , author =. 2026 , keywords =. doi:10.1016/j.jml.2026.104747 , abstract =
-
[11]
Topics in Cognitive Science , author =
A. Topics in Cognitive Science , author =. 2025 , note =. doi:10.1111/tops.12780 , abstract =
-
[12]
Dialogue & Discourse , author =
Locality in. Dialogue & Discourse , author =. 2011 , pages =. doi:10.5087/dad.2011.104 , abstract =
-
[13]
Dillon, Brian and Keshev, Maayan , editor =. Syntactic. The. 2025 , keywords =. doi:10.1017/9781009179362.035 , abstract =
-
[14]
and Monaghan, Padraic and Tsoukala, Chara , editor =
Frank, Stefan L. and Monaghan, Padraic and Tsoukala, Chara , editor =. Neural. Human. 2019 , pages =. doi:10.7551/mitpress/10841.003.0026 , language =
-
[15]
Maina-Kilaas, Amani and Levy, Roger , month = mar, year =. Algorithmic. doi:10.48550/arXiv.2603.11412 , abstract =
- [16]
-
[17]
Journal of Memory and Language , author =
Context ameliorates but does not eliminate garden-pathing:. Journal of Memory and Language , author =. 2026 , keywords =. doi:10.1016/j.jml.2026.104748 , abstract =
-
[18]
Behavior Research Methods , author =
The. Behavior Research Methods , author =. 2018 , keywords =. doi:10.3758/s13428-017-0908-4 , abstract =
-
[19]
Journal of Memory and Language , author =
Learning filler-gap dependencies with neural language models:. Journal of Memory and Language , author =. 2025 , keywords =. doi:10.1016/j.jml.2025.104663 , abstract =
-
[20]
Kush, Dave and Sant, Charlotte and Strætkvern, Sunniva Briså , month = sep, year =. Learning. Glossa: a journal of general linguistics , publisher =. doi:10.16995/glossa.5774 , abstract =
-
[21]
Journal of Memory and Language , author =
Incremental alternative sampling as a lens into the temporal and representational resolution of linguistic prediction , volume =. Journal of Memory and Language , author =. 2026 , keywords =. doi:10.1016/j.jml.2025.104715 , abstract =
-
[22]
What drives regressions in reading?. Cognition , author =. 2026 , keywords =. doi:10.1016/j.cognition.2026.106535 , abstract =
-
[23]
Michaelov, James A. and Levy, Roger P. , month = mar, year =. N-gram-like. doi:10.48550/arXiv.2603.09872 , abstract =
-
[24]
Aina, Laura and Linzen, Tal , editor =. The. Proceedings of the. 2021 , pages =. doi:10.18653/v1/2021.blackboxnlp-1.4 , abstract =
-
[25]
Distributed representations, simple recurrent networks, and grammatical structure , volume =. Machine Learning , author =. 1991 , keywords =. doi:10.1007/BF00114844 , abstract =
-
[26]
Chobey, Aryaman and Smith, Oliver and Wang, Anzi and Prasad, Grusha , editor =. Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior? , url =. Proceedings of the. 2023 , pages =. doi:10.18653/v1/2023.conll-babylm.9 , urldate =
-
[27]
Pimentel, Tiago and Meister, Clara , editor =. How to. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.1020 , abstract =
-
[28]
Oh, Byung-Doh and Schuler, William , editor =. Leading. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.202 , urldate =
-
[29]
McCurdy, Kate and Hahn, Michael , editor =. Lossy. Proceedings of the 28th. 2024 , pages =. doi:10.18653/v1/2024.conll-1.4 , abstract =
-
[30]
Behavior Research Methods , author =
Expanding horizons of cross-linguistic research on reading:. Behavior Research Methods , author =. 2022 , keywords =. doi:10.3758/s13428-021-01772-6 , abstract =
-
[31]
and Hendrick, Randall and Johnson, Marcus , year =
Gordon, Peter C. and Hendrick, Randall and Johnson, Marcus , year =. Memory interference during language processing , volume =. Journal of Experimental Psychology: Learning, Memory, and Cognition , publisher =. doi:10.1037/0278-7393.27.6.1411 , abstract =
-
[33]
Shain, Cory , editor =. Proceedings of the 59th. 2021 , pages =. doi:10.18653/v1/2021.acl-long.288 , abstract =
-
[34]
Journal of Linguistics , author =
Does headedness affect processing?. Journal of Linguistics , author =. 2009 , pages =. doi:10.1017/S0022226709990065 , abstract =
-
[35]
Journal of Memory and Language , author =
Memory for prediction:. Journal of Memory and Language , author =. 2025 , keywords =. doi:10.1016/j.jml.2025.104670 , abstract =
-
[36]
Journal of Memory and Language , author =
The effect of similarity-based interference on bottom-up and top-down processing in verb-final languages:. Journal of Memory and Language , author =. 2025 , keywords =. doi:10.1016/j.jml.2025.104627 , abstract =
-
[37]
The missing-. Memory & Cognition , author =. 2021 , keywords =. doi:10.3758/s13421-021-01159-0 , abstract =
-
[38]
Engelmann, Felix and Vasishth, Shravan and Howes, A. and Peebles, D. and Cooper, R. P. , year =. Processing grammatical and ungrammatical center embeddings in. \
-
[39]
Statistics in Medicine , author =
A simple method for converting an odds ratio to effect size for use in meta-analysis , volume =. Statistics in Medicine , author =. 2000 , note =. doi:10.1002/1097-0258(20001130)19:22<3127::AID-SIM784>3.0.CO;2-M , abstract =
-
[40]
and Wagenmakers, Eric-Jan , year =
Lee, Michael D. and Wagenmakers, Eric-Jan , year =. Bayesian. doi:10.1017/CBO9781139087759 , abstract =
-
[41]
Psychonomic Bulletin & Review , author =
Bayesian t tests for accepting and rejecting the null hypothesis , volume =. Psychonomic Bulletin & Review , author =. 2009 , keywords =. doi:10.3758/PBR.16.2.225 , abstract =
-
[42]
and Bod, Rens , month = jun, year =
Frank, Stefan L. and Bod, Rens , month = jun, year =. Insensitivity of the. Psychological Science , publisher =. doi:10.1177/0956797611409589 , abstract =
-
[43]
Language and Linguistics Compass , author =
The. Language and Linguistics Compass , author =. 2015 , note =. doi:10.1111/lnc3.12151 , abstract =
-
[44]
Word. Open Mind , author =. 2024 , pages =. doi:10.1162/opmi_a_00119 , abstract =
-
[45]
and Fiete, Ila and Irie, Kazuki , month = jun, year =
Gershman, Samuel J. and Fiete, Ila and Irie, Kazuki , month = jun, year =. Key-value memory in the brain , volume =. Neuron , publisher =. doi:10.1016/j.neuron.2025.02.029 , language =
-
[46]
Localizing syntactic predictions using recurrent neural network grammars , volume =. Neuropsychologia , author =. 2020 , keywords =. doi:10.1016/j.neuropsychologia.2020.107479 , abstract =
-
[47]
and Betancourt, Michael and Vasishth, Shravan , year =
Schad, Daniel J. and Betancourt, Michael and Vasishth, Shravan , year =. Toward a principled. Psychological Methods , publisher =. doi:10.1037/met0000275 , abstract =
-
[48]
Uncertainty. Cognitive Science , author =. 2006 , note =. doi:10.1207/s15516709cog0000_64 , abstract =
-
[49]
Journal of Psycholinguistic Research , author =
The. Journal of Psycholinguistic Research , author =. 2003 , keywords =. doi:10.1023/A:1022492123056 , abstract =
-
[50]
doi:10.1162/nol_a_00121 , abstract =
Surprisal. doi:10.1162/nol_a_00121 , abstract =
-
[51]
Gallistel, C. R. , year =. The importance of proving the null , volume =. Psychological Review , publisher =. doi:10.1037/a0015251 , abstract =
-
[52]
The binocular coordination of eye movements during reading in children and adults , volume =. Vision Research , author =. 2006 , keywords =. doi:10.1016/j.visres.2006.06.006 , abstract =
-
[53]
Rayner, Keith , month = aug, year =. The 35th. Quarterly Journal of Experimental Psychology , publisher =. doi:10.1080/17470210902816461 , abstract =
-
[54]
Goodkind, Adam and Bicknell, Klinton , editor =. Predictive power of word surprisal for reading times is a linear function of language model quality , url =. Proceedings of the 8th. 2018 , pages =. doi:10.18653/v1/W18-0102 , urldate =
-
[55]
Bever, Thomas G. , editor =. The cognitive basis for linguistic structures , isbn =. Language. 2013 , doi =
work page 2013
-
[56]
Journal of Experimental Psychology: General , author =
Paradigms and processes in reading comprehension , volume =. Journal of Experimental Psychology: General , author =. 1982 , keywords =. doi:10.1037/0096-3445.111.2.228 , abstract =
-
[57]
De Varda, Andrea and Marelli, Marco , editor =. Locally. Proceedings of the. 2024 , pages =. doi:10.18653/v1/2024.cmcl-1.3 , abstract =
-
[58]
Humans and language models diverge when predicting repeating text , url =
Vaidya, Aditya and Turek, Javier and Huth, Alexander , editor =. Humans and language models diverge when predicting repeating text , url =. Proceedings of the 27th. 2023 , pages =. doi:10.18653/v1/2023.conll-1.5 , abstract =
-
[59]
Clark, Christian and Oh, Byung-Doh and Schuler, William , editor =. Linear. Proceedings of the 31st. 2025 , pages =
work page 2025
-
[60]
and Kanwisher, Nancy and Tenenbaum, Joshua B
Schrimpf, Martin and Blank, Idan Asher and Tuckute, Greta and Kauf, Carina and Hosseini, Eghbal A. and Kanwisher, Nancy and Tenenbaum, Joshua B. and Fedorenko, Evelina , month = nov, year =. The neural architecture of language:. Proceedings of the National Academy of Sciences , publisher =. doi:10.1073/pnas.2105646118 , abstract =
-
[61]
Bigger is not always better:. Journal of Memory and Language , author =. 2025 , keywords =. doi:10.1016/j.jml.2025.104650 , abstract =
-
[62]
Armeni, Kristijan and Honey, Christopher and Linzen, Tal , editor =. Characterizing. Proceedings of the 26th. 2022 , pages =. doi:10.18653/v1/2022.conll-1.28 , abstract =
-
[63]
Kitaev, Nikita and Klein, Dan , editor =. Constituency. Proceedings of the 56th. 2018 , pages =. doi:10.18653/v1/P18-1249 , abstract =
-
[64]
Colorless Green Recurrent Networks Dream Hierarchically
Gulordava, Kristina and Bojanowski, Piotr and Grave, Edouard and Linzen, Tal and Baroni, Marco , editor =. Colorless. Proceedings of the 2018. 2018 , pages =. doi:10.18653/v1/N18-1108 , abstract =
-
[65]
The. Behavioral and Brain Sciences , author =. 2016 , keywords =. doi:10.1017/S0140525X1500031X , abstract =
-
[66]
Timkey, William and Huang, Kuan-Jung and Oh, Byung-Doh and Prasad, Grusha and Arehalli, Suhas and Linzen, Tal and Dillon, Brian , month = nov, year =. Eye movements reveal a dissociation between prediction and structural processing in language comprehension , url =
- [67]
- [68]
-
[69]
Two ways into the hall of mirrors:
McCurdy, Kate and Christian, Katharina and Seyfried, Amelie and Sonkin, Mikhail , editor =. Two ways into the hall of mirrors:. Proceedings of the. 2025 , pages =
work page 2025
-
[70]
Reichle, Erik D. and Sheridan, Heather , editor =. E-. The. 2015 , pages =. doi:10.1093/oxfordhb/9780199324576.013.17 , abstract =
-
[71]
To model human linguistic prediction, make LLMs less superhuman
Oh, Byung-Doh and Linzen, Tal , month = oct, year =. To model human linguistic prediction, make. doi:10.48550/arXiv.2510.05141 , abstract =
-
[72]
Yoshida, Ryo and Sugimoto, Yushi and Oseki, Yohei , editor =. Investigating. Proceedings of the 29th. 2025 , pages =. doi:10.18653/v1/2025.conll-1.27 , abstract =
-
[73]
and Poeppel, David and Vo, Vy A
Raccah, Omri and Chen, Phoebe and Willke, Ted L. and Poeppel, David and Vo, Vy A. , month = nov, year =. Memory in humans and deep language models:. doi:10.48550/arXiv.2210.01869 , abstract =
-
[74]
Attention, Perception, & Psychophysics , author =
Parafoveal processing in reading , volume =. Attention, Perception, & Psychophysics , author =. 2012 , keywords =. doi:10.3758/s13414-011-0219-2 , abstract =
-
[75]
Journal of Memory and Language , author =
Avoiding the garden path:. Journal of Memory and Language , author =. 1992 , pages =. doi:10.1016/0749-596X(92)90035-V , abstract =
-
[76]
and Rayner, Keith and Pollatsek, Alexander , year = 2003, journal =
The. The Behavioral and Brain Sciences , author =. 2003 , keywords =. doi:10.1017/s0140525x03000104 , abstract =
-
[77]
and Kliegl, Reinhold , year = 2005, journal =
Engbert, Ralf and Nuthmann, Antje and Richter, Eike M. and Kliegl, Reinhold , year =. Psychological Review , publisher =. doi:10.1037/0033-295X.112.4.777 , abstract =
-
[78]
Journal of eye movement research , author =
Eye. Journal of eye movement research , author =. 2009 , pages =
work page 2009
-
[79]
Journal of Memory and Language , author =
Language models that match reader experience are better predictors of reading times , volume =. Journal of Memory and Language , author =. 2026 , keywords =. doi:10.1016/j.jml.2025.104677 , abstract =
-
[80]
Heuristic interpretation as rational inference:. Cognition , author =. 2023 , keywords =. doi:10.1016/j.cognition.2022.105359 , abstract =
-
[81]
FRAZIER, LYN , year =. On
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.