Recognition: unknown
An Existence Proof for Neural Language Models That Can Explain Garden-Path Effects via Surprisal
Pith reviewed 2026-05-10 05:32 UTC · model grok-4.3
The pith
Fine-tuning neural language models on garden-path sentences lets surprisal explain human reading slowdowns on held-out cases and naturalistic text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuning neural language models on garden-path sentences yields surprisal estimates that capture human reading slowdowns on held-out garden-path items. These models further improve predictive power for human reading times on naturalistic corpora and preserve their general language modeling capabilities. The findings constitute an existence proof for neural LMs capable of explaining both garden-path effects and typical processing difficulty via surprisal.
What carries the argument
Fine-tuning pre-trained neural language models on garden-path sentences to adjust next-word probabilities and thus their surprisal values.
If this is right
- Surprisal from appropriately tuned neural LMs can account for garden-path effects without invoking separate mechanisms.
- The improvement generalizes to sentences not seen during fine-tuning.
- Accuracy in predicting reading times on everyday text increases after this tuning.
- General language modeling performance does not degrade as a result of the fine-tuning.
Where Pith is reading between the lines
- Human sentence prediction may resemble the output of models exposed to a mix of ambiguous and unambiguous structures during learning.
- Discrepancies between off-the-shelf LMs and human data may stem more from training distributions than from model architecture.
- Testing surprisal theory may require evaluating models trained to match human prediction behavior rather than raw pre-trained LMs.
- Similar fine-tuning approaches could be applied to model other psycholinguistic effects beyond garden paths.
Load-bearing premise
Fine-tuning changes the model's word predictions in a way that better reflects how humans anticipate words rather than causing it to memorize the training sentences without generalizing.
What would settle it
The fine-tuned models failing to predict human reading slowdowns on a fresh set of garden-path sentences or showing reduced performance on standard language modeling evaluations.
Figures
read the original abstract
Surprisal theory hypothesizes that the difficulty of human sentence processing increases linearly with surprisal, the negative log-probability of a word given its context. Computational psycholinguistics has tested this hypothesis using language models (LMs) as proxies for human prediction. While surprisal derived from recent neural LMs generally captures human processing difficulty on naturalistic corpora that predominantly consist of simple sentences, it severely underestimates processing difficulty on sentences that require syntactic disambiguation (garden-path effects). This leads to the claim that the processing difficulty of such sentences cannot be reduced to surprisal, although it remains possible that neural LMs simply differ from humans in next-word prediction. In this paper, we investigate whether it is truly impossible to construct a neural LM that can explain garden-path effects via surprisal. Specifically, instead of evaluating off-the-shelf neural LMs, we fine-tune these LMs on garden-path sentences so as to better align surprisal-based reading-time estimates with actual human reading times. Our results show that fine-tuned LMs do not overfit and successfully capture human reading slowdowns on held-out garden-path items; they even improve predictive power for human reading times on naturalistic corpora and preserve their general LM capabilities. These results provide an existence proof for a neural LM that can explain both garden-path effects and naturalistic reading times via surprisal, but also raise a theoretical question: what kind of evidence can truly falsify surprisal theory?
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to deliver an existence proof that neural language models can explain garden-path effects via surprisal. By fine-tuning pre-trained LMs on garden-path sentences specifically to align surprisal-derived reading-time predictions with human data, the resulting models capture human slowdowns on held-out garden-path items, improve regression fits to reading times on naturalistic corpora, and retain general language-modeling performance without overfitting.
Significance. If the fine-tuning procedure produces a predictive distribution that genuinely approximates human next-word probabilities rather than rote pattern matching, the result would show that garden-path difficulty is compatible with surprisal theory once the LM is brought into closer alignment with human expectations. The additional claims of improved naturalistic coverage and preserved LM capabilities would indicate that the adjustment does not trade off generality for the targeted effect.
major comments (3)
- [§3] §3 (Fine-tuning objective): The manuscript must specify the exact loss function and training objective used to 'align surprisal-based reading-time estimates with actual human reading times.' Standard next-token cross-entropy would increase probability (decrease surprisal) at the critical disambiguating region, which directly opposes the direction required to produce the observed human slowdowns. Without an explicit formulation showing how the objective raises surprisal where humans slow down while still preserving next-token prediction, it is impossible to evaluate whether the procedure yields a better model of human prediction or merely adjusts parameters to fit the target RTs.
- [§5.1 and §5.2] §5.1 and §5.2 (Held-out garden-path evaluation and controls): The paper reports successful capture of slowdowns on held-out items and no overfitting, yet provides no quantitative description of how the held-out set differs from the fine-tuning set in lexical items, syntactic templates, or n-gram overlap. Without such controls or an ablation that compares against a model fine-tuned on matched non-garden-path controls, the generalization claim remains vulnerable to the possibility that improvements reflect memorization of shared structural patterns rather than an improved approximation of human next-word probabilities.
- [§6] §6 (Preservation of general LM capabilities): The claim that fine-tuned models 'preserve their general LM capabilities' requires reporting of the exact benchmarks, the magnitude of any change in perplexity or downstream-task scores, and statistical tests for equivalence or non-inferiority. A simple statement that performance is 'maintained' is insufficient to support the broader conclusion that the existence proof does not come at the cost of the model's original predictive distribution.
minor comments (2)
- [Abstract and §1] The abstract and introduction should explicitly state the training objective (e.g., whether it is a custom RT-regression loss, a modified LM loss, or a multi-task combination) rather than describing it only as 'fine-tune ... so as to better align.'
- [Figures and Tables] Figure captions and tables reporting regression coefficients or R² values should include the exact number of items, the precise definition of the surprisal predictor (word-level or region-level), and whether the model was evaluated with the same fine-tuning hyperparameters across all experiments.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major point below and commit to revisions that strengthen the manuscript's clarity and rigor without altering its core claims.
read point-by-point responses
-
Referee: [§3] §3 (Fine-tuning objective): The manuscript must specify the exact loss function and training objective used to 'align surprisal-based reading-time estimates with actual human reading times.' Standard next-token cross-entropy would increase probability (decrease surprisal) at the critical disambiguating region, which directly opposes the direction required to produce the observed human slowdowns. Without an explicit formulation showing how the objective raises surprisal where humans slow down while still preserving next-token prediction, it is impossible to evaluate whether the procedure yields a better model of human prediction or merely adjusts parameters to fit the target RTs.
Authors: We agree that the loss function requires explicit specification. Our procedure uses a custom composite objective: a primary term that minimizes the squared error between surprisal-derived reading-time predictions and human RTs at critical regions (thereby raising surprisal where humans slow down), combined with a weighted standard cross-entropy term that regularizes against degradation of next-token prediction on general text. This formulation is what enables the observed increase in surprisal while preserving LM capabilities. We will insert the full mathematical definition, weighting scheme, and optimization details into the revised §3. revision: yes
-
Referee: [§5.1 and §5.2] §5.1 and §5.2 (Held-out garden-path evaluation and controls): The paper reports successful capture of slowdowns on held-out items and no overfitting, yet provides no quantitative description of how the held-out set differs from the fine-tuning set in lexical items, syntactic templates, or n-gram overlap. Without such controls or an ablation that compares against a model fine-tuned on matched non-garden-path controls, the generalization claim remains vulnerable to the possibility that improvements reflect memorization of shared structural patterns rather than an improved approximation of human next-word probabilities.
Authors: We accept that additional quantitative controls are needed. In the revision we will report Jaccard similarity on lexical items, syntactic template overlap scores, and 3-gram/4-gram overlap statistics between the fine-tuning and held-out garden-path sets. We will also add an ablation in which an identical number of non-garden-path sentences matched for length, lexical frequency, and syntactic complexity are used for fine-tuning; the resulting model will be evaluated on the same held-out garden-path items to isolate the contribution of garden-path-specific alignment. These additions will appear in the revised §§5.1–5.2. revision: yes
-
Referee: [§6] §6 (Preservation of general LM capabilities): The claim that fine-tuned models 'preserve their general LM capabilities' requires reporting of the exact benchmarks, the magnitude of any change in perplexity or downstream-task scores, and statistical tests for equivalence or non-inferiority. A simple statement that performance is 'maintained' is insufficient to support the broader conclusion that the existence proof does not come at the cost of the model's original predictive distribution.
Authors: We agree that the current statement is insufficiently quantitative. In the revised §6 we will list the precise evaluation benchmarks (WikiText-103 perplexity, a subset of GLUE tasks, and an additional held-out general-corpus perplexity measure), report the raw before/after scores with confidence intervals, and include equivalence tests (TOST procedure) and non-inferiority tests with pre-specified margins to demonstrate that performance changes are statistically compatible with no meaningful degradation. revision: yes
Circularity Check
No significant circularity; construction via fine-tuning with held-out validation is self-contained
full rationale
The paper's core argument is an existence proof obtained by fine-tuning off-the-shelf neural LMs on garden-path sentences to improve alignment between surprisal and human reading times, followed by explicit checks that the resulting models generalize to held-out garden-path items, improve regression fit on separate naturalistic corpora, and retain original LM capabilities. No step reduces by construction to its inputs: the fine-tuning objective is not shown to be identical to the evaluation metric on held-out data, no parameter is fitted directly to the target quantity and then relabeled a prediction, and no load-bearing premise rests on a self-citation chain. The derivation therefore remains an empirical construction with independent test sets rather than a definitional loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- Fine-tuning hyperparameters (e.g., learning rate, epochs)
axioms (1)
- domain assumption Sentence processing difficulty increases linearly with word surprisal
Reference graph
Works this paper leans on
-
[1]
Studies with impossible languages falsify LMs as models of human language.Preprint, arXiv:2511.11389. Noam Chomsky. 1965.Aspects of the Theory of Syntax. The MIT Press, Cambridge. Janet Dean Fodor and Fernanda Ferreira. 1998.Reanal- ysis in Sentence Processing. Studies in Theoretical Psycholinguistics ; v.21. Kluwer Academic Publish- ers, Dordrecht ;. Ste...
-
[2]
InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan
The Natural Stories Corpus. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Edward Gibson
2018
-
[3]
InProceedings of the 8th Workshop on Cognitive Modeling and Com- putational Linguistics (CMCL 2018), pages 10–18, Salt Lake City, Utah
Predictive power of word surprisal for reading times is a linear function of language model quality. InProceedings of the 8th Workshop on Cognitive Modeling and Com- putational Linguistics (CMCL 2018), pages 10–18, Salt Lake City, Utah. Association for Computational Linguistics. Daniel Grodner and Edward Gibson
2018
-
[4]
InSecond Meeting of the North American Chapter of the Association for Com- putational Linguistics
A Probabilistic Earley Parser as a Psycholinguistic Model. InSecond Meeting of the North American Chapter of the Association for Com- putational Linguistics. John T. Hale. 2014.Automaton Theories of Human Sentence Comprehension. CSLI Studies in Computa- tional Linguistics. CSLI Publications, Stanford, CA. Kuan-Jung Huang, Suhas Arehalli, Mari Kugemoto, Ch...
2014
-
[5]
InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9367–9389, Mi- ami, Florida, USA
Reverse-Engineering the Reader. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9367–9389, Mi- ami, Florida, USA. Association for Computational Linguistics. Tatsuki Kuribayashi, Yohei Oseki, and Timothy Bald- win
2024
-
[6]
InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1983–2005, Mexico City, Mexico
Psychometric Predictive Power of Large Language Models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1983–2005, Mexico City, Mexico. Association for Computational Linguistics. Tatsuki Kuribayashi, Yohei Oseki, Ana Brassard, and Kentaro Inui
2024
-
[7]
InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10421–10436, Abu Dhabi, United Arab Emirates
Context Limitations Make Neu- ral Language Models More Human-Like. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10421–10436, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Roger Levy
2022
-
[8]
In5th International Conference on Learning Rep- resentations, ICLR 2017, Toulon, France, April 24- 26, 2017, Conference Track Proceedings
SGDR: Stochastic Gradient Descent with Warm Restarts. In5th International Conference on Learning Rep- resentations, ICLR 2017, Toulon, France, April 24- 26, 2017, Conference Track Proceedings. OpenRe- view.net. Ilya Loshchilov and Frank Hutter
2017
-
[9]
In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
2019
-
[10]
David Marr
The myth of the Bayesian brain.European Journal of Applied Physiology, 125(10):2643–2677. David Marr. 1982.Vision: A Computational Investiga- tion into the Human Representation and Processing of Visual Information. W. H. Freeman and Company, San Francisco. D. C. Mitchell
1982
-
[11]
InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3464–3472, Miami, Florida, USA
Leading Whitespaces of Language Models’ Subword V ocab- ulary Pose a Confound for Calculating Word Proba- bilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3464–3472, Miami, Florida, USA. Association for Computational Linguistics. Tiago Pimentel and Clara Meister
2024
-
[12]
InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18358–18375, Miami, Florida, USA
How to Com- pute the Probability of a Word. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18358–18375, Miami, Florida, USA. Association for Computational Lin- guistics. Tiago Pimentel, Clara Meister, Ethan G. Wilcox, Roger P. Levy, and Ryan Cotterell
2024
-
[13]
On the Effect of Anticipation on Reading Times.Transac- tions of the Association for Computational Linguis- tics, 11:1624–1642. Karl R. Popper. 1959.The Logic of Scientific Discovery. Routledge, London. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever
1959
-
[14]
11 Wilson L
Predictability in Language Comprehension: Prospects and Problems for Sur- prisal.Annual Review of Linguistics, 11(V olume 11, 2025):17–34. 11 Wilson L. Taylor
2025
-
[15]
InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online
Trans- formers: State-of-the-Art Natural Language Process- ing. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- ciation for Computational Linguistics. A Hyperparameters Fine-tuning hyperparameters are shown in Table
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.