arxiv: 2604.18293 · v1 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

An Existence Proof for Neural Language Models That Can Explain Garden-Path Effects via Surprisal

Ryo Yoshida , Shinnosuke Isono , Taiga Someya , Yohei Oseki , Tatsuki Kuribayashi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords surprisal theorygarden-path effectsneural language modelsreading timesfine-tuningsentence processingpsycholinguisticshuman language processing

0 comments

The pith

Fine-tuning neural language models on garden-path sentences lets surprisal explain human reading slowdowns on held-out cases and naturalistic text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Surprisal theory claims that sentence processing difficulty scales with how surprising a word is given the preceding context. Neural language models have historically failed to predict the extra reading time people need when encountering garden-path sentences that require reinterpreting the structure. This paper demonstrates that fine-tuning standard neural models on garden-path sentences produces surprisal values that match human slowdowns on new garden-path sentences. The same fine-tuned models also improve their predictions of reading times on ordinary sentences and continue to perform well on general language tasks. This result shows it is possible for neural models to account for these effects through surprisal if their predictions are aligned more closely with human behavior.

Core claim

Fine-tuning neural language models on garden-path sentences yields surprisal estimates that capture human reading slowdowns on held-out garden-path items. These models further improve predictive power for human reading times on naturalistic corpora and preserve their general language modeling capabilities. The findings constitute an existence proof for neural LMs capable of explaining both garden-path effects and typical processing difficulty via surprisal.

What carries the argument

Fine-tuning pre-trained neural language models on garden-path sentences to adjust next-word probabilities and thus their surprisal values.

If this is right

Surprisal from appropriately tuned neural LMs can account for garden-path effects without invoking separate mechanisms.
The improvement generalizes to sentences not seen during fine-tuning.
Accuracy in predicting reading times on everyday text increases after this tuning.
General language modeling performance does not degrade as a result of the fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Human sentence prediction may resemble the output of models exposed to a mix of ambiguous and unambiguous structures during learning.
Discrepancies between off-the-shelf LMs and human data may stem more from training distributions than from model architecture.
Testing surprisal theory may require evaluating models trained to match human prediction behavior rather than raw pre-trained LMs.
Similar fine-tuning approaches could be applied to model other psycholinguistic effects beyond garden paths.

Load-bearing premise

Fine-tuning changes the model's word predictions in a way that better reflects how humans anticipate words rather than causing it to memorize the training sentences without generalizing.

What would settle it

The fine-tuned models failing to predict human reading slowdowns on a fresh set of garden-path sentences or showing reduced performance on standard language modeling evaluations.

Figures

Figures reproduced from arXiv: 2604.18293 by Ryo Yoshida, Shinnosuke Isono, Taiga Someya, Tatsuki Kuribayashi, Yohei Oseki.

**Figure 2.** Figure 2: Impact of fine-tuning on predictive power for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 5.** Figure 5: Impact of subject/object relative clause asym [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Subject/object relative clause asymmetry [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Cross-construction transfer results for GPT-2 medium and large at ROI 1. Within each panel, rows [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Surprisal theory hypothesizes that the difficulty of human sentence processing increases linearly with surprisal, the negative log-probability of a word given its context. Computational psycholinguistics has tested this hypothesis using language models (LMs) as proxies for human prediction. While surprisal derived from recent neural LMs generally captures human processing difficulty on naturalistic corpora that predominantly consist of simple sentences, it severely underestimates processing difficulty on sentences that require syntactic disambiguation (garden-path effects). This leads to the claim that the processing difficulty of such sentences cannot be reduced to surprisal, although it remains possible that neural LMs simply differ from humans in next-word prediction. In this paper, we investigate whether it is truly impossible to construct a neural LM that can explain garden-path effects via surprisal. Specifically, instead of evaluating off-the-shelf neural LMs, we fine-tune these LMs on garden-path sentences so as to better align surprisal-based reading-time estimates with actual human reading times. Our results show that fine-tuned LMs do not overfit and successfully capture human reading slowdowns on held-out garden-path items; they even improve predictive power for human reading times on naturalistic corpora and preserve their general LM capabilities. These results provide an existence proof for a neural LM that can explain both garden-path effects and naturalistic reading times via surprisal, but also raise a theoretical question: what kind of evidence can truly falsify surprisal theory?

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning neural LMs on garden-path sentences can make surprisal match human reading times on held-out items and naturalistic text, but the method leaves open whether this reflects better human-like prediction or targeted adjustment to the data.

read the letter

The main point is that this paper shows an existence proof: fine-tuned neural LMs can produce surprisal values that line up with human reading slowdowns on held-out garden-path sentences. The models also improve fits to reading times on broader naturalistic corpora and keep their general language modeling performance intact. This pushes back on earlier claims that no neural LM could explain garden-path effects through surprisal alone.

Referee Report

3 major / 2 minor

Summary. The paper claims to deliver an existence proof that neural language models can explain garden-path effects via surprisal. By fine-tuning pre-trained LMs on garden-path sentences specifically to align surprisal-derived reading-time predictions with human data, the resulting models capture human slowdowns on held-out garden-path items, improve regression fits to reading times on naturalistic corpora, and retain general language-modeling performance without overfitting.

Significance. If the fine-tuning procedure produces a predictive distribution that genuinely approximates human next-word probabilities rather than rote pattern matching, the result would show that garden-path difficulty is compatible with surprisal theory once the LM is brought into closer alignment with human expectations. The additional claims of improved naturalistic coverage and preserved LM capabilities would indicate that the adjustment does not trade off generality for the targeted effect.

major comments (3)

[§3] §3 (Fine-tuning objective): The manuscript must specify the exact loss function and training objective used to 'align surprisal-based reading-time estimates with actual human reading times.' Standard next-token cross-entropy would increase probability (decrease surprisal) at the critical disambiguating region, which directly opposes the direction required to produce the observed human slowdowns. Without an explicit formulation showing how the objective raises surprisal where humans slow down while still preserving next-token prediction, it is impossible to evaluate whether the procedure yields a better model of human prediction or merely adjusts parameters to fit the target RTs.
[§5.1 and §5.2] §5.1 and §5.2 (Held-out garden-path evaluation and controls): The paper reports successful capture of slowdowns on held-out items and no overfitting, yet provides no quantitative description of how the held-out set differs from the fine-tuning set in lexical items, syntactic templates, or n-gram overlap. Without such controls or an ablation that compares against a model fine-tuned on matched non-garden-path controls, the generalization claim remains vulnerable to the possibility that improvements reflect memorization of shared structural patterns rather than an improved approximation of human next-word probabilities.
[§6] §6 (Preservation of general LM capabilities): The claim that fine-tuned models 'preserve their general LM capabilities' requires reporting of the exact benchmarks, the magnitude of any change in perplexity or downstream-task scores, and statistical tests for equivalence or non-inferiority. A simple statement that performance is 'maintained' is insufficient to support the broader conclusion that the existence proof does not come at the cost of the model's original predictive distribution.

minor comments (2)

[Abstract and §1] The abstract and introduction should explicitly state the training objective (e.g., whether it is a custom RT-regression loss, a modified LM loss, or a multi-task combination) rather than describing it only as 'fine-tune ... so as to better align.'
[Figures and Tables] Figure captions and tables reporting regression coefficients or R² values should include the exact number of items, the precise definition of the surprisal predictor (word-level or region-level), and whether the model was evaluated with the same fine-tuning hyperparameters across all experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and commit to revisions that strengthen the manuscript's clarity and rigor without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (Fine-tuning objective): The manuscript must specify the exact loss function and training objective used to 'align surprisal-based reading-time estimates with actual human reading times.' Standard next-token cross-entropy would increase probability (decrease surprisal) at the critical disambiguating region, which directly opposes the direction required to produce the observed human slowdowns. Without an explicit formulation showing how the objective raises surprisal where humans slow down while still preserving next-token prediction, it is impossible to evaluate whether the procedure yields a better model of human prediction or merely adjusts parameters to fit the target RTs.

Authors: We agree that the loss function requires explicit specification. Our procedure uses a custom composite objective: a primary term that minimizes the squared error between surprisal-derived reading-time predictions and human RTs at critical regions (thereby raising surprisal where humans slow down), combined with a weighted standard cross-entropy term that regularizes against degradation of next-token prediction on general text. This formulation is what enables the observed increase in surprisal while preserving LM capabilities. We will insert the full mathematical definition, weighting scheme, and optimization details into the revised §3. revision: yes
Referee: [§5.1 and §5.2] §5.1 and §5.2 (Held-out garden-path evaluation and controls): The paper reports successful capture of slowdowns on held-out items and no overfitting, yet provides no quantitative description of how the held-out set differs from the fine-tuning set in lexical items, syntactic templates, or n-gram overlap. Without such controls or an ablation that compares against a model fine-tuned on matched non-garden-path controls, the generalization claim remains vulnerable to the possibility that improvements reflect memorization of shared structural patterns rather than an improved approximation of human next-word probabilities.

Authors: We accept that additional quantitative controls are needed. In the revision we will report Jaccard similarity on lexical items, syntactic template overlap scores, and 3-gram/4-gram overlap statistics between the fine-tuning and held-out garden-path sets. We will also add an ablation in which an identical number of non-garden-path sentences matched for length, lexical frequency, and syntactic complexity are used for fine-tuning; the resulting model will be evaluated on the same held-out garden-path items to isolate the contribution of garden-path-specific alignment. These additions will appear in the revised §§5.1–5.2. revision: yes
Referee: [§6] §6 (Preservation of general LM capabilities): The claim that fine-tuned models 'preserve their general LM capabilities' requires reporting of the exact benchmarks, the magnitude of any change in perplexity or downstream-task scores, and statistical tests for equivalence or non-inferiority. A simple statement that performance is 'maintained' is insufficient to support the broader conclusion that the existence proof does not come at the cost of the model's original predictive distribution.

Authors: We agree that the current statement is insufficiently quantitative. In the revised §6 we will list the precise evaluation benchmarks (WikiText-103 perplexity, a subset of GLUE tasks, and an additional held-out general-corpus perplexity measure), report the raw before/after scores with confidence intervals, and include equivalence tests (TOST procedure) and non-inferiority tests with pre-specified margins to demonstrate that performance changes are statistically compatible with no meaningful degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; construction via fine-tuning with held-out validation is self-contained

full rationale

The paper's core argument is an existence proof obtained by fine-tuning off-the-shelf neural LMs on garden-path sentences to improve alignment between surprisal and human reading times, followed by explicit checks that the resulting models generalize to held-out garden-path items, improve regression fit on separate naturalistic corpora, and retain original LM capabilities. No step reduces by construction to its inputs: the fine-tuning objective is not shown to be identical to the evaluation metric on held-out data, no parameter is fitted directly to the target quantity and then relabeled a prediction, and no load-bearing premise rests on a self-citation chain. The derivation therefore remains an empirical construction with independent test sets rather than a definitional loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim depends on the assumption that fine-tuned LMs better approximate human predictive processing without introducing artifacts, and that the linear surprisal-RT relationship holds.

free parameters (1)

Fine-tuning hyperparameters (e.g., learning rate, epochs)
These are adjusted during the process to achieve alignment between model surprisal and human reading times on garden-path sentences.

axioms (1)

domain assumption Sentence processing difficulty increases linearly with word surprisal
This is the foundational hypothesis from surprisal theory that links LM-derived surprisal to observed human reading times.

pith-pipeline@v0.9.0 · 5577 in / 1365 out tokens · 64398 ms · 2026-05-10T05:32:16.030613+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 1 canonical work pages

[1]

Noam Chomsky

Studies with impossible languages falsify LMs as models of human language.Preprint, arXiv:2511.11389. Noam Chomsky. 1965.Aspects of the Theory of Syntax. The MIT Press, Cambridge. Janet Dean Fodor and Fernanda Ferreira. 1998.Reanal- ysis in Sentence Processing. Studies in Theoretical Psycholinguistics ; v.21. Kluwer Academic Publish- ers, Dordrecht ;. Ste...

work page arXiv 1965
[2]

InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan

The Natural Stories Corpus. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Edward Gibson

2018
[3]

InProceedings of the 8th Workshop on Cognitive Modeling and Com- putational Linguistics (CMCL 2018), pages 10–18, Salt Lake City, Utah

Predictive power of word surprisal for reading times is a linear function of language model quality. InProceedings of the 8th Workshop on Cognitive Modeling and Com- putational Linguistics (CMCL 2018), pages 10–18, Salt Lake City, Utah. Association for Computational Linguistics. Daniel Grodner and Edward Gibson

2018
[4]

InSecond Meeting of the North American Chapter of the Association for Com- putational Linguistics

A Probabilistic Earley Parser as a Psycholinguistic Model. InSecond Meeting of the North American Chapter of the Association for Com- putational Linguistics. John T. Hale. 2014.Automaton Theories of Human Sentence Comprehension. CSLI Studies in Computa- tional Linguistics. CSLI Publications, Stanford, CA. Kuan-Jung Huang, Suhas Arehalli, Mari Kugemoto, Ch...

2014
[5]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9367–9389, Mi- ami, Florida, USA

Reverse-Engineering the Reader. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9367–9389, Mi- ami, Florida, USA. Association for Computational Linguistics. Tatsuki Kuribayashi, Yohei Oseki, and Timothy Bald- win

2024
[6]

InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1983–2005, Mexico City, Mexico

Psychometric Predictive Power of Large Language Models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1983–2005, Mexico City, Mexico. Association for Computational Linguistics. Tatsuki Kuribayashi, Yohei Oseki, Ana Brassard, and Kentaro Inui

2024
[7]

InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10421–10436, Abu Dhabi, United Arab Emirates

Context Limitations Make Neu- ral Language Models More Human-Like. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10421–10436, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Roger Levy

2022
[8]

In5th International Conference on Learning Rep- resentations, ICLR 2017, Toulon, France, April 24- 26, 2017, Conference Track Proceedings

SGDR: Stochastic Gradient Descent with Warm Restarts. In5th International Conference on Learning Rep- resentations, ICLR 2017, Toulon, France, April 24- 26, 2017, Conference Track Proceedings. OpenRe- view.net. Ilya Loshchilov and Frank Hutter

2017
[9]

In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

2019
[10]

David Marr

The myth of the Bayesian brain.European Journal of Applied Physiology, 125(10):2643–2677. David Marr. 1982.Vision: A Computational Investiga- tion into the Human Representation and Processing of Visual Information. W. H. Freeman and Company, San Francisco. D. C. Mitchell

1982
[11]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3464–3472, Miami, Florida, USA

Leading Whitespaces of Language Models’ Subword V ocab- ulary Pose a Confound for Calculating Word Proba- bilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3464–3472, Miami, Florida, USA. Association for Computational Linguistics. Tiago Pimentel and Clara Meister

2024
[12]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18358–18375, Miami, Florida, USA

How to Com- pute the Probability of a Word. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18358–18375, Miami, Florida, USA. Association for Computational Lin- guistics. Tiago Pimentel, Clara Meister, Ethan G. Wilcox, Roger P. Levy, and Ryan Cotterell

2024
[13]

On the Effect of Anticipation on Reading Times.Transac- tions of the Association for Computational Linguis- tics, 11:1624–1642. Karl R. Popper. 1959.The Logic of Scientific Discovery. Routledge, London. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever

1959
[14]

11 Wilson L

Predictability in Language Comprehension: Prospects and Problems for Sur- prisal.Annual Review of Linguistics, 11(V olume 11, 2025):17–34. 11 Wilson L. Taylor

2025
[15]

InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online

Trans- formers: State-of-the-Art Natural Language Process- ing. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- ciation for Computational Linguistics. A Hyperparameters Fine-tuning hyperparameters are shown in Table

2020