A Dual-Task Paradigm to Investigate Sentence Comprehension Strategies in Language Models

Rei Emura; Saku Sugawara

arxiv: 2604.26351 · v1 · submitted 2026-04-29 · 💻 cs.CL

A Dual-Task Paradigm to Investigate Sentence Comprehension Strategies in Language Models

Rei Emura , Saku Sugawara This is my paper

Pith reviewed 2026-05-07 13:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords dual-task paradigmsentence comprehensionlanguage modelsplausibility-based inferencecognitive constraintsrational inferenceworking memorypassive sentences

0 comments

The pith

Language models shift toward plausibility-based sentence comprehension when placed under dual-task cognitive load.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dual-task setup that pairs arithmetic computation with passive-sentence interpretation to restrict the balance between memory and processing resources. Under these conditions, models such as GPT-4o, o3-mini, and o4-mini display a reliably larger accuracy difference between plausible and implausible sentences than they do in single-task settings. This pattern matches the rational-inference strategy observed in humans when working memory is taxed. The result indicates that human-like comprehension behaviors can emerge in LMs once their resource budget is limited rather than unlimited.

Core claim

Imposing a dual-task load that interleaves arithmetic computation with sentence comprehension causes advanced language models to exhibit a greater accuracy gap between plausible sentences and their implausible counterparts, thereby demonstrating a shift to plausibility-based comprehension that parallels human rational inference under resource constraints.

What carries the argument

The dual-task paradigm that combines an arithmetic computation task with a sentence comprehension task to constrain the balance between memory storage and processing resources.

If this is right

Resource constraints on LMs promote greater reliance on plausibility during sentence interpretation.
This reliance produces accuracy patterns that mirror those of humans under working-memory load.
Human-like sentence comprehension strategies in LMs can arise from limited rather than abundant cognitive resources.
The balance between memory and processing capacity is a key driver of rational inference behaviors in both humans and models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling model capacity without corresponding resource limits may reduce the emergence of human-like inference strategies on some tasks.
The same dual-task method could be applied to other comprehension phenomena such as garden-path recovery or pronoun resolution.
Similar resource bottlenecks might be engineered in multimodal or reasoning models to test whether plausibility-based shortcuts increase.

Load-bearing premise

Differences in accuracy between plausible and implausible sentences under dual-task load specifically reflect a shift in comprehension strategy rather than a general drop in performance or unrelated task interference.

What would settle it

Showing that the accuracy gap between plausible and implausible sentences remains unchanged after equating overall performance levels across single-task and dual-task conditions would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.26351 by Rei Emura, Saku Sugawara.

**Figure 1.** Figure 1: Overview of hypothesis and tasks. This study view at source ↗

**Figure 2.** Figure 2: Mean accuracy of comprehension tasks by plausibility, task, LM, and construction. GPT-4o is likely to view at source ↗

**Figure 3.** Figure 3: presents the mean accuracy rates of comprehension questions by plausibility, task, LM, and correct answer (Yes or No). When the correct answer is “Yes,” all models show larger plausibility contrasts under the dual task condition than in the single or noisy single tasks. This effect is driven by a substantial decrease in accuracy for implausible sentences under the dual task. That is, the models often fa… view at source ↗

**Figure 4.** Figure 4: Proportion of implausible items answered correctly in the single and noisy single tasks but incorrectly in view at source ↗

**Figure 5.** Figure 5: Example prompt of the single task. C Effect of Plausibility by Calculation view at source ↗

**Figure 6.** Figure 6: Example prompt of the noisy single task. view at source ↗

**Figure 7.** Figure 7: Example prompt of the dual task view at source ↗

**Figure 8.** Figure 8: Accuracy rate of comprehension tasks by plausibility, task, LM, and calculation. Some conditions are view at source ↗

**Figure 9.** Figure 9: Instruction screens for the single task in the human experiment. view at source ↗

**Figure 10.** Figure 10: Instruction screens for the noisy single task in the human experiment. view at source ↗

**Figure 11.** Figure 11: Instruction screens for the dual task in the human experiment. view at source ↗

read the original abstract

Language models (LMs) behave more like humans when their cognitive resources are restricted, particularly in predicting sentence processing costs such as reading times. However, it remains unclear whether such constraints similarly affect sentence comprehension strategies. Besides, existing methods do not directly target the balance between memory storage and sentence processing, which is central to human working memory. To address this issue, we propose a dual-task paradigm that combines an arithmetic computation task with a sentence comprehension task, such as "The 2 cocktail + blended 3 =..." Our experiments show that under dual-task conditions, GPT-4o, o3-mini, and o4-mini shift toward plausibility-based comprehension, mirroring humans' rational inference. Specifically, these models show a greater accuracy gap between plausible sentences (e.g., "The cocktail was blended by the bartender") and implausible sentences (e.g., "The bartender was blended by the cocktail") in the dual-task condition compared to the single-task conditions. These findings suggest that constraints on the balance between memory and processing resources promote rational inference in LMs. More broadly, they support the view that human-like sentence comprehension fundamentally arises from the allocation of limited cognitive resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-task arithmetic setup produces a larger plausibility gap in GPT-4o and similar models, but the abstract supplies no controls or stats to show this is a targeted strategy shift rather than generic load.

read the letter

The paper's core observation is that GPT-4o, o3-mini, and o4-mini show a bigger accuracy difference between plausible and implausible passives when they must also do arithmetic, compared with single-task conditions. This is framed as evidence that resource limits push the models toward human-style rational inference based on plausibility. The new element is the direct transfer of the dual-task paradigm from human sentence processing work to test whether the same memory-processing tradeoff produces similar behavior in LMs. That extension is clean and worth checking. The examples are concrete and the claim is stated plainly. What the work does well is keep the design simple and tie it explicitly to working-memory ideas that already exist in the human literature. The soft spots are more serious. The abstract gives no trial counts, no overall accuracy numbers, no statistical tests, and no description of prompt wording or lexical matching. Without those, it is impossible to tell whether the widened gap reflects a genuine change in comprehension strategy or just that implausible sentences are harder and therefore drop more under any added load. The stress-test concern about nonspecific effects therefore lands. The paper is aimed at computational psycholinguists who want to model resource-rational behavior in LMs. A reader already working on dual-task or capacity-limit accounts would get something useful from the idea, even if the current data are preliminary. It deserves a serious referee because the paradigm itself is straightforward to run properly and the question is live; the current version would need methods and results sections that actually address the controls before it could be published.

Referee Report

2 major / 1 minor

Summary. The paper proposes a dual-task paradigm combining arithmetic computation with sentence comprehension (e.g., 'The 2 cocktail + blended 3 =...') to examine how resource constraints affect language models' sentence processing strategies. Experiments with GPT-4o, o3-mini, and o4-mini report a larger accuracy gap between plausible passives (e.g., 'The cocktail was blended by the bartender') and implausible passives (e.g., 'The bartender was blended by the cocktail') under dual-task versus single-task conditions, interpreted as a shift toward plausibility-based rational inference mirroring human behavior.

Significance. If the result holds after appropriate controls, the work would provide empirical support for resource-rational accounts of sentence comprehension in LMs, showing that memory-processing trade-offs can induce more human-like inference strategies. The dual-task design offers a targeted method for probing strategy shifts beyond standard prompting, with potential implications for cognitive modeling of LMs.

major comments (2)

[Abstract] Abstract and Results: The central claim that the larger plausible-vs-implausible accuracy gap under dual-task load specifically indexes a shift to plausibility-based comprehension (rather than nonspecific load effects, altered attention, or differential interference) is load-bearing but currently unsupported; the manuscript must show that the interaction survives controls for main effects of task difficulty and uniform performance drops.
[Methods] Methods and Results: No details are supplied on trial counts, statistical tests for the critical interaction, prompt templates, lexical/structural matching between plausible and implausible items, or whether overall accuracy declines uniformly across conditions, all of which are required to evaluate the skeptic's concern about confounds.

minor comments (1)

[Abstract] The abstract would be strengthened by reporting quantitative details such as effect sizes or exact accuracy percentages for the key conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. These points highlight important areas where additional evidence and transparency will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and Results: The central claim that the larger plausible-vs-implausible accuracy gap under dual-task load specifically indexes a shift to plausibility-based comprehension (rather than nonspecific load effects, altered attention, or differential interference) is load-bearing but currently unsupported; the manuscript must show that the interaction survives controls for main effects of task difficulty and uniform performance drops.

Authors: We agree that ruling out nonspecific load effects is essential for the central interpretation. In the revised manuscript we will add analyses that explicitly control for overall accuracy declines (e.g., by testing whether the plausible-implausible gap widens under dual-task load after accounting for main effects of condition on baseline performance). We will also report condition-wise accuracy means and include a supplementary check that the critical interaction remains significant when overall performance is partialled out. revision: yes
Referee: [Methods] Methods and Results: No details are supplied on trial counts, statistical tests for the critical interaction, prompt templates, lexical/structural matching between plausible and implausible items, or whether overall accuracy declines uniformly across conditions, all of which are required to evaluate the skeptic's concern about confounds.

Authors: We appreciate the referee's request for these methodological details, which are necessary for evaluating potential confounds. The revised version will include: (i) exact trial counts per condition, (ii) the statistical tests and models used for the critical interaction (including any mixed-effects specifications), (iii) the full prompt templates, (iv) the criteria and examples for lexical and structural matching between plausible and implausible items, and (v) an explicit report of whether accuracy declines are uniform or differential across sentence types. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical behavioral study

full rationale

This paper reports results from a dual-task experiment (arithmetic + sentence comprehension) on GPT-4o, o3-mini, and o4-mini, measuring accuracy gaps between plausible and implausible passive sentences. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the abstract or described methods. The central claim rests on direct empirical observations of accuracy differences under varying task loads, which are independently testable and not reduced to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about what the dual-task measures and how accuracy gaps should be interpreted; no free parameters are introduced because the work is experimental rather than modeling-based.

axioms (2)

domain assumption Accuracy differences between plausible and implausible sentences under dual-task conditions reflect a shift in comprehension strategy toward rational inference.
This interpretive step is required to link the behavioral result to the claim about human-like processing.
domain assumption The arithmetic-plus-sentence dual task constrains the balance between memory storage and processing resources in LMs analogously to human working memory.
Invoked to explain why the observed behavior mirrors human rational inference.

pith-pipeline@v0.9.0 · 5504 in / 1417 out tokens · 81729 ms · 2026-05-07T13:34:45.305712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 3 internal anchors

[1]

A note on some item characteristics related to acquiescent responding.Personality and Individual Differences, 40(3):403–407. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai,...

work page internal anchor Pith review arXiv 2025
[2]

Fernanda Ferreira

Systematic testing of three language mod- els reveals low language accuracy, absence of re- sponse stability, and a yes-response bias.Pro- ceedings of the National Academy of Sciences, 120(51):e2309583120. Fernanda Ferreira. 2003. The misinterpretation of noncanonical sentences.Cognitive Psychology, 47(2):164–203. Fernanda Ferreira and Nikole D Patson. 20...

work page 2003
[3]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann

Lossy-context surprisal: An information- theoretic model of memory effects in sentence pro- cessing.Cognitive Science, 44(3):e12814. Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673. Edwa...

work page 2020
[4]

Proceedings of the National Academy of Sciences, 110(20):8051–8056

Rational integration of noisy evidence and prior semantic expectations in sentence interpretation. Proceedings of the National Academy of Sciences, 110(20):8051–8056. Edward Gibson, Chaleece Sandberg, Evelina Fedorenko, Leon Bergen, and Swathi Kiran. 2016. A rational in- ference approach to aphasic language comprehension. Aphasiology, 30(11):1341–1360. Do...

work page 2016
[5]

Working memory capacity of chatgpt: An empirical study.Proceedings of the AAAI Confer- ence on Artificial Intelligence, 38(9):10048–10056. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, ...

work page internal anchor Pith review arXiv 2024
[6]

GPT-4 Technical Report

Comparison of structural parsers and neural language models as surprisal estimators.Frontiers in Artificial Intelligence, 5:777963. Byung-Doh Oh and William Schuler. 2023. Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times? Transactions of the Association for Computational Linguistics, 11:336–350. ...

work page internal anchor Pith review arXiv 2023
[7]

InICLR 2025 Workshop on Building Trust in Language Mod- els and Applications

Working memory attack on LLMs. InICLR 2025 Workshop on Building Trust in Language Mod- els and Applications. Julie A Van Dyke and Richard L Lewis. 2003. Distin- guishing effects of structure and decay on attachment and repair: A cue-based parsing account of recovery from misanalyzed ambiguities.Journal of Memory and Language, 49(3):285–316. Ethan Gotlieb ...

work page 2025
[8]

Read the sentence,

Bigger is not always better: The importance of human-scale language modeling for psycholinguis- tics.Journal of Memory and Language, 144:104650. Nan Xu, Fei Wang, Ben Zhou, Bangzheng Li, Chaowei Xiao, and Muhao Chen. 2024. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. InFindings of the Association for Computation...

work page 2024
[10]

To the instruction for the reading-only task

Read the Sentence and CALCULATE the Math Problem. To the instruction for the reading-only task. (Audio will play) (a) First instruction progress Next (b) Second instruction progress Next (c) Third instruction progress Next (d) Fourth instruction progress Next (e) Fifth instruction Figure 9: Instruction screens for the single task in the human experiment. ...

work page
[12]

To the instruction for the ignoring task

Read the Sentence and CALCULATE the Math Problem. To the instruction for the ignoring task. (Audio will play) (a) First instruction progress Next (b) Second instruction progress Next (c) Third instruction progress Next (d) Fourth instruction progress Next (e) Fifth instruction progress Next (f) Sixth instruction Figure 10: Instruction screens for the nois...

work page
[13]

Read the Sentence and IGNORE the Math Problem

work page
[14]

To the instruction for the calculating task

Read the Sentence and CALCULATE the Math Problem. To the instruction for the calculating task. (Audio will play.) (a) First instruction progress Next (b) Second instruction progress Next (c) Third instruction progress Next (d) Fourth instruction progress Next (e) Fifth instruction progress Next (f) Sixth instruction progress Start Practice (g) Seventh ins...

work page

[1] [1]

A note on some item characteristics related to acquiescent responding.Personality and Individual Differences, 40(3):403–407. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai,...

work page internal anchor Pith review arXiv 2025

[2] [2]

Fernanda Ferreira

Systematic testing of three language mod- els reveals low language accuracy, absence of re- sponse stability, and a yes-response bias.Pro- ceedings of the National Academy of Sciences, 120(51):e2309583120. Fernanda Ferreira. 2003. The misinterpretation of noncanonical sentences.Cognitive Psychology, 47(2):164–203. Fernanda Ferreira and Nikole D Patson. 20...

work page 2003

[3] [3]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann

Lossy-context surprisal: An information- theoretic model of memory effects in sentence pro- cessing.Cognitive Science, 44(3):e12814. Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673. Edwa...

work page 2020

[4] [4]

Proceedings of the National Academy of Sciences, 110(20):8051–8056

Rational integration of noisy evidence and prior semantic expectations in sentence interpretation. Proceedings of the National Academy of Sciences, 110(20):8051–8056. Edward Gibson, Chaleece Sandberg, Evelina Fedorenko, Leon Bergen, and Swathi Kiran. 2016. A rational in- ference approach to aphasic language comprehension. Aphasiology, 30(11):1341–1360. Do...

work page 2016

[5] [5]

Working memory capacity of chatgpt: An empirical study.Proceedings of the AAAI Confer- ence on Artificial Intelligence, 38(9):10048–10056. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, ...

work page internal anchor Pith review arXiv 2024

[6] [6]

GPT-4 Technical Report

Comparison of structural parsers and neural language models as surprisal estimators.Frontiers in Artificial Intelligence, 5:777963. Byung-Doh Oh and William Schuler. 2023. Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times? Transactions of the Association for Computational Linguistics, 11:336–350. ...

work page internal anchor Pith review arXiv 2023

[7] [7]

InICLR 2025 Workshop on Building Trust in Language Mod- els and Applications

Working memory attack on LLMs. InICLR 2025 Workshop on Building Trust in Language Mod- els and Applications. Julie A Van Dyke and Richard L Lewis. 2003. Distin- guishing effects of structure and decay on attachment and repair: A cue-based parsing account of recovery from misanalyzed ambiguities.Journal of Memory and Language, 49(3):285–316. Ethan Gotlieb ...

work page 2025

[8] [8]

Read the sentence,

Bigger is not always better: The importance of human-scale language modeling for psycholinguis- tics.Journal of Memory and Language, 144:104650. Nan Xu, Fei Wang, Ben Zhou, Bangzheng Li, Chaowei Xiao, and Muhao Chen. 2024. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. InFindings of the Association for Computation...

work page 2024

[9] [10]

To the instruction for the reading-only task

Read the Sentence and CALCULATE the Math Problem. To the instruction for the reading-only task. (Audio will play) (a) First instruction progress Next (b) Second instruction progress Next (c) Third instruction progress Next (d) Fourth instruction progress Next (e) Fifth instruction Figure 9: Instruction screens for the single task in the human experiment. ...

work page

[10] [12]

To the instruction for the ignoring task

Read the Sentence and CALCULATE the Math Problem. To the instruction for the ignoring task. (Audio will play) (a) First instruction progress Next (b) Second instruction progress Next (c) Third instruction progress Next (d) Fourth instruction progress Next (e) Fifth instruction progress Next (f) Sixth instruction Figure 10: Instruction screens for the nois...

work page

[11] [13]

Read the Sentence and IGNORE the Math Problem

work page

[12] [14]

To the instruction for the calculating task

Read the Sentence and CALCULATE the Math Problem. To the instruction for the calculating task. (Audio will play.) (a) First instruction progress Next (b) Second instruction progress Next (c) Third instruction progress Next (d) Fourth instruction progress Next (e) Fifth instruction progress Next (f) Sixth instruction progress Start Practice (g) Seventh ins...

work page