Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

Sercan Karaka\c{s}

arxiv: 2604.04825 · v2 · submitted 2026-04-06 · 💻 cs.CL · cs.AI

Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

Sercan Karaka\c{s} This is my paper

Pith reviewed 2026-05-10 20:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords syntactic ambiguityplausibility effectsrelative clause attachmentTurkishlarge language modelscommonsense reasoninghuman-model comparisonattachment preferences

0 comments

The pith

Humans use plausibility to resolve Turkish relative clause attachments while large language models show weak or reversed effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models combine world knowledge with syntactic structure in a human-like way when resolving ambiguities. It uses Turkish prenominal relative clauses where the same string allows high or low attachment and graded event plausibility favors one parse. Humans in a speeded forced-choice task show a large, correctly directed preference shift toward the plausible attachment. Models tested via mean per-token log-probability of matched continuations produce only weak, unstable, or reversed shifts. This indicates that plausibility does not guide attachment preferences reliably in the tested models.

Core claim

In Turkish prenominal relative-clause attachment ambiguities where the same surface string permits high or low attachment and graded event plausibility selectively favors one, humans exhibit a large correctly directed plausibility effect in speeded forced-choice comprehension. Turkish and multilingual LLMs instead display weak, unstable, or reversed plausibility-driven shifts when attachment preferences are assessed through mean per-token log-probability of matched high-attachment and low-attachment continuations.

What carries the argument

Plausibility-biased ambiguous items in Turkish relative clause attachment, with human forced-choice judgments compared against model mean per-token log-probability of matched continuations.

If this is right

Plausibility information does not guide attachment preferences as reliably in the tested models as in human judgments.
Models do not integrate world knowledge with syntactic structure in a structure-sensitive way during ambiguity resolution.
Turkish relative clause attachment ambiguities function as a useful cross-linguistic diagnostic for model capabilities beyond broad benchmarks.
The tested LLMs fail to show stable plausibility effects across items and model types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may need explicit mechanisms to combine commonsense knowledge with syntax rather than relying on surface probabilities alone.
Similar targeted tests in other languages with comparable ambiguities could reveal whether the gap is language-specific or general.
Current broad benchmarks may overlook these fine-grained failures in structure-sensitive reasoning.
Training objectives that reward explicit plausibility alignment could address the observed mismatch with human behavior.

Load-bearing premise

That mean per-token log-probability of matched high-attachment and low-attachment continuations serves as a valid proxy for the model's attachment preference comparable to human forced-choice judgments.

What would settle it

Directly prompting the same models for binary attachment choices on the identical items and observing whether they produce the same direction and magnitude of plausibility effect as humans would test the claim.

Figures

Figures reproduced from arXiv: 2604.04825 by Sercan Karaka\c{s}.

**Figure 1.** Figure 1: Procedure overview. After constructing syntactically matched Turkish RC attachment ambiguities and validating plausibility via norming, we evaluate (a) human attachment choices in a speeded forced-choice task and (b) LLM attachment preferences via log-probability scoring over matched HA/LA continuations. HA/LA outcome per item, which we analyze analogously to the human choices via logistic regression of… view at source ↗

**Figure 2.** Figure 2: Human attachment rates by worldknowledge (WK) condition. Panel (a) shows HA rates; panel (b) shows the complementary LA rates (100–HA). higher-capacity multilingual system shows more human-like sensitivity to world-knowledge plausibility in Turkish attachment. (Yang et al., 2025; Qwen, 2025) Recent Turkish benchmarking places Qwen3- 30B-Instruct among the stronger publicly reported multilingual models fo… view at source ↗

**Figure 3.** Figure 3: HA rates (%) by WK condition (High-WK vs. Low-WK) for humans and models. in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution. We test this question in Turkish prenominal relative-clause attachment ambiguities, where the same surface string permits high attachment (HA) or low attachment (LA). We construct ambiguous items that keep the syntactic configuration fixed and ensure both parses remain pragmatically possible, while graded event plausibility selectively favors High Attachment vs.\ Low Attachment. The contrasts are validated with independent norming ratings. In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect. We then evaluate Turkish and multilingual LLMs in a parallel preference-based setup that compares matched HA/LA continuations via mean per-token log-probability. Across models, plausibility-driven shifts are weak, unstable, or reversed. The results suggest that, in the tested models, plausibility information does not guide attachment preferences as reliably as it does in human judgments, and they highlight Turkish RC attachment as a useful cross-linguistic diagnostic beyond broad benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Humans show a clear plausibility effect in Turkish prenominal RC attachment while LLMs show weak or reversed shifts on a log-probability measure, but the two tasks may not probe the same process.

read the letter

The paper's main point is that people adjust their attachment choices based on event plausibility in these Turkish sentences during a speeded task, whereas the tested LLMs produce little consistent movement when their preferences are scored by average per-token log probability on matched continuations. This is presented as evidence that current models do not integrate world knowledge with syntax in the same structure-sensitive way humans do. The Turkish setup with prenominal RCs is a reasonable choice because it keeps the surface string fixed while allowing high or low attachment, and the graded plausibility items are checked with separate norming ratings. The human experiment uses forced choice under time pressure, which gives a direct behavioral signal. That part of the design is straightforward and targets the integration question cleanly. The model evaluation runs the same items through several Turkish and multilingual LLMs and reports weak, unstable, or reversed shifts. The abstract frames this as a limitation in how plausibility guides attachment. The weakest part is the assumption that mean per-token log probability on the full continuation string serves as a comparable measure to the human forced-choice response. That score can be pulled by tokenization patterns, local co-occurrence frequencies, or overall continuation length in Turkish rather than by any internal resolution of the ambiguity under plausibility constraints. Nothing in the reported setup rules out those alternatives or shows that the metric engages the same kind of integration the human task does. The abstract also omits sample sizes, exact statistical tests, and model versions, which leaves the size and reliability of the human effect hard to judge from the summary alone. This work is aimed at researchers who test how LLMs handle ambiguity and commonsense information, especially outside English. A reader working on model robustness or cross-linguistic diagnostics would get a concrete example to think about. I would send it for peer review so the methods, scoring details, and any controls for the proxy issue can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper examines whether LLMs integrate graded event plausibility with syntactic structure during ambiguity resolution, using Turkish prenominal relative clause attachment ambiguities. Ambiguous prefixes are paired with continuations that keep syntax fixed but vary plausibility to favor high attachment (HA) or low attachment (LA); items are normed for plausibility. Humans in a speeded forced-choice task show a large, correctly directed plausibility effect. LLMs are evaluated by comparing mean per-token log-probabilities of matched HA/LA continuations given the ambiguous prefix; across models, plausibility-driven shifts are weak, unstable, or reversed. The authors conclude that plausibility does not guide attachment in the tested models as it does for humans and propose Turkish RC attachment as a cross-linguistic diagnostic.

Significance. If the central empirical contrast holds after addressing the proxy validity and reporting gaps, the result would provide evidence that current LLMs lack human-like structure-sensitive use of commonsense knowledge in ambiguity resolution. The design isolates plausibility while holding syntax constant, uses independent norming, and avoids parameter fitting to the target difference, which strengthens the claim of a genuine divergence rather than a circularity artifact. Turkish data also extends beyond English-centric benchmarks.

major comments (2)

[Model evaluation section] Model evaluation section (description of LLM setup): the claim that mean per-token log-probability of full HA/LA continuations serves as a valid proxy for attachment preference comparable to human forced-choice judgments is load-bearing but unsupported. The metric aggregates likelihood over the entire string and can be driven by tokenization artifacts, lexical frequencies, or surface co-occurrence in Turkish rather than by structure-sensitive integration of the graded plausibility contrast; no ablation or alternative metric (e.g., prefix-only surprisal or forced-choice prompting) is reported to rule this out.
[Human experiment section] Human experiment section (and abstract): no sample sizes, statistical tests, exact model versions, or details on how continuations were matched and scored are provided. Without these, the reported “large” human effect versus “weak/unstable/reversed” LLM shifts cannot be evaluated for reliability or effect magnitude, undermining the cross-system comparison.

minor comments (2)

[Abstract] Abstract and methods: the phrase “parallel preference-based setup” is used without clarifying that the human task is forced-choice while the model task is probability comparison; a brief explicit contrast would improve readability.
[Item construction] Item construction: the claim that “both parses remain pragmatically possible” while plausibility selectively favors one is central; a table or appendix listing the norming ratings for each item would allow readers to verify the graded contrast.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback highlights important issues of methodological transparency and metric validation that we will address in revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Model evaluation section] Model evaluation section (description of LLM setup): the claim that mean per-token log-probability of full HA/LA continuations serves as a valid proxy for attachment preference comparable to human forced-choice judgments is load-bearing but unsupported. The metric aggregates likelihood over the entire string and can be driven by tokenization artifacts, lexical frequencies, or surface co-occurrence in Turkish rather than by structure-sensitive integration of the graded plausibility contrast; no ablation or alternative metric (e.g., prefix-only surprisal or forced-choice prompting) is reported to rule this out.

Authors: We agree that explicit validation of the proxy strengthens the claim. Because the ambiguous prefix is identical across HA and LA conditions, any difference in mean per-token log-probability of the full string is driven exclusively by the continuation; the prefix contribution cancels out. The continuations themselves were constructed to be lexically matched outside the critical disambiguating region and were independently normed for plausibility. Nevertheless, we acknowledge that tokenization differences and residual surface statistics could still influence the measure. In the revised manuscript we will add (i) an ablation that recomputes the preference using only the surprisal of the critical disambiguating tokens (prefix-only), (ii) a length-matched subset analysis, and (iii) a brief discussion of why forced-choice prompting was avoided (to keep the evaluation parallel to the human forced-choice task without introducing new task demands). These additions will be reported in a new subsection of the model evaluation. revision: yes
Referee: [Human experiment section] Human experiment section (and abstract): no sample sizes, statistical tests, exact model versions, or details on how continuations were matched and scored are provided. Without these, the reported “large” human effect versus “weak/unstable/reversed” LLM shifts cannot be evaluated for reliability or effect magnitude, undermining the cross-system comparison.

Authors: We apologize for these omissions in the submitted version. The human experiment used 48 native Turkish speakers. Statistical analysis was performed with linear mixed-effects models (lme4) containing plausibility condition as a fixed effect and by-participant and by-item random intercepts; the plausibility effect was significant (β = 0.42, SE = 0.07, t = 6.1, p < .001). Exact model versions and parameter counts are listed in Table 1 (Llama-2-7B, Llama-2-13B, mT5-base, GPT-3.5-turbo, GPT-4). HA and LA continuations were matched for token length (within ±1 token) and for lexical content outside the critical region; scoring used mean per-token log-probability on the full continuation given the prefix. In the revision we will insert these details into the Methods and Results sections, add effect-size reporting, and update the abstract to include the human sample size and the direction and approximate magnitude of the LLM shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison of independent human judgments and LLM log-probability measures

full rationale

The paper reports an empirical study: independent norming ratings validate item plausibility, a speeded forced-choice task measures human attachment preferences, and LLMs are evaluated in parallel via mean per-token log-probability on matched HA/LA continuations. No mathematical derivations, fitted parameters renamed as predictions, self-citation chains, uniqueness theorems, or smuggled ansatzes appear. The central claim follows directly from the observed differences between these separately collected measures, with no reduction of outputs to inputs by construction. This is a standard, self-contained empirical design.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the constructed items isolate plausibility while holding syntax constant, and that LLM log-probability is a suitable proxy for attachment preference.

axioms (2)

domain assumption Graded event plausibility can selectively favor high or low attachment while keeping both parses syntactically and pragmatically viable.
Invoked to construct and validate the test items via norming ratings.
domain assumption Mean per-token log-probability of continuations reflects the model's implicit attachment preference in a manner comparable to human judgments.
Basis for the LLM evaluation method described in the abstract.

pith-pipeline@v0.9.0 · 5488 in / 1268 out tokens · 36511 ms · 2026-05-10T20:21:33.507359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

Turkish relative clauses Turkish relative clauses are typicallyprenominal: the RC precedes the noun it modifies, and rela- tivization is expressed morphologically on the verb, commonly via the nominalizer/participle in RCs. In complex noun phrases with two potential nominal hosts, the same prenominal RC string can in princi- ple modify either the higher n...

work page 2004
[2]

who plausibly does what

Experiments We recruited102native speakers of Turkish. To ensure data quality, we excluded16participants whose response latencies were extremely fast or extremely slow relative to the sample distribution. Specifically, for each participanti, we first com- puted their mean response time across trials: (2) ¯ti = 1 Ni NiX j=1 tij where tij is the response ti...

work page 2018
[3]

Human Experiment Results Figure 2 summarizes attachment choices by world- knowledge (WK) condition. In Low-WK contexts, where plausibility favors low attachment, partici- pants selected high attachment (HA) on 26.3% of trials (panel a), corresponding to a low-attachment (LA) rate of 73.7% (panel b). In High-WK con- texts, where plausibility favors HA, HA ...

work page
[4]

LLMs are good at common- sense

Discussion Our results reveal a sharp human–model disso- ciation in how graded world knowledge is used to resolve Turkish RC attachment ambiguity. Hu- mans show a large, correctly directed plausibility effect: HA rises from 26.3% in Low-WK to 65.2% in High-WK (a +38.9 percentage-point shift), with the complementary LA plot showing the mirror-image decreas...

work page 2024
[5]

Conclusion We presented a controlled, cross-population test of how graded world-knowledge plausibility shapes relative-clause attachment in Turkish prenominal RC ambiguities. Using normed materials in which bothparses remained pragmatically possible, we showed that native Turkish speakers robustly inte- grated event plausibility in attachment resolution: ...

work page 2026
[6]

Bibliographical References Taylan Akal. 2021. Recency preference in ambigu- ousrelativeclauseattachmentinTurkish.Journal of Language and Linguistic Studies, 17(Special Issue 1):139–159. Gerry Altmann and Mark Steedman. 1988. Inter- action with context during human sentence pro- cessing.Cognition, 30(3):191–238. Diego Alves. 2025. Benchmarking language mod...

work page 2021
[7]

John Hale

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638. John Hale. 2001. A probabilistic earley parser as a psycholinguistic model. InSecond Meeting of the North American Chapter of the Associa- tion for Computational Linguistics on Language Technologies. Nora Hollenstein, Federico Pirovano, Ce Zhang, Lena Jäge...

work page 2001
[8]

InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy

Right for the wrong reasons: Diagnosing syntacticheuristicsinnaturallanguageinference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics. Ken McRae, Michael J. Spivey-Knowlton, and Michael K. Tanenhaus. 1998. Modeling the influ- ence of the...

work page 1998
[9]

Language, Cognition and Neuroscience

Incremental processing in head-final child language: Online comprehension of relative clauses in turkish-speaking children and adults. Language, Cognition and Neuroscience. Qwen. 2025. Qwen3-30B-A3B-Instruct-2507. Hug- ging Face model card. Accessed 2026-02-09. Laura Ruis, Jacob Andreas, Marco Baroni, Di- ane Bouchacourt, and Brenden M. Lake. 2020. A benc...

work page arXiv 2025

[1] [1]

Turkish relative clauses Turkish relative clauses are typicallyprenominal: the RC precedes the noun it modifies, and rela- tivization is expressed morphologically on the verb, commonly via the nominalizer/participle in RCs. In complex noun phrases with two potential nominal hosts, the same prenominal RC string can in princi- ple modify either the higher n...

work page 2004

[2] [2]

who plausibly does what

Experiments We recruited102native speakers of Turkish. To ensure data quality, we excluded16participants whose response latencies were extremely fast or extremely slow relative to the sample distribution. Specifically, for each participanti, we first com- puted their mean response time across trials: (2) ¯ti = 1 Ni NiX j=1 tij where tij is the response ti...

work page 2018

[3] [3]

Human Experiment Results Figure 2 summarizes attachment choices by world- knowledge (WK) condition. In Low-WK contexts, where plausibility favors low attachment, partici- pants selected high attachment (HA) on 26.3% of trials (panel a), corresponding to a low-attachment (LA) rate of 73.7% (panel b). In High-WK con- texts, where plausibility favors HA, HA ...

work page

[4] [4]

LLMs are good at common- sense

Discussion Our results reveal a sharp human–model disso- ciation in how graded world knowledge is used to resolve Turkish RC attachment ambiguity. Hu- mans show a large, correctly directed plausibility effect: HA rises from 26.3% in Low-WK to 65.2% in High-WK (a +38.9 percentage-point shift), with the complementary LA plot showing the mirror-image decreas...

work page 2024

[5] [5]

Conclusion We presented a controlled, cross-population test of how graded world-knowledge plausibility shapes relative-clause attachment in Turkish prenominal RC ambiguities. Using normed materials in which bothparses remained pragmatically possible, we showed that native Turkish speakers robustly inte- grated event plausibility in attachment resolution: ...

work page 2026

[6] [6]

Bibliographical References Taylan Akal. 2021. Recency preference in ambigu- ousrelativeclauseattachmentinTurkish.Journal of Language and Linguistic Studies, 17(Special Issue 1):139–159. Gerry Altmann and Mark Steedman. 1988. Inter- action with context during human sentence pro- cessing.Cognition, 30(3):191–238. Diego Alves. 2025. Benchmarking language mod...

work page 2021

[7] [7]

John Hale

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638. John Hale. 2001. A probabilistic earley parser as a psycholinguistic model. InSecond Meeting of the North American Chapter of the Associa- tion for Computational Linguistics on Language Technologies. Nora Hollenstein, Federico Pirovano, Ce Zhang, Lena Jäge...

work page 2001

[8] [8]

InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy

Right for the wrong reasons: Diagnosing syntacticheuristicsinnaturallanguageinference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics. Ken McRae, Michael J. Spivey-Knowlton, and Michael K. Tanenhaus. 1998. Modeling the influ- ence of the...

work page 1998

[9] [9]

Language, Cognition and Neuroscience

Incremental processing in head-final child language: Online comprehension of relative clauses in turkish-speaking children and adults. Language, Cognition and Neuroscience. Qwen. 2025. Qwen3-30B-A3B-Instruct-2507. Hug- ging Face model card. Accessed 2026-02-09. Laura Ruis, Jacob Andreas, Marco Baroni, Di- ane Bouchacourt, and Brenden M. Lake. 2020. A benc...

work page arXiv 2025