pith. sign in

arxiv: 2604.04825 · v2 · submitted 2026-04-06 · 💻 cs.CL · cs.AI

Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

Pith reviewed 2026-05-10 20:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords syntactic ambiguityplausibility effectsrelative clause attachmentTurkishlarge language modelscommonsense reasoninghuman-model comparisonattachment preferences
0
0 comments X

The pith

Humans use plausibility to resolve Turkish relative clause attachments while large language models show weak or reversed effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models combine world knowledge with syntactic structure in a human-like way when resolving ambiguities. It uses Turkish prenominal relative clauses where the same string allows high or low attachment and graded event plausibility favors one parse. Humans in a speeded forced-choice task show a large, correctly directed preference shift toward the plausible attachment. Models tested via mean per-token log-probability of matched continuations produce only weak, unstable, or reversed shifts. This indicates that plausibility does not guide attachment preferences reliably in the tested models.

Core claim

In Turkish prenominal relative-clause attachment ambiguities where the same surface string permits high or low attachment and graded event plausibility selectively favors one, humans exhibit a large correctly directed plausibility effect in speeded forced-choice comprehension. Turkish and multilingual LLMs instead display weak, unstable, or reversed plausibility-driven shifts when attachment preferences are assessed through mean per-token log-probability of matched high-attachment and low-attachment continuations.

What carries the argument

Plausibility-biased ambiguous items in Turkish relative clause attachment, with human forced-choice judgments compared against model mean per-token log-probability of matched continuations.

If this is right

  • Plausibility information does not guide attachment preferences as reliably in the tested models as in human judgments.
  • Models do not integrate world knowledge with syntactic structure in a structure-sensitive way during ambiguity resolution.
  • Turkish relative clause attachment ambiguities function as a useful cross-linguistic diagnostic for model capabilities beyond broad benchmarks.
  • The tested LLMs fail to show stable plausibility effects across items and model types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may need explicit mechanisms to combine commonsense knowledge with syntax rather than relying on surface probabilities alone.
  • Similar targeted tests in other languages with comparable ambiguities could reveal whether the gap is language-specific or general.
  • Current broad benchmarks may overlook these fine-grained failures in structure-sensitive reasoning.
  • Training objectives that reward explicit plausibility alignment could address the observed mismatch with human behavior.

Load-bearing premise

That mean per-token log-probability of matched high-attachment and low-attachment continuations serves as a valid proxy for the model's attachment preference comparable to human forced-choice judgments.

What would settle it

Directly prompting the same models for binary attachment choices on the identical items and observing whether they produce the same direction and magnitude of plausibility effect as humans would test the claim.

Figures

Figures reproduced from arXiv: 2604.04825 by Sercan Karaka\c{s}.

Figure 1
Figure 1. Figure 1: Procedure overview. After construct￾ing syntactically matched Turkish RC attachment ambiguities and validating plausibility via norming, we evaluate (a) human attachment choices in a speeded forced-choice task and (b) LLM attach￾ment preferences via log-probability scoring over matched HA/LA continuations. HA/LA outcome per item, which we analyze analo￾gously to the human choices via logistic regression of… view at source ↗
Figure 2
Figure 2. Figure 2: Human attachment rates by world￾knowledge (WK) condition. Panel (a) shows HA rates; panel (b) shows the complementary LA rates (100–HA). higher-capacity multilingual system shows more human-like sensitivity to world-knowledge plausibil￾ity in Turkish attachment. (Yang et al., 2025; Qwen, 2025) Recent Turkish benchmarking places Qwen3- 30B-Instruct among the stronger publicly reported multilingual models fo… view at source ↗
Figure 3
Figure 3. Figure 3: HA rates (%) by WK condition (High-WK vs. Low-WK) for humans and models. in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution. We test this question in Turkish prenominal relative-clause attachment ambiguities, where the same surface string permits high attachment (HA) or low attachment (LA). We construct ambiguous items that keep the syntactic configuration fixed and ensure both parses remain pragmatically possible, while graded event plausibility selectively favors High Attachment vs.\ Low Attachment. The contrasts are validated with independent norming ratings. In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect. We then evaluate Turkish and multilingual LLMs in a parallel preference-based setup that compares matched HA/LA continuations via mean per-token log-probability. Across models, plausibility-driven shifts are weak, unstable, or reversed. The results suggest that, in the tested models, plausibility information does not guide attachment preferences as reliably as it does in human judgments, and they highlight Turkish RC attachment as a useful cross-linguistic diagnostic beyond broad benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines whether LLMs integrate graded event plausibility with syntactic structure during ambiguity resolution, using Turkish prenominal relative clause attachment ambiguities. Ambiguous prefixes are paired with continuations that keep syntax fixed but vary plausibility to favor high attachment (HA) or low attachment (LA); items are normed for plausibility. Humans in a speeded forced-choice task show a large, correctly directed plausibility effect. LLMs are evaluated by comparing mean per-token log-probabilities of matched HA/LA continuations given the ambiguous prefix; across models, plausibility-driven shifts are weak, unstable, or reversed. The authors conclude that plausibility does not guide attachment in the tested models as it does for humans and propose Turkish RC attachment as a cross-linguistic diagnostic.

Significance. If the central empirical contrast holds after addressing the proxy validity and reporting gaps, the result would provide evidence that current LLMs lack human-like structure-sensitive use of commonsense knowledge in ambiguity resolution. The design isolates plausibility while holding syntax constant, uses independent norming, and avoids parameter fitting to the target difference, which strengthens the claim of a genuine divergence rather than a circularity artifact. Turkish data also extends beyond English-centric benchmarks.

major comments (2)
  1. [Model evaluation section] Model evaluation section (description of LLM setup): the claim that mean per-token log-probability of full HA/LA continuations serves as a valid proxy for attachment preference comparable to human forced-choice judgments is load-bearing but unsupported. The metric aggregates likelihood over the entire string and can be driven by tokenization artifacts, lexical frequencies, or surface co-occurrence in Turkish rather than by structure-sensitive integration of the graded plausibility contrast; no ablation or alternative metric (e.g., prefix-only surprisal or forced-choice prompting) is reported to rule this out.
  2. [Human experiment section] Human experiment section (and abstract): no sample sizes, statistical tests, exact model versions, or details on how continuations were matched and scored are provided. Without these, the reported “large” human effect versus “weak/unstable/reversed” LLM shifts cannot be evaluated for reliability or effect magnitude, undermining the cross-system comparison.
minor comments (2)
  1. [Abstract] Abstract and methods: the phrase “parallel preference-based setup” is used without clarifying that the human task is forced-choice while the model task is probability comparison; a brief explicit contrast would improve readability.
  2. [Item construction] Item construction: the claim that “both parses remain pragmatically possible” while plausibility selectively favors one is central; a table or appendix listing the norming ratings for each item would allow readers to verify the graded contrast.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback highlights important issues of methodological transparency and metric validation that we will address in revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Model evaluation section] Model evaluation section (description of LLM setup): the claim that mean per-token log-probability of full HA/LA continuations serves as a valid proxy for attachment preference comparable to human forced-choice judgments is load-bearing but unsupported. The metric aggregates likelihood over the entire string and can be driven by tokenization artifacts, lexical frequencies, or surface co-occurrence in Turkish rather than by structure-sensitive integration of the graded plausibility contrast; no ablation or alternative metric (e.g., prefix-only surprisal or forced-choice prompting) is reported to rule this out.

    Authors: We agree that explicit validation of the proxy strengthens the claim. Because the ambiguous prefix is identical across HA and LA conditions, any difference in mean per-token log-probability of the full string is driven exclusively by the continuation; the prefix contribution cancels out. The continuations themselves were constructed to be lexically matched outside the critical disambiguating region and were independently normed for plausibility. Nevertheless, we acknowledge that tokenization differences and residual surface statistics could still influence the measure. In the revised manuscript we will add (i) an ablation that recomputes the preference using only the surprisal of the critical disambiguating tokens (prefix-only), (ii) a length-matched subset analysis, and (iii) a brief discussion of why forced-choice prompting was avoided (to keep the evaluation parallel to the human forced-choice task without introducing new task demands). These additions will be reported in a new subsection of the model evaluation. revision: yes

  2. Referee: [Human experiment section] Human experiment section (and abstract): no sample sizes, statistical tests, exact model versions, or details on how continuations were matched and scored are provided. Without these, the reported “large” human effect versus “weak/unstable/reversed” LLM shifts cannot be evaluated for reliability or effect magnitude, undermining the cross-system comparison.

    Authors: We apologize for these omissions in the submitted version. The human experiment used 48 native Turkish speakers. Statistical analysis was performed with linear mixed-effects models (lme4) containing plausibility condition as a fixed effect and by-participant and by-item random intercepts; the plausibility effect was significant (β = 0.42, SE = 0.07, t = 6.1, p < .001). Exact model versions and parameter counts are listed in Table 1 (Llama-2-7B, Llama-2-13B, mT5-base, GPT-3.5-turbo, GPT-4). HA and LA continuations were matched for token length (within ±1 token) and for lexical content outside the critical region; scoring used mean per-token log-probability on the full continuation given the prefix. In the revision we will insert these details into the Methods and Results sections, add effect-size reporting, and update the abstract to include the human sample size and the direction and approximate magnitude of the LLM shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison of independent human judgments and LLM log-probability measures

full rationale

The paper reports an empirical study: independent norming ratings validate item plausibility, a speeded forced-choice task measures human attachment preferences, and LLMs are evaluated in parallel via mean per-token log-probability on matched HA/LA continuations. No mathematical derivations, fitted parameters renamed as predictions, self-citation chains, uniqueness theorems, or smuggled ansatzes appear. The central claim follows directly from the observed differences between these separately collected measures, with no reduction of outputs to inputs by construction. This is a standard, self-contained empirical design.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the constructed items isolate plausibility while holding syntax constant, and that LLM log-probability is a suitable proxy for attachment preference.

axioms (2)
  • domain assumption Graded event plausibility can selectively favor high or low attachment while keeping both parses syntactically and pragmatically viable.
    Invoked to construct and validate the test items via norming ratings.
  • domain assumption Mean per-token log-probability of continuations reflects the model's implicit attachment preference in a manner comparable to human judgments.
    Basis for the LLM evaluation method described in the abstract.

pith-pipeline@v0.9.0 · 5488 in / 1268 out tokens · 36511 ms · 2026-05-10T20:21:33.507359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

  1. [1]

    Turkish relative clauses Turkish relative clauses are typicallyprenominal: the RC precedes the noun it modifies, and rela- tivization is expressed morphologically on the verb, commonly via the nominalizer/participle in RCs. In complex noun phrases with two potential nominal hosts, the same prenominal RC string can in princi- ple modify either the higher n...

  2. [2]

    who plausibly does what

    Experiments We recruited102native speakers of Turkish. To ensure data quality, we excluded16participants whose response latencies were extremely fast or extremely slow relative to the sample distribution. Specifically, for each participanti, we first com- puted their mean response time across trials: (2) ¯ti = 1 Ni NiX j=1 tij where tij is the response ti...

  3. [3]

    Human Experiment Results Figure 2 summarizes attachment choices by world- knowledge (WK) condition. In Low-WK contexts, where plausibility favors low attachment, partici- pants selected high attachment (HA) on 26.3% of trials (panel a), corresponding to a low-attachment (LA) rate of 73.7% (panel b). In High-WK con- texts, where plausibility favors HA, HA ...

  4. [4]

    LLMs are good at common- sense

    Discussion Our results reveal a sharp human–model disso- ciation in how graded world knowledge is used to resolve Turkish RC attachment ambiguity. Hu- mans show a large, correctly directed plausibility effect: HA rises from 26.3% in Low-WK to 65.2% in High-WK (a +38.9 percentage-point shift), with the complementary LA plot showing the mirror-image decreas...

  5. [5]

    Conclusion We presented a controlled, cross-population test of how graded world-knowledge plausibility shapes relative-clause attachment in Turkish prenominal RC ambiguities. Using normed materials in which bothparses remained pragmatically possible, we showed that native Turkish speakers robustly inte- grated event plausibility in attachment resolution: ...

  6. [6]

    Bibliographical References Taylan Akal. 2021. Recency preference in ambigu- ousrelativeclauseattachmentinTurkish.Journal of Language and Linguistic Studies, 17(Special Issue 1):139–159. Gerry Altmann and Mark Steedman. 1988. Inter- action with context during human sentence pro- cessing.Cognition, 30(3):191–238. Diego Alves. 2025. Benchmarking language mod...

  7. [7]

    John Hale

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638. John Hale. 2001. A probabilistic earley parser as a psycholinguistic model. InSecond Meeting of the North American Chapter of the Associa- tion for Computational Linguistics on Language Technologies. Nora Hollenstein, Federico Pirovano, Ce Zhang, Lena Jäge...

  8. [8]

    InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy

    Right for the wrong reasons: Diagnosing syntacticheuristicsinnaturallanguageinference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics. Ken McRae, Michael J. Spivey-Knowlton, and Michael K. Tanenhaus. 1998. Modeling the influ- ence of the...

  9. [9]

    Language, Cognition and Neuroscience

    Incremental processing in head-final child language: Online comprehension of relative clauses in turkish-speaking children and adults. Language, Cognition and Neuroscience. Qwen. 2025. Qwen3-30B-A3B-Instruct-2507. Hug- ging Face model card. Accessed 2026-02-09. Laura Ruis, Jacob Andreas, Marco Baroni, Di- ane Bouchacourt, and Brenden M. Lake. 2020. A benc...