arxiv: 2604.24665 · v1 · submitted 2026-04-27 · 💻 cs.CL · cs.AI

Recognition: unknown

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

Sercan Karaka\c{s} , Yusuf \c{S}im\c{s}ek

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Turkish evidentialitysource trustworthinessevidential morphologyLLM benchmarkinghuman-AI comparisoncloze productiontrust manipulation

0 comments

The pith

Turkish speakers prefer -DI for high-trust sources and -mIs for low-trust ones, but LLMs show unstable or reversed patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether source trustworthiness affects Turkish evidential morphology in the past tense. Native speakers in cloze production tasks consistently choose more -DI under high-trust conditions and more -mIs under low-trust conditions, even after sensitivity checks. Ten LLMs given the same materials across open, explicit, and forced-choice prompts display model-specific and prompt-dependent behavior that is often weak, reversed, or dominated by base-rate suffix preferences and compliance issues. The contrast supplies evidence that Turkish evidential choice encodes trust and commitment levels while exposing a gap in how current models handle source-sensitive reasoning.

Core claim

Native Turkish speakers produce relatively more -DI in high-trust cloze contexts and relatively more -mIs in low-trust cloze contexts, with the trust effect stable across analyses. Ten evaluated LLMs exhibit highly variable, often unstable trust-consistent shifts that are frequently overshadowed by output-compliance problems and strong base-rate suffix preferences.

What carries the argument

The Turkish past-tense evidential contrast between -DI and -mIs, with perceived source reliability manipulated as the sole variable in controlled cloze production tasks.

If this is right

Human evidential production in Turkish tracks perceived source trustworthiness.
The human pattern supports a trust- and commitment-based account of Turkish evidentiality.
LLM responses remain prompt- and model-dependent and rarely match the human trust effect.
Output compliance and base-rate preferences often override any trust sensitivity in LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evidential morphology in other languages may similarly encode social trust or commitment information.
Targeted training on pragmatic inferences about source reliability could narrow the observed human-LLM gap.
The same trust manipulation could be tested in comprehension rather than production tasks to check robustness.
Base-rate suffix biases in LLMs may reflect training data distributions rather than evidential reasoning.

Load-bearing premise

The controlled cloze contexts isolate perceived source reliability as the only manipulated factor without confounds from sentence structure, lexical choice, or participant expectations.

What would settle it

A replication in which Turkish speakers show no reliable difference in -DI versus -mIs choice between high-trust and low-trust versions of the same cloze items would falsify the reported trust effect.

Figures

Figures reproduced from arXiv: 2604.24665 by Sercan Karaka\c{s}, Yusuf \c{S}im\c{s}ek.

**Figure 1.** Figure 1: Information-source marking, adapted from typological overviews of evidential systems (Willett, 1988; Aikhenvald, 2004). slake, 2005; Kornfilt, 1997). A well-known contrast is between the suffix -DI, traditionally associated with direct evidence or stronger speaker commitment, and the suffix -mIş, which is commonly associated with indirect evidence, such as inference or report (Aksu-Koç and Slobin, 1986; … view at source ↗

**Figure 2.** Figure 2: Human production (strict coding): within view at source ↗

**Figure 3.** Figure 3: Gemma-2-9B-IT-TR (Experiment II): high–low difference in trust-span attention mass (query token: last_prompt_token). The pairlevel effect is weak and inconsistent (∆pair mean = −0.0179, 95% bootstrap CI [−0.0513, 0.0170]; P(∆pair > 0) = 46%), so we treat this as a qualitative diagnostic rather than strong evidence of robust trust encoding view at source ↗

**Figure 4.** Figure 4: Gemma-3-27B-IT (Experiment II): high– low difference in trust-span attention mass (query token: last_prompt_token). The map shows localized modulation concentrated in a small cluster of mid-to-late layers and heads, consistent with the exploratory analysis in Section 7. 18 view at source ↗

read the original abstract

This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly external, while only its perceived reliability is manipulated (High-Trust vs. Low-Trust). In a human production experiment, native speakers of Turkish show a robust trust effect: High-Trust contexts yield relatively more -DI, whereas Low-Trust contexts yield relatively more -mIs, with the pattern remaining stable across sensitivity analyses. We then evaluate 10 LLMs in three prompting paradigms (open gap-fill, explicit past-tense gap-fill, and forced-choice A/B selection). LLM behavior is highly model- and prompt-dependent: some models show weak or local trust-consistent shifts, but effects are generally unstable, often reversed, and frequently overshadowed by output-compliance problems and strong base-rate suffix preferences. The results provide new evidence for a trust-/commitment-based account of Turkish evidentiality and reveal a clear human-LLM gap in source-sensitive evidential reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates source trustworthiness effects on Turkish evidential morphology, specifically the -DI vs. -mIs contrast in past-tense contexts. In a human cloze production experiment with overtly external sources, native Turkish speakers produce relatively more -DI in High-Trust conditions and more -mIs in Low-Trust conditions; this pattern is reported as robust and stable across sensitivity analyses. The study then benchmarks 10 LLMs across three prompting paradigms (open gap-fill, explicit past-tense gap-fill, forced-choice A/B), finding highly model- and prompt-dependent behavior that is often inconsistent with the human pattern, frequently reversed, and dominated by base-rate suffix preferences and output-compliance failures. The work is framed as supporting a trust-/commitment-based account of Turkish evidentiality while highlighting a human-LLM gap in source-sensitive reasoning.

Significance. If the human trust effect survives rigorous controls for item-level confounds, the results would supply direct empirical support for commitment-sensitive accounts of evidential choice and offer a useful benchmark for evaluating whether LLMs can track source reliability in morphologically rich languages. The systematic comparison of multiple models and prompting regimes is a strength, as is the focus on a non-English language where evidential distinctions are grammatically encoded.

major comments (3)

[Human Experiment / Results] The headline human result (High-Trust favoring -DI, Low-Trust favoring -mIs) is load-bearing for the central claim, yet the manuscript supplies no sample sizes, statistical tests, effect sizes, or exclusion criteria. Without these, it is impossible to assess whether the reported robustness is statistically reliable or driven by unmentioned confounds or base-rate preferences.
[Methods (Cloze Task Construction)] The cloze-task design is the weakest link in the central empirical claim. High-Trust and Low-Trust items may differ systematically in lexical choice, verb semantics (perceptual vs. inferential), or syntactic complexity, any of which could independently bias -DI/-mIs selection. The paper must provide item-level matching statistics, full stimulus lists, or explicit covariate tests in the sensitivity analyses to demonstrate that trust is the sole manipulated variable.
[Results / Sensitivity Analyses] The abstract states that the human pattern 'remains stable across sensitivity analyses,' but neither the covariates tested nor the exact analytic procedures are described. This omission prevents readers from evaluating whether plausible alternative explanations (e.g., information-type or structural biases) have been ruled out.

minor comments (2)

[Abstract] The abstract would benefit from a concise statement of participant N, key statistical results, and the number of items per condition to allow immediate assessment of the human findings.
[LLM Evaluation] LLM evaluation sections should report exact prompt templates and output-compliance rates for each model-prompt combination, as these appear to be major drivers of the observed inconsistencies.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful and constructive review. The comments correctly identify areas where the current manuscript lacks sufficient detail on the human experiment, which limits readers' ability to evaluate the reliability of the reported trust effect. We will revise the manuscript to provide the requested information on sample characteristics, statistical procedures, stimulus properties, and sensitivity analyses. Our responses to each major comment are below.

read point-by-point responses

Referee: [Human Experiment / Results] The headline human result (High-Trust favoring -DI, Low-Trust favoring -mIs) is load-bearing for the central claim, yet the manuscript supplies no sample sizes, statistical tests, effect sizes, or exclusion criteria. Without these, it is impossible to assess whether the reported robustness is statistically reliable or driven by unmentioned confounds or base-rate preferences.

Authors: We agree that these details must be reported for the results to be properly evaluated. The revised manuscript will include a dedicated subsection on participants and analysis methods that reports the sample size, the statistical tests (mixed-effects logistic regression with appropriate random effects), effect sizes, and the pre-specified exclusion criteria. We will also add a data availability statement with access to the anonymized dataset and analysis code. revision: yes
Referee: [Methods (Cloze Task Construction)] The cloze-task design is the weakest link in the central empirical claim. High-Trust and Low-Trust items may differ systematically in lexical choice, verb semantics (perceptual vs. inferential), or syntactic complexity, any of which could independently bias -DI/-mIs selection. The paper must provide item-level matching statistics, full stimulus lists, or explicit covariate tests in the sensitivity analyses to demonstrate that trust is the sole manipulated variable.

Authors: This is a valid concern. Although the items were constructed with an attempt to balance verb types and structures, the revised version will append the complete stimulus list, report item-level matching statistics on lexical frequency, length, and semantic category, and include additional regression models that treat these properties as covariates to confirm the trust manipulation remains the primary predictor. revision: yes
Referee: [Results / Sensitivity Analyses] The abstract states that the human pattern 'remains stable across sensitivity analyses,' but neither the covariates tested nor the exact analytic procedures are described. This omission prevents readers from evaluating whether plausible alternative explanations (e.g., information-type or structural biases) have been ruled out.

Authors: We will expand the description of the sensitivity analyses in the revision. A new subsection will explicitly list the covariates examined (including lexical frequency, verb semantic type, and syntactic complexity), detail the analytic procedures (subset analyses and covariate-adjusted models), and report the outcomes showing that the trust effect remains stable. This will allow direct assessment of whether alternative explanations have been addressed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of human and LLM experiments

full rationale

The paper presents results from a human production experiment using controlled cloze contexts and evaluations of 10 LLMs across prompting paradigms. No mathematical derivations, first-principles predictions, fitted parameters, or self-citation chains are invoked to support the central claims. The trust effect in Turkish evidential morphology is reported as an observed data pattern stable across sensitivity analyses, with no reduction of any result to its own inputs by construction. This is a standard empirical study self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study relies on standard methods from experimental linguistics and NLP evaluation without introducing new theoretical axioms, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5504 in / 1199 out tokens · 83460 ms · 2026-05-08T03:37:24.256099+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

Introduction Evidentialityreferstothelinguisticencodingofinfor- mation source, that is, how a speaker indicates the basis on which a proposition is presented (e.g., di- rect perception, inference, or report) (Willett, 1988; Dendale and Tasmowski, 2001; de Haan, 2001; Plungian, 2001; Aikhenvald, 2004; Boye, 2012; Ünal and Papafragou, 2020). In many languag...

work page internal anchor Pith review Pith/arXiv arXiv 1988
[2]

directness

Related Work Turkish evidentiality: traditional descriptive and grammatical accounts.Turkish has long occupied a central place in the evidentiality lit- erature because its past-domain morphology is closely tied to distinctions in information source and speaker stance (Underhill, 1976; Aksu-Koç and Slobin, 1986; Johanson, 2003; Kornfilt, 1997; Göksel and ...

1976
[3]

First, we designed a con- trolled cloze-style task that can be administered to both human participants and LLMs

Methods To examine how Turkish evidential morphology re- sponds to contextualtrustworthiness, we used two complementary experimental paradigms with two corresponding datasets. First, we designed a con- trolled cloze-style task that can be administered to both human participants and LLMs. Second, we constructed a larger dataset for LLM-only evalua- tion in...
[4]

-DI" (7), and the other is the word conjugated with

Experiments 4.1. Human Experiment The human experiment received prior approval from the University of Chicago Institutional Review Board (IRB26-0198). All participants provided in- formed consent before participation. We recruited 75 unique participants who self-identified as native speakers of Turkish, were at least 18 years old, and were residing in Tur...

2018
[5]

Production: trust robustly shifts-DI vs.-mIş Table 1 shows strict-coding counts for critical tri- als (High+Low; fillers excluded)

Human experiment results 5.1. Production: trust robustly shifts-DI vs.-mIş Table 1 shows strict-coding counts for critical tri- als (High+Low; fillers excluded). Descriptively, High-trust contexts produce moreDIcompletions, whereasLow-trustcontextsproducemoremIşcom- pletions. TheOthercategory is similar across conditions (High: 28.4%; Low: 29.5%), suggest...
[6]

= 0.518andmIş= 0.482. 4 DI mIş 0 0.2 0.4 0.6 0.8 1 Proportion High trust DI mIş 0 0.2 0.4 0.6 0.8 1 Proportion Low trust Figure 2: Human production (strict coding): within- condition proportions of-DIvs.-mIşamong DI/mIş responses only. Two robustness checks yield the same qualita- tive pattern. (i) Under lenient last-token coding (al- lowing two-word VP c...
[7]

Each item was presented to the model three times, and the final label was determined by majority vote across the three outputs

The First Experiment with LLMs Experiment I evaluates LLM behavior in a prompting-based gap-fill task using 200 trust- manipulated Turkish cloze items. Each item was presented to the model three times, and the final label was determined by majority vote across the three outputs. Table 2 summarizes the model-level results on the prompted task, reporting (i...
[8]

Unlike Experiment I, the prompt here explicitly instructed models to produce apast-tensecompletion

The Second Experiment with LLMs Experiment II evaluates LLM behavior in anex- plicit past-tensegeneration setup using the same 200 trust-manipulated Turkish cloze items. Unlike Experiment I, the prompt here explicitly instructed models to produce apast-tensecompletion. As in Experiment I, each item was sampled three times. To obtain a single interpretable...
[9]

This design is more constrained than Experiments I–II because it removes open-ended lexical generation and directly targets the evidential choice itself

The Third Experiment with LLMs Experiment III evaluates LLM behavior in aforced- choiceprompting setup, in which models are ex- plicitly asked to choose between two candidate completions: a-DIform (option A) and a-mIşform (option B). This design is more constrained than Experiments I–II because it removes open-ended lexical generation and directly targets...
[10]

di- rectness

Discussion This study makes theoretical and methodological contributions to Turkish evidentiality and its evalu- ation in LLMs. Human results show a clear trust effect: more credible external sources favor-DI, while less credible sources favor-mIş(Section 5; Table 1; Figure 2). In contrast, current LLMs do not reliably reproduce this pattern, and their ap...

2016
[11]

Conclusion Thispaperinvestigatedhowsourcetrustworthiness shapes Turkish evidential choice in the-DI/-mIş contrast, and whether current LLMs track this sen- sitivity. In a controlled human cloze experiment, we found a robust and replicable trust effect: partic- ipants produced relatively more-DIin High-Trust contexts and relatively more-mIşin Low-Trust con...
[12]

doi:10.48550/arXiv.2504.17550 , abstract =

Bibliographical References Alexandra Y. Aikhenvald. 2004.Evidentiality. Ox- ford University Press, Oxford. Ayhan Aksu-Koç and Dan I. Slobin. 1986. A psy- chological account of the development and use of evidentials in Turkish. In Wallace Chafe and Johanna Nichols, editors,Evidentiality: The Lin- guistic Coding of Epistemology, pages 159–167. Ablex Publish...

work page arXiv 2004
[13]

InProceedingsofthe63rdAnnualMeetingofthe Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 22816–22838, Vi- enna, Austria

TUMLU: A unified and native language understanding benchmark for Turkic languages. InProceedingsofthe63rdAnnualMeetingofthe Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 22816–22838, Vi- enna, Austria. Association for Computational Lin- guistics. Mete Ismayilzada, Defne Circi, Jonne Sälevä, Hale Sirin, Abdullatif Köksal, Bhuwa...

work page arXiv 2025
[14]

AgentHallu: Benchmarking automated hallucination attribution of LLM-based agents.arXiv preprint arXiv:2601.06818, 2026

TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics. Xuannan Liu, Xiao Yang, Zekun Li, Peipei Li, and Ran He. 2026. Agenthallu: Benchmarking au- tomated halluc...

work page arXiv 2026
[15]

InProceedings of the 2023 Con- ference on Empirical Methods in Natural Lan- guage Processing, pages 9004–9017, Singa- pore

SelfCheckGPT: Zero-resource black-box hallucination detection for generative large lan- guage models. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Lan- guage Processing, pages 9004–9017, Singa- pore. Association for Computational Linguistics. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness an...

2023
[16]

WebGPT: Browser-assisted question-answering with human feedback

WebGPT: Browser-assisted question- answering with human feedback.arXiv preprint arXiv:2112.09332. neuralwork. n.d. gemma-2-9b-it-tr. Hugging Face model card. Accessed: 2026-01-20. NovusResearch. n.d. Novus-7b-tr_v1. Hugging Face model card. Accessed: 2026-01-20. ocaklisemih. n.d. gpt-oss-20b-Turkish-astrology- gguf. Hugging Face model card. Accessed: 2026...

work page internal anchor Pith review arXiv 2026
[17]

__________

Mecellem models: Turkish models trained from scratch and continually pre-trained for the legal domain.arXiv preprint arXiv:2601.16018. Elif Ecem Umutlu, Ayse Aysu Cengiz, Ahmet Kaan Sever, Seyma Erdem, Burak Aytan, Busra Tufan, Abdullah Topraksoy, Esra Darıcı, and Cagri Tora- man. 2025. Evaluating the quality of benchmark datasets for low-resource languag...

work page arXiv 2025