Recognition: unknown
Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation
Pith reviewed 2026-05-08 03:37 UTC · model grok-4.3
The pith
Turkish speakers prefer -DI for high-trust sources and -mIs for low-trust ones, but LLMs show unstable or reversed patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Native Turkish speakers produce relatively more -DI in high-trust cloze contexts and relatively more -mIs in low-trust cloze contexts, with the trust effect stable across analyses. Ten evaluated LLMs exhibit highly variable, often unstable trust-consistent shifts that are frequently overshadowed by output-compliance problems and strong base-rate suffix preferences.
What carries the argument
The Turkish past-tense evidential contrast between -DI and -mIs, with perceived source reliability manipulated as the sole variable in controlled cloze production tasks.
If this is right
- Human evidential production in Turkish tracks perceived source trustworthiness.
- The human pattern supports a trust- and commitment-based account of Turkish evidentiality.
- LLM responses remain prompt- and model-dependent and rarely match the human trust effect.
- Output compliance and base-rate preferences often override any trust sensitivity in LLMs.
Where Pith is reading between the lines
- Evidential morphology in other languages may similarly encode social trust or commitment information.
- Targeted training on pragmatic inferences about source reliability could narrow the observed human-LLM gap.
- The same trust manipulation could be tested in comprehension rather than production tasks to check robustness.
- Base-rate suffix biases in LLMs may reflect training data distributions rather than evidential reasoning.
Load-bearing premise
The controlled cloze contexts isolate perceived source reliability as the only manipulated factor without confounds from sentence structure, lexical choice, or participant expectations.
What would settle it
A replication in which Turkish speakers show no reliable difference in -DI versus -mIs choice between high-trust and low-trust versions of the same cloze items would falsify the reported trust effect.
Figures
read the original abstract
This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly external, while only its perceived reliability is manipulated (High-Trust vs. Low-Trust). In a human production experiment, native speakers of Turkish show a robust trust effect: High-Trust contexts yield relatively more -DI, whereas Low-Trust contexts yield relatively more -mIs, with the pattern remaining stable across sensitivity analyses. We then evaluate 10 LLMs in three prompting paradigms (open gap-fill, explicit past-tense gap-fill, and forced-choice A/B selection). LLM behavior is highly model- and prompt-dependent: some models show weak or local trust-consistent shifts, but effects are generally unstable, often reversed, and frequently overshadowed by output-compliance problems and strong base-rate suffix preferences. The results provide new evidence for a trust-/commitment-based account of Turkish evidentiality and reveal a clear human-LLM gap in source-sensitive evidential reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates source trustworthiness effects on Turkish evidential morphology, specifically the -DI vs. -mIs contrast in past-tense contexts. In a human cloze production experiment with overtly external sources, native Turkish speakers produce relatively more -DI in High-Trust conditions and more -mIs in Low-Trust conditions; this pattern is reported as robust and stable across sensitivity analyses. The study then benchmarks 10 LLMs across three prompting paradigms (open gap-fill, explicit past-tense gap-fill, forced-choice A/B), finding highly model- and prompt-dependent behavior that is often inconsistent with the human pattern, frequently reversed, and dominated by base-rate suffix preferences and output-compliance failures. The work is framed as supporting a trust-/commitment-based account of Turkish evidentiality while highlighting a human-LLM gap in source-sensitive reasoning.
Significance. If the human trust effect survives rigorous controls for item-level confounds, the results would supply direct empirical support for commitment-sensitive accounts of evidential choice and offer a useful benchmark for evaluating whether LLMs can track source reliability in morphologically rich languages. The systematic comparison of multiple models and prompting regimes is a strength, as is the focus on a non-English language where evidential distinctions are grammatically encoded.
major comments (3)
- [Human Experiment / Results] The headline human result (High-Trust favoring -DI, Low-Trust favoring -mIs) is load-bearing for the central claim, yet the manuscript supplies no sample sizes, statistical tests, effect sizes, or exclusion criteria. Without these, it is impossible to assess whether the reported robustness is statistically reliable or driven by unmentioned confounds or base-rate preferences.
- [Methods (Cloze Task Construction)] The cloze-task design is the weakest link in the central empirical claim. High-Trust and Low-Trust items may differ systematically in lexical choice, verb semantics (perceptual vs. inferential), or syntactic complexity, any of which could independently bias -DI/-mIs selection. The paper must provide item-level matching statistics, full stimulus lists, or explicit covariate tests in the sensitivity analyses to demonstrate that trust is the sole manipulated variable.
- [Results / Sensitivity Analyses] The abstract states that the human pattern 'remains stable across sensitivity analyses,' but neither the covariates tested nor the exact analytic procedures are described. This omission prevents readers from evaluating whether plausible alternative explanations (e.g., information-type or structural biases) have been ruled out.
minor comments (2)
- [Abstract] The abstract would benefit from a concise statement of participant N, key statistical results, and the number of items per condition to allow immediate assessment of the human findings.
- [LLM Evaluation] LLM evaluation sections should report exact prompt templates and output-compliance rates for each model-prompt combination, as these appear to be major drivers of the observed inconsistencies.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. The comments correctly identify areas where the current manuscript lacks sufficient detail on the human experiment, which limits readers' ability to evaluate the reliability of the reported trust effect. We will revise the manuscript to provide the requested information on sample characteristics, statistical procedures, stimulus properties, and sensitivity analyses. Our responses to each major comment are below.
read point-by-point responses
-
Referee: [Human Experiment / Results] The headline human result (High-Trust favoring -DI, Low-Trust favoring -mIs) is load-bearing for the central claim, yet the manuscript supplies no sample sizes, statistical tests, effect sizes, or exclusion criteria. Without these, it is impossible to assess whether the reported robustness is statistically reliable or driven by unmentioned confounds or base-rate preferences.
Authors: We agree that these details must be reported for the results to be properly evaluated. The revised manuscript will include a dedicated subsection on participants and analysis methods that reports the sample size, the statistical tests (mixed-effects logistic regression with appropriate random effects), effect sizes, and the pre-specified exclusion criteria. We will also add a data availability statement with access to the anonymized dataset and analysis code. revision: yes
-
Referee: [Methods (Cloze Task Construction)] The cloze-task design is the weakest link in the central empirical claim. High-Trust and Low-Trust items may differ systematically in lexical choice, verb semantics (perceptual vs. inferential), or syntactic complexity, any of which could independently bias -DI/-mIs selection. The paper must provide item-level matching statistics, full stimulus lists, or explicit covariate tests in the sensitivity analyses to demonstrate that trust is the sole manipulated variable.
Authors: This is a valid concern. Although the items were constructed with an attempt to balance verb types and structures, the revised version will append the complete stimulus list, report item-level matching statistics on lexical frequency, length, and semantic category, and include additional regression models that treat these properties as covariates to confirm the trust manipulation remains the primary predictor. revision: yes
-
Referee: [Results / Sensitivity Analyses] The abstract states that the human pattern 'remains stable across sensitivity analyses,' but neither the covariates tested nor the exact analytic procedures are described. This omission prevents readers from evaluating whether plausible alternative explanations (e.g., information-type or structural biases) have been ruled out.
Authors: We will expand the description of the sensitivity analyses in the revision. A new subsection will explicitly list the covariates examined (including lexical frequency, verb semantic type, and syntactic complexity), detail the analytic procedures (subset analyses and covariate-adjusted models), and report the outcomes showing that the trust effect remains stable. This will allow direct assessment of whether alternative explanations have been addressed. revision: yes
Circularity Check
No circularity: purely empirical reporting of human and LLM experiments
full rationale
The paper presents results from a human production experiment using controlled cloze contexts and evaluations of 10 LLMs across prompting paradigms. No mathematical derivations, first-principles predictions, fitted parameters, or self-citation chains are invoked to support the central claims. The trust effect in Turkish evidential morphology is reported as an observed data pattern stable across sensitivity analyses, with no reduction of any result to its own inputs by construction. This is a standard empirical study self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Evidentialityreferstothelinguisticencodingofinfor- mation source, that is, how a speaker indicates the basis on which a proposition is presented (e.g., di- rect perception, inference, or report) (Willett, 1988; Dendale and Tasmowski, 2001; de Haan, 2001; Plungian, 2001; Aikhenvald, 2004; Boye, 2012; Ünal and Papafragou, 2020). In many languag...
work page internal anchor Pith review Pith/arXiv arXiv 1988
-
[2]
directness
Related Work Turkish evidentiality: traditional descriptive and grammatical accounts.Turkish has long occupied a central place in the evidentiality lit- erature because its past-domain morphology is closely tied to distinctions in information source and speaker stance (Underhill, 1976; Aksu-Koç and Slobin, 1986; Johanson, 2003; Kornfilt, 1997; Göksel and ...
1976
-
[3]
First, we designed a con- trolled cloze-style task that can be administered to both human participants and LLMs
Methods To examine how Turkish evidential morphology re- sponds to contextualtrustworthiness, we used two complementary experimental paradigms with two corresponding datasets. First, we designed a con- trolled cloze-style task that can be administered to both human participants and LLMs. Second, we constructed a larger dataset for LLM-only evalua- tion in...
-
[4]
-DI" (7), and the other is the word conjugated with
Experiments 4.1. Human Experiment The human experiment received prior approval from the University of Chicago Institutional Review Board (IRB26-0198). All participants provided in- formed consent before participation. We recruited 75 unique participants who self-identified as native speakers of Turkish, were at least 18 years old, and were residing in Tur...
2018
-
[5]
Production: trust robustly shifts-DI vs.-mIş Table 1 shows strict-coding counts for critical tri- als (High+Low; fillers excluded)
Human experiment results 5.1. Production: trust robustly shifts-DI vs.-mIş Table 1 shows strict-coding counts for critical tri- als (High+Low; fillers excluded). Descriptively, High-trust contexts produce moreDIcompletions, whereasLow-trustcontextsproducemoremIşcom- pletions. TheOthercategory is similar across conditions (High: 28.4%; Low: 29.5%), suggest...
-
[6]
= 0.518andmIş= 0.482. 4 DI mIş 0 0.2 0.4 0.6 0.8 1 Proportion High trust DI mIş 0 0.2 0.4 0.6 0.8 1 Proportion Low trust Figure 2: Human production (strict coding): within- condition proportions of-DIvs.-mIşamong DI/mIş responses only. Two robustness checks yield the same qualita- tive pattern. (i) Under lenient last-token coding (al- lowing two-word VP c...
-
[7]
Each item was presented to the model three times, and the final label was determined by majority vote across the three outputs
The First Experiment with LLMs Experiment I evaluates LLM behavior in a prompting-based gap-fill task using 200 trust- manipulated Turkish cloze items. Each item was presented to the model three times, and the final label was determined by majority vote across the three outputs. Table 2 summarizes the model-level results on the prompted task, reporting (i...
-
[8]
Unlike Experiment I, the prompt here explicitly instructed models to produce apast-tensecompletion
The Second Experiment with LLMs Experiment II evaluates LLM behavior in anex- plicit past-tensegeneration setup using the same 200 trust-manipulated Turkish cloze items. Unlike Experiment I, the prompt here explicitly instructed models to produce apast-tensecompletion. As in Experiment I, each item was sampled three times. To obtain a single interpretable...
-
[9]
This design is more constrained than Experiments I–II because it removes open-ended lexical generation and directly targets the evidential choice itself
The Third Experiment with LLMs Experiment III evaluates LLM behavior in aforced- choiceprompting setup, in which models are ex- plicitly asked to choose between two candidate completions: a-DIform (option A) and a-mIşform (option B). This design is more constrained than Experiments I–II because it removes open-ended lexical generation and directly targets...
-
[10]
di- rectness
Discussion This study makes theoretical and methodological contributions to Turkish evidentiality and its evalu- ation in LLMs. Human results show a clear trust effect: more credible external sources favor-DI, while less credible sources favor-mIş(Section 5; Table 1; Figure 2). In contrast, current LLMs do not reliably reproduce this pattern, and their ap...
2016
-
[11]
Conclusion Thispaperinvestigatedhowsourcetrustworthiness shapes Turkish evidential choice in the-DI/-mIş contrast, and whether current LLMs track this sen- sitivity. In a controlled human cloze experiment, we found a robust and replicable trust effect: partic- ipants produced relatively more-DIin High-Trust contexts and relatively more-mIşin Low-Trust con...
-
[12]
doi:10.48550/arXiv.2504.17550 , abstract =
Bibliographical References Alexandra Y. Aikhenvald. 2004.Evidentiality. Ox- ford University Press, Oxford. Ayhan Aksu-Koç and Dan I. Slobin. 1986. A psy- chological account of the development and use of evidentials in Turkish. In Wallace Chafe and Johanna Nichols, editors,Evidentiality: The Lin- guistic Coding of Epistemology, pages 159–167. Ablex Publish...
-
[13]
TUMLU: A unified and native language understanding benchmark for Turkic languages. InProceedingsofthe63rdAnnualMeetingofthe Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 22816–22838, Vi- enna, Austria. Association for Computational Lin- guistics. Mete Ismayilzada, Defne Circi, Jonne Sälevä, Hale Sirin, Abdullatif Köksal, Bhuwa...
-
[14]
TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics. Xuannan Liu, Xiao Yang, Zekun Li, Peipei Li, and Ran He. 2026. Agenthallu: Benchmarking au- tomated halluc...
-
[15]
InProceedings of the 2023 Con- ference on Empirical Methods in Natural Lan- guage Processing, pages 9004–9017, Singa- pore
SelfCheckGPT: Zero-resource black-box hallucination detection for generative large lan- guage models. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Lan- guage Processing, pages 9004–9017, Singa- pore. Association for Computational Linguistics. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness an...
2023
-
[16]
WebGPT: Browser-assisted question-answering with human feedback
WebGPT: Browser-assisted question- answering with human feedback.arXiv preprint arXiv:2112.09332. neuralwork. n.d. gemma-2-9b-it-tr. Hugging Face model card. Accessed: 2026-01-20. NovusResearch. n.d. Novus-7b-tr_v1. Hugging Face model card. Accessed: 2026-01-20. ocaklisemih. n.d. gpt-oss-20b-Turkish-astrology- gguf. Hugging Face model card. Accessed: 2026...
work page internal anchor Pith review arXiv 2026
-
[17]
Mecellem models: Turkish models trained from scratch and continually pre-trained for the legal domain.arXiv preprint arXiv:2601.16018. Elif Ecem Umutlu, Ayse Aysu Cengiz, Ahmet Kaan Sever, Seyma Erdem, Burak Aytan, Busra Tufan, Abdullah Topraksoy, Esra Darıcı, and Cagri Tora- man. 2025. Evaluating the quality of benchmark datasets for low-resource languag...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.