pith. sign in

arxiv: 2604.19768 · v1 · submitted 2026-03-27 · 💻 cs.CL · cs.AI

Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models

Pith reviewed 2026-05-15 00:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords epistemic-rhetorical miscalibrationlarge language modelsrhetorical devicesform-meaning divergenceAI text detectionpragmaticsargumentative writing
0
0 comments X

The pith

Large language models produce rhetorical patterns whose intensity exceeds their epistemic grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate argumentative texts in which rhetorical devices appear at levels not matched by the certainty or knowledge they convey. The authors introduce a triadic epistemic-rhetorical marker taxonomy and three composite metrics to quantify the resulting mismatch. When the framework is applied to 225 texts from expert humans, non-expert humans, and LLMs, it detects elevated form-meaning divergence and more uniform device placement in the model outputs. Specific patterns include nearly twice the expert rate of tricolon and twice the human density of performed hesitancy markers. The metrics are designed to be automatable and therefore usable as a screening layer for epistemic miscalibration in generated content.

Core claim

LLM-generated texts produce tricolon at nearly twice the expert rate while human authors produce erotema at more than twice the LLM rate. Performed hesitancy markers appear at twice the human density in LLM output. Form-meaning divergence is significantly elevated in LLM texts relative to both human groups, and rhetorical devices are distributed significantly more uniformly across LLM documents.

What carries the argument

Triadic epistemic-rhetorical marker (ERM) taxonomy operationalized through form-meaning divergence (FMD), genuine-to-performed epistemic ratio (GPR), and rhetorical device distribution entropy (RDDE) metrics.

Load-bearing premise

The triadic ERM taxonomy and its operationalization via FMD, GPR, and RDDE metrics accurately isolate epistemic-rhetorical miscalibration without bias from corpus construction, annotation rules, or LLM generation artifacts.

What would settle it

An independent corpus of argumentative texts in which form-meaning divergence scores show no significant elevation for LLM outputs relative to human outputs would falsify the claim of systematic miscalibration.

Figures

Figures reproduced from arXiv: 2604.19768 by Asim D. Bakhshi.

Figure 1
Figure 1. Figure 1: Architecture of the ERM taxonomy showing linkages between the three theoretical anchors (left), the six [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The ERM architecture pipeline. Stage 1 constructs the corpus and segments each document into sentence [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean count per document of all Level 1 and Level 2b markers. Significance markers: [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of documents in the genuine ( [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of the three composite ERM metrics across sub-corpora. Each violin shows the full density [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Corpus-level proportions of Level 3 discourse markers per sub-corpus. Cell values show the percentage [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of composite ERM metrics within the LLM-generated sub-corpus by model. Box spans the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Large language models (LLMs) exhibit systematic miscalibration with rhetorical intensity not proportionate to epistemic grounding. This study tests this hypothesis and proposes a framework for quantifying this decoupling by designing a triadic epistemic-rhetorical marker (ERM) taxonomy. The taxonomy is operationalized through composite metrics of form-meaning divergence (FMD), genuine-to-performed epistemic ratio (GPR), and rhetorical device distribution entropy (RDDE). Applied to 225 argumentative texts spanning approximately 0.6 Million tokens across human expert, human non-expert, and LLM-generated sub-corpora, the framework identifies a consistent, model-agnostic LLM epistemic signature. LLM-generated texts produce tricolon at nearly twice the expert rate ($\Delta = 0.95$), while human authors produce erotema at more than twice the LLM rate. Performed hesitancy markers appear at twice the human density in LLM output. FMD is significantly elevated in LLM texts relative to both human groups ($p < 0.001, \Delta = 0.68$), and rhetorical devices are distributed significantly more uniformly across LLM documents. The findings are consistent with theoretical intuitions derived from Gricean pragmatics, Relevance Theory, and Brandomian inferentialism. The annotation pipeline is fully automatable, making it deployable as a lightweight screening tool for epistemic miscalibration in AI-generated content and as a theoretically motivated feature set for LLM-generated text detection pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs exhibit systematic epistemic-rhetorical miscalibration, with rhetorical intensity disproportionate to epistemic grounding. It introduces a triadic ERM taxonomy operationalized via composite metrics FMD, GPR, and RDDE. Applied to 225 argumentative texts (~0.6M tokens) across human expert, human non-expert, and LLM sub-corpora, the framework reports LLM texts using tricolon at nearly twice the expert rate (Δ=0.95), humans using erotema more than twice the LLM rate, performed hesitancy markers at twice human density in LLMs, significantly elevated FMD in LLMs (p<0.001, Δ=0.68), and more uniform rhetorical device distribution in LLM documents. Findings align with Gricean pragmatics, Relevance Theory, and Brandomian inferentialism, with the pipeline presented as automatable for AI content screening.

Significance. If the metrics validly isolate epistemic-rhetorical decoupling independent of stylistic artifacts, the work offers a novel, automatable framework grounded in linguistic theory for detecting miscalibration in LLM outputs. This could strengthen AI-generated text detection pipelines and provide empirical tests of pragmatic theories in computational settings. The model-agnostic results, large corpus size, and reported effect sizes add potential impact for NLP applications in content moderation and evaluation.

major comments (3)
  1. [Methods] Methods section: The operational definitions, formulas, and validation procedures for FMD, GPR, and RDDE are not provided. Without explicit computation details (e.g., how form-meaning divergence is scored from annotations or how the genuine-to-performed ratio is derived), it is impossible to confirm that the metrics measure epistemic-rhetorical miscalibration rather than LLM stylistic regularities such as repetition or formality patterns. This directly affects interpretation of the central results including FMD Δ=0.68 (p<0.001).
  2. [Corpus and Annotation] Corpus construction and annotation: No details are given on sampling criteria for the 225 texts, inter-annotator agreement for the ERM taxonomy, or controls for confounders like text length, topic, or LLM prompt design. These omissions leave open the possibility that differences in device rates (e.g., tricolon Δ=0.95) and RDDE uniformity arise from corpus or annotation biases rather than the hypothesized decoupling.
  3. [Results] Results section: While p-values and effect sizes are reported, the manuscript does not specify the exact statistical tests, sample sizes per comparison, or corrections for multiple testing. This weakens assessment of claims such as significantly more uniform rhetorical device distribution across LLM documents.
minor comments (2)
  1. [Abstract] Abstract: The token count is stated as 'approximately 0.6 Million tokens'; reporting the exact total would aid reproducibility.
  2. [Discussion] Discussion: Adding an explicit limitations paragraph addressing potential LLM generation artifacts (e.g., prompt-induced patterns) would strengthen the manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened for clarity and reproducibility. We address each major comment below and will incorporate the suggested revisions in the next version of the paper.

read point-by-point responses
  1. Referee: [Methods] Methods section: The operational definitions, formulas, and validation procedures for FMD, GPR, and RDDE are not provided. Without explicit computation details (e.g., how form-meaning divergence is scored from annotations or how the genuine-to-performed ratio is derived), it is impossible to confirm that the metrics measure epistemic-rhetorical miscalibration rather than LLM stylistic regularities such as repetition or formality patterns. This directly affects interpretation of the central results including FMD Δ=0.68 (p<0.001).

    Authors: We agree that the current Methods section lacks sufficient explicit detail on the metrics. In the revised manuscript, we will expand this section to include the full operational definitions of the ERM taxonomy, the precise mathematical formulas for computing FMD (form-meaning divergence), GPR (genuine-to-performed epistemic ratio), and RDDE (rhetorical device distribution entropy), along with step-by-step validation procedures, annotation guidelines, and pseudocode for the automatable pipeline. These additions will demonstrate how the metrics isolate epistemic-rhetorical decoupling from purely stylistic factors such as repetition or formality. revision: yes

  2. Referee: [Corpus and Annotation] Corpus construction and annotation: No details are given on sampling criteria for the 225 texts, inter-annotator agreement for the ERM taxonomy, or controls for confounders like text length, topic, or LLM prompt design. These omissions leave open the possibility that differences in device rates (e.g., tricolon Δ=0.95) and RDDE uniformity arise from corpus or annotation biases rather than the hypothesized decoupling.

    Authors: We acknowledge the importance of transparency in corpus construction. The revised manuscript will include a dedicated subsection detailing the sampling criteria for selecting the 225 argumentative texts (~0.6M tokens), report inter-annotator agreement statistics (e.g., Cohen's kappa or Fleiss' kappa) for the ERM taxonomy annotations, and describe the controls applied for potential confounders including text length normalization, topic balancing across sub-corpora, and standardization of LLM prompt designs. These additions will help rule out alternative explanations for the observed differences. revision: yes

  3. Referee: [Results] Results section: While p-values and effect sizes are reported, the manuscript does not specify the exact statistical tests, sample sizes per comparison, or corrections for multiple testing. This weakens assessment of claims such as significantly more uniform rhetorical device distribution across LLM documents.

    Authors: We agree that full statistical reporting is necessary for rigorous evaluation. In the revised Results section, we will explicitly specify the statistical tests used for each comparison (e.g., independent t-tests or Mann-Whitney U tests for FMD and device rates), the exact sample sizes per group (225 texts divided across the three sub-corpora), and the multiple-testing correction applied (e.g., Bonferroni or FDR). This will allow readers to fully assess the robustness of findings such as the elevated FMD (p<0.001, Δ=0.68) and RDDE uniformity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a novel triadic ERM taxonomy and operationalizes it via author-defined composite metrics (FMD, GPR, RDDE) explicitly motivated by external linguistic theories (Gricean pragmatics, Relevance Theory, Brandomian inferentialism). These are then applied empirically to a 0.6M-token corpus spanning human expert, human non-expert, and LLM texts, producing statistical comparisons (e.g., tricolon rate Δ=0.95, FMD p<0.001 Δ=0.68). No equations, definitions, or self-citations are present that reduce any reported result to a fitted parameter, self-referential definition, or prior author work by construction. The derivation chain is an independent empirical measurement using a new framework and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the validity of newly introduced metrics and the assumption that observed differences reflect genuine miscalibration rather than artifacts of text generation or measurement.

axioms (1)
  • domain assumption Gricean pragmatics, Relevance Theory, and Brandomian inferentialism provide the theoretical foundation for interpreting the metrics as indicators of epistemic-rhetorical misalignment
    Findings are stated to be consistent with these theories.
invented entities (4)
  • Triadic epistemic-rhetorical marker (ERM) taxonomy no independent evidence
    purpose: To categorize and quantify the decoupling between epistemic grounding and rhetorical intensity
    Newly proposed framework operationalized in the study.
  • Form-meaning divergence (FMD) no independent evidence
    purpose: Composite metric capturing divergence between linguistic form and epistemic meaning
    Introduced as one of the three core metrics.
  • Genuine-to-performed epistemic ratio (GPR) no independent evidence
    purpose: Metric comparing genuine epistemic markers to performed hesitancy markers
    Introduced as one of the three core metrics.
  • Rhetorical device distribution entropy (RDDE) no independent evidence
    purpose: Measure of uniformity in rhetorical device usage across documents
    Introduced as one of the three core metrics.

pith-pipeline@v0.9.0 · 5564 in / 1603 out tokens · 60378 ms · 2026-05-15T00:21:00.955244+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Gallegos, Ryan A

    doi:10.48550/arXiv.2309.00770. Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A Comprehensive Survey of Bias in LLMs: Current Landscape and Future Directions, September

  2. [2]

    Ranjan, S

    URLhttp://arxiv.org/abs/2409.16430. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online, July

  3. [3]

    Language (Technology) is Power: A Critical Survey of ``Bias'' in NLP

    Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.485. Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards Mitigating LLM Hallucination via Self Reflection. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, Sing...

  4. [4]

    doi:10.18653/v1/2023.findings-emnlp.123

    Association for Computational Linguistics. doi:10.18653/v1/2023.findings-emnlp.123. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages 610–623, New Yo...

  5. [5]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Association for Computing Machinery. ISBN 978-1-4503-8309-7. doi:10.1145/3442188.3445922. Fabian Erhardt. Metacognitive Text Organizastion Semiotic and Rhetorical Agency in LLMs, December

  6. [6]

    ISBN 979-8-89176-251-0

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi:10.18653/v1/2025.acl-long.1345. Kai V on Fintel and Anthony S Gillies. An Opinionated Guide to Epistemic Modality. In Tamar Szabó Gendler and John Hawthorne, editors,Oxford Studies In Epistemology, pages 32–62. Oxford University Press, December

  7. [7]

    Scale-Free Networks: Complex Webs in Nature and Technology

    doi:10.1093/oso/9780199237067.003.0002. Ziqi Li and Qi Zhang. Linguistic Differences between AI and Human Comments in Weibo: Detect AI-Generated Text through Stylometric Features. InProceedings of the 24th China National Conference on Computational Linguistics, pages 842–851, August

  8. [8]

    Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation

    Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, and Kyomin Jung. Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

  9. [9]

    ISBN 979-8-89176-189-6

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi:10.18653/v1/2025.naacl-long.452. David Clausen. HedgeHunter: A System for Hedge Detection and Uncertainty Classification. In Richárd Farkas, Veronika Vincze, György Szarvas, György Móra, and János Csirik, editors,Proceedings of the Fourteenth Confer- ence on Computational Natural Langu...

  10. [10]

    doi:10.4000/14yb8

    ISSN 1565-8961. doi:10.4000/14yb8. Herbert P Grice. Logic and Conversation. InSpeech acts, pages 41–58. Brill,

  11. [11]

    doi:10.1007/s10994-025-06767-4

    ISSN 1573-0565. doi:10.1007/s10994-025-06767-4. 18 Saying More Than They KnowA PREPRINT Krishnaram Kenthapadi, Mehrnoosh Sameki, and Ankur Taly. Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey). InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6523–6533,

  12. [12]

    ISBN 978-1-4503-7110-0

    Association for Computing Machinery. ISBN 978-1-4503-7110-0. doi:10.1145/3375627.3375811. Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. Surveying Stylometry Techniques and Applications.ACM Computing Surveys (CSuR), 50(6):1–36,

  13. [13]

    ISBN 979-8-4007-1124-4

    Association for Computing Machinery. ISBN 979-8-4007-1124-4. doi:10.1145/3703323.3703712. Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, and Norbert Tihanyi. I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution....

  14. [14]

    ISBN 979-8-4007-1895-3

    Association for Computing Machinery. ISBN 979-8-4007-1895-3. doi:10.1145/3733799.3762964. Tharindu Kumarage and Huan Liu. Neural Authorship Attribution: Stylometric Analysis on Large Language Models, August

  15. [15]

    Wataru Zaitsu, Mingzhe Jin, Shunichi Ishihara, Satoru Tsuge, and Mitsuyuki Inaba

    arXiv:2308.07305 [cs]. Wataru Zaitsu, Mingzhe Jin, Shunichi Ishihara, Satoru Tsuge, and Mitsuyuki Inaba. Stylometry can reveal artificial intelligence authorship, but humans struggle: A comparison of human and seven large language models in Japanese. PLOS ONE, 20(10):e0335369, October

  16. [16]

    doi:10.1371/journal.pone.0335369

    ISSN 1932-6203. doi:10.1371/journal.pone.0335369. Zoltan P. Majdik and S. Scott Graham. Rhetoric of/with AI: An Introduction.Rhetoric Society Quarterly, 54(3): 222–231, May

  17. [17]

    doi:10.1080/02773945.2024.2343264

    ISSN 0277-3945, 1930-322X. doi:10.1080/02773945.2024.2343264. Sergey K. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, and Thy Tran. A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features, January

  18. [18]

    arXiv:2511.21744 [cs]. Maged S. Al-Shaibani and Moataz Ahmed. Arabic machine-generated text detection: Stylometric analysis and cross-model evaluation.Expert Systems with Applications, 305:130644, April

  19. [19]

    doi:10.1016/j.eswa.2025.130644

    ISSN 0957-4174. doi:10.1016/j.eswa.2025.130644. Ben Medlock and Ted Briscoe. Weakly Supervised Learning for Hedge Classification in Scientific Literature. In Annie Zaenen and Antal van den Bosch, editors,Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 992–999, Prague, Czech Republic, June

  20. [20]

    , author Montani, I

    doi:10.5281/zenodo.1212303. Stephen E Toulmin.The uses of argument. Cambridge university press,