Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models

Asim D. Bakhshi

arxiv: 2604.19768 · v1 · submitted 2026-03-27 · 💻 cs.CL · cs.AI

Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models

Asim D. Bakhshi This is my paper

Pith reviewed 2026-05-15 00:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords epistemic-rhetorical miscalibrationlarge language modelsrhetorical devicesform-meaning divergenceAI text detectionpragmaticsargumentative writing

0 comments

The pith

Large language models produce rhetorical patterns whose intensity exceeds their epistemic grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate argumentative texts in which rhetorical devices appear at levels not matched by the certainty or knowledge they convey. The authors introduce a triadic epistemic-rhetorical marker taxonomy and three composite metrics to quantify the resulting mismatch. When the framework is applied to 225 texts from expert humans, non-expert humans, and LLMs, it detects elevated form-meaning divergence and more uniform device placement in the model outputs. Specific patterns include nearly twice the expert rate of tricolon and twice the human density of performed hesitancy markers. The metrics are designed to be automatable and therefore usable as a screening layer for epistemic miscalibration in generated content.

Core claim

LLM-generated texts produce tricolon at nearly twice the expert rate while human authors produce erotema at more than twice the LLM rate. Performed hesitancy markers appear at twice the human density in LLM output. Form-meaning divergence is significantly elevated in LLM texts relative to both human groups, and rhetorical devices are distributed significantly more uniformly across LLM documents.

What carries the argument

Triadic epistemic-rhetorical marker (ERM) taxonomy operationalized through form-meaning divergence (FMD), genuine-to-performed epistemic ratio (GPR), and rhetorical device distribution entropy (RDDE) metrics.

Load-bearing premise

The triadic ERM taxonomy and its operationalization via FMD, GPR, and RDDE metrics accurately isolate epistemic-rhetorical miscalibration without bias from corpus construction, annotation rules, or LLM generation artifacts.

What would settle it

An independent corpus of argumentative texts in which form-meaning divergence scores show no significant elevation for LLM outputs relative to human outputs would falsify the claim of systematic miscalibration.

Figures

Figures reproduced from arXiv: 2604.19768 by Asim D. Bakhshi.

**Figure 2.** Figure 2: The ERM architecture pipeline. Stage 1 constructs the corpus and segments each document into sentence [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Mean count per document of all Level 1 and Level 2b markers. Significance markers: [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of documents in the genuine ( [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of the three composite ERM metrics across sub-corpora. Each violin shows the full density [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Corpus-level proportions of Level 3 discourse markers per sub-corpus. Cell values show the percentage [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of composite ERM metrics within the LLM-generated sub-corpus by model. Box spans the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Large language models (LLMs) exhibit systematic miscalibration with rhetorical intensity not proportionate to epistemic grounding. This study tests this hypothesis and proposes a framework for quantifying this decoupling by designing a triadic epistemic-rhetorical marker (ERM) taxonomy. The taxonomy is operationalized through composite metrics of form-meaning divergence (FMD), genuine-to-performed epistemic ratio (GPR), and rhetorical device distribution entropy (RDDE). Applied to 225 argumentative texts spanning approximately 0.6 Million tokens across human expert, human non-expert, and LLM-generated sub-corpora, the framework identifies a consistent, model-agnostic LLM epistemic signature. LLM-generated texts produce tricolon at nearly twice the expert rate ($\Delta = 0.95$), while human authors produce erotema at more than twice the LLM rate. Performed hesitancy markers appear at twice the human density in LLM output. FMD is significantly elevated in LLM texts relative to both human groups ($p < 0.001, \Delta = 0.68$), and rhetorical devices are distributed significantly more uniformly across LLM documents. The findings are consistent with theoretical intuitions derived from Gricean pragmatics, Relevance Theory, and Brandomian inferentialism. The annotation pipeline is fully automatable, making it deployable as a lightweight screening tool for epistemic miscalibration in AI-generated content and as a theoretically motivated feature set for LLM-generated text detection pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a new taxonomy and three metrics to track when LLMs use heavy rhetoric without matching epistemic grounding, and it finds consistent differences in a 225-text corpus.

read the letter

The main point is that LLMs show higher form-meaning divergence in rhetorical choices than either expert or non-expert humans, measured through a triadic ERM taxonomy and the composites FMD, GPR, and RDDE. The corpus work turns up concrete patterns: tricolon nearly twice as common in LLM output, erotema more than twice as common in human writing, performed hesitancy markers doubled in LLMs, and more uniform device distribution overall, with FMD elevated at p<0.001 and effect size 0.68. The pipeline being fully automatable is a practical advantage for screening tools or detection features. The links to Gricean pragmatics, Relevance Theory, and inferentialism give the framing some theoretical anchor without overclaiming. The model-agnostic result across the LLM sub-corpus is also useful. The soft spot is that the abstract leaves the exact operationalization of FMD, GPR, and RDDE underspecified, so it is still possible the observed gaps partly reflect generation artifacts or annotation rules rather than pure epistemic-rhetorical decoupling. Corpus sampling criteria and any length or topic controls are not visible here either. If the full methods section supplies explicit formulas, inter-annotator checks, and validation against those confounds, the central claim strengthens; otherwise the differences stay suggestive rather than definitive. This is for NLP researchers working on calibration, generated-text detection, or AI safety applications. A reader who needs a lightweight, theory-motivated feature set for argumentative text would get direct value. It deserves peer review because the taxonomy is new, the empirical contrasts are sharp enough to test, and the automatable aspect makes it worth refining even if the current write-up needs tighter metric definitions.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs exhibit systematic epistemic-rhetorical miscalibration, with rhetorical intensity disproportionate to epistemic grounding. It introduces a triadic ERM taxonomy operationalized via composite metrics FMD, GPR, and RDDE. Applied to 225 argumentative texts (~0.6M tokens) across human expert, human non-expert, and LLM sub-corpora, the framework reports LLM texts using tricolon at nearly twice the expert rate (Δ=0.95), humans using erotema more than twice the LLM rate, performed hesitancy markers at twice human density in LLMs, significantly elevated FMD in LLMs (p<0.001, Δ=0.68), and more uniform rhetorical device distribution in LLM documents. Findings align with Gricean pragmatics, Relevance Theory, and Brandomian inferentialism, with the pipeline presented as automatable for AI content screening.

Significance. If the metrics validly isolate epistemic-rhetorical decoupling independent of stylistic artifacts, the work offers a novel, automatable framework grounded in linguistic theory for detecting miscalibration in LLM outputs. This could strengthen AI-generated text detection pipelines and provide empirical tests of pragmatic theories in computational settings. The model-agnostic results, large corpus size, and reported effect sizes add potential impact for NLP applications in content moderation and evaluation.

major comments (3)

[Methods] Methods section: The operational definitions, formulas, and validation procedures for FMD, GPR, and RDDE are not provided. Without explicit computation details (e.g., how form-meaning divergence is scored from annotations or how the genuine-to-performed ratio is derived), it is impossible to confirm that the metrics measure epistemic-rhetorical miscalibration rather than LLM stylistic regularities such as repetition or formality patterns. This directly affects interpretation of the central results including FMD Δ=0.68 (p<0.001).
[Corpus and Annotation] Corpus construction and annotation: No details are given on sampling criteria for the 225 texts, inter-annotator agreement for the ERM taxonomy, or controls for confounders like text length, topic, or LLM prompt design. These omissions leave open the possibility that differences in device rates (e.g., tricolon Δ=0.95) and RDDE uniformity arise from corpus or annotation biases rather than the hypothesized decoupling.
[Results] Results section: While p-values and effect sizes are reported, the manuscript does not specify the exact statistical tests, sample sizes per comparison, or corrections for multiple testing. This weakens assessment of claims such as significantly more uniform rhetorical device distribution across LLM documents.

minor comments (2)

[Abstract] Abstract: The token count is stated as 'approximately 0.6 Million tokens'; reporting the exact total would aid reproducibility.
[Discussion] Discussion: Adding an explicit limitations paragraph addressing potential LLM generation artifacts (e.g., prompt-induced patterns) would strengthen the manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened for clarity and reproducibility. We address each major comment below and will incorporate the suggested revisions in the next version of the paper.

read point-by-point responses

Referee: [Methods] Methods section: The operational definitions, formulas, and validation procedures for FMD, GPR, and RDDE are not provided. Without explicit computation details (e.g., how form-meaning divergence is scored from annotations or how the genuine-to-performed ratio is derived), it is impossible to confirm that the metrics measure epistemic-rhetorical miscalibration rather than LLM stylistic regularities such as repetition or formality patterns. This directly affects interpretation of the central results including FMD Δ=0.68 (p<0.001).

Authors: We agree that the current Methods section lacks sufficient explicit detail on the metrics. In the revised manuscript, we will expand this section to include the full operational definitions of the ERM taxonomy, the precise mathematical formulas for computing FMD (form-meaning divergence), GPR (genuine-to-performed epistemic ratio), and RDDE (rhetorical device distribution entropy), along with step-by-step validation procedures, annotation guidelines, and pseudocode for the automatable pipeline. These additions will demonstrate how the metrics isolate epistemic-rhetorical decoupling from purely stylistic factors such as repetition or formality. revision: yes
Referee: [Corpus and Annotation] Corpus construction and annotation: No details are given on sampling criteria for the 225 texts, inter-annotator agreement for the ERM taxonomy, or controls for confounders like text length, topic, or LLM prompt design. These omissions leave open the possibility that differences in device rates (e.g., tricolon Δ=0.95) and RDDE uniformity arise from corpus or annotation biases rather than the hypothesized decoupling.

Authors: We acknowledge the importance of transparency in corpus construction. The revised manuscript will include a dedicated subsection detailing the sampling criteria for selecting the 225 argumentative texts (~0.6M tokens), report inter-annotator agreement statistics (e.g., Cohen's kappa or Fleiss' kappa) for the ERM taxonomy annotations, and describe the controls applied for potential confounders including text length normalization, topic balancing across sub-corpora, and standardization of LLM prompt designs. These additions will help rule out alternative explanations for the observed differences. revision: yes
Referee: [Results] Results section: While p-values and effect sizes are reported, the manuscript does not specify the exact statistical tests, sample sizes per comparison, or corrections for multiple testing. This weakens assessment of claims such as significantly more uniform rhetorical device distribution across LLM documents.

Authors: We agree that full statistical reporting is necessary for rigorous evaluation. In the revised Results section, we will explicitly specify the statistical tests used for each comparison (e.g., independent t-tests or Mann-Whitney U tests for FMD and device rates), the exact sample sizes per group (225 texts divided across the three sub-corpora), and the multiple-testing correction applied (e.g., Bonferroni or FDR). This will allow readers to fully assess the robustness of findings such as the elevated FMD (p<0.001, Δ=0.68) and RDDE uniformity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a novel triadic ERM taxonomy and operationalizes it via author-defined composite metrics (FMD, GPR, RDDE) explicitly motivated by external linguistic theories (Gricean pragmatics, Relevance Theory, Brandomian inferentialism). These are then applied empirically to a 0.6M-token corpus spanning human expert, human non-expert, and LLM texts, producing statistical comparisons (e.g., tricolon rate Δ=0.95, FMD p<0.001 Δ=0.68). No equations, definitions, or self-citations are present that reduce any reported result to a fitted parameter, self-referential definition, or prior author work by construction. The derivation chain is an independent empirical measurement using a new framework and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the validity of newly introduced metrics and the assumption that observed differences reflect genuine miscalibration rather than artifacts of text generation or measurement.

axioms (1)

domain assumption Gricean pragmatics, Relevance Theory, and Brandomian inferentialism provide the theoretical foundation for interpreting the metrics as indicators of epistemic-rhetorical misalignment
Findings are stated to be consistent with these theories.

invented entities (4)

Triadic epistemic-rhetorical marker (ERM) taxonomy no independent evidence
purpose: To categorize and quantify the decoupling between epistemic grounding and rhetorical intensity
Newly proposed framework operationalized in the study.
Form-meaning divergence (FMD) no independent evidence
purpose: Composite metric capturing divergence between linguistic form and epistemic meaning
Introduced as one of the three core metrics.
Genuine-to-performed epistemic ratio (GPR) no independent evidence
purpose: Metric comparing genuine epistemic markers to performed hesitancy markers
Introduced as one of the three core metrics.
Rhetorical device distribution entropy (RDDE) no independent evidence
purpose: Measure of uniformity in rhetorical device usage across documents
Introduced as one of the three core metrics.

pith-pipeline@v0.9.0 · 5564 in / 1603 out tokens · 60378 ms · 2026-05-15T00:21:00.955244+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The taxonomy is operationalized through composite metrics of form-meaning divergence (FMD), genuine-to-performed epistemic ratio (GPR), and rhetorical device distribution entropy (RDDE).
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FMD is significantly elevated in LLM texts relative to both human groups (p < 0.001, Δ = 0.68)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Gallegos, Ryan A

doi:10.48550/arXiv.2309.00770. Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A Comprehensive Survey of Bias in LLMs: Current Landscape and Future Directions, September

work page doi:10.48550/arxiv.2309.00770
[2]

Ranjan, S

URLhttp://arxiv.org/abs/2409.16430. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online, July

work page arXiv
[3]

Language (Technology) is Power: A Critical Survey of ``Bias'' in NLP

Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.485. Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards Mitigating LLM Hallucination via Self Reflection. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, Sing...

work page doi:10.18653/v1/2020.acl-main.485 2020
[4]

doi:10.18653/v1/2023.findings-emnlp.123

Association for Computational Linguistics. doi:10.18653/v1/2023.findings-emnlp.123. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages 610–623, New Yo...

work page doi:10.18653/v1/2023.findings-emnlp.123 2023
[5]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Association for Computing Machinery. ISBN 978-1-4503-8309-7. doi:10.1145/3442188.3445922. Fabian Erhardt. Metacognitive Text Organizastion Semiotic and Rhetorical Agency in LLMs, December

work page doi:10.1145/3442188.3445922
[6]

ISBN 979-8-89176-251-0

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi:10.18653/v1/2025.acl-long.1345. Kai V on Fintel and Anthony S Gillies. An Opinionated Guide to Epistemic Modality. In Tamar Szabó Gendler and John Hawthorne, editors,Oxford Studies In Epistemology, pages 32–62. Oxford University Press, December

work page doi:10.18653/v1/2025.acl-long.1345 2025
[7]

Scale-Free Networks: Complex Webs in Nature and Technology

doi:10.1093/oso/9780199237067.003.0002. Ziqi Li and Qi Zhang. Linguistic Differences between AI and Human Comments in Weibo: Detect AI-Generated Text through Stylometric Features. InProceedings of the 24th China National Conference on Computational Linguistics, pages 842–851, August

work page doi:10.1093/oso/9780199237067.003.0002
[8]

Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation

Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, and Kyomin Jung. Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

work page 2025
[9]

ISBN 979-8-89176-189-6

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi:10.18653/v1/2025.naacl-long.452. David Clausen. HedgeHunter: A System for Hedge Detection and Uncertainty Classification. In Richárd Farkas, Veronika Vincze, György Szarvas, György Móra, and János Csirik, editors,Proceedings of the Fourteenth Confer- ence on Computational Natural Langu...

work page doi:10.18653/v1/2025.naacl-long.452 2025
[10]

doi:10.4000/14yb8

ISSN 1565-8961. doi:10.4000/14yb8. Herbert P Grice. Logic and Conversation. InSpeech acts, pages 41–58. Brill,

work page doi:10.4000/14yb8
[11]

doi:10.1007/s10994-025-06767-4

ISSN 1573-0565. doi:10.1007/s10994-025-06767-4. 18 Saying More Than They KnowA PREPRINT Krishnaram Kenthapadi, Mehrnoosh Sameki, and Ankur Taly. Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey). InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6523–6533,

work page doi:10.1007/s10994-025-06767-4
[12]

ISBN 978-1-4503-7110-0

Association for Computing Machinery. ISBN 978-1-4503-7110-0. doi:10.1145/3375627.3375811. Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. Surveying Stylometry Techniques and Applications.ACM Computing Surveys (CSuR), 50(6):1–36,

work page doi:10.1145/3375627.3375811
[13]

ISBN 979-8-4007-1124-4

Association for Computing Machinery. ISBN 979-8-4007-1124-4. doi:10.1145/3703323.3703712. Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, and Norbert Tihanyi. I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution....

work page doi:10.1145/3703323.3703712
[14]

ISBN 979-8-4007-1895-3

Association for Computing Machinery. ISBN 979-8-4007-1895-3. doi:10.1145/3733799.3762964. Tharindu Kumarage and Huan Liu. Neural Authorship Attribution: Stylometric Analysis on Large Language Models, August

work page doi:10.1145/3733799.3762964
[15]

Wataru Zaitsu, Mingzhe Jin, Shunichi Ishihara, Satoru Tsuge, and Mitsuyuki Inaba

arXiv:2308.07305 [cs]. Wataru Zaitsu, Mingzhe Jin, Shunichi Ishihara, Satoru Tsuge, and Mitsuyuki Inaba. Stylometry can reveal artificial intelligence authorship, but humans struggle: A comparison of human and seven large language models in Japanese. PLOS ONE, 20(10):e0335369, October

work page arXiv
[16]

doi:10.1371/journal.pone.0335369

ISSN 1932-6203. doi:10.1371/journal.pone.0335369. Zoltan P. Majdik and S. Scott Graham. Rhetoric of/with AI: An Introduction.Rhetoric Society Quarterly, 54(3): 222–231, May

work page doi:10.1371/journal.pone.0335369 1932
[17]

doi:10.1080/02773945.2024.2343264

ISSN 0277-3945, 1930-322X. doi:10.1080/02773945.2024.2343264. Sergey K. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, and Thy Tran. A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features, January

work page doi:10.1080/02773945.2024.2343264 1930
[18]

arXiv:2511.21744 [cs]. Maged S. Al-Shaibani and Moataz Ahmed. Arabic machine-generated text detection: Stylometric analysis and cross-model evaluation.Expert Systems with Applications, 305:130644, April

work page arXiv
[19]

doi:10.1016/j.eswa.2025.130644

ISSN 0957-4174. doi:10.1016/j.eswa.2025.130644. Ben Medlock and Ted Briscoe. Weakly Supervised Learning for Hedge Classification in Scientific Literature. In Annie Zaenen and Antal van den Bosch, editors,Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 992–999, Prague, Czech Republic, June

work page doi:10.1016/j.eswa.2025.130644 2025
[20]

, author Montani, I

doi:10.5281/zenodo.1212303. Stephen E Toulmin.The uses of argument. Cambridge university press,

work page doi:10.5281/zenodo.1212303

[1] [1]

Gallegos, Ryan A

doi:10.48550/arXiv.2309.00770. Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A Comprehensive Survey of Bias in LLMs: Current Landscape and Future Directions, September

work page doi:10.48550/arxiv.2309.00770

[2] [2]

Ranjan, S

URLhttp://arxiv.org/abs/2409.16430. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online, July

work page arXiv

[3] [3]

Language (Technology) is Power: A Critical Survey of ``Bias'' in NLP

Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.485. Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards Mitigating LLM Hallucination via Self Reflection. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, Sing...

work page doi:10.18653/v1/2020.acl-main.485 2020

[4] [4]

doi:10.18653/v1/2023.findings-emnlp.123

Association for Computational Linguistics. doi:10.18653/v1/2023.findings-emnlp.123. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages 610–623, New Yo...

work page doi:10.18653/v1/2023.findings-emnlp.123 2023

[5] [5]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Association for Computing Machinery. ISBN 978-1-4503-8309-7. doi:10.1145/3442188.3445922. Fabian Erhardt. Metacognitive Text Organizastion Semiotic and Rhetorical Agency in LLMs, December

work page doi:10.1145/3442188.3445922

[6] [6]

ISBN 979-8-89176-251-0

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi:10.18653/v1/2025.acl-long.1345. Kai V on Fintel and Anthony S Gillies. An Opinionated Guide to Epistemic Modality. In Tamar Szabó Gendler and John Hawthorne, editors,Oxford Studies In Epistemology, pages 32–62. Oxford University Press, December

work page doi:10.18653/v1/2025.acl-long.1345 2025

[7] [7]

Scale-Free Networks: Complex Webs in Nature and Technology

doi:10.1093/oso/9780199237067.003.0002. Ziqi Li and Qi Zhang. Linguistic Differences between AI and Human Comments in Weibo: Detect AI-Generated Text through Stylometric Features. InProceedings of the 24th China National Conference on Computational Linguistics, pages 842–851, August

work page doi:10.1093/oso/9780199237067.003.0002

[8] [8]

Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation

Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, and Kyomin Jung. Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

work page 2025

[9] [9]

ISBN 979-8-89176-189-6

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi:10.18653/v1/2025.naacl-long.452. David Clausen. HedgeHunter: A System for Hedge Detection and Uncertainty Classification. In Richárd Farkas, Veronika Vincze, György Szarvas, György Móra, and János Csirik, editors,Proceedings of the Fourteenth Confer- ence on Computational Natural Langu...

work page doi:10.18653/v1/2025.naacl-long.452 2025

[10] [10]

doi:10.4000/14yb8

ISSN 1565-8961. doi:10.4000/14yb8. Herbert P Grice. Logic and Conversation. InSpeech acts, pages 41–58. Brill,

work page doi:10.4000/14yb8

[11] [11]

doi:10.1007/s10994-025-06767-4

ISSN 1573-0565. doi:10.1007/s10994-025-06767-4. 18 Saying More Than They KnowA PREPRINT Krishnaram Kenthapadi, Mehrnoosh Sameki, and Ankur Taly. Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey). InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6523–6533,

work page doi:10.1007/s10994-025-06767-4

[12] [12]

ISBN 978-1-4503-7110-0

Association for Computing Machinery. ISBN 978-1-4503-7110-0. doi:10.1145/3375627.3375811. Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. Surveying Stylometry Techniques and Applications.ACM Computing Surveys (CSuR), 50(6):1–36,

work page doi:10.1145/3375627.3375811

[13] [13]

ISBN 979-8-4007-1124-4

Association for Computing Machinery. ISBN 979-8-4007-1124-4. doi:10.1145/3703323.3703712. Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, and Norbert Tihanyi. I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution....

work page doi:10.1145/3703323.3703712

[14] [14]

ISBN 979-8-4007-1895-3

Association for Computing Machinery. ISBN 979-8-4007-1895-3. doi:10.1145/3733799.3762964. Tharindu Kumarage and Huan Liu. Neural Authorship Attribution: Stylometric Analysis on Large Language Models, August

work page doi:10.1145/3733799.3762964

[15] [15]

Wataru Zaitsu, Mingzhe Jin, Shunichi Ishihara, Satoru Tsuge, and Mitsuyuki Inaba

arXiv:2308.07305 [cs]. Wataru Zaitsu, Mingzhe Jin, Shunichi Ishihara, Satoru Tsuge, and Mitsuyuki Inaba. Stylometry can reveal artificial intelligence authorship, but humans struggle: A comparison of human and seven large language models in Japanese. PLOS ONE, 20(10):e0335369, October

work page arXiv

[16] [16]

doi:10.1371/journal.pone.0335369

ISSN 1932-6203. doi:10.1371/journal.pone.0335369. Zoltan P. Majdik and S. Scott Graham. Rhetoric of/with AI: An Introduction.Rhetoric Society Quarterly, 54(3): 222–231, May

work page doi:10.1371/journal.pone.0335369 1932

[17] [17]

doi:10.1080/02773945.2024.2343264

ISSN 0277-3945, 1930-322X. doi:10.1080/02773945.2024.2343264. Sergey K. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, and Thy Tran. A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features, January

work page doi:10.1080/02773945.2024.2343264 1930

[18] [18]

arXiv:2511.21744 [cs]. Maged S. Al-Shaibani and Moataz Ahmed. Arabic machine-generated text detection: Stylometric analysis and cross-model evaluation.Expert Systems with Applications, 305:130644, April

work page arXiv

[19] [19]

doi:10.1016/j.eswa.2025.130644

ISSN 0957-4174. doi:10.1016/j.eswa.2025.130644. Ben Medlock and Ted Briscoe. Weakly Supervised Learning for Hedge Classification in Scientific Literature. In Annie Zaenen and Antal van den Bosch, editors,Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 992–999, Prague, Czech Republic, June

work page doi:10.1016/j.eswa.2025.130644 2025

[20] [20]

, author Montani, I

doi:10.5281/zenodo.1212303. Stephen E Toulmin.The uses of argument. Cambridge university press,

work page doi:10.5281/zenodo.1212303