When Certainty Is an Artifact: Keyword Lexicon Blindness and the (Mis)Measurement of Rhetorical Stance

Bo Chen

arxiv: 2606.26062 · v1 · pith:PPC4V5KVnew · submitted 2026-06-24 · 💻 cs.CL

When Certainty Is an Artifact: Keyword Lexicon Blindness and the (Mis)Measurement of Rhetorical Stance

Bo Chen This is my paper

Pith reviewed 2026-06-25 19:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords keyword lexiconsrhetorical stanceepistemic certaintyhedgingnegative affectmeasurement artifactLLM classificationcomputational social science

0 comments

The pith

Keyword lexicons measure lexical co-occurrence patterns rather than actual rhetorical stance or epistemic certainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a prominent finding in computational social science—that negative affect strongly co-occurs with emphatic certainty—survives when the measurement instrument changes. Keyword scoring on 85 interviews produces robust positive correlations (r = 0.72–0.93) across four speakers, but zero-shot LLM classification on the full 32,625-sentence corpus drops those correlations sharply or reverses their sign while uncovering the expected negative-hedging pattern. The authors trace the discrepancy to three lexicon failures—syntactic blindness, polysemy blindness, and categorical absence—that can invert sentence meaning. A sympathetic reader would care because many existing claims about public figures’ psychology rest on the same keyword method.

Core claim

Keyword lexicon scoring detects a universal tendency for negative discourse to attract emphatic vocabulary, producing high negative-affect/emphatic-certainty correlations that largely disappear or invert when the same corpus is scored by LLM semantic classification; the paper shows this occurs because lexicons cannot distinguish syntactic scope, word sense, or missing categories, so counts of words like “never,” “absolutely,” and “confident” in the phrase “never absolutely totally confident” register as high certainty.

What carries the argument

Direct replacement of keyword lexicon counts with LLM zero-shot semantic classification on the complete diarized corpus, exposing the three structural failure modes (syntactic blindness, polysemy blindness, categorical absence) that make lexicon scores orthogonal to rhetorical stance.

If this is right

Many published correlations between affect and certainty derived from keyword lexicons may reflect word co-occurrence habits rather than speaker psychology.
The conventional view that pessimistic discourse attracts hedging receives support only after the measurement method is changed from keywords to semantic classification.
Treating raw keyword counts as direct indicators of epistemic stance constitutes a category error in computational social science.
Lexicon-based results can systematically invert the direction of relationships that semantic methods recover.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing studies that relied on the same lexicons for claims about public intellectuals or media figures may require re-analysis with context-sensitive methods.
Similar measurement artifacts could affect other lexicon-driven variables such as sentiment polarity or moral foundations when applied to spoken or transcribed text.
Hybrid pipelines that first apply lexicons and then filter or correct for the three blindness modes might reduce the discrepancy without full LLM replacement.

Load-bearing premise

That LLM zero-shot classification on full sentences supplies a valid ground-truth measure of certainty, hedging, and negative affect without its own systematic misclassifications that could produce the observed shifts.

What would settle it

Human annotation of a random sample of the same sentences for the three rhetorical dimensions, followed by a direct correlation comparison showing whether keyword scores or LLM scores align more closely with the human labels.

Figures

Figures reproduced from arXiv: 2606.26062 by Bo Chen.

**Figure 2.** Figure 2: Cross-dimensional time series: normalized negative affect vs. emphatic/certain language fre [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Can a statistically significant, large-effect-size finding in computational social science be entirely an artifact of the measurement instrument? We present a case where the answer appears to be yes. Analyzing 85 interviews across four public intellectuals (2016--2026), we find a robust negative-affect/emphatic-certainty lexical co-occurrence pattern under keyword-based scoring ($r = 0.72$--$0.93$, $p < 0.01$ for all four speakers). Replacing keyword counting with LLM-based zero-shot semantic classification on the complete diarized corpus (32,625 sentences) dramatically reduces this correlation: Dalio's $r = 0.851$ drops to $r = 0.206$, with two speakers showing negative $r(\text{neg}, \text{emphatic})$ and one showing null. In contrast, the LLM reveals a strong negative-hedging coupling across speakers -- Rogoff's $r(\text{neg}, \text{hedged}) = 0.875$ ($p = 0.001$) and Zeihan's $r(\text{neg}, \text{hedged}) = 0.722$ ($p = 0.008$) -- consistent with the conventional expectation that pessimistic discourse attracts hedging, not certainty. Sentence-level error analysis traces this discrepancy to three structural failure modes in keyword lexicons -- syntactic blindness, polysemy blindness, and categorical absence -- illustrated through cases where keyword counting inverts semantic meaning (e.g., ''never absolutely totally confident'' scored as high-certainty). We argue that keyword lexicons measure a universal lexical co-occurrence tendency -- negative discourse naturally attracts emphatic vocabulary -- that is orthogonal to, and can systematically invert, rhetorical stance. Treating keyword counts as measurements of epistemic certainty is a category error: a finding that appears to be about a speaker's psychology may be entirely about the counting of words.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Keyword lexicons can flip stance correlations on this interview set versus LLM labels, but the paper treats the LLM as ground truth without any validation, so the category-error claim stays unproven.

read the letter

The paper's main contribution is a concrete demonstration on 85 interviews from four speakers: keyword lexicons yield strong positive correlations between negative affect and emphatic certainty (r 0.72-0.93), while zero-shot LLM classification on the full 32k sentences drops those numbers sharply (Dalio from 0.851 to 0.206) and flips some to negative or null. It also surfaces the opposite pattern with hedging. The three named failure modes—syntactic blindness, polysemy, and missing categories—are backed by sentence examples like "never absolutely totally confident" scoring high on certainty keywords. That part is useful; it gives a quantified case rather than another general warning about lexicons.

The soft spot is exactly where the stress-test flags it. The argument that keyword results are artifacts and LLM results reflect true rhetorical stance requires treating the LLM output as the better measure. The abstract gives no human gold standard, no inter-rater numbers, no error analysis on the LLM labels, and no accuracy check on the inversion cases. Without that, the observed shifts could come from LLM tendencies (default hedging, context misreads) instead of lexicon flaws. The paper attributes everything to the three keyword problems and does not test the alternative.

This leaves the central claim—that treating keyword counts as epistemic stance is a category error—resting on an untested premise. The lexical co-occurrence point (negative talk draws emphatic words) is plausible on its own, but the inversion-to-true-stance step is not yet supported.

The work is aimed at computational social scientists who rely on off-the-shelf lexicons for stance or affect in discourse data. A reader already skeptical of lexicon methods will find the examples helpful; someone looking for a settled replacement method will not. It deserves peer review because the discrepancy is large enough to matter and the failure-mode examples are specific, but referees will need to see whether the full paper adds any validation for the LLM side or tests robustness to different prompts and models. I would bring it to a reading group for the method discussion but would not cite the conclusions as established.

Referee Report

2 major / 2 minor

Summary. The paper claims that keyword lexicons for epistemic certainty, hedging, and negative affect produce artifactual correlations (r=0.72–0.93) in 85 interviews with four speakers because they suffer from syntactic blindness, polysemy, and categorical absence; replacing them with LLM zero-shot semantic classification on the full 32,625-sentence diarized corpus reduces or reverses the certainty–negative-affect correlations (e.g., Dalio r drops from 0.851 to 0.206) while revealing the expected negative-affect–hedging pattern, demonstrating that keyword counts measure lexical co-occurrence rather than rhetorical stance.

Significance. If the LLM labels are shown to be valid, the result would be a substantive methodological warning for computational social science: many published correlations between lexical certainty markers and affect may be measurement artifacts rather than evidence about speakers’ psychology. The scale of the corpus and the concrete inversion examples (e.g., “never absolutely totally confident”) strengthen the demonstration that the discrepancy is structural rather than idiosyncratic.

major comments (2)

[Abstract and §3] Abstract and §3 (results): the central claim that LLM zero-shot classification reveals the “true” rhetorical stance rests on the untested premise that the LLM is systematically more valid than the keyword lexicons. No human-annotated gold standard, inter-rater reliability, accuracy metrics, or error analysis for the zero-shot labels is reported; without this, the observed correlation shifts (Dalio r=0.851→0.206, sign changes for two speakers) could equally reflect LLM-specific biases such as default hedging or context misreading, leaving the category-error conclusion unsupported.
[§4] §4 (error analysis): the three failure modes are illustrated with selected counter-examples but are not quantified across the 32,625 sentences. It is therefore unclear what fraction of the corpus is affected by syntactic blindness, polysemy, or categorical absence, and whether these modes account for the full magnitude of the reported correlation drops.

minor comments (2)

[Methods] The manuscript should report the exact LLM prompts, model version, temperature, and any post-processing rules used for zero-shot classification.
[Methods] Corpus construction details (interview selection criteria, diarization method, sentence segmentation) are referenced but not fully specified; these are needed to assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (results): the central claim that LLM zero-shot classification reveals the “true” rhetorical stance rests on the untested premise that the LLM is systematically more valid than the keyword lexicons. No human-annotated gold standard, inter-rater reliability, accuracy metrics, or error analysis for the zero-shot labels is reported; without this, the observed correlation shifts (Dalio r=0.851→0.206, sign changes for two speakers) could equally reflect LLM-specific biases such as default hedging or context misreading, leaving the category-error conclusion unsupported.

Authors: We agree that the manuscript does not report a human-annotated gold standard or accuracy metrics for the LLM zero-shot labels, which leaves open the possibility of LLM-specific biases. Our primary argument is comparative: the keyword method produces results inconsistent with established expectations from the hedging and affect literature (negative affect should pair with hedging), while the LLM results are consistent with those expectations, and the lexicon method demonstrably inverts meaning in the provided counter-examples. We will add a limitations paragraph acknowledging the absence of direct validation and include a small-scale human annotation check on a subsample in the revision to assess LLM label quality. revision: yes
Referee: [§4] §4 (error analysis): the three failure modes are illustrated with selected counter-examples but are not quantified across the 32,625 sentences. It is therefore unclear what fraction of the corpus is affected by syntactic blindness, polysemy, or categorical absence, and whether these modes account for the full magnitude of the reported correlation drops.

Authors: The error analysis is intended to be illustrative, identifying the structural mechanisms (syntactic blindness, polysemy, categorical absence) that allow keyword counts to invert semantic meaning. We did not quantify the exact prevalence of each mode across the full corpus. The demonstration does not require that these modes explain the entire correlation drop, only that they are sufficient to produce artifactual results. We will revise §4 to explicitly state the illustrative purpose and add this as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: independent measurement comparison on shared corpus

full rationale

The paper's core result is an empirical discrepancy between two distinct scoring procedures (keyword lexicon counts vs. LLM zero-shot labels) applied to the identical set of 32,625 sentences. No parameter is fitted to the target correlations and then re-used as a 'prediction'; no derivation reduces to a self-citation chain; no ansatz or uniqueness theorem is invoked. The reported r-value drops are direct outputs of the two independent classifiers rather than algebraic identities or fitted-input artifacts. The analysis therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating LLM zero-shot classification as the more accurate ground truth for rhetorical stance. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption LLM zero-shot semantic classification accurately captures rhetorical stance (certainty, hedging, negative affect) without systematic errors that would explain the observed differences from keyword results.
The paper uses the LLM output to declare the keyword correlations artifactual; this premise is invoked when interpreting the drop from r=0.851 to r=0.206 and the emergence of hedging correlations.

pith-pipeline@v0.9.1-grok · 5884 in / 1328 out tokens · 35769 ms · 2026-06-25T19:30:30.303875+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 1 linked inside Pith

[1]

V ADER: A parsimonious rule-based model for sentiment analysis of social media text

C. J. Hutto and E. Gilbert. “V ADER: A parsimonious rule-based model for sentiment analysis of social media text.” InProc. ICWSM, 2014

2014
[2]

The development and psychometric prop- erties of LIWC2015

J. Pennebaker, R. Boyd, K. Jordan, and K. Blackburn. “The development and psychometric prop- erties of LIWC2015.” University of Texas at Austin, 2015

2015
[3]

When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks

T. Loughran and B. McDonald. “When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks.”Journal of Finance, 66(1):35–65, 2011

2011
[4]

BERT: Pre-training of deep bidirectional transformers for language understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of deep bidirectional transformers for language understanding.” InProc. NAACL, 2019

2019
[5]

RoBERTa: A robustly optimized BERT pretraining approach

Y . Liu et al. “RoBERTa: A robustly optimized BERT pretraining approach.”arXiv:1907.11692, 2019

Pith/arXiv arXiv 1907
[6]

Text style transfer and sentiment analysis: A case study in confounding factors

K. Keith et al. “Text style transfer and sentiment analysis: A case study in confounding factors.” In Proc. ACL, 2020

2020
[7]

How context affects language models’ factual predictions

F. Petroni et al. “How context affects language models’ factual predictions.” InProc. AKBC, 2020

2020
[8]

F. R. Palmer.Mood and Modality. Cambridge University Press, 2001

2001
[9]

Stance and engagement: A model of interaction in academic discourse

K. Hyland. “Stance and engagement: A model of interaction in academic discourse.”Discourse Studies, 7(2):173–192, 2005

2005
[10]

R. P. Hart.Campaign Talk: Why Elections Are Good for Us. Princeton University Press, 2000

2000
[11]

Leading in an uncertain world: A rhetorical analysis of Federal Reserve communications

M. C. Bligh and J. C. Hess. “Leading in an uncertain world: A rhetorical analysis of Federal Reserve communications.”Leadership Quarterly, 2010

2010
[12]

Temporal dynamics of sentiment in online discourse

Y . Wang, J. Zhao, and M. Zhu. “Temporal dynamics of sentiment in online discourse.” InProc. ACL, 2019

2019
[13]

Time-aware sentiment analysis of political speeches

K. Larsen, L. Hansen, and P. Johansen. “Time-aware sentiment analysis of political speeches.” Journal of Computational Linguistics, 47(3):521–548, 2021. 15

2021
[14]

The Fed and the stock market: A textual analysis of FOMC minutes

W. Brady et al. “The Fed and the stock market: A textual analysis of FOMC minutes.”Journal of Financial Economics, 2022

2022
[15]

yt-dlp: A youtube-dl fork with additional features and fixes

yt-dlp Contributors. “yt-dlp: A youtube-dl fork with additional features and fixes.” https://github.com/yt-dlp/yt-dlp, 2024

2024
[16]

youtube-transcript-api: A Python API for retrieving YouTube transcripts

J. Depoix. “youtube-transcript-api: A Python API for retrieving YouTube transcripts.” https://github.com/jdepoix/youtube-transcript-api, 2020

2020
[17]

BERT rediscovers the classical NLP pipeline

I. Tenney, D. Das, and E. Pavlick. “BERT rediscovers the classical NLP pipeline.” InProc. ACL, 2019

2019
[18]

ChatGPT outperforms crowd workers for text-annotation tasks

F. Gilardi, M. Alizadeh, and M. Kubli. “ChatGPT outperforms crowd workers for text-annotation tasks.”Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

2023
[19]

ChatGPT for text annotation? Mind the hype!

E. Ollion, Y . Shen, and A. Macanovic. “ChatGPT for text annotation? Mind the hype!” SocArXiv preprint, 2023

2023
[20]

Can large language models transform computational social science?

C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, and D. Yang. “Can large language models transform computational social science?”Computational Linguistics, 50(1):237–272, 2024

2024
[21]

Can generative AI improve social science?

C. A. Bail. “Can generative AI improve social science?”Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

2024
[22]

Financial statement analysis with large language models

A. Kim, M. Muhn, and V . Nikolaev. “Financial statement analysis with large language models.” SSRN Working Paper 4835311, 2024

2024
[23]

Hyland.Metadiscourse: Exploring Interaction in Writing

K. Hyland.Metadiscourse: Exploring Interaction in Writing. Continuum, 2005. 16

2005

[1] [1]

V ADER: A parsimonious rule-based model for sentiment analysis of social media text

C. J. Hutto and E. Gilbert. “V ADER: A parsimonious rule-based model for sentiment analysis of social media text.” InProc. ICWSM, 2014

2014

[2] [2]

The development and psychometric prop- erties of LIWC2015

J. Pennebaker, R. Boyd, K. Jordan, and K. Blackburn. “The development and psychometric prop- erties of LIWC2015.” University of Texas at Austin, 2015

2015

[3] [3]

When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks

T. Loughran and B. McDonald. “When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks.”Journal of Finance, 66(1):35–65, 2011

2011

[4] [4]

BERT: Pre-training of deep bidirectional transformers for language understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of deep bidirectional transformers for language understanding.” InProc. NAACL, 2019

2019

[5] [5]

RoBERTa: A robustly optimized BERT pretraining approach

Y . Liu et al. “RoBERTa: A robustly optimized BERT pretraining approach.”arXiv:1907.11692, 2019

Pith/arXiv arXiv 1907

[6] [6]

Text style transfer and sentiment analysis: A case study in confounding factors

K. Keith et al. “Text style transfer and sentiment analysis: A case study in confounding factors.” In Proc. ACL, 2020

2020

[7] [7]

How context affects language models’ factual predictions

F. Petroni et al. “How context affects language models’ factual predictions.” InProc. AKBC, 2020

2020

[8] [8]

F. R. Palmer.Mood and Modality. Cambridge University Press, 2001

2001

[9] [9]

Stance and engagement: A model of interaction in academic discourse

K. Hyland. “Stance and engagement: A model of interaction in academic discourse.”Discourse Studies, 7(2):173–192, 2005

2005

[10] [10]

R. P. Hart.Campaign Talk: Why Elections Are Good for Us. Princeton University Press, 2000

2000

[11] [11]

Leading in an uncertain world: A rhetorical analysis of Federal Reserve communications

M. C. Bligh and J. C. Hess. “Leading in an uncertain world: A rhetorical analysis of Federal Reserve communications.”Leadership Quarterly, 2010

2010

[12] [12]

Temporal dynamics of sentiment in online discourse

Y . Wang, J. Zhao, and M. Zhu. “Temporal dynamics of sentiment in online discourse.” InProc. ACL, 2019

2019

[13] [13]

Time-aware sentiment analysis of political speeches

K. Larsen, L. Hansen, and P. Johansen. “Time-aware sentiment analysis of political speeches.” Journal of Computational Linguistics, 47(3):521–548, 2021. 15

2021

[14] [14]

The Fed and the stock market: A textual analysis of FOMC minutes

W. Brady et al. “The Fed and the stock market: A textual analysis of FOMC minutes.”Journal of Financial Economics, 2022

2022

[15] [15]

yt-dlp: A youtube-dl fork with additional features and fixes

yt-dlp Contributors. “yt-dlp: A youtube-dl fork with additional features and fixes.” https://github.com/yt-dlp/yt-dlp, 2024

2024

[16] [16]

youtube-transcript-api: A Python API for retrieving YouTube transcripts

J. Depoix. “youtube-transcript-api: A Python API for retrieving YouTube transcripts.” https://github.com/jdepoix/youtube-transcript-api, 2020

2020

[17] [17]

BERT rediscovers the classical NLP pipeline

I. Tenney, D. Das, and E. Pavlick. “BERT rediscovers the classical NLP pipeline.” InProc. ACL, 2019

2019

[18] [18]

ChatGPT outperforms crowd workers for text-annotation tasks

F. Gilardi, M. Alizadeh, and M. Kubli. “ChatGPT outperforms crowd workers for text-annotation tasks.”Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

2023

[19] [19]

ChatGPT for text annotation? Mind the hype!

E. Ollion, Y . Shen, and A. Macanovic. “ChatGPT for text annotation? Mind the hype!” SocArXiv preprint, 2023

2023

[20] [20]

Can large language models transform computational social science?

C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, and D. Yang. “Can large language models transform computational social science?”Computational Linguistics, 50(1):237–272, 2024

2024

[21] [21]

Can generative AI improve social science?

C. A. Bail. “Can generative AI improve social science?”Proceedings of the National Academy of Sciences, 121(21):e2314021121, 2024

2024

[22] [22]

Financial statement analysis with large language models

A. Kim, M. Muhn, and V . Nikolaev. “Financial statement analysis with large language models.” SSRN Working Paper 4835311, 2024

2024

[23] [23]

Hyland.Metadiscourse: Exploring Interaction in Writing

K. Hyland.Metadiscourse: Exploring Interaction in Writing. Continuum, 2005. 16

2005