pith. sign in

arxiv: 2604.19943 · v1 · submitted 2026-04-21 · 💻 cs.CL

Structured Disagreement in Health-Literacy Annotation: Epistemic Stability, Conceptual Difficulty, and Agreement-Stratified Inference

Pith reviewed 2026-05-10 02:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords health literacyannotation disagreementperspectivismvariance decompositioninter-annotator agreementCOVID-19social effects
0
0 comments X

The pith

Disagreement in health-literacy annotations stems more from question difficulty than from who the annotators are.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies graded ratings of how well people answered COVID-19 health questions, with each answer scored by several annotators on a scale of correctness. It decomposes the sources of variation in those scores and finds that the particular question being asked explains more of the differences than which person is doing the rating. Splitting the results into groups with high, medium, and low agreement among raters shows that patterns by country, schooling, and city versus countryside often shift or flip direction. The result is that averaging all ratings together can mask real differences in interpretation.

Core claim

Variance decomposition shows that question-level conceptual difficulty accounts for substantially more variance than annotator identity, indicating that disagreement is structured by the task itself rather than driven by individual raters. Agreement-stratified analyses further reveal that key social-scientific effects, including country, education, and urban-rural differences, vary in magnitude and in some cases reverse direction across levels of inter-annotator agreement.

What carries the argument

Proportional correctness scores on open-ended responses, processed with variance decomposition and agreement-stratified inference.

If this is right

  • Disagreement is structured by the task itself rather than by individual raters.
  • Social effects like country and education vary in magnitude and can reverse by agreement level.
  • Aggregating across agreement levels can obscure important inferential differences.
  • Perspectivist modeling is necessary for valid inference in graded interpretive tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance and stratification approach may reveal structured disagreement in other interpretive annotation tasks such as fact-checking or policy evaluation.
  • Splitting data by agreement level before drawing social conclusions could become a standard check in similar studies.
  • Some annotation components appear stable across raters while others vary, suggesting targeted follow-up on the unstable parts.

Load-bearing premise

The proportional correctness scores accurately reflect alignment with public-health guidelines without bias from how annotators interpret the responses.

What would settle it

Re-running the variance decomposition on the same data and finding that annotator identity accounts for more variance than question difficulty, or that social effects stay consistent in size and direction at every agreement level.

Figures

Figures reproduced from arXiv: 2604.19943 by Candice Koo, Nemika Tyagi, Olga Kellert, Sriya Kondury, Steffen Eikenberry.

Figure 2
Figure 2. Figure 2: Gender Comparison Across Agreement Levels 6.3. Education Effects Education is a significant predictor in the aggre￾gated model as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Country Comparison Across Agreement Levels As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Urban and Rural Comparisons Across Agreement Levels ban: n = 117, M = 0.6961; Rural: n = 181, M = 0.6438). Under high agreement, the urban ad￾vantage strengthens (t = 6.2274, p < .001). Under medium agreement, the effect reverses (t = −2.12, p = .0351), indicating higher accuracy in rural re￾spondents. Under low agreement, the difference disappears entirely (t = 0.24, p = .8124). Again, we see that inferen… view at source ↗
Figure 5
Figure 5. Figure 5: Agreement-Stratified Effects Across Pre [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Annotation pipelines in Natural Language Processing (NLP) commonly assume a single latent ground truth per instance and resolve disagreement through label aggregation. Perspectivist approaches challenge this view by treating disagreement as potentially informative rather than erroneous. We present a large-scale analysis of graded health-literacy annotations from 6,323 open-ended COVID-19 responses collected in Ecuador and Peru. Each response was independently labeled by multiple annotators using proportional correctness scores, reflecting the degree to which responses align with normative public-health guidelines, allowing us to analyze the full distribution of judgments rather than aggregated labels. Variance decomposition shows that question-level conceptual difficulty accounts for substantially more variance than annotator identity, indicating that disagreement is structured by the task itself rather than driven by individual raters. Agreement-stratified analyses further reveal that key social-scientific effects, including country, education, and urban-rural differences, vary in magnitude and in some cases reverse direction across levels of inter-annotator agreement. These findings suggest that graded health-literacy evaluation contains both epistemically stable and unstable components, and that aggregating across them can obscure important inferential differences. We therefore argue that strong perspectivist modeling is not only conceptually justified but statistically necessary for valid inference in graded interpretive tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript analyzes disagreement in health-literacy annotations of 6,323 open-ended COVID-19 responses from Ecuador and Peru using proportional correctness scores assigned by multiple annotators. Through variance decomposition, it finds that question-level conceptual difficulty explains more variance than annotator identity. Agreement-stratified analyses show that effects of country, education, and urban-rural status vary or reverse depending on inter-annotator agreement levels. The authors conclude that perspectivist modeling is necessary for valid inference in such graded interpretive tasks.

Significance. If the variance decomposition and stratified results hold after proper controls, this would be a significant contribution to perspectivist NLP and annotation methodology. It supplies large-scale empirical evidence from a real-world health domain that disagreement is largely task-structured rather than rater-driven, and demonstrates that aggregation can mask or reverse key social-scientific inferences. The scale of the dataset and the focus on graded rather than categorical labels strengthen its potential impact on annotation pipelines.

major comments (3)
  1. [Methods (variance decomposition)] Methods section on variance decomposition: The manuscript must specify the exact statistical model (e.g., linear mixed-effects with crossed random effects for questions and annotators, or inclusion of fixed covariates for response length, lexical complexity, or annotator training). Without this, it is impossible to verify that the larger question-level variance component isolates conceptual difficulty rather than unmeasured confounders, directly undermining the central claim that disagreement is 'structured by the task itself'.
  2. [Results (agreement-stratified analyses)] Agreement-stratified analyses: The paper needs to define the agreement strata explicitly (e.g., exact thresholds or quantiles used for high/medium/low agreement), report stratum-specific sample sizes, and show robustness checks (e.g., alternative stratification methods). The reported reversals in country, education, and urban-rural effects are load-bearing for the inference claim but cannot be evaluated without these details.
  3. [Annotation procedure] Annotation procedure: The operationalization of 'proportional correctness scores' requires explicit description of the guidelines provided to annotators, any calibration training, and how scores were normalized to align with normative public-health guidelines. Potential question-specific interpretive latitude (e.g., differing thresholds for 'sufficient' detail) could artifactually inflate question-level variance.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'proportional correctness scores' should be briefly defined on first use to aid readers unfamiliar with the annotation scheme.
  2. [Results] The manuscript should include a table or figure summarizing the variance components (e.g., ICC or percentage of variance explained by question vs. annotator) with confidence intervals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have prompted us to clarify key methodological aspects and strengthen the transparency of our analyses. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: Methods section on variance decomposition: The manuscript must specify the exact statistical model (e.g., linear mixed-effects with crossed random effects for questions and annotators, or inclusion of fixed covariates for response length, lexical complexity, or annotator training). Without this, it is impossible to verify that the larger question-level variance component isolates conceptual difficulty rather than unmeasured confounders, directly undermining the central claim that disagreement is 'structured by the task itself'.

    Authors: We appreciate this observation. The variance decomposition was conducted via a linear mixed-effects model with crossed random effects for questions and annotators, implemented in R (lme4 package) as score ~ 1 + (1|question) + (1|annotator). Robustness specifications additionally included fixed effects for response length and lexical complexity (via Flesch-Kincaid scores). We have expanded the Methods section with the full model formula, estimated variance components, and a paragraph explaining why question-level variance is interpreted as reflecting conceptual difficulty after these controls. These additions directly address potential confounding. revision: yes

  2. Referee: Agreement-stratified analyses: The paper needs to define the agreement strata explicitly (e.g., exact thresholds or quantiles used for high/medium/low agreement), report stratum-specific sample sizes, and show robustness checks (e.g., alternative stratification methods). The reported reversals in country, education, and urban-rural effects are load-bearing for the inference claim but cannot be evaluated without these details.

    Authors: We agree that explicit reporting is essential. The revised manuscript now defines strata as tertiles of the pairwise agreement distribution (low: bottom third, medium: middle third, high: top third), reports the exact sample sizes per stratum (2,108 / 2,107 / 2,108 responses), and includes robustness checks using both quantile-based and fixed-threshold alternatives (e.g., agreement <0.5 vs. >0.75). The reversals in country, education, and urban-rural effects remain consistent across specifications, reinforcing the claim that aggregation can obscure or invert social-scientific inferences. revision: yes

  3. Referee: Annotation procedure: The operationalization of 'proportional correctness scores' requires explicit description of the guidelines provided to annotators, any calibration training, and how scores were normalized to align with normative public-health guidelines. Potential question-specific interpretive latitude (e.g., differing thresholds for 'sufficient' detail) could artifactually inflate question-level variance.

    Authors: The proportional correctness scores were assigned against a rubric derived from WHO and national public-health guidelines for COVID-19 information. Annotators completed a calibration phase with 50 pilot items and received written guidelines specifying scoring criteria for key factual elements. Scores were normalized to the [0,1] interval by dividing the number of satisfied criteria by the total possible. We have added a new Methods subsection with the full guidelines excerpt, training protocol, and normalization formula. While some interpretive latitude is inherent to open-ended health-literacy responses (and is in fact the motivation for our perspectivist stance), the guidelines were designed to constrain arbitrary variation; we now discuss this explicitly as a feature rather than a flaw of the task. revision: yes

Circularity Check

0 steps flagged

No circularity: variance decomposition applied directly to observed annotation scores

full rationale

The paper's central analysis applies standard variance decomposition (likely mixed-effects or ANOVA-style partitioning) to the collected proportional correctness scores from 6,323 responses. The finding that question-level conceptual difficulty explains more variance than annotator identity is a direct empirical output of this decomposition on the raw judgments, not a quantity defined in terms of itself or fitted to a subset and then relabeled as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the result; the derivation remains self-contained against the external annotation data and does not reduce by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit parameters, axioms, or invented entities. The analysis implicitly relies on standard statistical assumptions for variance decomposition (e.g., additive effects, independence of residuals) and the validity of proportional scoring as a measure of guideline alignment.

pith-pipeline@v0.9.0 · 5537 in / 1363 out tokens · 38021 ms · 2026-05-10T02:11:01.773903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [2]

    Valerio Basile, Michael Fell, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, Massimo Poesio, and Alexandra Uma. 2021. We need to consider disagreement in evaluation. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 15–21. Association for Computational Linguistics

  2. [3]

    Laura Biester, Vinodkumar Prabhakaran, Anjalie Field, Naomi Saphra, and Rada Mihalcea. 2022. Analyzing the effects of annotator gender across NLP tasks. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @ LREC2022, pages 10–19. European Language Resources Association

  3. [4]

    Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. 2022. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92–110

  4. [5]

    Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Marco, and Davide Bernardi. 2025. Perspectivist approaches to natural language processing: a survey. Language Resources and Evaluation, 59(2):1719–1746

  5. [6]

    Kamil Kanclerz, Alicja Figas, Marcin Gruza, Tomasz Kajdanowicz, Jan Kocon, Daria Puchalska, Przemyslaw Kazienko. 2021. Controversy and Conformity: from Generalized to Personalized Aggressiveness Detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), pages 5915–5926

  6. [7]

    Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014. Linguistically debatable or just plain wrong? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 507–511

  7. [8]

    Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio

    Alexandra N. Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. 2022. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470

  8. [9]

    Valerio Basile. 2021. It's the End of the Gold Standard as We Know It: Leveraging Non-aggregated Data for Better Evaluation and Explanation of Subjective Tasks. In AIxIA 2020 - Advances in Artificial Intelligence, Lecture Notes in Computer Science, pages 441–453. Springer

  9. [10]

    Sergio Meneses-Navarro, María Graciela Freyermuth-Enciso, Blanca Estela Pelcastre-Villafuerte, Roberto Campos-Navarro, David Mariano Meléndez-Navarro, and Liliana Gómez-Flores-Ramos. 2020. The challenges facing indigenous communities in Latin America as they confront the COVID-19 pandemic. International Journal for Equity in Health, 19(1):63

  10. [11]

    Mejia, Telmo Raul Aveiro-Robalo, Luciana Daniela Garlisi Torales, Maria Fernanda Fernández, Francisco E

    Christian R. Mejia, Telmo Raul Aveiro-Robalo, Luciana Daniela Garlisi Torales, Maria Fernanda Fernández, Francisco E. Bonilla-Rodríguez, Enrique Estigarribia, Johanna Magali Coronel-Ocampos, Cecilia J. Caballero-Arzamendia, Renato R. Torres, Aram Conde-Escobar, Yuliana Canaviri-Murillo, Diana Castro-Pacoricona, Victor Serna-Alarcón and Dennis Arias-Chávez...

  11. [12]

    Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5), 378

  12. [13]

    Sibel Vildan Altin, Isabelle Finke, Sibylle Kautz-Freimuth and Stephanie Stock. 2014. The evolution of health literacy assessment tools: a systematic review. BMC Public Health, 14(1):1207

  13. [14]

    In press

    Olga Kellert, Fernando Ortega, Claudia Crespo, Marleen Haboud, Salma Atfah, Hannah Sommer, and Stavros Skopeteas. In press. ¿Cómo impactaron las fuentes de información y los factores sociodemográficos en la transmisión del conocimiento relacionado con la COVID-19 a nivel de minorías étnicas y lingüísticas? In Luis Moreno Hernández, Alma Delia Zárate Flore...

  14. [15]

    Sibel Vildan Altin, Isabelle Finke, Sibylle Kautz-Freimuth, and Stephanie Stock. 2014. The evolution of health literacy assessment tools: a systematic review. BMC public health, 14(1):1207

  15. [17]

    Laura Biester, Vanita Sharma, Ashkan Kazemi, Naihao Deng, Steven Wilson, and Rada Mihalcea. 2022. https://aclanthology.org/2022.nlperspectives-1.2/ Analyzing the effects of annotator gender across NLP tasks . In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, pages 10--19, Marseille, France. European Language Resources Association

  16. [18]

    Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378

  17. [19]

    Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Marco, and Davide Bernardi. 2025. Perspectivist approaches to natural language processing: a survey. Language Resources and Evaluation, 59(2):1719--1746

  18. [21]

    Olga Kellert, Fernando Ortega, Claudia Crespo, Marleen Haboud, Salma Atfah, Hannah Sommer, and Stavros Skopeteas. ¿cómo impactaron las fuentes de información y los factores sociodemográficos en la transmisión del conocimiento relacionado con la covid-19 a nivel de minorías étnicas y lingüísticas? In Luis Moreno Hernández, Alma Delia Zárate Flores, Blanca ...

  19. [22]

    Christian R Mejia, Telmo Raul Aveiro-Robalo, Luciana Daniela Garlisi Torales, Maria Fernanda Fern \'a ndez, Francisco E Bonilla-Rodr \' guez, Enrique Estigarribia, Johanna Magali Coronel-Ocampos, Cecilia J Caballero-Arzamendia, Renato R Torres, Aram Conde-Escobar, et al. 2022. Basic covid-19 knowledge according to education level and country of residence:...

  20. [23]

    Sergio Meneses-Navarro, Mar \' a Graciela Freyermuth-Enciso, Blanca Estela Pelcastre-Villafuerte, Roberto Campos-Navarro, David Mariano Mel \'e ndez-Navarro, and Liliana G \'o mez-Flores-Ramos. 2020. The challenges facing indigenous communities in latin america as they confront the covid-19 pandemic. International journal for equity in health, 19(1):63

  21. [26]

    Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. 2021. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385--1470

  22. [27]

    CoRR , volume =

    Valerio Basile and Federico Cabitza and Andrea Campagner and Michael Fell , title =. CoRR , volume =. 2021 , url =. 2109.04270 , timestamp =

  23. [28]

    We need to consider disagreement in evaluation

    Basile, Valerio and Fell, Michael and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo and Uma, Alexandra. We Need to Consider Disagreement in Evaluation. Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future. 2021. doi:10.18653/v1/2021.bppf-1.3

  24. [29]

    Analyzing the Effects of Annotator Gender across NLP Tasks

    Biester, Laura and Sharma, Vanita and Kazemi, Ashkan and Deng, Naihao and Wilson, Steven and Mihalcea, Rada. Analyzing the Effects of Annotator Gender across NLP Tasks. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022. 2022

  25. [30]

    Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

    Mostafazadeh Davani, Aida and D \'i az, Mark and Prabhakaran, Vinodkumar. Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00449

  26. [31]

    Language Resources and Evaluation , volume=

    Perspectivist approaches to natural language processing: a survey , author=. Language Resources and Evaluation , volume=. 2025 , publisher=

  27. [32]

    Controversy and Conformity: from Generalized to Personalized Aggressiveness Detection

    Kanclerz, Kamil and Figas, Alicja and Gruza, Marcin and Kajdanowicz, Tomasz and Kocon, Jan and Puchalska, Daria and Kazienko, Przemyslaw. Controversy and Conformity: from Generalized to Personalized Aggressiveness Detection. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference ...

  28. [33]

    Linguistically debatable or just plain wrong?

    Plank, Barbara and Hovy, Dirk and S gaard, Anders. Linguistically debatable or just plain wrong?. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2014. doi:10.3115/v1/P14-2083

  29. [34]

    Journal of Artificial Intelligence Research , volume=

    Learning from disagreement: A survey , author=. Journal of Artificial Intelligence Research , volume=

  30. [35]

    International Conference of the Italian Association for Artificial Intelligence , pages=

    It’s the end of the gold standard as we know it: Leveraging non-aggregated data for better evaluation and explanation of subjective tasks , author=. International Conference of the Italian Association for Artificial Intelligence , pages=. 2020 , organization=

  31. [36]

    International journal for equity in health , volume=

    The challenges facing indigenous communities in Latin America as they confront the COVID-19 pandemic , author=. International journal for equity in health , volume=. 2020 , publisher=

  32. [37]

    Frontiers in Medicine , volume=

    Basic COVID-19 knowledge according to education level and country of residence: Analysis of twelve countries in Latin America , author=. Frontiers in Medicine , volume=. 2022 , publisher=

  33. [38]

    , author=

    Measuring nominal scale agreement among many raters. , author=. Psychological bulletin , volume=. 1971 , publisher=

  34. [39]

    BMC public health , volume=

    The evolution of health literacy assessment tools: a systematic review , author=. BMC public health , volume=. 2014 , publisher=

  35. [40]

    Olga Kellert and Fernando Ortega and Claudia Crespo and Marleen Haboud and Salma Atfah and Hannah Sommer and Stavros Skopeteas , note =. ¿Cómo impactaron las fuentes de información y los factores sociodemográficos en la transmisión del conocimiento relacionado con la COVID-19 a nivel de minorías étnicas y lingüísticas? , journal =