pith. sign in

arxiv: 2605.07084 · v1 · submitted 2026-05-08 · 💻 cs.CL

Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation

Pith reviewed 2026-05-11 01:12 UTC · model grok-4.3

classification 💻 cs.CL
keywords automatic speech recognitionepistemic injusticeaphasiaword error ratereference monismtranscription conventionsground truth
0
0 comments X

The pith

Enforcing a single transcription convention as ground truth in ASR evaluation commits epistemic injustice against speakers with aphasia.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that automatic speech recognition evaluation relies on one chosen transcript as the sole correct reference, yet multiple valid conventions for transcribing identical speech exist and produce different judgments of system output. For speakers with aphasia, whose disfluencies often hold clinical significance, a clean reference convention marks those features as errors and thereby disadvantages their speech in performance scores. The core issue is not just uneven numbers but an evaluative system that lacks resources to treat such speech variations as legitimate rather than mistakes. A sympathetic reader would care because this shows how technical benchmarks can embed assumptions that marginalize certain ways of speaking. If the argument holds, evaluation must shift from assuming one right answer to reporting performance across multiple legitimate conventions.

Core claim

Reference monism, the enforcement of one transcription convention as the definitive ground truth, produces epistemic injustice by creating a hermeneutical gap that prevents recognition of clinically meaningful disfluencies in aphasic speech as valid contributions; this leads to systematic disadvantage in Word Error Rate calculations when clean references are used, which the authors quantify through Epistemic Injustice Distance and demonstrate with data from AphasiaBank, ultimately advocating WER-Range as a way to report results across conventions instead of a single one.

What carries the argument

Reference monism, the practice of treating one transcription convention as the only legitimate ground truth, which carries the argument by exposing how it generates a hermeneutical gap that withholds interpretive resources from legitimate speech variations.

If this is right

  • WER scores for the same ASR output on aphasic speech will vary depending on whether the reference convention treats disfluencies as errors or preserves them.
  • A single WER number conceals the performance cost imposed by reference monism.
  • Reporting WER-Range across legitimate conventions would give a more accurate picture of system capability.
  • The evaluative infrastructure itself must gain interpretive resources to avoid treating meaningful speech features as noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same logic could apply to ASR evaluation on accented, dialectal, or other non-standard speech varieties that deviate from dominant transcription norms.
  • Training data collection for ASR might also need to incorporate multiple reference conventions rather than one canonical version.
  • Similar reference-monism problems could arise in other AI evaluation settings that rely on single human annotations for variable or subjective phenomena.

Load-bearing premise

Multiple transcription conventions represent equally legitimate ways to capture the same speech, and the absence of resources to accommodate them in evaluation amounts to epistemic injustice rather than a purely technical issue.

What would settle it

An experiment showing that ASR systems receive identical rankings and no differential disadvantage for aphasic speech when evaluated against any of several legitimate transcription conventions instead of one clean reference.

Figures

Figures reproduced from arXiv: 2605.07084 by Anna Seo Gyeong Choi, Corey Miller, Hoon Choi, James Caverlee, Maria Teleki, Miguel del Rio.

Figure 1
Figure 1. Figure 1: Reference monism versus pluralism, and their metric consequences. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: EID decomposition by speaker group for Rev AI v2 (verbatim) under enforced non [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Automatic speech recognition (ASR) evaluation compares system output to ground truth transcripts, with Word Error Rate (WER) quantifying the distance between them. But ground truth transcripts are not discovered - they are produced by human annotators following conventions that encode normative assumptions about which speech features matter. Different conventions (verbatim, non-verbatim, legal) produce different transcripts of identical speech and judge the same ASR output differently. This paper argues that reference monism - enforcing a single transcription convention as ground truth - commits epistemic injustice. Speakers with aphasia, whose speech includes clinically meaningful disfluencies, are systematically disadvantaged when evaluated against "clean" references that treat those disfluencies as errors. The harm is not merely differential performance, but that evaluative infrastructure lacks interpretive resources to recognize their contributions as legitimate. We develop a philosophical framework introducing the hermeneutical gap, formalize Epistemic Injustice Distance (EID) to measure reference monism's cost, and demonstrate empirically using AphasiaBank that WER varies depending on which convention defines ground truth. We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that reference monism—enforcing a single transcription convention as ground truth in ASR evaluation—commits epistemic injustice by systematically disadvantaging speakers with aphasia, whose clinically meaningful disfluencies are treated as errors in 'clean' references. It introduces the hermeneutical gap and Epistemic Injustice Distance (EID) as a formal measure, empirically demonstrates WER variation across conventions on AphasiaBank data, and proposes WER-Range as an alternative to single-reference reporting.

Significance. If the argument holds, the work could meaningfully shift ASR evaluation practices by highlighting how reference conventions embed normative assumptions that marginalize non-standard speech patterns. The empirical observation of WER sensitivity to transcription choices provides a concrete, falsifiable basis for the critique, and the proposal of WER-Range offers a practical alternative. Strengths include the explicit linkage of philosophical concepts to an existing dataset and the introduction of EID as a quantifiable metric; these elements make the contribution more than purely conceptual.

major comments (3)
  1. [Empirical evaluation] Empirical section (AphasiaBank experiments): The manuscript reports WER variation across conventions but provides insufficient detail on data processing, how verbatim vs. non-verbatim conventions were operationalized, inter-annotator reliability, and any statistical controls or significance tests. Without these, the empirical support for the claim that reference monism produces systematic disadvantage remains partial and hard to replicate.
  2. [Philosophical framework] Framework section introducing EID and hermeneutical gap: The formalization treats disfluency-preserving transcripts as legitimate alternative ground truths for ASR evaluation, yet offers no independent argument or evidence that such conventions better serve ASR's core goal of recovering intended propositional content (as opposed to clinical documentation). This premise is load-bearing for moving from 'WER changes with convention' to 'monism enacts epistemic injustice.'
  3. [Discussion and proposal] Discussion of WER-Range proposal: The recommendation to report performance across 'legitimate' conventions does not specify criteria for determining legitimacy or weighting, leaving the method underspecified for practical use and risking inconsistent application across papers.
minor comments (2)
  1. Define all acronyms (EID, WER-Range, etc.) on first use and ensure consistent terminology between the abstract and body.
  2. Add a small table or example transcriptions from AphasiaBank showing specific differences between conventions to ground the abstract claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. These have helped us identify areas where the manuscript requires greater clarity, detail, and specification. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Empirical evaluation] Empirical section (AphasiaBank experiments): The manuscript reports WER variation across conventions but provides insufficient detail on data processing, how verbatim vs. non-verbatim conventions were operationalized, inter-annotator reliability, and any statistical controls or significance tests. Without these, the empirical support for the claim that reference monism produces systematic disadvantage remains partial and hard to replicate.

    Authors: We agree that the empirical section requires substantially more detail to support replicability and strengthen the claims. In the revised manuscript, we will expand the section to include: a complete description of data selection and preprocessing from AphasiaBank; explicit operational definitions and examples for verbatim versus non-verbatim conventions; inter-annotator reliability statistics (e.g., Cohen's kappa); and statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests) with effect sizes and p-values for WER differences. These additions will make the empirical evidence more robust and transparent. revision: yes

  2. Referee: [Philosophical framework] Framework section introducing EID and hermeneutical gap: The formalization treats disfluency-preserving transcripts as legitimate alternative ground truths for ASR evaluation, yet offers no independent argument or evidence that such conventions better serve ASR's core goal of recovering intended propositional content (as opposed to clinical documentation). This premise is load-bearing for moving from 'WER changes with convention' to 'monism enacts epistemic injustice.'

    Authors: We appreciate the referee highlighting the need to clarify the relationship between our framework and ASR's primary objectives. The manuscript does not argue that disfluency-preserving conventions are superior for recovering propositional content in general ASR use cases. Rather, it contends that reference monism imposes a single normative convention that erases features meaningful to speakers with aphasia, thereby creating a hermeneutical gap and enacting epistemic injustice. We will revise the framework section to explicitly distinguish between general ASR goals and the specific harms of monism for non-standard speech, drawing on additional clinical linguistics literature to justify the legitimacy of alternative conventions without claiming universal superiority. revision: partial

  3. Referee: [Discussion and proposal] Discussion of WER-Range proposal: The recommendation to report performance across 'legitimate' conventions does not specify criteria for determining legitimacy or weighting, leaving the method underspecified for practical use and risking inconsistent application across papers.

    Authors: We agree that the WER-Range proposal needs clearer operational guidance to be practically useful. In the revised discussion, we will specify criteria for legitimacy, including conventions that are (i) established in clinical or linguistic practice for the relevant population, (ii) supported by peer-reviewed research on the speech variety, or (iii) developed in consultation with speaker communities. We will also outline implementation options, such as reporting the full range of WER values, minimum and maximum scores, and context-dependent weighting schemes, with examples of how these might be applied in different evaluation settings. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on independent empirical observation and new interpretive framework

full rationale

The paper's chain proceeds from the observation that different transcription conventions produce different WER scores on the same audio (demonstrated on AphasiaBank), through a newly introduced philosophical framework that labels single-convention enforcement as epistemic injustice via the hermeneutical gap concept, to the formalization of EID as a distance metric and the proposal of WER-Range reporting. None of these steps reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the empirical WER variation is an external measurement, and the injustice framing is an interpretive overlay rather than a tautological renaming or closed loop. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about the legitimacy of multiple transcription conventions and the normative encoding in those conventions; it introduces new conceptual entities without external independent validation beyond the stated empirical variation.

axioms (2)
  • domain assumption Transcription conventions encode normative assumptions about which speech features matter.
    Invoked to argue that ground truth transcripts are produced rather than discovered.
  • domain assumption Multiple transcription conventions (verbatim, non-verbatim, legal) are legitimate representations of identical speech.
    Underpins the claim that enforcing one convention as ground truth is monism and injustice.
invented entities (3)
  • Epistemic Injustice Distance (EID) no independent evidence
    purpose: To measure the cost of reference monism in ASR evaluation.
    Newly formalized in the paper.
  • hermeneutical gap no independent evidence
    purpose: To describe the lack of interpretive resources in evaluative infrastructure for recognizing certain speech contributions as legitimate.
    Introduced as part of the new philosophical framework.
  • WER-Range no independent evidence
    purpose: To report ASR performance across multiple legitimate conventions rather than a single WER value.
    Proposed as an alternative evaluation approach.

pith-pipeline@v0.9.0 · 5515 in / 1734 out tokens · 58737 ms · 2026-05-11T01:12:50.961423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    [1]Aks ¨enova, A., van Esch, D., Flynn, J., and Golik, P.How might we create better benchmarks for speech recognition? InProceedings of the 1st workshop on benchmarking: Past, present and future(2021), pp. 22–34. [2]Anderson, E.Epistemic justice as a virtue of social institutions.Social epistemology 26, 2 (2012), 163–173. [3]Ardila, R., Branson, M., Davis...

  2. [2]

    Our point is that even under the richest treatment of annotator disagreement, the policyp ⋆ itself remains fixed; plural ground truth relaxes this constraint

    or as evidence of genuine interpretive ambiguity. Our point is that even under the richest treatment of annotator disagreement, the policyp ⋆ itself remains fixed; plural ground truth relaxes this constraint. 27 [4]Aroyo, L., and Welty, C.Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine 36, 1 (2015), 15–24. [5]Blodgett, S. L...

  3. [3]

    [9]Davani, A. M., D ´ıaz, M., and Prabhakaran, V.Dealing with disagreements: Look- ing beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics 10(2022), 92–110. [10]Davis, E.Typecasts, tokens, and spokespersons: A case for credibility excess as testimonial injustice.Hypatia 31, 3 (2016), 485–501. [1...

  4. [4]

    InProceedings of the 3rd innovations in theoretical computer science conference (2012), pp

    [16]Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R.Fairness through awareness. InProceedings of the 3rd innovations in theoretical computer science conference (2012), pp. 214–226. [17]Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., and Fernandez-Llorca, D.Can we trust ai benchmarks? an interdisciplinary rev...

  5. [5]

    InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society(2025), vol

    [18]Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., and Fernandez-Llorca, D.Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society(2025), vol. 8, pp. 850–864. [19]Fjelland, R.Why general artificial intelligence will no...

  6. [6]

    InProceedings of the ACM on Web Conference 2025(2025), pp

    [24]Gao, C., Chen, R., Yuan, S., Huang, K., Yu, Y., and He, X.Sprec: Self-play to debias llm-based recommendation. InProceedings of the ACM on Web Conference 2025(2025), pp. 5075–5084. [25]Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., and Crawford, K.Datasheets for datasets.Communications of the ACM 64, 12 (2021), 86–...

  7. [7]

    it’s kind of like code-switching

    [29]Hardt, M., Price, E., and Srebro, N.Equality of opportunity in supervised learning. Advances in neural information processing systems 29(2016). [30]Harrington, C. N., Garg, R., Woodward, A., and Williams, D.“it’s kind of like code-switching”: Black older adults’ experiences with a voice assistant for health information seeking. InProceedings of the 20...

  8. [8]

    R., Jurafsky, D., and Goel, S.Racial disparities in automated speech recognition.Proceedings of the national academy of sciences 117, 14 (2020), 7684–7689

    [37]Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., and Goel, S.Racial disparities in automated speech recognition.Proceedings of the national academy of sciences 117, 14 (2020), 7684–7689. 29 [38]Krippendorff, K.Computing krippendorff’s alpha-reliability. [39]Labov, W.Language in the inner...

  9. [9]

    T., Georgiou, P., Bigham, J

    [40]Lea, C., Huang, Z., Narain, J., Tooley, L., Yee, D., Tran, D. T., Georgiou, P., Bigham, J. P., and Findlater, L.From user perceptions to technical improvement: En- abling people who stutter to better use speech recognition. InProceedings of the 2023 CHI conference on human factors in computing systems(2023), pp. 1–16. [41]Li, J., Tang, Z., Liu, X., Sp...

  10. [10]

    [45]MacWhinney, B.The childes project: Tools for analyzing talk third edition,

    [44]Love, R., and Wright, D.Specifying challenges in transcribing covert recordings: Implica- tions for forensic transcription.Frontiers in Communication 6(2021), 797448. [45]MacWhinney, B.The childes project: Tools for analyzing talk third edition,

  11. [11]

    [47]McNamara, Q., Fern ´andez, M

    [46]MacWhinney, B., Fromm, D., Forbes, M., and Holland, A.Aphasiabank: Methods for studying discourse.Aphasiology 25(2011), 1286–1307. [47]McNamara, Q., Fern ´andez, M. ´A. d. R., Bhandari, N., Ratajczak, M., Chen, D., Miller, C., and Jett ´e, M.Style-agnostic evaluation of asr using multiple reference tran- scripts.arXiv preprint arXiv:2412.07937(2024). ...

  12. [12]

    J.Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices

    [59]Reuel, A., Hardy, A., Smith, C., Lamparth, M., Hardy, M., and Kochenderfer, M. J.Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices. Advances in Neural Information Processing Systems 37(2024), 21763–21813. [60]Rickford, J. R., and Rickford, R. J.Spoken soul: The story of black English. Turner Publishing Company,

  13. [13]

    A., and Jefferson, G.A simplest systematics for the organi- zation of turn-taking for conversation.language 50, 4 (1974), 696–735

    [61]Sacks, H., Schegloff, E. A., and Jefferson, G.A simplest systematics for the organi- zation of turn-taking for conversation.language 50, 4 (1974), 696–735. [62]Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., and Vertesi, J. Fairness and abstraction in sociotechnical systems. InProceedings of the conference on fairness, accountability...

  14. [14]

    N., Fornaciari, T., Hovy, D., Paun, S., Plank, B., and Poesio, M.Learning from disagreement: A survey.Journal of Artificial Intelligence Research 72(2021), 1385–1470

    [66]Uma, A. N., Fornaciari, T., Hovy, D., Paun, S., Plank, B., and Poesio, M.Learning from disagreement: A survey.Journal of Artificial Intelligence Research 72(2021), 1385–1470. [67]Vipperla, R., Renals, S., and Frankel, J.Ageing voices: The effect of changes in voice parameters on asr performance.EURASIP Journal on Audio, Speech, and Music Processing 20...

  15. [15]

    S., Koenecke, A., and Rameau, A.Quantification of automatic speech recognition system performance on d/deaf and hard of hearing speech.The Laryngoscope 135, 1 (2025), 191–197

    [72]Zhao, R., Choi, A. S., Koenecke, A., and Rameau, A.Quantification of automatic speech recognition system performance on d/deaf and hard of hearing speech.The Laryngoscope 135, 1 (2025), 191–197. [73]Zolnoori, M., Vergez, S., Xu, Z., Esmaeili, E., Zolnour, A., Anne Briggs, K., Scroggins, J. K., Hosseini Ebrahimabad, S. F., Noble, J. M., Topaz, M., et a...