Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation
Pith reviewed 2026-05-11 01:12 UTC · model grok-4.3
The pith
Enforcing a single transcription convention as ground truth in ASR evaluation commits epistemic injustice against speakers with aphasia.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reference monism, the enforcement of one transcription convention as the definitive ground truth, produces epistemic injustice by creating a hermeneutical gap that prevents recognition of clinically meaningful disfluencies in aphasic speech as valid contributions; this leads to systematic disadvantage in Word Error Rate calculations when clean references are used, which the authors quantify through Epistemic Injustice Distance and demonstrate with data from AphasiaBank, ultimately advocating WER-Range as a way to report results across conventions instead of a single one.
What carries the argument
Reference monism, the practice of treating one transcription convention as the only legitimate ground truth, which carries the argument by exposing how it generates a hermeneutical gap that withholds interpretive resources from legitimate speech variations.
If this is right
- WER scores for the same ASR output on aphasic speech will vary depending on whether the reference convention treats disfluencies as errors or preserves them.
- A single WER number conceals the performance cost imposed by reference monism.
- Reporting WER-Range across legitimate conventions would give a more accurate picture of system capability.
- The evaluative infrastructure itself must gain interpretive resources to avoid treating meaningful speech features as noise.
Where Pith is reading between the lines
- The same logic could apply to ASR evaluation on accented, dialectal, or other non-standard speech varieties that deviate from dominant transcription norms.
- Training data collection for ASR might also need to incorporate multiple reference conventions rather than one canonical version.
- Similar reference-monism problems could arise in other AI evaluation settings that rely on single human annotations for variable or subjective phenomena.
Load-bearing premise
Multiple transcription conventions represent equally legitimate ways to capture the same speech, and the absence of resources to accommodate them in evaluation amounts to epistemic injustice rather than a purely technical issue.
What would settle it
An experiment showing that ASR systems receive identical rankings and no differential disadvantage for aphasic speech when evaluated against any of several legitimate transcription conventions instead of one clean reference.
Figures
read the original abstract
Automatic speech recognition (ASR) evaluation compares system output to ground truth transcripts, with Word Error Rate (WER) quantifying the distance between them. But ground truth transcripts are not discovered - they are produced by human annotators following conventions that encode normative assumptions about which speech features matter. Different conventions (verbatim, non-verbatim, legal) produce different transcripts of identical speech and judge the same ASR output differently. This paper argues that reference monism - enforcing a single transcription convention as ground truth - commits epistemic injustice. Speakers with aphasia, whose speech includes clinically meaningful disfluencies, are systematically disadvantaged when evaluated against "clean" references that treat those disfluencies as errors. The harm is not merely differential performance, but that evaluative infrastructure lacks interpretive resources to recognize their contributions as legitimate. We develop a philosophical framework introducing the hermeneutical gap, formalize Epistemic Injustice Distance (EID) to measure reference monism's cost, and demonstrate empirically using AphasiaBank that WER varies depending on which convention defines ground truth. We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that reference monism—enforcing a single transcription convention as ground truth in ASR evaluation—commits epistemic injustice by systematically disadvantaging speakers with aphasia, whose clinically meaningful disfluencies are treated as errors in 'clean' references. It introduces the hermeneutical gap and Epistemic Injustice Distance (EID) as a formal measure, empirically demonstrates WER variation across conventions on AphasiaBank data, and proposes WER-Range as an alternative to single-reference reporting.
Significance. If the argument holds, the work could meaningfully shift ASR evaluation practices by highlighting how reference conventions embed normative assumptions that marginalize non-standard speech patterns. The empirical observation of WER sensitivity to transcription choices provides a concrete, falsifiable basis for the critique, and the proposal of WER-Range offers a practical alternative. Strengths include the explicit linkage of philosophical concepts to an existing dataset and the introduction of EID as a quantifiable metric; these elements make the contribution more than purely conceptual.
major comments (3)
- [Empirical evaluation] Empirical section (AphasiaBank experiments): The manuscript reports WER variation across conventions but provides insufficient detail on data processing, how verbatim vs. non-verbatim conventions were operationalized, inter-annotator reliability, and any statistical controls or significance tests. Without these, the empirical support for the claim that reference monism produces systematic disadvantage remains partial and hard to replicate.
- [Philosophical framework] Framework section introducing EID and hermeneutical gap: The formalization treats disfluency-preserving transcripts as legitimate alternative ground truths for ASR evaluation, yet offers no independent argument or evidence that such conventions better serve ASR's core goal of recovering intended propositional content (as opposed to clinical documentation). This premise is load-bearing for moving from 'WER changes with convention' to 'monism enacts epistemic injustice.'
- [Discussion and proposal] Discussion of WER-Range proposal: The recommendation to report performance across 'legitimate' conventions does not specify criteria for determining legitimacy or weighting, leaving the method underspecified for practical use and risking inconsistent application across papers.
minor comments (2)
- Define all acronyms (EID, WER-Range, etc.) on first use and ensure consistent terminology between the abstract and body.
- Add a small table or example transcriptions from AphasiaBank showing specific differences between conventions to ground the abstract claims.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. These have helped us identify areas where the manuscript requires greater clarity, detail, and specification. We respond to each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Empirical evaluation] Empirical section (AphasiaBank experiments): The manuscript reports WER variation across conventions but provides insufficient detail on data processing, how verbatim vs. non-verbatim conventions were operationalized, inter-annotator reliability, and any statistical controls or significance tests. Without these, the empirical support for the claim that reference monism produces systematic disadvantage remains partial and hard to replicate.
Authors: We agree that the empirical section requires substantially more detail to support replicability and strengthen the claims. In the revised manuscript, we will expand the section to include: a complete description of data selection and preprocessing from AphasiaBank; explicit operational definitions and examples for verbatim versus non-verbatim conventions; inter-annotator reliability statistics (e.g., Cohen's kappa); and statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests) with effect sizes and p-values for WER differences. These additions will make the empirical evidence more robust and transparent. revision: yes
-
Referee: [Philosophical framework] Framework section introducing EID and hermeneutical gap: The formalization treats disfluency-preserving transcripts as legitimate alternative ground truths for ASR evaluation, yet offers no independent argument or evidence that such conventions better serve ASR's core goal of recovering intended propositional content (as opposed to clinical documentation). This premise is load-bearing for moving from 'WER changes with convention' to 'monism enacts epistemic injustice.'
Authors: We appreciate the referee highlighting the need to clarify the relationship between our framework and ASR's primary objectives. The manuscript does not argue that disfluency-preserving conventions are superior for recovering propositional content in general ASR use cases. Rather, it contends that reference monism imposes a single normative convention that erases features meaningful to speakers with aphasia, thereby creating a hermeneutical gap and enacting epistemic injustice. We will revise the framework section to explicitly distinguish between general ASR goals and the specific harms of monism for non-standard speech, drawing on additional clinical linguistics literature to justify the legitimacy of alternative conventions without claiming universal superiority. revision: partial
-
Referee: [Discussion and proposal] Discussion of WER-Range proposal: The recommendation to report performance across 'legitimate' conventions does not specify criteria for determining legitimacy or weighting, leaving the method underspecified for practical use and risking inconsistent application across papers.
Authors: We agree that the WER-Range proposal needs clearer operational guidance to be practically useful. In the revised discussion, we will specify criteria for legitimacy, including conventions that are (i) established in clinical or linguistic practice for the relevant population, (ii) supported by peer-reviewed research on the speech variety, or (iii) developed in consultation with speaker communities. We will also outline implementation options, such as reporting the full range of WER values, minimum and maximum scores, and context-dependent weighting schemes, with examples of how these might be applied in different evaluation settings. revision: yes
Circularity Check
No circularity; derivation relies on independent empirical observation and new interpretive framework
full rationale
The paper's chain proceeds from the observation that different transcription conventions produce different WER scores on the same audio (demonstrated on AphasiaBank), through a newly introduced philosophical framework that labels single-convention enforcement as epistemic injustice via the hermeneutical gap concept, to the formalization of EID as a distance metric and the proposal of WER-Range reporting. None of these steps reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the empirical WER variation is an external measurement, and the injustice framing is an interpretive overlay rather than a tautological renaming or closed loop. The central claim therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Transcription conventions encode normative assumptions about which speech features matter.
- domain assumption Multiple transcription conventions (verbatim, non-verbatim, legal) are legitimate representations of identical speech.
invented entities (3)
-
Epistemic Injustice Distance (EID)
no independent evidence
-
hermeneutical gap
no independent evidence
-
WER-Range
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[1]Aks ¨enova, A., van Esch, D., Flynn, J., and Golik, P.How might we create better benchmarks for speech recognition? InProceedings of the 1st workshop on benchmarking: Past, present and future(2021), pp. 22–34. [2]Anderson, E.Epistemic justice as a virtue of social institutions.Social epistemology 26, 2 (2012), 163–173. [3]Ardila, R., Branson, M., Davis...
work page 2021
-
[2]
or as evidence of genuine interpretive ambiguity. Our point is that even under the richest treatment of annotator disagreement, the policyp ⋆ itself remains fixed; plural ground truth relaxes this constraint. 27 [4]Aroyo, L., and Welty, C.Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine 36, 1 (2015), 15–24. [5]Blodgett, S. L...
work page 2015
-
[3]
[9]Davani, A. M., D ´ıaz, M., and Prabhakaran, V.Dealing with disagreements: Look- ing beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics 10(2022), 92–110. [10]Davis, E.Typecasts, tokens, and spokespersons: A case for credibility excess as testimonial injustice.Hypatia 31, 3 (2016), 485–501. [1...
-
[4]
InProceedings of the 3rd innovations in theoretical computer science conference (2012), pp
[16]Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R.Fairness through awareness. InProceedings of the 3rd innovations in theoretical computer science conference (2012), pp. 214–226. [17]Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., and Fernandez-Llorca, D.Can we trust ai benchmarks? an interdisciplinary rev...
work page 2012
-
[5]
InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society(2025), vol
[18]Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., and Fernandez-Llorca, D.Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society(2025), vol. 8, pp. 850–864. [19]Fjelland, R.Why general artificial intelligence will no...
work page 2025
-
[6]
InProceedings of the ACM on Web Conference 2025(2025), pp
[24]Gao, C., Chen, R., Yuan, S., Huang, K., Yu, Y., and He, X.Sprec: Self-play to debias llm-based recommendation. InProceedings of the ACM on Web Conference 2025(2025), pp. 5075–5084. [25]Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., and Crawford, K.Datasheets for datasets.Communications of the ACM 64, 12 (2021), 86–...
work page 2025
-
[7]
it’s kind of like code-switching
[29]Hardt, M., Price, E., and Srebro, N.Equality of opportunity in supervised learning. Advances in neural information processing systems 29(2016). [30]Harrington, C. N., Garg, R., Woodward, A., and Williams, D.“it’s kind of like code-switching”: Black older adults’ experiences with a voice assistant for health information seeking. InProceedings of the 20...
work page 2016
-
[8]
[37]Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., and Goel, S.Racial disparities in automated speech recognition.Proceedings of the national academy of sciences 117, 14 (2020), 7684–7689. 29 [38]Krippendorff, K.Computing krippendorff’s alpha-reliability. [39]Labov, W.Language in the inner...
work page 2020
-
[9]
[40]Lea, C., Huang, Z., Narain, J., Tooley, L., Yee, D., Tran, D. T., Georgiou, P., Bigham, J. P., and Findlater, L.From user perceptions to technical improvement: En- abling people who stutter to better use speech recognition. InProceedings of the 2023 CHI conference on human factors in computing systems(2023), pp. 1–16. [41]Li, J., Tang, Z., Liu, X., Sp...
work page 2023
-
[10]
[45]MacWhinney, B.The childes project: Tools for analyzing talk third edition,
[44]Love, R., and Wright, D.Specifying challenges in transcribing covert recordings: Implica- tions for forensic transcription.Frontiers in Communication 6(2021), 797448. [45]MacWhinney, B.The childes project: Tools for analyzing talk third edition,
work page 2021
-
[11]
[47]McNamara, Q., Fern ´andez, M
[46]MacWhinney, B., Fromm, D., Forbes, M., and Holland, A.Aphasiabank: Methods for studying discourse.Aphasiology 25(2011), 1286–1307. [47]McNamara, Q., Fern ´andez, M. ´A. d. R., Bhandari, N., Ratajczak, M., Chen, D., Miller, C., and Jett ´e, M.Style-agnostic evaluation of asr using multiple reference tran- scripts.arXiv preprint arXiv:2412.07937(2024). ...
-
[12]
J.Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices
[59]Reuel, A., Hardy, A., Smith, C., Lamparth, M., Hardy, M., and Kochenderfer, M. J.Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices. Advances in Neural Information Processing Systems 37(2024), 21763–21813. [60]Rickford, J. R., and Rickford, R. J.Spoken soul: The story of black English. Turner Publishing Company,
work page 2024
-
[13]
[61]Sacks, H., Schegloff, E. A., and Jefferson, G.A simplest systematics for the organi- zation of turn-taking for conversation.language 50, 4 (1974), 696–735. [62]Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., and Vertesi, J. Fairness and abstraction in sociotechnical systems. InProceedings of the conference on fairness, accountability...
work page 1974
-
[14]
[66]Uma, A. N., Fornaciari, T., Hovy, D., Paun, S., Plank, B., and Poesio, M.Learning from disagreement: A survey.Journal of Artificial Intelligence Research 72(2021), 1385–1470. [67]Vipperla, R., Renals, S., and Frankel, J.Ageing voices: The effect of changes in voice parameters on asr performance.EURASIP Journal on Audio, Speech, and Music Processing 20...
work page 2021
-
[15]
[72]Zhao, R., Choi, A. S., Koenecke, A., and Rameau, A.Quantification of automatic speech recognition system performance on d/deaf and hard of hearing speech.The Laryngoscope 135, 1 (2025), 191–197. [73]Zolnoori, M., Vergez, S., Xu, Z., Esmaeili, E., Zolnour, A., Anne Briggs, K., Scroggins, J. K., Hosseini Ebrahimabad, S. F., Noble, J. M., Topaz, M., et a...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.