Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

Bohan Li; Hankun Wang; Kai Yu; Xie Chen; Yiwei Guo; Zhihan Li

arxiv: 2606.04680 · v1 · pith:FKND4C3Ynew · submitted 2026-06-03 · 📡 eess.AS · cs.CL· cs.SD

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

Zhihan Li , Hankun Wang , Yiwei Guo , Bohan Li , Xie Chen , Kai Yu This is my paper

Pith reviewed 2026-06-28 04:34 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords reference-free evaluationASR hypothesis scoringacoustic discrepancyTTS conditional likelihoodspeech recognitionhypothesis refinementnoisy ASR

0 comments

The pith

A pretrained TTS model scores ASR hypotheses by acoustic fit to enable reference-free evaluation and refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes READ, a metric that evaluates automatic speech recognition hypotheses directly from the speech signal by computing how likely the audio is under a text hypothesis. It does this without any reference transcriptions or extra training, using only an off-the-shelf auto-regressive TTS model to measure acoustic discrepancy. The approach is shown to correlate with recognition errors and can refine hypotheses to lower word error rates. Gains appear strongest when the input speech is noisy.

Core claim

READ measures fine-grained acoustic discrepancy between a speech signal and a text hypothesis by computing the conditional likelihood of speech tokens given the hypothesis inside a pretrained auto-regressive TTS model. This score serves as a reference-free evaluation metric that can also be used to select or refine ASR outputs, producing up to 20% relative error-rate reduction with larger benefits under noisy conditions.

What carries the argument

Conditional likelihood of speech tokens given a text hypothesis, computed by a pretrained auto-regressive TTS model to quantify acoustic discrepancy.

If this is right

READ scores correlate with specific types of ASR recognition errors.
READ can refine ASR outputs to achieve up to 20% relative word error rate reduction.
Performance gains from READ are larger under noisy input conditions than in clean conditions.
The method requires no additional training or domain-specific fine-tuning of the TTS model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same acoustic-discrepancy signal could be used to flag low-confidence segments in real-time ASR pipelines.
If the TTS model is swapped for one trained on a different domain, the correlation with errors may change, providing a way to test domain robustness.
READ could be combined with language-model rescoring to separate acoustic from linguistic sources of error.

Load-bearing premise

An off-the-shelf pretrained TTS model produces a conditional likelihood that reliably measures acoustic discrepancy between any speech signal and any text hypothesis.

What would settle it

Running READ-based hypothesis refinement on a standard noisy ASR test set and finding no reduction in word error rate compared with the original decoder outputs.

Figures

Figures reproduced from arXiv: 2606.04680 by Bohan Li, Hankun Wang, Kai Yu, Xie Chen, Yiwei Guo, Zhihan Li.

**Figure 1.** Figure 1: We propose READ for hypothesis evaluation. Models (LLMs) has introduced a generative paradigm, where the model directly produces final results without explicit evaluation [9, 10, 11]. While generative approaches demonstrate impressive performance, we argue that refinement cannot replace evaluation. Beyond pursuing a better output, we need evaluation metrics that provide interpretable and diagnostic insi… view at source ↗

**Figure 2.** Figure 2: Segment-level combination with READ Evaluation. obtained via dynamic programming [29] as π ∗ = arg max π∈M XT t=1 At,π(t), (3) where M denotes the set of all possible mappings satisfying the above constraint. Given the optimal mapping π ∗ , we aggregate discrepancy from speech-token-level to any text segment: READtext [n1,n2] = X t: n1≤π∗(t)≤n2 READt. (4) This yields discrepancy scores for any text segment… view at source ↗

**Figure 4.** Figure 4: Case study of READ’s locality in reference-free evaluation. When two hypotheses differ, their READ values exhibit clear peaks of difference in the corresponding temporal range. sequently, the correlation between READ and WER becomes stronger, reflecting READ’s bias to acoustic modeling. 4.2.2. Case Study: On the Locality of READ Evaluation We illustrate the reference-free evaluation capability of READ and… view at source ↗

read the original abstract

Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

READ uses a pretrained TTS to score ASR hypotheses via conditional likelihood for reference-free acoustic checks, but the 20% gains rest on thin experimental detail.

read the letter

The main thing here is that the paper defines READ as a metric that runs ASR text hypotheses through an off-the-shelf auto-regressive TTS and uses the resulting likelihood to quantify acoustic mismatch with the input speech. This lets it do reference-free evaluation and even hypothesis refinement, with reported gains up to 20% relative error reduction that are strongest in noisy conditions.

What is actually new is the direct use of TTS conditional likelihood as the acoustic grounding signal rather than internal ASR confidence or a separate language model. The construction is clean and requires no additional training on the target data, which is a practical plus for unlabeled or noisy settings.

The soft spot is the evidence. The abstract states correlation with specific errors and the error-rate improvement, yet supplies no dataset descriptions, baseline comparisons, hypothesis generation details, or statistical tests. That makes it hard to assess whether the TTS likelihood truly tracks acoustic discrepancy or whether it is picking up TTS artifacts such as prosody or speaker mismatch, especially under the noisy conditions where the gains are claimed. The assumption that an unmodified pretrained TTS serves as a general acoustic proxy without domain validation looks like the weakest link.

This paper is for ASR researchers who need reference-free tools in noisy or low-resource conditions. A reader already working on hypothesis rescoring or evaluation metrics would find the idea worth testing.

It deserves a serious referee because the framing is distinct enough from prior reference-free work to merit closer examination, even if the current results are preliminary.

Referee Report

2 major / 1 minor

Summary. The paper proposes READ, a reference-free metric for ASR hypothesis evaluation that computes the conditional likelihood of speech tokens given a text hypothesis via a pretrained auto-regressive TTS model to quantify acoustic discrepancy. Without additional training, READ is applied to correlate with recognition errors and refine hypotheses, with claimed results of up to 20% relative error rate reduction (stronger in noisy conditions).

Significance. If the empirical claims hold after validation, the approach provides a parameter-free (no fine-tuning) acoustic-grounding method for reference-free ASR evaluation using existing TTS models, which could aid hypothesis selection in low-resource or noisy settings. The direct use of pretrained TTS likelihoods without self-referential fitting is a conceptual strength.

major comments (2)

[Abstract] Abstract: The central quantitative claim of up to 20% relative error rate reduction (and correlation with specific errors) is presented without any description of hypothesis generation, ASR systems, datasets, noise conditions, baselines, or statistical tests; this leaves the support for the claim limited to the summary statement and undermines assessment of the result.
[Method and Experiments] Method and Experiments: The core assumption that p(speech tokens | text) from an off-the-shelf pretrained TTS model provides a faithful, domain-agnostic measure of acoustic discrepancy is load-bearing for all claims yet receives no validation or auxiliary experiments on the target ASR corpora or noisy conditions; differences could reflect TTS artifacts (prosody, speaker mismatch) rather than acoustic match.

minor comments (1)

[Abstract] The abstract uses '20\%' notation that should be rendered consistently as 20% in the final version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and outline revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The central quantitative claim of up to 20% relative error rate reduction (and correlation with specific errors) is presented without any description of hypothesis generation, ASR systems, datasets, noise conditions, baselines, or statistical tests; this leaves the support for the claim limited to the summary statement and undermines assessment of the result.

Authors: We agree that the abstract is too concise and does not provide sufficient context for the central claim. In the revised manuscript we will expand the abstract to briefly specify the evaluation datasets (including clean and noisy conditions), the ASR systems used for hypothesis generation, the refinement procedure, and the statistical testing applied. Full experimental details remain in the body of the paper, but the abstract will become self-contained. revision: yes
Referee: [Method and Experiments] Method and Experiments: The core assumption that p(speech tokens | text) from an off-the-shelf pretrained TTS model provides a faithful, domain-agnostic measure of acoustic discrepancy is load-bearing for all claims yet receives no validation or auxiliary experiments on the target ASR corpora or noisy conditions; differences could reflect TTS artifacts (prosody, speaker mismatch) rather than acoustic match.

Authors: The referee correctly notes that the paper provides no dedicated auxiliary experiments isolating the TTS likelihood from potential artifacts such as prosody or speaker mismatch. The current manuscript treats empirical gains on the target corpora as supporting evidence. We will add an explicit limitations paragraph in the revised version that acknowledges this gap, discusses the possibility of TTS-specific effects, and outlines directions for future targeted validation experiments. No new experiments will be added at this stage. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines READ directly as the conditional likelihood from an external pretrained auto-regressive TTS model applied to speech tokens given a text hypothesis. This is an independent external component, not derived from or fitted to the paper's own data, hypotheses, or ASR outputs. No equations, self-citations, or ansatzes are shown that reduce the metric or its claimed correlations/error reductions back to the paper's inputs by construction. The reported empirical gains (e.g., 20% relative WER reduction) are presented as experimental outcomes rather than forced predictions. This meets the criteria for a self-contained, non-circular definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that an off-the-shelf TTS model's token likelihood faithfully captures acoustic mismatch relevant to ASR errors; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption A pretrained auto-regressive TTS model produces conditional likelihoods that serve as a reliable proxy for acoustic discrepancy between speech and arbitrary text hypotheses.
The method is presented as training-free and directly applicable, so the paper treats the TTS likelihood as a ready-made acoustic measure without further justification or validation steps described in the abstract.

pith-pipeline@v0.9.1-grok · 5662 in / 1303 out tokens · 29599 ms · 2026-06-28T04:34:59.574902+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

Introduction Automatic Speech Recognition (ASR) has achieved remarkable progress with the advent of large-scale pre-training and end- to-end architectures. However,hypothesis evaluationof ASR systems remains a non-trivial challenge, particularly in real- world scenarios where ground-truth transcripts are often un- available. Traditional evaluation relies ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

ASR Hypothesis Evaluation ASR Hypothesis Evaluation concerns the problem of assessing how well the output of an ASR system explains the original speech signal

Background 2.1. ASR Hypothesis Evaluation ASR Hypothesis Evaluation concerns the problem of assessing how well the output of an ASR system explains the original speech signal. Reference-based Evaluation.Reference-based methods directly compare the generated hypothesis with a given ground- truth transcript. Among these, the edit-distance-based Word Er- ror...
[3]

What I cannot create I

READ: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy 3.1. Deriving Acoustic Discrepancy from AR TTS Systems LetX= (X 1, . . . , XN )denote a sequence of text tokens and Y= (Y 1, . . . , YT )denote a sequence of speech tokens. A trained auto-regressive TTS system parameterized byθmodels the sequence causally, defining a conditional probabil...
[4]

R. ”, “Sen

Experiments 4.1. Experimental Setup We conduct experiments mainly on CosyV oice2 [19], a discrete auto-regressive TTS system. We adopt the official checkpoints, which have not been subjected to any additional training on the involved datasets. This setup is to make sure that the evaluation capability originates purely from the TTS model itself. For candid...

work page arXiv
[5]

READ is reference-free and requires only the original speech signal to assess hypotheses, indicating fine-grained acoustic discrepancy

Conclusion We propose READ, a hypothesis evaluation method based on the conditional likelihood modeled by a TTS system. READ is reference-free and requires only the original speech signal to assess hypotheses, indicating fine-grained acoustic discrepancy. Without any additional training tailored to specific ASR models or datasets, our approach leverages t...
[6]

The authors independently developed the re- search framework and experimental methodology

Generative AI Use Disclosure Generative AI was utilized for manuscript editing and technical troubleshooting. The authors independently developed the re- search framework and experimental methodology. We take full responsibility for the content and consent to this submission
[7]

Confidence measures for large vocabulary continuous speech recognition,

F. Wessel, R. Schluter, K. Macherey, and H. Ney, “Confidence measures for large vocabulary continuous speech recognition,” IEEE Transactions on speech and audio processing, vol. 9, no. 3, pp. 288–298, 2001

2001
[8]

Confidence measures for speech recognition: A sur- vey,

H. Jiang, “Confidence measures for speech recognition: A sur- vey,”Speech communication, vol. 45, no. 4, pp. 455–470, 2005

2005
[9]

Word level confidence annotation using combinations of features,

R. Zhang and A. I. Rudnicky, “Word level confidence annotation using combinations of features,” inProc. Eurospeech 2001, 2001, pp. 2105–2108

2001
[10]

On calibra- tion of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibra- tion of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

2017
[11]

Large Scale Language Modeling in Automatic Speech Recognition

C. Chelba, D. Bikel, M. Shugrina, P. Nguyen, and S. Kumar, “Large scale language modeling in automatic speech recognition,” arXiv preprint arXiv:1210.8440, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[12]

Recurrent neural network based language model

T. Mikolov, M. Karafi ´at, L. Burget, J. Cernock `y, and S. Khudan- pur, “Recurrent neural network based language model.” inInter- speech, vol. 2, no. 3. Makuhari, 2010, pp. 1045–1048

2010
[13]

Word error rate estimation for speech recognition: e-WER,

A. Ali and S. Renals, “Word error rate estimation for speech recognition: e-WER,” inProceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (V olume 2: Short Papers), 2018, pp. 20–24

2018
[14]

A post-processing system to yield reduced word er- ror rates: Recognizer output voting error reduction (rover),

J. Fiscus, “A post-processing system to yield reduced word er- ror rates: Recognizer output voting error reduction (rover),” in 1997 IEEE Workshop on Automatic Speech Recognition and Un- derstanding Proceedings, 1997, pp. 347–354

1997
[15]

Can gener- ative large language models perform asr error correction?

R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can gener- ative large language models perform asr error correction?”arXiv preprint arXiv:2307.04172, 2023

work page arXiv 2023
[16]

Hyporadise: An open baseline for generative speech recognition with large language models,

C. Chen, Y . Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y . Chen, and E.-S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,”Advances in Neural In- formation Processing Systems, vol. 36, pp. 31 665–31 688, 2023

2023
[17]

Progres: Prompted generative rescoring on asr n-best,

A. D. Tur, A. Moumen, and M. Ravanelli, “Progres: Prompted generative rescoring on asr n-best,” in2024 IEEE Spoken Lan- guage Technology Workshop (SLT). IEEE, 2024, pp. 600–607

2024
[18]

Re- jection improves reliability: Training LLMs to refuse unknown questions using RL from knowledge feedback,

H. Xu, Z. Zhu, S. Zhang, D. Ma, S. Fan, L. Chen, and K. Yu, “Re- jection improves reliability: Training LLMs to refuse unknown questions using RL from knowledge feedback,” inFirst Confer- ence on Language Modeling, 2024

2024
[19]

Word Error Rate Estimation Without ASR Output: e-WER2,

A. Ali and S. Renals, “Word Error Rate Estimation Without ASR Output: e-WER2,” inInterspeech 2020, 2020, pp. 616–620

2020
[20]

Fast word error rate esti- mation using self-supervised representations for speech and text,

C. Park, C. Lu, M. Chen, and T. Hain, “Fast word error rate esti- mation using self-supervised representations for speech and text,” inICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[21]

On the robust ap- proximation of asr metrics,

A. Waheed, H. Atwany, R. Singh, and B. Raj, “On the robust ap- proximation of asr metrics,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 23 119–23 146

2025
[22]

Large language models are efficient learners of noise- robust speech recognition,

Y . Hu, C. CHEN, C.-H. H. Yang, R. Li, C. Zhang, P.-Y . Chen, and E. Chng, “Large language models are efficient learners of noise- robust speech recognition,” inThe Twelfth International Confer- ence on Learning Representations, 2024

2024
[23]

Listen again and choose the right answer: A new paradigm for automatic speech recognition with large language models,

Y . Hu, C. Chen, C. Qin, Q. Zhu, E. Chng, and R. Li, “Listen again and choose the right answer: A new paradigm for automatic speech recognition with large language models,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 666–679

2024
[24]

Speech recognition: A model and a program for research,

M. Halle and K. Stevens, “Speech recognition: A model and a program for research,”IRE transactions on information theory, vol. 8, no. 2, pp. 155–159, 2003

2003
[25]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gaoet al., “Cosyvoice 2: Scalable streaming speech synthesis with large lan- guage models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,

S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,” in Interspeech 2021, 2021, pp. 1977–1981

2021
[27]

Asr rescoring and confidence estimation with electra,

H. Futami, H. Inaguma, M. Mimura, S. Sakai, and T. Kawahara, “Asr rescoring and confidence estimation with electra,” in2021 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2021, pp. 380–387

2021
[28]

An evaluation of word-level confidence estimation for end-to-end automatic speech recognition,

D. Oneat ¸˘a, A. Caranica, A. Stan, and H. Cucu, “An evaluation of word-level confidence estimation for end-to-end automatic speech recognition,” in2021 IEEE Spoken Language Technology Work- shop (SLT). IEEE, 2021, pp. 258–265

2021
[29]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen et al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Trans. ASLP, vol. 33, pp. 705–718, 2025

2025
[30]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

IndexTTS: An industrial-level controllable and efficient zero-shot text-to-speech system,

W. Deng, S. Zhou, J. Shu, J. Wang, and L. Wang, “IndexTTS: An industrial-level controllable and efficient zero-shot text-to-speech system,”arXiv preprint arXiv:2502.05512, 2025

work page arXiv 2025
[32]

A recursive algorithm for the forced alignment of very long audio segments

P. J. Moreno, C. F. Joerg, J.-M. Van Thong, and O. Glickman, “A recursive algorithm for the forced alignment of very long audio segments.” inICSLP, vol. 98, 1998, pp. 2711–2714

1998
[33]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The kaldi speech recognition toolkit,” inIEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011

2011
[34]

Qwen3-ASR Technical Report

X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yanget al., “Qwen3-ASR technical report,” arXiv preprint arXiv:2601.21337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Attention-constrained inference for robust decoder-only text-to- speech,

H. Wang, C. Du, Y . Guo, S. Wang, X. Chen, and K. Yu, “Attention-constrained inference for robust decoder-only text-to- speech,” in2024 IEEE SLT. IEEE, 2024, pp. 630–637

2024
[36]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th ICML, vol. 202. PMLR, 23–29 Jul 2023, pp. 28 492–28 518

2023
[37]

Fast conformer with linearly scalable attention for efficient speech recognition,

D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in2023 IEEE ASRU. IEEE, 2023, pp. 1–8

2023
[38]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen et al., “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in2015 IEEE ICASSP, 2015, pp. 5206–5210

2015
[40]

SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recog- nition,

P. K. O’Neill, V . Lavrukhin, S. Majumdar, V . Noroozi, Y . Zhang, O. Kuchaievet al., “SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recog- nition,” inInterspeech 2021, 2021, pp. 1434–1438

2021
[41]

Switchboard: Telephone speech corpus for research and development,

J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” inIEEE ICASSP 1992, vol. 1. IEEE, 1992, pp. 517–520

1992
[42]

TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,

F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Es- teve, “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” inInternational confer- ence on speech and computer. Springer, 2018, pp. 198–208

2018
[43]

Noisy speech database for training speech enhancement algorithms and TTS models, 2016,

C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and TTS models, 2016,” University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017, [sound]. [Online]. Available: https://doi.org/10.7488/ds/2117

work page doi:10.7488/ds/2117 2016
[44]

The ASRU 2019 mandarin-english code-switching speech recognition challenge: Open datasets, tracks, methods and results,

X. Shi, Q. Feng, and L. Xie, “The ASRU 2019 mandarin-english code-switching speech recognition challenge: Open datasets, tracks, methods and results,” inProceedings of the First Workshop on Speech Technologies for Code-switching in Multilingual Communities (WSTCSMC 2020), 2020, pp. 71–75. [Online]. Available: http://festvox.org/cedar/WSTCSMC2020.pdf

2019
[45]

TALCS: An open-source Mandarin-English code-switching cor- pus and a speech recognition baseline,

C. Li, S. Deng, Y . Wang, G. Wang, Y . Gong, C. Chen, and J. Bai, “TALCS: An open-source Mandarin-English code-switching cor- pus and a speech recognition baseline,” inInterspeech 2022, 2022, pp. 1741–1745

2022
[46]

WHAM!: Extending Speech Separation to Noisy Envi- ronments,

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn et al., “WHAM!: Extending Speech Separation to Noisy Envi- ronments,” inInterspeech 2019, 2019, pp. 1368–1372

2019

[1] [1]

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

Introduction Automatic Speech Recognition (ASR) has achieved remarkable progress with the advent of large-scale pre-training and end- to-end architectures. However,hypothesis evaluationof ASR systems remains a non-trivial challenge, particularly in real- world scenarios where ground-truth transcripts are often un- available. Traditional evaluation relies ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

ASR Hypothesis Evaluation ASR Hypothesis Evaluation concerns the problem of assessing how well the output of an ASR system explains the original speech signal

Background 2.1. ASR Hypothesis Evaluation ASR Hypothesis Evaluation concerns the problem of assessing how well the output of an ASR system explains the original speech signal. Reference-based Evaluation.Reference-based methods directly compare the generated hypothesis with a given ground- truth transcript. Among these, the edit-distance-based Word Er- ror...

[3] [3]

What I cannot create I

READ: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy 3.1. Deriving Acoustic Discrepancy from AR TTS Systems LetX= (X 1, . . . , XN )denote a sequence of text tokens and Y= (Y 1, . . . , YT )denote a sequence of speech tokens. A trained auto-regressive TTS system parameterized byθmodels the sequence causally, defining a conditional probabil...

[4] [4]

R. ”, “Sen

Experiments 4.1. Experimental Setup We conduct experiments mainly on CosyV oice2 [19], a discrete auto-regressive TTS system. We adopt the official checkpoints, which have not been subjected to any additional training on the involved datasets. This setup is to make sure that the evaluation capability originates purely from the TTS model itself. For candid...

work page arXiv

[5] [5]

READ is reference-free and requires only the original speech signal to assess hypotheses, indicating fine-grained acoustic discrepancy

Conclusion We propose READ, a hypothesis evaluation method based on the conditional likelihood modeled by a TTS system. READ is reference-free and requires only the original speech signal to assess hypotheses, indicating fine-grained acoustic discrepancy. Without any additional training tailored to specific ASR models or datasets, our approach leverages t...

[6] [6]

The authors independently developed the re- search framework and experimental methodology

Generative AI Use Disclosure Generative AI was utilized for manuscript editing and technical troubleshooting. The authors independently developed the re- search framework and experimental methodology. We take full responsibility for the content and consent to this submission

[7] [7]

Confidence measures for large vocabulary continuous speech recognition,

F. Wessel, R. Schluter, K. Macherey, and H. Ney, “Confidence measures for large vocabulary continuous speech recognition,” IEEE Transactions on speech and audio processing, vol. 9, no. 3, pp. 288–298, 2001

2001

[8] [8]

Confidence measures for speech recognition: A sur- vey,

H. Jiang, “Confidence measures for speech recognition: A sur- vey,”Speech communication, vol. 45, no. 4, pp. 455–470, 2005

2005

[9] [9]

Word level confidence annotation using combinations of features,

R. Zhang and A. I. Rudnicky, “Word level confidence annotation using combinations of features,” inProc. Eurospeech 2001, 2001, pp. 2105–2108

2001

[10] [10]

On calibra- tion of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibra- tion of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

2017

[11] [11]

Large Scale Language Modeling in Automatic Speech Recognition

C. Chelba, D. Bikel, M. Shugrina, P. Nguyen, and S. Kumar, “Large scale language modeling in automatic speech recognition,” arXiv preprint arXiv:1210.8440, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[12] [12]

Recurrent neural network based language model

T. Mikolov, M. Karafi ´at, L. Burget, J. Cernock `y, and S. Khudan- pur, “Recurrent neural network based language model.” inInter- speech, vol. 2, no. 3. Makuhari, 2010, pp. 1045–1048

2010

[13] [13]

Word error rate estimation for speech recognition: e-WER,

A. Ali and S. Renals, “Word error rate estimation for speech recognition: e-WER,” inProceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (V olume 2: Short Papers), 2018, pp. 20–24

2018

[14] [14]

A post-processing system to yield reduced word er- ror rates: Recognizer output voting error reduction (rover),

J. Fiscus, “A post-processing system to yield reduced word er- ror rates: Recognizer output voting error reduction (rover),” in 1997 IEEE Workshop on Automatic Speech Recognition and Un- derstanding Proceedings, 1997, pp. 347–354

1997

[15] [15]

Can gener- ative large language models perform asr error correction?

R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can gener- ative large language models perform asr error correction?”arXiv preprint arXiv:2307.04172, 2023

work page arXiv 2023

[16] [16]

Hyporadise: An open baseline for generative speech recognition with large language models,

C. Chen, Y . Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y . Chen, and E.-S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,”Advances in Neural In- formation Processing Systems, vol. 36, pp. 31 665–31 688, 2023

2023

[17] [17]

Progres: Prompted generative rescoring on asr n-best,

A. D. Tur, A. Moumen, and M. Ravanelli, “Progres: Prompted generative rescoring on asr n-best,” in2024 IEEE Spoken Lan- guage Technology Workshop (SLT). IEEE, 2024, pp. 600–607

2024

[18] [18]

Re- jection improves reliability: Training LLMs to refuse unknown questions using RL from knowledge feedback,

H. Xu, Z. Zhu, S. Zhang, D. Ma, S. Fan, L. Chen, and K. Yu, “Re- jection improves reliability: Training LLMs to refuse unknown questions using RL from knowledge feedback,” inFirst Confer- ence on Language Modeling, 2024

2024

[19] [19]

Word Error Rate Estimation Without ASR Output: e-WER2,

A. Ali and S. Renals, “Word Error Rate Estimation Without ASR Output: e-WER2,” inInterspeech 2020, 2020, pp. 616–620

2020

[20] [20]

Fast word error rate esti- mation using self-supervised representations for speech and text,

C. Park, C. Lu, M. Chen, and T. Hain, “Fast word error rate esti- mation using self-supervised representations for speech and text,” inICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[21] [21]

On the robust ap- proximation of asr metrics,

A. Waheed, H. Atwany, R. Singh, and B. Raj, “On the robust ap- proximation of asr metrics,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 23 119–23 146

2025

[22] [22]

Large language models are efficient learners of noise- robust speech recognition,

Y . Hu, C. CHEN, C.-H. H. Yang, R. Li, C. Zhang, P.-Y . Chen, and E. Chng, “Large language models are efficient learners of noise- robust speech recognition,” inThe Twelfth International Confer- ence on Learning Representations, 2024

2024

[23] [23]

Listen again and choose the right answer: A new paradigm for automatic speech recognition with large language models,

Y . Hu, C. Chen, C. Qin, Q. Zhu, E. Chng, and R. Li, “Listen again and choose the right answer: A new paradigm for automatic speech recognition with large language models,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 666–679

2024

[24] [24]

Speech recognition: A model and a program for research,

M. Halle and K. Stevens, “Speech recognition: A model and a program for research,”IRE transactions on information theory, vol. 8, no. 2, pp. 155–159, 2003

2003

[25] [25]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gaoet al., “Cosyvoice 2: Scalable streaming speech synthesis with large lan- guage models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,

S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,” in Interspeech 2021, 2021, pp. 1977–1981

2021

[27] [27]

Asr rescoring and confidence estimation with electra,

H. Futami, H. Inaguma, M. Mimura, S. Sakai, and T. Kawahara, “Asr rescoring and confidence estimation with electra,” in2021 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2021, pp. 380–387

2021

[28] [28]

An evaluation of word-level confidence estimation for end-to-end automatic speech recognition,

D. Oneat ¸˘a, A. Caranica, A. Stan, and H. Cucu, “An evaluation of word-level confidence estimation for end-to-end automatic speech recognition,” in2021 IEEE Spoken Language Technology Work- shop (SLT). IEEE, 2021, pp. 258–265

2021

[29] [29]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen et al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Trans. ASLP, vol. 33, pp. 705–718, 2025

2025

[30] [30]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

IndexTTS: An industrial-level controllable and efficient zero-shot text-to-speech system,

W. Deng, S. Zhou, J. Shu, J. Wang, and L. Wang, “IndexTTS: An industrial-level controllable and efficient zero-shot text-to-speech system,”arXiv preprint arXiv:2502.05512, 2025

work page arXiv 2025

[32] [32]

A recursive algorithm for the forced alignment of very long audio segments

P. J. Moreno, C. F. Joerg, J.-M. Van Thong, and O. Glickman, “A recursive algorithm for the forced alignment of very long audio segments.” inICSLP, vol. 98, 1998, pp. 2711–2714

1998

[33] [33]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The kaldi speech recognition toolkit,” inIEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011

2011

[34] [34]

Qwen3-ASR Technical Report

X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yanget al., “Qwen3-ASR technical report,” arXiv preprint arXiv:2601.21337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Attention-constrained inference for robust decoder-only text-to- speech,

H. Wang, C. Du, Y . Guo, S. Wang, X. Chen, and K. Yu, “Attention-constrained inference for robust decoder-only text-to- speech,” in2024 IEEE SLT. IEEE, 2024, pp. 630–637

2024

[36] [36]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th ICML, vol. 202. PMLR, 23–29 Jul 2023, pp. 28 492–28 518

2023

[37] [37]

Fast conformer with linearly scalable attention for efficient speech recognition,

D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in2023 IEEE ASRU. IEEE, 2023, pp. 1–8

2023

[38] [38]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen et al., “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in2015 IEEE ICASSP, 2015, pp. 5206–5210

2015

[40] [40]

SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recog- nition,

P. K. O’Neill, V . Lavrukhin, S. Majumdar, V . Noroozi, Y . Zhang, O. Kuchaievet al., “SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recog- nition,” inInterspeech 2021, 2021, pp. 1434–1438

2021

[41] [41]

Switchboard: Telephone speech corpus for research and development,

J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” inIEEE ICASSP 1992, vol. 1. IEEE, 1992, pp. 517–520

1992

[42] [42]

TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,

F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Es- teve, “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” inInternational confer- ence on speech and computer. Springer, 2018, pp. 198–208

2018

[43] [43]

Noisy speech database for training speech enhancement algorithms and TTS models, 2016,

C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and TTS models, 2016,” University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017, [sound]. [Online]. Available: https://doi.org/10.7488/ds/2117

work page doi:10.7488/ds/2117 2016

[44] [44]

The ASRU 2019 mandarin-english code-switching speech recognition challenge: Open datasets, tracks, methods and results,

X. Shi, Q. Feng, and L. Xie, “The ASRU 2019 mandarin-english code-switching speech recognition challenge: Open datasets, tracks, methods and results,” inProceedings of the First Workshop on Speech Technologies for Code-switching in Multilingual Communities (WSTCSMC 2020), 2020, pp. 71–75. [Online]. Available: http://festvox.org/cedar/WSTCSMC2020.pdf

2019

[45] [45]

TALCS: An open-source Mandarin-English code-switching cor- pus and a speech recognition baseline,

C. Li, S. Deng, Y . Wang, G. Wang, Y . Gong, C. Chen, and J. Bai, “TALCS: An open-source Mandarin-English code-switching cor- pus and a speech recognition baseline,” inInterspeech 2022, 2022, pp. 1741–1745

2022

[46] [46]

WHAM!: Extending Speech Separation to Noisy Envi- ronments,

G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn et al., “WHAM!: Extending Speech Separation to Noisy Envi- ronments,” inInterspeech 2019, 2019, pp. 1368–1372

2019