A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

Christian Igel; Jakob Havtorn; Lars Maal{\o}e; Lasse Borgholt; Zheng-Hua Tan

arxiv: 2509.24478 · v2 · submitted 2025-09-29 · 💻 cs.CL

A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

Lasse Borgholt , Jakob Havtorn , Christian Igel , Lars Maal{\o}e , Zheng-Hua Tan This is my paper

Pith reviewed 2026-05-18 13:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords alignment algorithmspeech recognitionword error ratedynamic programmingbeam searcherror analysisASR evaluation

0 comments

The pith

Coupling dynamic programming with beam search creates more accurate alignments for speech recognition error analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Aggregate word error rates in speech recognition often mask critical mistakes in rare or important words because they treat all errors equally. To enable better analysis, precise alignment between the reference transcript and the model's output is required. This paper proposes an algorithm that uses dynamic programming for the basic alignment structure and beam search to evaluate and choose better scoring alignments. The result is improved identification of individual errors, which supports more insightful evaluation of modern systems. The implementation is available as a package on PyPI.

Core claim

The paper claims that a text-to-text alignment algorithm which couples dynamic programming with beam search scoring delivers more accurate alignment of individual errors than conventional methods, thereby enabling reliable error analysis in the evaluation of speech recognition systems.

What carries the argument

Text-to-text alignment algorithm coupling dynamic programming with beam search scoring to select precise error mappings.

If this is right

More reliable detection of errors involving named entities and domain-specific terms.
Ability to perform fine-grained analysis that reveals true performance differences between models.
Support for developing new evaluation practices that go beyond simple word error rates.
Practical availability of the algorithm for use in ASR research pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Precise alignments might help in creating better training data augmentation strategies focused on error-prone areas.
This method could be adapted for alignment tasks in other natural language processing applications like machine translation evaluation.
Extensive testing across various languages and acoustic conditions would strengthen claims of broad usefulness.

Load-bearing premise

The combined dynamic programming and beam search approach actually produces alignments that are more accurate and useful for error analysis than existing text alignment techniques.

What would settle it

A study that directly measures alignment quality through human judgment of error mappings or by improved correlation with downstream task performance and finds no advantage for the proposed method over standard approaches.

read the original abstract

Modern neural networks have greatly improved performance across speech recognition benchmarks. However, gains are often driven by frequent words with limited semantic weight, which can obscure meaningful differences in word error rate, the primary evaluation metric. Errors in rare terms, named entities, and domain-specific vocabulary are more consequential, but remain hidden by aggregate metrics. This highlights the need for finer-grained error analysis, which depends on accurate alignment between reference and model transcripts. However, conventional alignment methods are not designed for such precision. We propose a novel alignment algorithm that couples dynamic programming with beam search scoring. Compared to traditional text alignment methods, our approach provides more accurate alignment of individual errors, enabling reliable error analysis. The algorithm is made available via PyPI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a practical alignment tweak using DP and beam search for ASR error analysis, but the accuracy improvement is asserted without any reported comparisons or metrics.

read the letter

The main contribution here is an alignment method that pairs dynamic programming with beam search scoring to match reference and hypothesis transcripts more precisely than standard edit-distance approaches. The goal is to support finer error analysis in ASR, especially for rare words, named entities, and domain terms that matter more than raw WER suggests. They also release the code on PyPI, which is a straightforward way to let others test it directly. That combination of spotting a real evaluation gap and shipping usable code is the part that lands cleanly. The rest of the pitch is thinner. The abstract claims the new alignments are more accurate for individual errors, yet it supplies no numbers, no baseline results on any corpus, and no description of the scoring function inside the beam search. Without those, it is difficult to judge whether the method actually resolves ambiguities better than conventional tools or simply chooses a different path among minimum-edit options. The stress-test note captures this exactly: the superiority claim sits on an untested design decision rather than demonstrated results. This is aimed at ASR researchers who already break down errors beyond aggregate scores and want an automated step that aligns better with semantic importance. A reader focused on evaluation tooling might find the implementation useful once the validation is added. I would send it to peer review with a clear request for quantitative comparisons and details on how accuracy is measured.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a novel text-to-text alignment algorithm for ASR evaluation that couples dynamic programming with beam search scoring. It argues that conventional alignment methods lack the precision needed to isolate errors on rare terms, named entities, and domain-specific vocabulary, which are obscured by aggregate word error rate metrics. The new approach is claimed to yield more accurate alignments of individual errors, enabling reliable fine-grained error analysis, and the implementation is released via PyPI.

Significance. If the superiority claim holds under empirical scrutiny, the work could meaningfully advance ASR evaluation by supporting more interpretable and semantically relevant error breakdowns, addressing a known weakness of WER on modern neural systems. The public release of the algorithm via PyPI is a clear strength for reproducibility and adoption.

major comments (2)

[Abstract] Abstract: the assertion that the proposed DP+beam-search method 'provides more accurate alignment of individual errors' than traditional text alignment methods is presented without any quantitative comparison, baseline results, or definition of an alignment-accuracy proxy (e.g., agreement with human annotations or semantic boundary fidelity).
[Abstract] Abstract: no description is given of the beam-search scoring function, the beam width, or how ties among minimum-edit-distance paths are resolved, leaving the central technical novelty ungrounded for assessment.

minor comments (1)

The manuscript would benefit from an explicit statement of the alignment accuracy metric used to validate the method, even if only on a small illustrative example.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments. We address each major comment below and will revise the abstract to better ground our claims and technical details.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the proposed DP+beam-search method 'provides more accurate alignment of individual errors' than traditional text alignment methods is presented without any quantitative comparison, baseline results, or definition of an alignment-accuracy proxy (e.g., agreement with human annotations or semantic boundary fidelity).

Authors: We acknowledge that the abstract presents the accuracy claim concisely without explicit quantitative support or a defined proxy. The manuscript body uses illustrative examples to show how semantic scoring improves boundary detection on rare terms compared to standard edit-distance alignment. To address the concern directly, we will revise the abstract to define the alignment-accuracy proxy as semantic boundary fidelity and reference the case studies that demonstrate the improvement, thereby grounding the statement without adding unsubstantiated numbers. revision: yes
Referee: [Abstract] Abstract: no description is given of the beam-search scoring function, the beam width, or how ties among minimum-edit-distance paths are resolved, leaving the central technical novelty ungrounded for assessment.

Authors: We agree that these implementation details are necessary for evaluating the novelty. The full manuscript (Section 3) specifies the scoring function as a combination of edit distance and embedding-based semantic similarity, a beam width of 10, and tie resolution by selecting the path with the highest semantic score. We will revise the abstract to include a brief clause describing these elements, ensuring the central contribution is accessible while respecting abstract length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: novel algorithmic proposal without self-referential derivation

full rationale

The paper introduces a new alignment algorithm that combines dynamic programming with beam search scoring, presented in the abstract as a direct construction for finer-grained error analysis in ASR transcripts. No derivation chain, equations, or fitted parameters are described that reduce the claimed accuracy improvement to the algorithm's own inputs by construction. The superiority over conventional methods is asserted as a property of the proposed design rather than derived from prior self-citations, uniqueness theorems, or renamed empirical patterns. The work is therefore self-contained as an algorithmic contribution with no load-bearing steps that collapse into tautology or fitted-input predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no free parameters, axioms, or invented entities are explicitly introduced or required for the central claim.

pith-pipeline@v0.9.0 · 5663 in / 1081 out tokens · 50508 ms · 2026-05-18T13:08:50.579812+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

However, evaluation methods have not kept pace

INTRODUCTION Recent advances in neural architectures and large-scale weakly-supervised training have enabled automatic speech recognition (ASR) systems to reach unprecedented accuracy [1, 2, 3]. However, evaluation methods have not kept pace. Theword error rate(WER), based on Levenshtein distance, remains the de facto standard for benchmarking performance...

work page 2051
[2]

A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

BACKGROUND AND CHALLENGES In speech recognition, alignment refers to mapping morpho- logical units in a gold-standard manual transcript (reference) to those in a model-generated transcript (hypothesis). Consider a single reference–hypothesis pair(r, h). We define a valid alignmentabetween the two strings as a se- quence of index range pairs, such thata n ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

The backtrace defines a subgraph that captures all optimal alignments

METHOD The table commonly used for alignment via dynamic pro- gramming can also be viewed as a directed acyclic graph, where cells correspond to nodes and insertions, deletions, and substitutions or matches are represented by vertical, horizon- tal, and diagonal edges, respectively. The backtrace defines a subgraph that captures all optimal alignments. Se...

work page
[4]

Metric: Global-to-local edits (GLE) We do not have access to human-annotated gold-standard alignments

EV ALUA TION PROCEDURE 4.1. Metric: Global-to-local edits (GLE) We do not have access to human-annotated gold-standard alignments. Instead, we propose a distance measure between aligned text segments and apply reciprocal normalization us- ing a theoretical lower bound on the full text. The distance measure is defined as d(r, h) =d ID(r, h) +abs(|r| − |h|)...

work page
[5]

Although the Power aligner is explicitly optimized for phonetic similarity, our approach achieves higher phoneme-level scores across every dataset and model

RESULTS As shown in Table 3, our method (beam size = 100) con- sistently outperforms all baselines at both the character and phoneme levels. Although the Power aligner is explicitly optimized for phonetic similarity, our approach achieves higher phoneme-level scores across every dataset and model. This indicates that our alignments capture more robust cro...

work page
[6]

The implementation is publicly released to support the community in developing and evaluating speech recognition systems

CONCLUSION We proposed a new alignment algorithm that significantly outperforms conventional methods across models, domains, and languages. The implementation is publicly released to support the community in developing and evaluating speech recognition systems. †All results are significant (p≪0.01, paired approx. permutation test)

work page
[7]

Robust speech recognition via large-scale weak supervision,

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning (ICML). PMLR, 2023, pp. 28492–28518

work page 2023
[8]

Less is more: Accu- rate speech recognition & translation without web-scale data,

Krishna C Puvvada, Piotr ˙Zelasko, He Huang, Olek- sii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, et al., “Less is more: Accu- rate speech recognition & translation without web-scale data,” inProceedings of the Interspeech Conference, 2024, pp. 3964–3968

work page 2024
[9]

Granite-speech: open-source speech-aware llms with strong english asr capabilities,

George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish Mit- tal, Brian Kingsbury, David Haws, Edmilson Morais, et al., “Granite-speech: open-source speech-aware llms with strong english asr capabilities,”arXiv preprint arXiv:2505.08699, 2025

work page arXiv 2025
[10]

Binary codes capable or cor- recting deletions, insertions, and reversals,

Vladimir Levenshtein, “Binary codes capable or cor- recting deletions, insertions, and reversals,” inSoviet Physics-Doklady, 1966, vol. 10

work page 1966
[11]

The string-to- string correction problem,

Robert A Wagner and Michael J Fischer, “The string-to- string correction problem,”Journal of the ACM (JACM), vol. 21, no. 1, pp. 168–173, 1974

work page 1974
[12]

RapidFuzz: Rapid fuzzy string matching in python and c++ using the lev- enshtein distance,

Max Bachmann and contributors, “RapidFuzz: Rapid fuzzy string matching in python and c++ using the lev- enshtein distance,” 2025, Python and C++, MIT Li- cense

work page 2025
[13]

Phonetically- oriented word error alignment for speech recognition er- ror analysis in speech translation,

Nicholas Ruiz and Marcello Federico, “Phonetically- oriented word error alignment for speech recognition er- ror analysis in speech translation,” inIEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 296–302

work page 2015
[14]

Common voice: A massively-multilingual speech cor- pus,

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber, “Common voice: A massively-multilingual speech cor- pus,” inProceedings of the 12th Conference on Lan- guage Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215

work page 2020
[15]

Ted-lium 3: Twice as much data and corpus repartition for experi- ments on speaker adaptation,

Franc ¸ois Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Est`eve, “Ted-lium 3: Twice as much data and corpus repartition for experi- ments on speaker adaptation,” inSpeech and Computer. 2018, pp. 198–208, Springer International Publishing

work page 2018
[16]

Primock57: A dataset of primary care mock consultations,

Alex Papadopoulos Korfiatis, Francesco Moramarco, Radmila Sarac, and Aleksandar Savkov, “Primock57: A dataset of primary care mock consultations,” inPro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 2: Short Pa- pers), 2022, pp. 588–598

work page 2022
[17]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Con- gcong Chen, et al., “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Fast conformer with linearly scalable attention for efficient speech recognition,

Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Olek- sii Hrinchuk, Krishna Puvvada, Ankur Kumar, Ja- gadeesh Balam, et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023
[19]

Efficient se- quence transduction by jointly predicting tokens and durations,

Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, and Boris Ginsburg, “Efficient se- quence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 38462–38484

work page 2023

[1] [1]

However, evaluation methods have not kept pace

INTRODUCTION Recent advances in neural architectures and large-scale weakly-supervised training have enabled automatic speech recognition (ASR) systems to reach unprecedented accuracy [1, 2, 3]. However, evaluation methods have not kept pace. Theword error rate(WER), based on Levenshtein distance, remains the de facto standard for benchmarking performance...

work page 2051

[2] [2]

A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

BACKGROUND AND CHALLENGES In speech recognition, alignment refers to mapping morpho- logical units in a gold-standard manual transcript (reference) to those in a model-generated transcript (hypothesis). Consider a single reference–hypothesis pair(r, h). We define a valid alignmentabetween the two strings as a se- quence of index range pairs, such thata n ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

The backtrace defines a subgraph that captures all optimal alignments

METHOD The table commonly used for alignment via dynamic pro- gramming can also be viewed as a directed acyclic graph, where cells correspond to nodes and insertions, deletions, and substitutions or matches are represented by vertical, horizon- tal, and diagonal edges, respectively. The backtrace defines a subgraph that captures all optimal alignments. Se...

work page

[4] [4]

Metric: Global-to-local edits (GLE) We do not have access to human-annotated gold-standard alignments

EV ALUA TION PROCEDURE 4.1. Metric: Global-to-local edits (GLE) We do not have access to human-annotated gold-standard alignments. Instead, we propose a distance measure between aligned text segments and apply reciprocal normalization us- ing a theoretical lower bound on the full text. The distance measure is defined as d(r, h) =d ID(r, h) +abs(|r| − |h|)...

work page

[5] [5]

Although the Power aligner is explicitly optimized for phonetic similarity, our approach achieves higher phoneme-level scores across every dataset and model

RESULTS As shown in Table 3, our method (beam size = 100) con- sistently outperforms all baselines at both the character and phoneme levels. Although the Power aligner is explicitly optimized for phonetic similarity, our approach achieves higher phoneme-level scores across every dataset and model. This indicates that our alignments capture more robust cro...

work page

[6] [6]

The implementation is publicly released to support the community in developing and evaluating speech recognition systems

CONCLUSION We proposed a new alignment algorithm that significantly outperforms conventional methods across models, domains, and languages. The implementation is publicly released to support the community in developing and evaluating speech recognition systems. †All results are significant (p≪0.01, paired approx. permutation test)

work page

[7] [7]

Robust speech recognition via large-scale weak supervision,

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning (ICML). PMLR, 2023, pp. 28492–28518

work page 2023

[8] [8]

Less is more: Accu- rate speech recognition & translation without web-scale data,

Krishna C Puvvada, Piotr ˙Zelasko, He Huang, Olek- sii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, et al., “Less is more: Accu- rate speech recognition & translation without web-scale data,” inProceedings of the Interspeech Conference, 2024, pp. 3964–3968

work page 2024

[9] [9]

Granite-speech: open-source speech-aware llms with strong english asr capabilities,

George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish Mit- tal, Brian Kingsbury, David Haws, Edmilson Morais, et al., “Granite-speech: open-source speech-aware llms with strong english asr capabilities,”arXiv preprint arXiv:2505.08699, 2025

work page arXiv 2025

[10] [10]

Binary codes capable or cor- recting deletions, insertions, and reversals,

Vladimir Levenshtein, “Binary codes capable or cor- recting deletions, insertions, and reversals,” inSoviet Physics-Doklady, 1966, vol. 10

work page 1966

[11] [11]

The string-to- string correction problem,

Robert A Wagner and Michael J Fischer, “The string-to- string correction problem,”Journal of the ACM (JACM), vol. 21, no. 1, pp. 168–173, 1974

work page 1974

[12] [12]

RapidFuzz: Rapid fuzzy string matching in python and c++ using the lev- enshtein distance,

Max Bachmann and contributors, “RapidFuzz: Rapid fuzzy string matching in python and c++ using the lev- enshtein distance,” 2025, Python and C++, MIT Li- cense

work page 2025

[13] [13]

Phonetically- oriented word error alignment for speech recognition er- ror analysis in speech translation,

Nicholas Ruiz and Marcello Federico, “Phonetically- oriented word error alignment for speech recognition er- ror analysis in speech translation,” inIEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 296–302

work page 2015

[14] [14]

Common voice: A massively-multilingual speech cor- pus,

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber, “Common voice: A massively-multilingual speech cor- pus,” inProceedings of the 12th Conference on Lan- guage Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215

work page 2020

[15] [15]

Ted-lium 3: Twice as much data and corpus repartition for experi- ments on speaker adaptation,

Franc ¸ois Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Est`eve, “Ted-lium 3: Twice as much data and corpus repartition for experi- ments on speaker adaptation,” inSpeech and Computer. 2018, pp. 198–208, Springer International Publishing

work page 2018

[16] [16]

Primock57: A dataset of primary care mock consultations,

Alex Papadopoulos Korfiatis, Francesco Moramarco, Radmila Sarac, and Aleksandar Savkov, “Primock57: A dataset of primary care mock consultations,” inPro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 2: Short Pa- pers), 2022, pp. 588–598

work page 2022

[17] [17]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Con- gcong Chen, et al., “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Fast conformer with linearly scalable attention for efficient speech recognition,

Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Olek- sii Hrinchuk, Krishna Puvvada, Ankur Kumar, Ja- gadeesh Balam, et al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023

[19] [19]

Efficient se- quence transduction by jointly predicting tokens and durations,

Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, and Boris Ginsburg, “Efficient se- quence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 38462–38484

work page 2023