pith. sign in

arxiv: 2606.18466 · v1 · pith:LM2LIKFCnew · submitted 2026-06-16 · 💻 cs.CL

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Pith reviewed 2026-06-27 00:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords forced alignmentMFAspeech-to-text alignmentphonetic boundariesmodel adaptationcross-language remappingpronunciation modeling
0
0 comments X

The pith

MFA 3.0 reaches state-of-the-art alignment accuracy with mean boundary errors below 15 ms on four benchmark datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that version 3.0 of the Montreal Forced Aligner matches or exceeds other tools on standard tests for aligning speech audio to text transcripts. It reports this performance level across English, Japanese, and Korean data while also showing that model adaptation and cross-language remapping work well for languages outside the original training set. Pronunciation probability modeling and phonological rules add further gains under specific conditions. A sympathetic reader would care because forced alignment supplies the timing labels that many downstream speech and language studies depend on, so lower error rates translate directly into more reliable measurements in those studies.

Core claim

MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions. The paper documents MFA 3.0's developments since version 1.0, including expanded language coverage from larger open-source datasets and harmonized IPA dictionaries.

What carries the argument

The Montreal Forced Aligner 3.0, whose performance is driven by expanded training data, model adaptation, cross-language phone remapping, and optional pronunciation probability modeling.

If this is right

  • Researchers gain a single open tool that delivers near-top accuracy on English, Japanese, and Korean without needing separate aligners for each language.
  • Model adaptation and cross-language remapping extend usable accuracy to languages outside the original training distribution.
  • Pronunciation probability modeling and phonological rules can be added selectively to improve results in variable speech conditions.
  • Mean boundary errors below 15 ms become a new practical target for forced alignment systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported performance level may reduce the need for manual correction of alignments in large speech corpora.
  • The same adaptation methods could be tested on additional low-resource languages to measure how far the gains generalize.
  • Downstream tasks such as automatic prosody labeling or phonetic analysis may achieve higher reliability once input alignments carry smaller timing errors.

Load-bearing premise

The four benchmark datasets are representative enough to support claims of state-of-the-art status across languages and conditions.

What would settle it

A new benchmark dataset or language in which a competing forced aligner produces lower mean boundary errors than MFA 3.0 when both are evaluated under matched training and testing conditions.

read the original abstract

The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0's developments since version 1.0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper documents developments in the Montreal Forced Aligner (MFA) from v1.0 to v3.0, including expanded language coverage via larger datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. It evaluates MFA 3.0 on four benchmark datasets spanning English, Japanese, and Korean against classic and neural forced aligners, claiming SOTA or near-SOTA performance with mean boundary errors below 15 ms. Additional claims address the effectiveness of adaptation/remapping for out-of-distribution languages and gains from pronunciation probability modeling and phonological rules under specific conditions.

Significance. If the empirical claims hold under transparent and equivalent evaluation conditions, the work would be significant for the speech alignment community: MFA is already the most widely used tool, so an updated performance characterization with practical adaptation methods for additional languages would provide a useful reference point and baseline for 2026-era forced alignment research.

major comments (2)
  1. [Abstract and evaluation description] Abstract and evaluation description: the SOTA/near-SOTA claim with mean boundary error <15 ms across all four datasets rests on unspecified benchmark datasets (only languages named, not the actual corpora or their sizes/splits), unnamed neural aligners and versions from the 2025-2026 literature, and no statement of identical data splits, evaluation protocols, or statistical testing. This directly undermines verification of the central performance claim.
  2. [Results section] Results section (implied by abstract claims): equivalence of comparisons to neural baselines is not demonstrated; without explicit confirmation that all systems were run under matched conditions (same audio, same evaluation metric, same train/test partitions), the reported superiority cannot be assessed as load-bearing evidence.
minor comments (2)
  1. [Abstract] Clarify the fourth dataset (abstract names only three languages).
  2. Provide at least one table or figure summarizing the exact mean boundary errors, standard deviations, and baseline comparisons rather than summary statements alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues with the verifiability of our central performance claims. We agree that additional explicit details are needed and will revise the manuscript to address both major comments.

read point-by-point responses
  1. Referee: [Abstract and evaluation description] Abstract and evaluation description: the SOTA/near-SOTA claim with mean boundary error <15 ms across all four datasets rests on unspecified benchmark datasets (only languages named, not the actual corpora or their sizes/splits), unnamed neural aligners and versions from the 2025-2026 literature, and no statement of identical data splits, evaluation protocols, or statistical testing. This directly undermines verification of the central performance claim.

    Authors: We agree this is a substantive gap. The current manuscript names only the languages and does not list the specific corpora, sizes, splits, exact neural aligner versions, or confirm identical protocols and testing. We will add a dedicated evaluation subsection that enumerates the four benchmark datasets with sizes and splits, names the neural baselines with citations and versions, states that all systems used the same boundary-error metric, and reports any statistical testing performed. This revision will make the SOTA claims directly verifiable. revision: yes

  2. Referee: [Results section] Results section (implied by abstract claims): equivalence of comparisons to neural baselines is not demonstrated; without explicit confirmation that all systems were run under matched conditions (same audio, same evaluation metric, same train/test partitions), the reported superiority cannot be assessed as load-bearing evidence.

    Authors: The manuscript does not contain an explicit statement confirming matched conditions across systems. We will revise the results section to add a clear paragraph stating that all aligners (classic and neural) were evaluated on identical audio files using the same mean boundary error metric and the same train/test partitions. If any neural results were taken from published tables rather than re-run, we will note that limitation and its implications for direct comparison. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking paper contains no derivation chain or self-referential predictions

full rationale

The paper evaluates MFA 3.0 performance via reported boundary errors on four benchmark datasets and compares against other aligners under stated conditions. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All claims rest on experimental results rather than any reduction to inputs by construction, satisfying the default expectation of no significant circularity for an empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities as this is an empirical report on tool performance rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5688 in / 1138 out tokens · 31548 ms · 2026-06-27T00:24:38.653481+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 2 linked inside Pith

  1. [1]

    Many forced aligners have been developed over the past 20 years (e.g., [1, 2, 3, 4]), and the field has a healthy ecosystem of tools for different use cases

    Introduction Forced alignment, the automatic temporal alignment of words and phonemes to a speech recording given its orthographic tran- scription, has become a standard first step in language science research across (socio)phonetics, language documentation, and psycholinguistics. Many forced aligners have been developed over the past 20 years (e.g., [1, ...

  2. [2]

    We briefly cover each to motivate the features of MFA 3.0 described in Sec

    Background Development of MFA from 1.0 to 3.0 has been driven by rapid expansion in theuse casesof forced aligners in scientific re- search and thedata and toolsavailable. We briefly cover each to motivate the features of MFA 3.0 described in Sec. 3. 2.1. Use cases A forced aligner consists minimally of anacoustic modeland pronunciation dictionary; ten ye...

  3. [3]

    Pronunciation probability

    Montreal Forced Aligner 3.0 MFA is an open-source command line utility with prebuilt ex- ecutables for Windows, Mac OSX, and Linux [5]. MFA 3.0 extends version 1.0 in four main areas. First, it leverages the increase in available data to provide an expanded set of pre- trained acoustic models with greater coverage of languages and linguistic/social variat...

  4. [4]

    T” maps to TIMIT “t

    Evaluation 4.1. Datasets Benchmark datasets were created from four corpora with man- ually corrected phone-level boundaries across three languages (Table 3). The two English datasets are TIMIT [63], a corpus of read speech, and the Buckeye Corpus of spontaneous speech [64]. The other two datasets represent Japanese and Korean, which are not typically incl...

  5. [5]

    Word alignment results Table 4 shows word-level alignment accuracy on TIMIT and Buckeye

    Results 5.1. Word alignment results Table 4 shows word-level alignment accuracy on TIMIT and Buckeye. MFA 3.0 substantially outperforms all three neural ASR-based aligners on both datasets, extending the findings of

  6. [6]

    -PP” and “+rules

    to a larger comparison class. The gap is greatest for small thresholds (10, 25 ms) that are most relevant for speech re- search. In comparison to other aligners, MFA 3.0 markedly outperforms MFA 1.0 on both datasets and ranks near the top compared to most other aligners. For Buckeye, MFA ARPA 3.0 and MFA Global 3.0 show the best performance, and all other...

  7. [7]

    for” pronounced as “F AO1 R

    Discussion The goal of this paper was two-fold: 1) to evaluate MFA 3.0 per- formance against MFA 1.0 and other current aligners, and 2) to demonstrate available modeling utilities in MFA 3.0 that feed into end-to-end forced alignment for speech research. MFA 3.0 pretrained models demonstrate substantial improvements on benchmarking datasets compared to MF...

  8. [8]

    MFA 3.0 demonstrates state of the art performance against existing aligners and provides users additional function- ality to tailor MFA to their own data

    Conclusion We have presented key updates to the Montreal Forced Aligner that have improved performance on benchmarks across three languages, along with summarizing new utilities included in MFA 3.0. MFA 3.0 demonstrates state of the art performance against existing aligners and provides users additional function- ality to tailor MFA to their own data. Fut...

  9. [9]

    Acknowledgments We acknowledge funding from SSHRC #430-2014-00018, FRQSC #183356, CFI #32451 and SSHRC CRC program to Morgan Sonderegger; SSHRC #435-2014-1504 and the SSHRC CRC program to Michael Wagner; and NIH #R01DC019645-03 awarded to Katherine C. Hustad

  10. [10]

    Generative AI Use Disclosure No generative AI was used in preparing this manuscript

  11. [11]

    Speaker identification on the SCO- TUS corpus,

    J. Yuan, M. Libermanet al., “Speaker identification on the SCO- TUS corpus,”Journal of the Acoustical Society of America, vol. 123, no. 5, p. 3878, 2008

  12. [12]

    FA VE (Forced Alignment and V owel Extraction) Program Suite,

    I. Rosenfelder, J. Fruehwald, K. Evanini, and J. Yuan, “FA VE (Forced Alignment and V owel Extraction) Program Suite,” 2011, available at http://fave.ling.upenn.edu

  13. [13]

    Signal processing via web services: the use case WebMAUS,

    T. Kisler, F. Schiel, and H. Sloetjes, “Signal processing via web services: the use case WebMAUS,” inDigital Humanities Confer- ence, 2012

  14. [14]

    Prosodylab-Aligner: A tool for forced alignment of laboratory speech,

    K. Gorman, J. Howell, and M. Wagner, “Prosodylab-Aligner: A tool for forced alignment of laboratory speech,”Canadian Acous- tics, vol. 39, no. 3, pp. 192–193, 2011

  15. [15]

    Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,

    M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,” inInterspeech, 2017, pp. 498–502

  16. [16]

    Performance of forced-alignment algorithms on children’s speech,

    T. J. Mahr, V . Berisha, K. Kawabata, J. Liss, and K. C. Hus- tad, “Performance of forced-alignment algorithms on children’s speech,”JSLHR, vol. 64, no. 6S, pp. 2213–2222, 2021

  17. [17]

    Tradition or innovation: A comparison of modern ASR methods for forced alignment,

    R. Rousso, E. Cohen, J. Keshet, and E. Chodroff, “Tradition or innovation: A comparison of modern ASR methods for forced alignment,”arXiv preprint arXiv:2406.19363, 2024

  18. [18]

    The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation,

    M. C. Kelley, S. J. Perry, and B. V . Tucker, “The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation,”Phonetica, vol. 81, no. 5, pp. 451–508, 2024

  19. [19]

    The variation in con- versation (ViC) project: Creation of the Buckeye Corpus of Con- versational Speech,

    S. Kiesling, L. Dilley, and W. D. Raymond, “The variation in con- versation (ViC) project: Creation of the Buckeye Corpus of Con- versational Speech,”Language V ariation and Change, pp. 55–97, 2006

  20. [20]

    Using automatic alignment to ana- lyze endangered language data: Testing the viability of untrained alignment,

    C. DiCanio, H. Nam, D. H. Whalen, H. Timothy Bunnell, J. D. Amith, and R. C. Garc ´ıa, “Using automatic alignment to ana- lyze endangered language data: Testing the viability of untrained alignment,”JASA, vol. 134, no. 3, pp. 2235–2246, 2013

  21. [21]

    Forced alignment for understudied language varieties: Testing Prosodylab-Aligner with Tongan data,

    L. M. Johnson, M. Di Paolo, and A. Bell, “Forced alignment for understudied language varieties: Testing Prosodylab-Aligner with Tongan data,”Language Documentation & Conservation, vol. 12, pp. 80–123, 2018

  22. [22]

    A Robin Hood approach to forced align- ment: English-trained algorithms and their use on Australian lan- guages,

    S. Babinski, R. Dockum, J. H. Craft, A. Fergus, D. Golden- berg, and C. Bowern, “A Robin Hood approach to forced align- ment: English-trained algorithms and their use on Australian lan- guages,”LSA, vol. 4, pp. 3–1, 2019

  23. [23]

    The use of phone categories and cross-language modeling for phone align- ment of Panara,

    E. P. Ahn, E. Chodroff, M. Lapierre, and G.-A. Levow, “The use of phone categories and cross-language modeling for phone align- ment of Panara,” inInterspeech, 2024, pp. 1505–1509

  24. [24]

    Multilingual MFA: Forced Align- ment on Low-Resource Related Languages,

    A. Tosolini and C. Bowern, “Multilingual MFA: Forced Align- ment on Low-Resource Related Languages,” inEighth Compu- tEL, 2025, pp. 100–109

  25. [25]

    Assessing the accuracy of existing forced alignment software on varieties of British English,

    L. MacKenzie and D. Turton, “Assessing the accuracy of existing forced alignment software on varieties of British English,”Lin- guistics V anguard, vol. 6, no. s1, p. 20180061, 2020

  26. [26]

    Maximiz- ing accuracy of forced alignment for spontaneous child speech,

    R. Fromont, L. Clark, J. W. Black, and M. Blackwood, “Maximiz- ing accuracy of forced alignment for spontaneous child speech,” Language Development Research, vol. 3, no. 1, 2023

  27. [27]

    Examining fac- tors influencing the viability of automatic acoustic analysis of child speech,

    T. Knowles, M. Clayards, and M. Sonderegger, “Examining fac- tors influencing the viability of automatic acoustic analysis of child speech,”JSLHR, vol. 61, no. 10, pp. 2487–2501, 2018

  28. [28]

    A semi-automatic pipeline for transcribing and segmenting child speech,

    P. Christodoulidou, J. Tanner, J. Stuart-Smith, M. McAuliffe, M. Murali, A. Smith, L. Taylor, J. Cleland, and A. Kuschmann, “A semi-automatic pipeline for transcribing and segmenting child speech,” inInterspeech, 2025

  29. [29]

    Analysis of forced aligner performance on L2 English speech,

    S. Williams, P. Foulkes, and V . Hughes, “Analysis of forced aligner performance on L2 English speech,”Speech Communi- cation, vol. 158, p. 103042, 2024

  30. [30]

    SPPAS: a tool for the phonetic segmentations of speech,

    B. Bigi, “SPPAS: a tool for the phonetic segmentations of speech,” inLREC, 2012, pp. 1748–1755

  31. [31]

    BFA: Real-time Multilingual Text-to-speech Forced Alignment,

    A. Rehman, J. Cai, J.-J. Zhang, and X. Yang, “BFA: Real-time Multilingual Text-to-speech Forced Alignment,”arXiv preprint arXiv:2509.23147, 2025

  32. [32]

    Recent ad- vances in technologies for resource creation and mobilization in language documentation,

    A. L. Berez-Kroeker, S. Gabber, and A. Slayton, “Recent ad- vances in technologies for resource creation and mobilization in language documentation,”Annual Review of Linguistics, vol. 9, no. 1, pp. 195–214, 2023

  33. [33]

    Computational sociophonetics using automatic speech recognition,

    R. Coto-Solano, “Computational sociophonetics using automatic speech recognition,”Language and Linguistics Compass, vol. 16, no. 9, p. e12474, 2022

  34. [34]

    Comparing language- specific and cross-language acoustic models for low-resource phonetic forced alignment,

    E. Chodroff, E. P. Ahn, and H. Dolatian, “Comparing language- specific and cross-language acoustic models for low-resource phonetic forced alignment,”Language Documentation & Conser- vation, vol. 19, pp. 201 – 223, 2025

  35. [35]

    Probabilistic analysis of pronunciation with ’MAUS’,

    F. Schiel and A. Kipp, “Probabilistic analysis of pronunciation with ’MAUS’,”ZAS Papers in Linguistics, vol. 11, pp. 51–60, 1998

  36. [36]

    Investigating /l/ variation in English through forced alignment,

    J. Yuan and M. Y . Liberman, “Investigating /l/ variation in English through forced alignment,” inInterspeech, 2009

  37. [37]

    Acoustic reduction in conversational Dutch: A quantitative anal- ysis based on automatically generated segmental transcriptions,

    B. Schuppler, M. Ernestus, O. Scharenborg, and L. Boves, “Acoustic reduction in conversational Dutch: A quantitative anal- ysis based on automatically generated segmental transcriptions,” Journal of Phonetics, vol. 39, no. 1, pp. 96–109, 2011

  38. [38]

    Large-scale analysis of Spanish/s/- lenition using audiobooks,

    N. Ryant and M. Liberman, “Large-scale analysis of Spanish/s/- lenition using audiobooks,” inMeetings on Acoustics, vol. 28, no. 1, 2016, p. 060005

  39. [39]

    Extracting linguistic knowledge from speech: A study of stop realization in 5 Romance languages,

    Y . Wu, M. Hutin, I. Vasilescu, L. Lamel, and M. Adda-Decker, “Extracting linguistic knowledge from speech: A study of stop realization in 5 Romance languages,” inLREC, 2022, pp. 3257– 3263

  40. [40]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inLREC, 2020, pp. 4218–4222

  41. [41]

    MLS: A large-scale multilingual dataset for speech research,

    V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” ArXiv, vol. abs/2012.03411, 2020

  42. [42]

    Massively Multi- lingual Pronunciation Modeling with WikiPron,

    J. L. Lee, L. F. Ashby, M. E. Garza, Y . Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman, “Massively Multi- lingual Pronunciation Modeling with WikiPron,” inLREC, 2020, pp. 4223–4228

  43. [43]

    Pynini: A Python library for weighted finite-state grammar compilation,

    K. Gorman, “Pynini: A Python library for weighted finite-state grammar compilation,” inSIGFSM Workshop on Statistical NLP and Weighted Automata, 2016, pp. 75–80

  44. [44]

    Epitran: Precision G2P for many languages,

    D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” inLREC, 2018

  45. [45]

    The cross- linguistic phonological frequencies (XPF) corpus,

    U. C. Priva, E. Strand, S. Yang, W. Mizgerd, A. Creighton, J. Bai, R. Mathew, A. Shao, J. Schuster, and D. Wiepert, “The cross- linguistic phonological frequencies (XPF) corpus,” 2021

  46. [46]

    LaBB-CAT: An annotation store,

    R. Fromont and J. Hay, “LaBB-CAT: An annotation store,” inAus- tralasian language technology association workshop, 2012, pp. 113–117

  47. [47]

    EMU-SDMS: Ad- vanced speech database management and analysis in R,

    R. Winkelmann, J. Harrington, and K. J ¨ansch, “EMU-SDMS: Ad- vanced speech database management and analysis in R,”Com- puter Speech & Language, vol. 45, pp. 392–410, 2017

  48. [48]

    Pyannote.audio: neural building blocks for speaker diarization,

    H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote.audio: neural building blocks for speaker diarization,” inICASSP. IEEE, 2020, pp. 7124–7128

  49. [49]

    Whisperx: Time- accurate speech transcription of long-form audio,

    M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time- accurate speech transcription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023

  50. [50]

    SpeechBrain: A general-purpose speech toolkit,

    M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, and others, “SpeechBrain: A general-purpose speech toolkit,”arXiv preprint arXiv:2106.04624, 2021

  51. [51]

    V oxcommunis: A corpus for cross- linguistic phonetic analysis,

    E. Ahn and E. Chodroff, “V oxcommunis: A corpus for cross- linguistic phonetic analysis,” inLREC, 2022, pp. 5286–5294

  52. [52]

    A pipeline for the large-scale acoustic analysis of streamed content,

    S. Coats, “A pipeline for the large-scale acoustic analysis of streamed content,” inCMC-Corpora 2023, 2023, pp. 51–54

  53. [53]

    The HTK hidden Markov model toolkit: Design and philosophy,

    S. J. Young, “The HTK hidden Markov model toolkit: Design and philosophy,” 1993

  54. [54]

    Julius—an open source real-time large vocabulary recognition engine,

    A. Lee, T. Kawahara, and K. Shikano, “Julius—an open source real-time large vocabulary recognition engine,” 2001

  55. [55]

    The Kaldi Speech Recognition Toolkit,

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” inIEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB

  56. [56]

    NeMo Forced Aligner and its application to word alignment for subtitle genera- tion,

    E. Rastorgueva, V . Lavrukhin, and B. Ginsburg, “NeMo Forced Aligner and its application to word alignment for subtitle genera- tion,” inInterspeech 2023, 2023, pp. 5257–5258

  57. [57]

    Phone-to-audio alignment without text: A semi-supervised approach,

    J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,”ICASSP, 2022

  58. [58]

    Globalphone: A multilin- gual text & speech database in 20 languages,

    T. Schultz, N. T. Vu, and T. Schlippe, “Globalphone: A multilin- gual text & speech database in 20 languages,” inICASSP. IEEE, 2013, pp. 8126–8130

  59. [59]

    Lib- rispeech: an ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” inICASSP. IEEE, 2015, pp. 5206–5210

  60. [60]

    GlobalPhone: Pronunciation Dictio- naries in 20 Languages

    T. Schultz and T. Schlippe, “GlobalPhone: Pronunciation Dictio- naries in 20 Languages.” inLREC, 2014, pp. 337–341

  61. [61]

    Phonetisaurus: Ex- ploring grapheme-to-phoneme conversion with joint n-gram mod- els in the WFST framework,

    J. R. Novak, N. Minematsu, and K. Hirose, “Phonetisaurus: Ex- ploring grapheme-to-phoneme conversion with joint n-gram mod- els in the WFST framework,”Natural Language Engineering, vol. 22, no. 6, pp. 907–938, 2016

  62. [62]

    Pronunci- ation and silence probability modeling for ASR,

    G. Chen, H. Xu, M. Wu, D. Povey, and S. Khudanpur, “Pronunci- ation and silence probability modeling for ASR,” inInterspeech, 2015, pp. 533–537

  63. [63]

    Multilingual tedx corpus for speech recognition and translation,

    E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni, M. Negri, M. Turchi, D. W. Oard, and M. Post, “Multilingual tedx corpus for speech recognition and translation,” inInterspeech, 2021

  64. [64]

    One size does not fit all: Adapt- ing the Montreal Forced Aligner (MFA) to your data,

    M. McAuliffe and K. Gunter, “One size does not fit all: Adapt- ing the Montreal Forced Aligner (MFA) to your data,” 2025, workshop at the 2025 LSA Summer Institute. Available at https://github.com/mmcauliffe/mfa-adaptation

  65. [65]

    Model cards for model reporting,

    M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” inF AccT, 2019, pp. 220–229

  66. [66]

    Pkuseg: A toolkit for multi-domain chinese word segmentation

    R. Luo, J. Xu, Y . Zhang, X. Ren, and X. Sun, “Pkuseg: A toolkit for multi-domain chinese word segmentation.”CoRR, vol. abs/1906.11455, 2019. [Online]. Available: https://arxiv.org/abs/ 1906.11455

  67. [67]

    Sudachi: A japanese tokenizer for business,

    K. Takaoka, S. Hisamoto, N. Kawahara, M. Sakamoto, Y . Uchida, and Y . Matsumoto, “Sudachi: A japanese tokenizer for business,” inLREC, 2018

  68. [68]

    PyThaiNLP: Thai natural language processing in Python,

    W. Phatthiyaphaibun, K. Chaovavanich, C. Polpanu- mas, A. Suriyawongkul, L. Lowphansirikul, P. Chormai, P. Limkonchotiwat, T. Suntorntip, and C. Udomcharoenchaikit, “PyThaiNLP: Thai natural language processing in Python,” in NLP-OSS, L. Tan, D. Milajevs, G. Chauhan, J. Gwinnup, and E. Rippeth, Eds. Singapore, Singapore: Empirical Methods in Natural Langua...

  69. [69]

    Polyglot and Speech Corpus Tools: A System for Repre- senting, Integrating, and Querying Speech Corpora,

    M. McAuliffe, E. Stengel-Eskin, M. Socolof, and M. Sondereg- ger, “Polyglot and Speech Corpus Tools: A System for Repre- senting, Integrating, and Querying Speech Corpora,” inINTER- SPEECH, 2017, pp. 3887–3891

  70. [70]

    Montreal Forced Aligner: Speech-to-text alignment in 2025,

    M. McAuliffe and K. Gunter, “Montreal Forced Aligner: Speech-to-text alignment in 2025,” 2025, workshop at the 2025 Montreal Open Tools Symposium. Avail- able at https://colab.research.google.com/drive/1kqaSSyx- DEV AxrSmoWhJTNXtEsVI15yf

  71. [71]

    Phonetic forced alignment with the Montreal Forced Aligner,

    E. Chodroff, “Phonetic forced alignment with the Montreal Forced Aligner,” 2021, available at https://www.youtube.com/watch?v=Zhj-ccMDj w and https://eleanorchodroff.com/tutorial/montreal-forced- aligner.html

  72. [72]

    A Gentle Guide to Montreal Forced Aligner,

    C. Xu, “A Gentle Guide to Montreal Forced Aligner,” 2024, avail- able at https://chenzixu.rbind.io/resources/1forcedalignment/fa6/

  73. [73]

    DARPA TIMIT acoustic-phonetic continous speech cor- pus,

    J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech cor- pus,”NASA STI/Recon technical report, vol. 93, p. 27403, 1993

  74. [74]

    Buckeye Corpus of Conversa- tional Speech,

    M. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier, “Buckeye Corpus of Conversa- tional Speech,” 2007, available at www.buckeyecorpus.osu.edu

  75. [75]

    Corpus of Spontaneous Japanese: Its design and evaluation,

    K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evaluation,” inISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003

  76. [76]

    The Korean Corpus of Spontaneous Speech,

    W. Yun, K. Yoon, S. Park, J. Lee, S. Cho, D. Kang, K. Byun, H. Hahn, and J. Kim, “The Korean Corpus of Spontaneous Speech,”Journal of the Korean Society of Speech Sciences, vol. 7(2), pp. 103–109, 2015

  77. [77]

    A forced-alignment-based study of declarative sentence-ending,

    T.-J. Yoon and Y . Kang, “A forced-alignment-based study of declarative sentence-ending,” inSpeechProsody 2012, 2012, pp. 559–562

  78. [78]

    wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

  79. [79]

    Scal- ing speech technology to 1,000+ languages,

    V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, and others, “Scal- ing speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024