Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Kaylynn Gunter; Michael McAuliffe; Michael Wagner; Morgan Sonderegger

arxiv: 2606.18466 · v1 · pith:LM2LIKFCnew · submitted 2026-06-16 · 💻 cs.CL

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Michael McAuliffe , Kaylynn Gunter , Michael Wagner , Morgan Sonderegger This is my paper

Pith reviewed 2026-06-27 00:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords forced alignmentMFAspeech-to-text alignmentphonetic boundariesmodel adaptationcross-language remappingpronunciation modeling

0 comments

The pith

MFA 3.0 reaches state-of-the-art alignment accuracy with mean boundary errors below 15 ms on four benchmark datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that version 3.0 of the Montreal Forced Aligner matches or exceeds other tools on standard tests for aligning speech audio to text transcripts. It reports this performance level across English, Japanese, and Korean data while also showing that model adaptation and cross-language remapping work well for languages outside the original training set. Pronunciation probability modeling and phonological rules add further gains under specific conditions. A sympathetic reader would care because forced alignment supplies the timing labels that many downstream speech and language studies depend on, so lower error rates translate directly into more reliable measurements in those studies.

Core claim

MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions. The paper documents MFA 3.0's developments since version 1.0, including expanded language coverage from larger open-source datasets and harmonized IPA dictionaries.

What carries the argument

The Montreal Forced Aligner 3.0, whose performance is driven by expanded training data, model adaptation, cross-language phone remapping, and optional pronunciation probability modeling.

If this is right

Researchers gain a single open tool that delivers near-top accuracy on English, Japanese, and Korean without needing separate aligners for each language.
Model adaptation and cross-language remapping extend usable accuracy to languages outside the original training distribution.
Pronunciation probability modeling and phonological rules can be added selectively to improve results in variable speech conditions.
Mean boundary errors below 15 ms become a new practical target for forced alignment systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported performance level may reduce the need for manual correction of alignments in large speech corpora.
The same adaptation methods could be tested on additional low-resource languages to measure how far the gains generalize.
Downstream tasks such as automatic prosody labeling or phonetic analysis may achieve higher reliability once input alignments carry smaller timing errors.

Load-bearing premise

The four benchmark datasets are representative enough to support claims of state-of-the-art status across languages and conditions.

What would settle it

A new benchmark dataset or language in which a competing forced aligner produces lower mean boundary errors than MFA 3.0 when both are evaluated under matched training and testing conditions.

read the original abstract

The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0's developments since version 1.0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MFA 3.0 update documents useful features and reports low error rates, but SOTA claim depends on details the abstract leaves out.

read the letter

The main point is that this is a straightforward update paper on the Montreal Forced Aligner. It covers what changed since version 1.0, including expanded language coverage, adaptation methods, cross-language phone remapping, and pronunciation probability modeling, then gives benchmark numbers on English, Japanese, and Korean.

It does the practical job well. MFA is already the default tool for a lot of speech work, so spelling out how adaptation and remapping perform outside the training languages, and when phonological rules add value, gives users something they can actually apply. The reported mean boundary errors below 15 ms on the four datasets are the kind of concrete result that matters for dataset building.

The soft spot is exactly what the stress-test flags: the abstract does not name the four datasets or the specific neural aligners from 2025-2026 that were compared, nor does it confirm that baselines ran under identical splits and protocols. Without those details the state-of-the-art claim is hard to evaluate. If the full paper has clear tables and methods, this is minor; if the comparisons stay underspecified, it becomes the central weakness.

This is for researchers who already use forced alignment and want current numbers on MFA for English or the two Asian languages tested. It is incremental rather than foundational, but the tool's wide use means the benchmarks are worth having in the record.

I would send it to peer review. The work is honest and the topic has a ready audience even if the claims need tighter grounding.

Referee Report

2 major / 2 minor

Summary. The paper documents developments in the Montreal Forced Aligner (MFA) from v1.0 to v3.0, including expanded language coverage via larger datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. It evaluates MFA 3.0 on four benchmark datasets spanning English, Japanese, and Korean against classic and neural forced aligners, claiming SOTA or near-SOTA performance with mean boundary errors below 15 ms. Additional claims address the effectiveness of adaptation/remapping for out-of-distribution languages and gains from pronunciation probability modeling and phonological rules under specific conditions.

Significance. If the empirical claims hold under transparent and equivalent evaluation conditions, the work would be significant for the speech alignment community: MFA is already the most widely used tool, so an updated performance characterization with practical adaptation methods for additional languages would provide a useful reference point and baseline for 2026-era forced alignment research.

major comments (2)

[Abstract and evaluation description] Abstract and evaluation description: the SOTA/near-SOTA claim with mean boundary error <15 ms across all four datasets rests on unspecified benchmark datasets (only languages named, not the actual corpora or their sizes/splits), unnamed neural aligners and versions from the 2025-2026 literature, and no statement of identical data splits, evaluation protocols, or statistical testing. This directly undermines verification of the central performance claim.
[Results section] Results section (implied by abstract claims): equivalence of comparisons to neural baselines is not demonstrated; without explicit confirmation that all systems were run under matched conditions (same audio, same evaluation metric, same train/test partitions), the reported superiority cannot be assessed as load-bearing evidence.

minor comments (2)

[Abstract] Clarify the fourth dataset (abstract names only three languages).
Provide at least one table or figure summarizing the exact mean boundary errors, standard deviations, and baseline comparisons rather than summary statements alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues with the verifiability of our central performance claims. We agree that additional explicit details are needed and will revise the manuscript to address both major comments.

read point-by-point responses

Referee: [Abstract and evaluation description] Abstract and evaluation description: the SOTA/near-SOTA claim with mean boundary error <15 ms across all four datasets rests on unspecified benchmark datasets (only languages named, not the actual corpora or their sizes/splits), unnamed neural aligners and versions from the 2025-2026 literature, and no statement of identical data splits, evaluation protocols, or statistical testing. This directly undermines verification of the central performance claim.

Authors: We agree this is a substantive gap. The current manuscript names only the languages and does not list the specific corpora, sizes, splits, exact neural aligner versions, or confirm identical protocols and testing. We will add a dedicated evaluation subsection that enumerates the four benchmark datasets with sizes and splits, names the neural baselines with citations and versions, states that all systems used the same boundary-error metric, and reports any statistical testing performed. This revision will make the SOTA claims directly verifiable. revision: yes
Referee: [Results section] Results section (implied by abstract claims): equivalence of comparisons to neural baselines is not demonstrated; without explicit confirmation that all systems were run under matched conditions (same audio, same evaluation metric, same train/test partitions), the reported superiority cannot be assessed as load-bearing evidence.

Authors: The manuscript does not contain an explicit statement confirming matched conditions across systems. We will revise the results section to add a clear paragraph stating that all aligners (classic and neural) were evaluated on identical audio files using the same mean boundary error metric and the same train/test partitions. If any neural results were taken from published tables rather than re-run, we will note that limitation and its implications for direct comparison. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking paper contains no derivation chain or self-referential predictions

full rationale

The paper evaluates MFA 3.0 performance via reported boundary errors on four benchmark datasets and compares against other aligners under stated conditions. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All claims rest on experimental results rather than any reduction to inputs by construction, satisfying the default expectation of no significant circularity for an empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities as this is an empirical report on tool performance rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5688 in / 1138 out tokens · 31548 ms · 2026-06-27T00:24:38.653481+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 2 linked inside Pith

[1]

Many forced aligners have been developed over the past 20 years (e.g., [1, 2, 3, 4]), and the field has a healthy ecosystem of tools for different use cases

Introduction Forced alignment, the automatic temporal alignment of words and phonemes to a speech recording given its orthographic tran- scription, has become a standard first step in language science research across (socio)phonetics, language documentation, and psycholinguistics. Many forced aligners have been developed over the past 20 years (e.g., [1, ...

2016
[2]

We briefly cover each to motivate the features of MFA 3.0 described in Sec

Background Development of MFA from 1.0 to 3.0 has been driven by rapid expansion in theuse casesof forced aligners in scientific re- search and thedata and toolsavailable. We briefly cover each to motivate the features of MFA 3.0 described in Sec. 3. 2.1. Use cases A forced aligner consists minimally of anacoustic modeland pronunciation dictionary; ten ye...

Pith/arXiv arXiv 2012
[3]

Pronunciation probability

Montreal Forced Aligner 3.0 MFA is an open-source command line utility with prebuilt ex- ecutables for Windows, Mac OSX, and Linux [5]. MFA 3.0 extends version 1.0 in four main areas. First, it leverages the increase in available data to provide an expanded set of pre- trained acoustic models with greater coverage of languages and linguistic/social variat...
[4]

T” maps to TIMIT “t

Evaluation 4.1. Datasets Benchmark datasets were created from four corpora with man- ually corrected phone-level boundaries across three languages (Table 3). The two English datasets are TIMIT [63], a corpus of read speech, and the Buckeye Corpus of spontaneous speech [64]. The other two datasets represent Japanese and Korean, which are not typically incl...
[5]

Word alignment results Table 4 shows word-level alignment accuracy on TIMIT and Buckeye

Results 5.1. Word alignment results Table 4 shows word-level alignment accuracy on TIMIT and Buckeye. MFA 3.0 substantially outperforms all three neural ASR-based aligners on both datasets, extending the findings of
[6]

-PP” and “+rules

to a larger comparison class. The gap is greatest for small thresholds (10, 25 ms) that are most relevant for speech re- search. In comparison to other aligners, MFA 3.0 markedly outperforms MFA 1.0 on both datasets and ranks near the top compared to most other aligners. For Buckeye, MFA ARPA 3.0 and MFA Global 3.0 show the best performance, and all other...

arXiv
[7]

for” pronounced as “F AO1 R

Discussion The goal of this paper was two-fold: 1) to evaluate MFA 3.0 per- formance against MFA 1.0 and other current aligners, and 2) to demonstrate available modeling utilities in MFA 3.0 that feed into end-to-end forced alignment for speech research. MFA 3.0 pretrained models demonstrate substantial improvements on benchmarking datasets compared to MF...

arXiv 2096
[8]

MFA 3.0 demonstrates state of the art performance against existing aligners and provides users additional function- ality to tailor MFA to their own data

Conclusion We have presented key updates to the Montreal Forced Aligner that have improved performance on benchmarks across three languages, along with summarizing new utilities included in MFA 3.0. MFA 3.0 demonstrates state of the art performance against existing aligners and provides users additional function- ality to tailor MFA to their own data. Fut...
[9]

Acknowledgments We acknowledge funding from SSHRC #430-2014-00018, FRQSC #183356, CFI #32451 and SSHRC CRC program to Morgan Sonderegger; SSHRC #435-2014-1504 and the SSHRC CRC program to Michael Wagner; and NIH #R01DC019645-03 awarded to Katherine C. Hustad

2014
[10]

Generative AI Use Disclosure No generative AI was used in preparing this manuscript
[11]

Speaker identification on the SCO- TUS corpus,

J. Yuan, M. Libermanet al., “Speaker identification on the SCO- TUS corpus,”Journal of the Acoustical Society of America, vol. 123, no. 5, p. 3878, 2008

2008
[12]

FA VE (Forced Alignment and V owel Extraction) Program Suite,

I. Rosenfelder, J. Fruehwald, K. Evanini, and J. Yuan, “FA VE (Forced Alignment and V owel Extraction) Program Suite,” 2011, available at http://fave.ling.upenn.edu

2011
[13]

Signal processing via web services: the use case WebMAUS,

T. Kisler, F. Schiel, and H. Sloetjes, “Signal processing via web services: the use case WebMAUS,” inDigital Humanities Confer- ence, 2012

2012
[14]

Prosodylab-Aligner: A tool for forced alignment of laboratory speech,

K. Gorman, J. Howell, and M. Wagner, “Prosodylab-Aligner: A tool for forced alignment of laboratory speech,”Canadian Acous- tics, vol. 39, no. 3, pp. 192–193, 2011

2011
[15]

Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,” inInterspeech, 2017, pp. 498–502

2017
[16]

Performance of forced-alignment algorithms on children’s speech,

T. J. Mahr, V . Berisha, K. Kawabata, J. Liss, and K. C. Hus- tad, “Performance of forced-alignment algorithms on children’s speech,”JSLHR, vol. 64, no. 6S, pp. 2213–2222, 2021

2021
[17]

Tradition or innovation: A comparison of modern ASR methods for forced alignment,

R. Rousso, E. Cohen, J. Keshet, and E. Chodroff, “Tradition or innovation: A comparison of modern ASR methods for forced alignment,”arXiv preprint arXiv:2406.19363, 2024

arXiv 2024
[18]

The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation,

M. C. Kelley, S. J. Perry, and B. V . Tucker, “The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation,”Phonetica, vol. 81, no. 5, pp. 451–508, 2024

2024
[19]

The variation in con- versation (ViC) project: Creation of the Buckeye Corpus of Con- versational Speech,

S. Kiesling, L. Dilley, and W. D. Raymond, “The variation in con- versation (ViC) project: Creation of the Buckeye Corpus of Con- versational Speech,”Language V ariation and Change, pp. 55–97, 2006

2006
[20]

Using automatic alignment to ana- lyze endangered language data: Testing the viability of untrained alignment,

C. DiCanio, H. Nam, D. H. Whalen, H. Timothy Bunnell, J. D. Amith, and R. C. Garc ´ıa, “Using automatic alignment to ana- lyze endangered language data: Testing the viability of untrained alignment,”JASA, vol. 134, no. 3, pp. 2235–2246, 2013

2013
[21]

Forced alignment for understudied language varieties: Testing Prosodylab-Aligner with Tongan data,

L. M. Johnson, M. Di Paolo, and A. Bell, “Forced alignment for understudied language varieties: Testing Prosodylab-Aligner with Tongan data,”Language Documentation & Conservation, vol. 12, pp. 80–123, 2018

2018
[22]

A Robin Hood approach to forced align- ment: English-trained algorithms and their use on Australian lan- guages,

S. Babinski, R. Dockum, J. H. Craft, A. Fergus, D. Golden- berg, and C. Bowern, “A Robin Hood approach to forced align- ment: English-trained algorithms and their use on Australian lan- guages,”LSA, vol. 4, pp. 3–1, 2019

2019
[23]

The use of phone categories and cross-language modeling for phone align- ment of Panara,

E. P. Ahn, E. Chodroff, M. Lapierre, and G.-A. Levow, “The use of phone categories and cross-language modeling for phone align- ment of Panara,” inInterspeech, 2024, pp. 1505–1509

2024
[24]

Multilingual MFA: Forced Align- ment on Low-Resource Related Languages,

A. Tosolini and C. Bowern, “Multilingual MFA: Forced Align- ment on Low-Resource Related Languages,” inEighth Compu- tEL, 2025, pp. 100–109

2025
[25]

Assessing the accuracy of existing forced alignment software on varieties of British English,

L. MacKenzie and D. Turton, “Assessing the accuracy of existing forced alignment software on varieties of British English,”Lin- guistics V anguard, vol. 6, no. s1, p. 20180061, 2020

2020
[26]

Maximiz- ing accuracy of forced alignment for spontaneous child speech,

R. Fromont, L. Clark, J. W. Black, and M. Blackwood, “Maximiz- ing accuracy of forced alignment for spontaneous child speech,” Language Development Research, vol. 3, no. 1, 2023

2023
[27]

Examining fac- tors influencing the viability of automatic acoustic analysis of child speech,

T. Knowles, M. Clayards, and M. Sonderegger, “Examining fac- tors influencing the viability of automatic acoustic analysis of child speech,”JSLHR, vol. 61, no. 10, pp. 2487–2501, 2018

2018
[28]

A semi-automatic pipeline for transcribing and segmenting child speech,

P. Christodoulidou, J. Tanner, J. Stuart-Smith, M. McAuliffe, M. Murali, A. Smith, L. Taylor, J. Cleland, and A. Kuschmann, “A semi-automatic pipeline for transcribing and segmenting child speech,” inInterspeech, 2025

2025
[29]

Analysis of forced aligner performance on L2 English speech,

S. Williams, P. Foulkes, and V . Hughes, “Analysis of forced aligner performance on L2 English speech,”Speech Communi- cation, vol. 158, p. 103042, 2024

2024
[30]

SPPAS: a tool for the phonetic segmentations of speech,

B. Bigi, “SPPAS: a tool for the phonetic segmentations of speech,” inLREC, 2012, pp. 1748–1755

2012
[31]

BFA: Real-time Multilingual Text-to-speech Forced Alignment,

A. Rehman, J. Cai, J.-J. Zhang, and X. Yang, “BFA: Real-time Multilingual Text-to-speech Forced Alignment,”arXiv preprint arXiv:2509.23147, 2025

arXiv 2025
[32]

Recent ad- vances in technologies for resource creation and mobilization in language documentation,

A. L. Berez-Kroeker, S. Gabber, and A. Slayton, “Recent ad- vances in technologies for resource creation and mobilization in language documentation,”Annual Review of Linguistics, vol. 9, no. 1, pp. 195–214, 2023

2023
[33]

Computational sociophonetics using automatic speech recognition,

R. Coto-Solano, “Computational sociophonetics using automatic speech recognition,”Language and Linguistics Compass, vol. 16, no. 9, p. e12474, 2022

2022
[34]

Comparing language- specific and cross-language acoustic models for low-resource phonetic forced alignment,

E. Chodroff, E. P. Ahn, and H. Dolatian, “Comparing language- specific and cross-language acoustic models for low-resource phonetic forced alignment,”Language Documentation & Conser- vation, vol. 19, pp. 201 – 223, 2025

2025
[35]

Probabilistic analysis of pronunciation with ’MAUS’,

F. Schiel and A. Kipp, “Probabilistic analysis of pronunciation with ’MAUS’,”ZAS Papers in Linguistics, vol. 11, pp. 51–60, 1998

1998
[36]

Investigating /l/ variation in English through forced alignment,

J. Yuan and M. Y . Liberman, “Investigating /l/ variation in English through forced alignment,” inInterspeech, 2009

2009
[37]

Acoustic reduction in conversational Dutch: A quantitative anal- ysis based on automatically generated segmental transcriptions,

B. Schuppler, M. Ernestus, O. Scharenborg, and L. Boves, “Acoustic reduction in conversational Dutch: A quantitative anal- ysis based on automatically generated segmental transcriptions,” Journal of Phonetics, vol. 39, no. 1, pp. 96–109, 2011

2011
[38]

Large-scale analysis of Spanish/s/- lenition using audiobooks,

N. Ryant and M. Liberman, “Large-scale analysis of Spanish/s/- lenition using audiobooks,” inMeetings on Acoustics, vol. 28, no. 1, 2016, p. 060005

2016
[39]

Extracting linguistic knowledge from speech: A study of stop realization in 5 Romance languages,

Y . Wu, M. Hutin, I. Vasilescu, L. Lamel, and M. Adda-Decker, “Extracting linguistic knowledge from speech: A study of stop realization in 5 Romance languages,” inLREC, 2022, pp. 3257– 3263

2022
[40]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inLREC, 2020, pp. 4218–4222

2020
[41]

MLS: A large-scale multilingual dataset for speech research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” ArXiv, vol. abs/2012.03411, 2020

Pith/arXiv arXiv 2012
[42]

Massively Multi- lingual Pronunciation Modeling with WikiPron,

J. L. Lee, L. F. Ashby, M. E. Garza, Y . Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman, “Massively Multi- lingual Pronunciation Modeling with WikiPron,” inLREC, 2020, pp. 4223–4228

2020
[43]

Pynini: A Python library for weighted finite-state grammar compilation,

K. Gorman, “Pynini: A Python library for weighted finite-state grammar compilation,” inSIGFSM Workshop on Statistical NLP and Weighted Automata, 2016, pp. 75–80

2016
[44]

Epitran: Precision G2P for many languages,

D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” inLREC, 2018

2018
[45]

The cross- linguistic phonological frequencies (XPF) corpus,

U. C. Priva, E. Strand, S. Yang, W. Mizgerd, A. Creighton, J. Bai, R. Mathew, A. Shao, J. Schuster, and D. Wiepert, “The cross- linguistic phonological frequencies (XPF) corpus,” 2021

2021
[46]

LaBB-CAT: An annotation store,

R. Fromont and J. Hay, “LaBB-CAT: An annotation store,” inAus- tralasian language technology association workshop, 2012, pp. 113–117

2012
[47]

EMU-SDMS: Ad- vanced speech database management and analysis in R,

R. Winkelmann, J. Harrington, and K. J ¨ansch, “EMU-SDMS: Ad- vanced speech database management and analysis in R,”Com- puter Speech & Language, vol. 45, pp. 392–410, 2017

2017
[48]

Pyannote.audio: neural building blocks for speaker diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote.audio: neural building blocks for speaker diarization,” inICASSP. IEEE, 2020, pp. 7124–7128

2020
[49]

Whisperx: Time- accurate speech transcription of long-form audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time- accurate speech transcription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023

arXiv 2023
[50]

SpeechBrain: A general-purpose speech toolkit,

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, and others, “SpeechBrain: A general-purpose speech toolkit,”arXiv preprint arXiv:2106.04624, 2021

arXiv 2021
[51]

V oxcommunis: A corpus for cross- linguistic phonetic analysis,

E. Ahn and E. Chodroff, “V oxcommunis: A corpus for cross- linguistic phonetic analysis,” inLREC, 2022, pp. 5286–5294

2022
[52]

A pipeline for the large-scale acoustic analysis of streamed content,

S. Coats, “A pipeline for the large-scale acoustic analysis of streamed content,” inCMC-Corpora 2023, 2023, pp. 51–54

2023
[53]

The HTK hidden Markov model toolkit: Design and philosophy,

S. J. Young, “The HTK hidden Markov model toolkit: Design and philosophy,” 1993

1993
[54]

Julius—an open source real-time large vocabulary recognition engine,

A. Lee, T. Kawahara, and K. Shikano, “Julius—an open source real-time large vocabulary recognition engine,” 2001

2001
[55]

The Kaldi Speech Recognition Toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” inIEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB

2011
[56]

NeMo Forced Aligner and its application to word alignment for subtitle genera- tion,

E. Rastorgueva, V . Lavrukhin, and B. Ginsburg, “NeMo Forced Aligner and its application to word alignment for subtitle genera- tion,” inInterspeech 2023, 2023, pp. 5257–5258

2023
[57]

Phone-to-audio alignment without text: A semi-supervised approach,

J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,”ICASSP, 2022

2022
[58]

Globalphone: A multilin- gual text & speech database in 20 languages,

T. Schultz, N. T. Vu, and T. Schlippe, “Globalphone: A multilin- gual text & speech database in 20 languages,” inICASSP. IEEE, 2013, pp. 8126–8130

2013
[59]

Lib- rispeech: an ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” inICASSP. IEEE, 2015, pp. 5206–5210

2015
[60]

GlobalPhone: Pronunciation Dictio- naries in 20 Languages

T. Schultz and T. Schlippe, “GlobalPhone: Pronunciation Dictio- naries in 20 Languages.” inLREC, 2014, pp. 337–341

2014
[61]

Phonetisaurus: Ex- ploring grapheme-to-phoneme conversion with joint n-gram mod- els in the WFST framework,

J. R. Novak, N. Minematsu, and K. Hirose, “Phonetisaurus: Ex- ploring grapheme-to-phoneme conversion with joint n-gram mod- els in the WFST framework,”Natural Language Engineering, vol. 22, no. 6, pp. 907–938, 2016

2016
[62]

Pronunci- ation and silence probability modeling for ASR,

G. Chen, H. Xu, M. Wu, D. Povey, and S. Khudanpur, “Pronunci- ation and silence probability modeling for ASR,” inInterspeech, 2015, pp. 533–537

2015
[63]

Multilingual tedx corpus for speech recognition and translation,

E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni, M. Negri, M. Turchi, D. W. Oard, and M. Post, “Multilingual tedx corpus for speech recognition and translation,” inInterspeech, 2021

2021
[64]

One size does not fit all: Adapt- ing the Montreal Forced Aligner (MFA) to your data,

M. McAuliffe and K. Gunter, “One size does not fit all: Adapt- ing the Montreal Forced Aligner (MFA) to your data,” 2025, workshop at the 2025 LSA Summer Institute. Available at https://github.com/mmcauliffe/mfa-adaptation

2025
[65]

Model cards for model reporting,

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” inF AccT, 2019, pp. 220–229

2019
[66]

Pkuseg: A toolkit for multi-domain chinese word segmentation

R. Luo, J. Xu, Y . Zhang, X. Ren, and X. Sun, “Pkuseg: A toolkit for multi-domain chinese word segmentation.”CoRR, vol. abs/1906.11455, 2019. [Online]. Available: https://arxiv.org/abs/ 1906.11455

arXiv 1906
[67]

Sudachi: A japanese tokenizer for business,

K. Takaoka, S. Hisamoto, N. Kawahara, M. Sakamoto, Y . Uchida, and Y . Matsumoto, “Sudachi: A japanese tokenizer for business,” inLREC, 2018

2018
[68]

PyThaiNLP: Thai natural language processing in Python,

W. Phatthiyaphaibun, K. Chaovavanich, C. Polpanu- mas, A. Suriyawongkul, L. Lowphansirikul, P. Chormai, P. Limkonchotiwat, T. Suntorntip, and C. Udomcharoenchaikit, “PyThaiNLP: Thai natural language processing in Python,” in NLP-OSS, L. Tan, D. Milajevs, G. Chauhan, J. Gwinnup, and E. Rippeth, Eds. Singapore, Singapore: Empirical Methods in Natural Langua...

2023
[69]

Polyglot and Speech Corpus Tools: A System for Repre- senting, Integrating, and Querying Speech Corpora,

M. McAuliffe, E. Stengel-Eskin, M. Socolof, and M. Sondereg- ger, “Polyglot and Speech Corpus Tools: A System for Repre- senting, Integrating, and Querying Speech Corpora,” inINTER- SPEECH, 2017, pp. 3887–3891

2017
[70]

Montreal Forced Aligner: Speech-to-text alignment in 2025,

M. McAuliffe and K. Gunter, “Montreal Forced Aligner: Speech-to-text alignment in 2025,” 2025, workshop at the 2025 Montreal Open Tools Symposium. Avail- able at https://colab.research.google.com/drive/1kqaSSyx- DEV AxrSmoWhJTNXtEsVI15yf

2025
[71]

Phonetic forced alignment with the Montreal Forced Aligner,

E. Chodroff, “Phonetic forced alignment with the Montreal Forced Aligner,” 2021, available at https://www.youtube.com/watch?v=Zhj-ccMDj w and https://eleanorchodroff.com/tutorial/montreal-forced- aligner.html

2021
[72]

A Gentle Guide to Montreal Forced Aligner,

C. Xu, “A Gentle Guide to Montreal Forced Aligner,” 2024, avail- able at https://chenzixu.rbind.io/resources/1forcedalignment/fa6/

2024
[73]

DARPA TIMIT acoustic-phonetic continous speech cor- pus,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech cor- pus,”NASA STI/Recon technical report, vol. 93, p. 27403, 1993

1993
[74]

Buckeye Corpus of Conversa- tional Speech,

M. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier, “Buckeye Corpus of Conversa- tional Speech,” 2007, available at www.buckeyecorpus.osu.edu

2007
[75]

Corpus of Spontaneous Japanese: Its design and evaluation,

K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evaluation,” inISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003

2003
[76]

The Korean Corpus of Spontaneous Speech,

W. Yun, K. Yoon, S. Park, J. Lee, S. Cho, D. Kang, K. Byun, H. Hahn, and J. Kim, “The Korean Corpus of Spontaneous Speech,”Journal of the Korean Society of Speech Sciences, vol. 7(2), pp. 103–109, 2015

2015
[77]

A forced-alignment-based study of declarative sentence-ending,

T.-J. Yoon and Y . Kang, “A forced-alignment-based study of declarative sentence-ending,” inSpeechProsody 2012, 2012, pp. 559–562

2012
[78]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020
[79]

Scal- ing speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, and others, “Scal- ing speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

2024

[1] [1]

Many forced aligners have been developed over the past 20 years (e.g., [1, 2, 3, 4]), and the field has a healthy ecosystem of tools for different use cases

Introduction Forced alignment, the automatic temporal alignment of words and phonemes to a speech recording given its orthographic tran- scription, has become a standard first step in language science research across (socio)phonetics, language documentation, and psycholinguistics. Many forced aligners have been developed over the past 20 years (e.g., [1, ...

2016

[2] [2]

We briefly cover each to motivate the features of MFA 3.0 described in Sec

Background Development of MFA from 1.0 to 3.0 has been driven by rapid expansion in theuse casesof forced aligners in scientific re- search and thedata and toolsavailable. We briefly cover each to motivate the features of MFA 3.0 described in Sec. 3. 2.1. Use cases A forced aligner consists minimally of anacoustic modeland pronunciation dictionary; ten ye...

Pith/arXiv arXiv 2012

[3] [3]

Pronunciation probability

Montreal Forced Aligner 3.0 MFA is an open-source command line utility with prebuilt ex- ecutables for Windows, Mac OSX, and Linux [5]. MFA 3.0 extends version 1.0 in four main areas. First, it leverages the increase in available data to provide an expanded set of pre- trained acoustic models with greater coverage of languages and linguistic/social variat...

[4] [4]

T” maps to TIMIT “t

Evaluation 4.1. Datasets Benchmark datasets were created from four corpora with man- ually corrected phone-level boundaries across three languages (Table 3). The two English datasets are TIMIT [63], a corpus of read speech, and the Buckeye Corpus of spontaneous speech [64]. The other two datasets represent Japanese and Korean, which are not typically incl...

[5] [5]

Word alignment results Table 4 shows word-level alignment accuracy on TIMIT and Buckeye

Results 5.1. Word alignment results Table 4 shows word-level alignment accuracy on TIMIT and Buckeye. MFA 3.0 substantially outperforms all three neural ASR-based aligners on both datasets, extending the findings of

[6] [6]

-PP” and “+rules

to a larger comparison class. The gap is greatest for small thresholds (10, 25 ms) that are most relevant for speech re- search. In comparison to other aligners, MFA 3.0 markedly outperforms MFA 1.0 on both datasets and ranks near the top compared to most other aligners. For Buckeye, MFA ARPA 3.0 and MFA Global 3.0 show the best performance, and all other...

arXiv

[7] [7]

for” pronounced as “F AO1 R

Discussion The goal of this paper was two-fold: 1) to evaluate MFA 3.0 per- formance against MFA 1.0 and other current aligners, and 2) to demonstrate available modeling utilities in MFA 3.0 that feed into end-to-end forced alignment for speech research. MFA 3.0 pretrained models demonstrate substantial improvements on benchmarking datasets compared to MF...

arXiv 2096

[8] [8]

MFA 3.0 demonstrates state of the art performance against existing aligners and provides users additional function- ality to tailor MFA to their own data

Conclusion We have presented key updates to the Montreal Forced Aligner that have improved performance on benchmarks across three languages, along with summarizing new utilities included in MFA 3.0. MFA 3.0 demonstrates state of the art performance against existing aligners and provides users additional function- ality to tailor MFA to their own data. Fut...

[9] [9]

Acknowledgments We acknowledge funding from SSHRC #430-2014-00018, FRQSC #183356, CFI #32451 and SSHRC CRC program to Morgan Sonderegger; SSHRC #435-2014-1504 and the SSHRC CRC program to Michael Wagner; and NIH #R01DC019645-03 awarded to Katherine C. Hustad

2014

[10] [10]

Generative AI Use Disclosure No generative AI was used in preparing this manuscript

[11] [11]

Speaker identification on the SCO- TUS corpus,

J. Yuan, M. Libermanet al., “Speaker identification on the SCO- TUS corpus,”Journal of the Acoustical Society of America, vol. 123, no. 5, p. 3878, 2008

2008

[12] [12]

FA VE (Forced Alignment and V owel Extraction) Program Suite,

I. Rosenfelder, J. Fruehwald, K. Evanini, and J. Yuan, “FA VE (Forced Alignment and V owel Extraction) Program Suite,” 2011, available at http://fave.ling.upenn.edu

2011

[13] [13]

Signal processing via web services: the use case WebMAUS,

T. Kisler, F. Schiel, and H. Sloetjes, “Signal processing via web services: the use case WebMAUS,” inDigital Humanities Confer- ence, 2012

2012

[14] [14]

Prosodylab-Aligner: A tool for forced alignment of laboratory speech,

K. Gorman, J. Howell, and M. Wagner, “Prosodylab-Aligner: A tool for forced alignment of laboratory speech,”Canadian Acous- tics, vol. 39, no. 3, pp. 192–193, 2011

2011

[15] [15]

Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,” inInterspeech, 2017, pp. 498–502

2017

[16] [16]

Performance of forced-alignment algorithms on children’s speech,

T. J. Mahr, V . Berisha, K. Kawabata, J. Liss, and K. C. Hus- tad, “Performance of forced-alignment algorithms on children’s speech,”JSLHR, vol. 64, no. 6S, pp. 2213–2222, 2021

2021

[17] [17]

Tradition or innovation: A comparison of modern ASR methods for forced alignment,

R. Rousso, E. Cohen, J. Keshet, and E. Chodroff, “Tradition or innovation: A comparison of modern ASR methods for forced alignment,”arXiv preprint arXiv:2406.19363, 2024

arXiv 2024

[18] [18]

The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation,

M. C. Kelley, S. J. Perry, and B. V . Tucker, “The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation,”Phonetica, vol. 81, no. 5, pp. 451–508, 2024

2024

[19] [19]

The variation in con- versation (ViC) project: Creation of the Buckeye Corpus of Con- versational Speech,

S. Kiesling, L. Dilley, and W. D. Raymond, “The variation in con- versation (ViC) project: Creation of the Buckeye Corpus of Con- versational Speech,”Language V ariation and Change, pp. 55–97, 2006

2006

[20] [20]

Using automatic alignment to ana- lyze endangered language data: Testing the viability of untrained alignment,

C. DiCanio, H. Nam, D. H. Whalen, H. Timothy Bunnell, J. D. Amith, and R. C. Garc ´ıa, “Using automatic alignment to ana- lyze endangered language data: Testing the viability of untrained alignment,”JASA, vol. 134, no. 3, pp. 2235–2246, 2013

2013

[21] [21]

Forced alignment for understudied language varieties: Testing Prosodylab-Aligner with Tongan data,

L. M. Johnson, M. Di Paolo, and A. Bell, “Forced alignment for understudied language varieties: Testing Prosodylab-Aligner with Tongan data,”Language Documentation & Conservation, vol. 12, pp. 80–123, 2018

2018

[22] [22]

A Robin Hood approach to forced align- ment: English-trained algorithms and their use on Australian lan- guages,

S. Babinski, R. Dockum, J. H. Craft, A. Fergus, D. Golden- berg, and C. Bowern, “A Robin Hood approach to forced align- ment: English-trained algorithms and their use on Australian lan- guages,”LSA, vol. 4, pp. 3–1, 2019

2019

[23] [23]

The use of phone categories and cross-language modeling for phone align- ment of Panara,

E. P. Ahn, E. Chodroff, M. Lapierre, and G.-A. Levow, “The use of phone categories and cross-language modeling for phone align- ment of Panara,” inInterspeech, 2024, pp. 1505–1509

2024

[24] [24]

Multilingual MFA: Forced Align- ment on Low-Resource Related Languages,

A. Tosolini and C. Bowern, “Multilingual MFA: Forced Align- ment on Low-Resource Related Languages,” inEighth Compu- tEL, 2025, pp. 100–109

2025

[25] [25]

Assessing the accuracy of existing forced alignment software on varieties of British English,

L. MacKenzie and D. Turton, “Assessing the accuracy of existing forced alignment software on varieties of British English,”Lin- guistics V anguard, vol. 6, no. s1, p. 20180061, 2020

2020

[26] [26]

Maximiz- ing accuracy of forced alignment for spontaneous child speech,

R. Fromont, L. Clark, J. W. Black, and M. Blackwood, “Maximiz- ing accuracy of forced alignment for spontaneous child speech,” Language Development Research, vol. 3, no. 1, 2023

2023

[27] [27]

Examining fac- tors influencing the viability of automatic acoustic analysis of child speech,

T. Knowles, M. Clayards, and M. Sonderegger, “Examining fac- tors influencing the viability of automatic acoustic analysis of child speech,”JSLHR, vol. 61, no. 10, pp. 2487–2501, 2018

2018

[28] [28]

A semi-automatic pipeline for transcribing and segmenting child speech,

P. Christodoulidou, J. Tanner, J. Stuart-Smith, M. McAuliffe, M. Murali, A. Smith, L. Taylor, J. Cleland, and A. Kuschmann, “A semi-automatic pipeline for transcribing and segmenting child speech,” inInterspeech, 2025

2025

[29] [29]

Analysis of forced aligner performance on L2 English speech,

S. Williams, P. Foulkes, and V . Hughes, “Analysis of forced aligner performance on L2 English speech,”Speech Communi- cation, vol. 158, p. 103042, 2024

2024

[30] [30]

SPPAS: a tool for the phonetic segmentations of speech,

B. Bigi, “SPPAS: a tool for the phonetic segmentations of speech,” inLREC, 2012, pp. 1748–1755

2012

[31] [31]

BFA: Real-time Multilingual Text-to-speech Forced Alignment,

A. Rehman, J. Cai, J.-J. Zhang, and X. Yang, “BFA: Real-time Multilingual Text-to-speech Forced Alignment,”arXiv preprint arXiv:2509.23147, 2025

arXiv 2025

[32] [32]

Recent ad- vances in technologies for resource creation and mobilization in language documentation,

A. L. Berez-Kroeker, S. Gabber, and A. Slayton, “Recent ad- vances in technologies for resource creation and mobilization in language documentation,”Annual Review of Linguistics, vol. 9, no. 1, pp. 195–214, 2023

2023

[33] [33]

Computational sociophonetics using automatic speech recognition,

R. Coto-Solano, “Computational sociophonetics using automatic speech recognition,”Language and Linguistics Compass, vol. 16, no. 9, p. e12474, 2022

2022

[34] [34]

Comparing language- specific and cross-language acoustic models for low-resource phonetic forced alignment,

E. Chodroff, E. P. Ahn, and H. Dolatian, “Comparing language- specific and cross-language acoustic models for low-resource phonetic forced alignment,”Language Documentation & Conser- vation, vol. 19, pp. 201 – 223, 2025

2025

[35] [35]

Probabilistic analysis of pronunciation with ’MAUS’,

F. Schiel and A. Kipp, “Probabilistic analysis of pronunciation with ’MAUS’,”ZAS Papers in Linguistics, vol. 11, pp. 51–60, 1998

1998

[36] [36]

Investigating /l/ variation in English through forced alignment,

J. Yuan and M. Y . Liberman, “Investigating /l/ variation in English through forced alignment,” inInterspeech, 2009

2009

[37] [37]

Acoustic reduction in conversational Dutch: A quantitative anal- ysis based on automatically generated segmental transcriptions,

B. Schuppler, M. Ernestus, O. Scharenborg, and L. Boves, “Acoustic reduction in conversational Dutch: A quantitative anal- ysis based on automatically generated segmental transcriptions,” Journal of Phonetics, vol. 39, no. 1, pp. 96–109, 2011

2011

[38] [38]

Large-scale analysis of Spanish/s/- lenition using audiobooks,

N. Ryant and M. Liberman, “Large-scale analysis of Spanish/s/- lenition using audiobooks,” inMeetings on Acoustics, vol. 28, no. 1, 2016, p. 060005

2016

[39] [39]

Extracting linguistic knowledge from speech: A study of stop realization in 5 Romance languages,

Y . Wu, M. Hutin, I. Vasilescu, L. Lamel, and M. Adda-Decker, “Extracting linguistic knowledge from speech: A study of stop realization in 5 Romance languages,” inLREC, 2022, pp. 3257– 3263

2022

[40] [40]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inLREC, 2020, pp. 4218–4222

2020

[41] [41]

MLS: A large-scale multilingual dataset for speech research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” ArXiv, vol. abs/2012.03411, 2020

Pith/arXiv arXiv 2012

[42] [42]

Massively Multi- lingual Pronunciation Modeling with WikiPron,

J. L. Lee, L. F. Ashby, M. E. Garza, Y . Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman, “Massively Multi- lingual Pronunciation Modeling with WikiPron,” inLREC, 2020, pp. 4223–4228

2020

[43] [43]

Pynini: A Python library for weighted finite-state grammar compilation,

K. Gorman, “Pynini: A Python library for weighted finite-state grammar compilation,” inSIGFSM Workshop on Statistical NLP and Weighted Automata, 2016, pp. 75–80

2016

[44] [44]

Epitran: Precision G2P for many languages,

D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” inLREC, 2018

2018

[45] [45]

The cross- linguistic phonological frequencies (XPF) corpus,

U. C. Priva, E. Strand, S. Yang, W. Mizgerd, A. Creighton, J. Bai, R. Mathew, A. Shao, J. Schuster, and D. Wiepert, “The cross- linguistic phonological frequencies (XPF) corpus,” 2021

2021

[46] [46]

LaBB-CAT: An annotation store,

R. Fromont and J. Hay, “LaBB-CAT: An annotation store,” inAus- tralasian language technology association workshop, 2012, pp. 113–117

2012

[47] [47]

EMU-SDMS: Ad- vanced speech database management and analysis in R,

R. Winkelmann, J. Harrington, and K. J ¨ansch, “EMU-SDMS: Ad- vanced speech database management and analysis in R,”Com- puter Speech & Language, vol. 45, pp. 392–410, 2017

2017

[48] [48]

Pyannote.audio: neural building blocks for speaker diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote.audio: neural building blocks for speaker diarization,” inICASSP. IEEE, 2020, pp. 7124–7128

2020

[49] [49]

Whisperx: Time- accurate speech transcription of long-form audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time- accurate speech transcription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023

arXiv 2023

[50] [50]

SpeechBrain: A general-purpose speech toolkit,

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, and others, “SpeechBrain: A general-purpose speech toolkit,”arXiv preprint arXiv:2106.04624, 2021

arXiv 2021

[51] [51]

V oxcommunis: A corpus for cross- linguistic phonetic analysis,

E. Ahn and E. Chodroff, “V oxcommunis: A corpus for cross- linguistic phonetic analysis,” inLREC, 2022, pp. 5286–5294

2022

[52] [52]

A pipeline for the large-scale acoustic analysis of streamed content,

S. Coats, “A pipeline for the large-scale acoustic analysis of streamed content,” inCMC-Corpora 2023, 2023, pp. 51–54

2023

[53] [53]

The HTK hidden Markov model toolkit: Design and philosophy,

S. J. Young, “The HTK hidden Markov model toolkit: Design and philosophy,” 1993

1993

[54] [54]

Julius—an open source real-time large vocabulary recognition engine,

A. Lee, T. Kawahara, and K. Shikano, “Julius—an open source real-time large vocabulary recognition engine,” 2001

2001

[55] [55]

The Kaldi Speech Recognition Toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” inIEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB

2011

[56] [56]

NeMo Forced Aligner and its application to word alignment for subtitle genera- tion,

E. Rastorgueva, V . Lavrukhin, and B. Ginsburg, “NeMo Forced Aligner and its application to word alignment for subtitle genera- tion,” inInterspeech 2023, 2023, pp. 5257–5258

2023

[57] [57]

Phone-to-audio alignment without text: A semi-supervised approach,

J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,”ICASSP, 2022

2022

[58] [58]

Globalphone: A multilin- gual text & speech database in 20 languages,

T. Schultz, N. T. Vu, and T. Schlippe, “Globalphone: A multilin- gual text & speech database in 20 languages,” inICASSP. IEEE, 2013, pp. 8126–8130

2013

[59] [59]

Lib- rispeech: an ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” inICASSP. IEEE, 2015, pp. 5206–5210

2015

[60] [60]

GlobalPhone: Pronunciation Dictio- naries in 20 Languages

T. Schultz and T. Schlippe, “GlobalPhone: Pronunciation Dictio- naries in 20 Languages.” inLREC, 2014, pp. 337–341

2014

[61] [61]

Phonetisaurus: Ex- ploring grapheme-to-phoneme conversion with joint n-gram mod- els in the WFST framework,

J. R. Novak, N. Minematsu, and K. Hirose, “Phonetisaurus: Ex- ploring grapheme-to-phoneme conversion with joint n-gram mod- els in the WFST framework,”Natural Language Engineering, vol. 22, no. 6, pp. 907–938, 2016

2016

[62] [62]

Pronunci- ation and silence probability modeling for ASR,

G. Chen, H. Xu, M. Wu, D. Povey, and S. Khudanpur, “Pronunci- ation and silence probability modeling for ASR,” inInterspeech, 2015, pp. 533–537

2015

[63] [63]

Multilingual tedx corpus for speech recognition and translation,

E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni, M. Negri, M. Turchi, D. W. Oard, and M. Post, “Multilingual tedx corpus for speech recognition and translation,” inInterspeech, 2021

2021

[64] [64]

One size does not fit all: Adapt- ing the Montreal Forced Aligner (MFA) to your data,

M. McAuliffe and K. Gunter, “One size does not fit all: Adapt- ing the Montreal Forced Aligner (MFA) to your data,” 2025, workshop at the 2025 LSA Summer Institute. Available at https://github.com/mmcauliffe/mfa-adaptation

2025

[65] [65]

Model cards for model reporting,

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” inF AccT, 2019, pp. 220–229

2019

[66] [66]

Pkuseg: A toolkit for multi-domain chinese word segmentation

R. Luo, J. Xu, Y . Zhang, X. Ren, and X. Sun, “Pkuseg: A toolkit for multi-domain chinese word segmentation.”CoRR, vol. abs/1906.11455, 2019. [Online]. Available: https://arxiv.org/abs/ 1906.11455

arXiv 1906

[67] [67]

Sudachi: A japanese tokenizer for business,

K. Takaoka, S. Hisamoto, N. Kawahara, M. Sakamoto, Y . Uchida, and Y . Matsumoto, “Sudachi: A japanese tokenizer for business,” inLREC, 2018

2018

[68] [68]

PyThaiNLP: Thai natural language processing in Python,

W. Phatthiyaphaibun, K. Chaovavanich, C. Polpanu- mas, A. Suriyawongkul, L. Lowphansirikul, P. Chormai, P. Limkonchotiwat, T. Suntorntip, and C. Udomcharoenchaikit, “PyThaiNLP: Thai natural language processing in Python,” in NLP-OSS, L. Tan, D. Milajevs, G. Chauhan, J. Gwinnup, and E. Rippeth, Eds. Singapore, Singapore: Empirical Methods in Natural Langua...

2023

[69] [69]

Polyglot and Speech Corpus Tools: A System for Repre- senting, Integrating, and Querying Speech Corpora,

M. McAuliffe, E. Stengel-Eskin, M. Socolof, and M. Sondereg- ger, “Polyglot and Speech Corpus Tools: A System for Repre- senting, Integrating, and Querying Speech Corpora,” inINTER- SPEECH, 2017, pp. 3887–3891

2017

[70] [70]

Montreal Forced Aligner: Speech-to-text alignment in 2025,

M. McAuliffe and K. Gunter, “Montreal Forced Aligner: Speech-to-text alignment in 2025,” 2025, workshop at the 2025 Montreal Open Tools Symposium. Avail- able at https://colab.research.google.com/drive/1kqaSSyx- DEV AxrSmoWhJTNXtEsVI15yf

2025

[71] [71]

Phonetic forced alignment with the Montreal Forced Aligner,

E. Chodroff, “Phonetic forced alignment with the Montreal Forced Aligner,” 2021, available at https://www.youtube.com/watch?v=Zhj-ccMDj w and https://eleanorchodroff.com/tutorial/montreal-forced- aligner.html

2021

[72] [72]

A Gentle Guide to Montreal Forced Aligner,

C. Xu, “A Gentle Guide to Montreal Forced Aligner,” 2024, avail- able at https://chenzixu.rbind.io/resources/1forcedalignment/fa6/

2024

[73] [73]

DARPA TIMIT acoustic-phonetic continous speech cor- pus,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech cor- pus,”NASA STI/Recon technical report, vol. 93, p. 27403, 1993

1993

[74] [74]

Buckeye Corpus of Conversa- tional Speech,

M. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier, “Buckeye Corpus of Conversa- tional Speech,” 2007, available at www.buckeyecorpus.osu.edu

2007

[75] [75]

Corpus of Spontaneous Japanese: Its design and evaluation,

K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evaluation,” inISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003

2003

[76] [76]

The Korean Corpus of Spontaneous Speech,

W. Yun, K. Yoon, S. Park, J. Lee, S. Cho, D. Kang, K. Byun, H. Hahn, and J. Kim, “The Korean Corpus of Spontaneous Speech,”Journal of the Korean Society of Speech Sciences, vol. 7(2), pp. 103–109, 2015

2015

[77] [77]

A forced-alignment-based study of declarative sentence-ending,

T.-J. Yoon and Y . Kang, “A forced-alignment-based study of declarative sentence-ending,” inSpeechProsody 2012, 2012, pp. 559–562

2012

[78] [78]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020

[79] [79]

Scal- ing speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, and others, “Scal- ing speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

2024