Montreal Forced Aligner and the state of speech-to-text alignment in 2026
Pith reviewed 2026-06-27 00:24 UTC · model grok-4.3
The pith
MFA 3.0 reaches state-of-the-art alignment accuracy with mean boundary errors below 15 ms on four benchmark datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions. The paper documents MFA 3.0's developments since version 1.0, including expanded language coverage from larger open-source datasets and harmonized IPA dictionaries.
What carries the argument
The Montreal Forced Aligner 3.0, whose performance is driven by expanded training data, model adaptation, cross-language phone remapping, and optional pronunciation probability modeling.
If this is right
- Researchers gain a single open tool that delivers near-top accuracy on English, Japanese, and Korean without needing separate aligners for each language.
- Model adaptation and cross-language remapping extend usable accuracy to languages outside the original training distribution.
- Pronunciation probability modeling and phonological rules can be added selectively to improve results in variable speech conditions.
- Mean boundary errors below 15 ms become a new practical target for forced alignment systems.
Where Pith is reading between the lines
- The reported performance level may reduce the need for manual correction of alignments in large speech corpora.
- The same adaptation methods could be tested on additional low-resource languages to measure how far the gains generalize.
- Downstream tasks such as automatic prosody labeling or phonetic analysis may achieve higher reliability once input alignments carry smaller timing errors.
Load-bearing premise
The four benchmark datasets are representative enough to support claims of state-of-the-art status across languages and conditions.
What would settle it
A new benchmark dataset or language in which a competing forced aligner produces lower mean boundary errors than MFA 3.0 when both are evaluated under matched training and testing conditions.
read the original abstract
The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0's developments since version 1.0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper documents developments in the Montreal Forced Aligner (MFA) from v1.0 to v3.0, including expanded language coverage via larger datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. It evaluates MFA 3.0 on four benchmark datasets spanning English, Japanese, and Korean against classic and neural forced aligners, claiming SOTA or near-SOTA performance with mean boundary errors below 15 ms. Additional claims address the effectiveness of adaptation/remapping for out-of-distribution languages and gains from pronunciation probability modeling and phonological rules under specific conditions.
Significance. If the empirical claims hold under transparent and equivalent evaluation conditions, the work would be significant for the speech alignment community: MFA is already the most widely used tool, so an updated performance characterization with practical adaptation methods for additional languages would provide a useful reference point and baseline for 2026-era forced alignment research.
major comments (2)
- [Abstract and evaluation description] Abstract and evaluation description: the SOTA/near-SOTA claim with mean boundary error <15 ms across all four datasets rests on unspecified benchmark datasets (only languages named, not the actual corpora or their sizes/splits), unnamed neural aligners and versions from the 2025-2026 literature, and no statement of identical data splits, evaluation protocols, or statistical testing. This directly undermines verification of the central performance claim.
- [Results section] Results section (implied by abstract claims): equivalence of comparisons to neural baselines is not demonstrated; without explicit confirmation that all systems were run under matched conditions (same audio, same evaluation metric, same train/test partitions), the reported superiority cannot be assessed as load-bearing evidence.
minor comments (2)
- [Abstract] Clarify the fourth dataset (abstract names only three languages).
- Provide at least one table or figure summarizing the exact mean boundary errors, standard deviations, and baseline comparisons rather than summary statements alone.
Simulated Author's Rebuttal
We thank the referee for highlighting issues with the verifiability of our central performance claims. We agree that additional explicit details are needed and will revise the manuscript to address both major comments.
read point-by-point responses
-
Referee: [Abstract and evaluation description] Abstract and evaluation description: the SOTA/near-SOTA claim with mean boundary error <15 ms across all four datasets rests on unspecified benchmark datasets (only languages named, not the actual corpora or their sizes/splits), unnamed neural aligners and versions from the 2025-2026 literature, and no statement of identical data splits, evaluation protocols, or statistical testing. This directly undermines verification of the central performance claim.
Authors: We agree this is a substantive gap. The current manuscript names only the languages and does not list the specific corpora, sizes, splits, exact neural aligner versions, or confirm identical protocols and testing. We will add a dedicated evaluation subsection that enumerates the four benchmark datasets with sizes and splits, names the neural baselines with citations and versions, states that all systems used the same boundary-error metric, and reports any statistical testing performed. This revision will make the SOTA claims directly verifiable. revision: yes
-
Referee: [Results section] Results section (implied by abstract claims): equivalence of comparisons to neural baselines is not demonstrated; without explicit confirmation that all systems were run under matched conditions (same audio, same evaluation metric, same train/test partitions), the reported superiority cannot be assessed as load-bearing evidence.
Authors: The manuscript does not contain an explicit statement confirming matched conditions across systems. We will revise the results section to add a clear paragraph stating that all aligners (classic and neural) were evaluated on identical audio files using the same mean boundary error metric and the same train/test partitions. If any neural results were taken from published tables rather than re-run, we will note that limitation and its implications for direct comparison. revision: yes
Circularity Check
Empirical benchmarking paper contains no derivation chain or self-referential predictions
full rationale
The paper evaluates MFA 3.0 performance via reported boundary errors on four benchmark datasets and compares against other aligners under stated conditions. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All claims rest on experimental results rather than any reduction to inputs by construction, satisfying the default expectation of no significant circularity for an empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Many forced aligners have been developed over the past 20 years (e.g., [1, 2, 3, 4]), and the field has a healthy ecosystem of tools for different use cases
Introduction Forced alignment, the automatic temporal alignment of words and phonemes to a speech recording given its orthographic tran- scription, has become a standard first step in language science research across (socio)phonetics, language documentation, and psycholinguistics. Many forced aligners have been developed over the past 20 years (e.g., [1, ...
2016
-
[2]
We briefly cover each to motivate the features of MFA 3.0 described in Sec
Background Development of MFA from 1.0 to 3.0 has been driven by rapid expansion in theuse casesof forced aligners in scientific re- search and thedata and toolsavailable. We briefly cover each to motivate the features of MFA 3.0 described in Sec. 3. 2.1. Use cases A forced aligner consists minimally of anacoustic modeland pronunciation dictionary; ten ye...
Pith/arXiv arXiv 2012
-
[3]
Pronunciation probability
Montreal Forced Aligner 3.0 MFA is an open-source command line utility with prebuilt ex- ecutables for Windows, Mac OSX, and Linux [5]. MFA 3.0 extends version 1.0 in four main areas. First, it leverages the increase in available data to provide an expanded set of pre- trained acoustic models with greater coverage of languages and linguistic/social variat...
-
[4]
T” maps to TIMIT “t
Evaluation 4.1. Datasets Benchmark datasets were created from four corpora with man- ually corrected phone-level boundaries across three languages (Table 3). The two English datasets are TIMIT [63], a corpus of read speech, and the Buckeye Corpus of spontaneous speech [64]. The other two datasets represent Japanese and Korean, which are not typically incl...
-
[5]
Word alignment results Table 4 shows word-level alignment accuracy on TIMIT and Buckeye
Results 5.1. Word alignment results Table 4 shows word-level alignment accuracy on TIMIT and Buckeye. MFA 3.0 substantially outperforms all three neural ASR-based aligners on both datasets, extending the findings of
-
[6]
to a larger comparison class. The gap is greatest for small thresholds (10, 25 ms) that are most relevant for speech re- search. In comparison to other aligners, MFA 3.0 markedly outperforms MFA 1.0 on both datasets and ranks near the top compared to most other aligners. For Buckeye, MFA ARPA 3.0 and MFA Global 3.0 show the best performance, and all other...
-
[7]
Discussion The goal of this paper was two-fold: 1) to evaluate MFA 3.0 per- formance against MFA 1.0 and other current aligners, and 2) to demonstrate available modeling utilities in MFA 3.0 that feed into end-to-end forced alignment for speech research. MFA 3.0 pretrained models demonstrate substantial improvements on benchmarking datasets compared to MF...
arXiv 2096
-
[8]
MFA 3.0 demonstrates state of the art performance against existing aligners and provides users additional function- ality to tailor MFA to their own data
Conclusion We have presented key updates to the Montreal Forced Aligner that have improved performance on benchmarks across three languages, along with summarizing new utilities included in MFA 3.0. MFA 3.0 demonstrates state of the art performance against existing aligners and provides users additional function- ality to tailor MFA to their own data. Fut...
-
[9]
Acknowledgments We acknowledge funding from SSHRC #430-2014-00018, FRQSC #183356, CFI #32451 and SSHRC CRC program to Morgan Sonderegger; SSHRC #435-2014-1504 and the SSHRC CRC program to Michael Wagner; and NIH #R01DC019645-03 awarded to Katherine C. Hustad
2014
-
[10]
Generative AI Use Disclosure No generative AI was used in preparing this manuscript
-
[11]
Speaker identification on the SCO- TUS corpus,
J. Yuan, M. Libermanet al., “Speaker identification on the SCO- TUS corpus,”Journal of the Acoustical Society of America, vol. 123, no. 5, p. 3878, 2008
2008
-
[12]
FA VE (Forced Alignment and V owel Extraction) Program Suite,
I. Rosenfelder, J. Fruehwald, K. Evanini, and J. Yuan, “FA VE (Forced Alignment and V owel Extraction) Program Suite,” 2011, available at http://fave.ling.upenn.edu
2011
-
[13]
Signal processing via web services: the use case WebMAUS,
T. Kisler, F. Schiel, and H. Sloetjes, “Signal processing via web services: the use case WebMAUS,” inDigital Humanities Confer- ence, 2012
2012
-
[14]
Prosodylab-Aligner: A tool for forced alignment of laboratory speech,
K. Gorman, J. Howell, and M. Wagner, “Prosodylab-Aligner: A tool for forced alignment of laboratory speech,”Canadian Acous- tics, vol. 39, no. 3, pp. 192–193, 2011
2011
-
[15]
Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,
M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal Forced Aligner: Trainable text-speech align- ment using Kaldi,” inInterspeech, 2017, pp. 498–502
2017
-
[16]
Performance of forced-alignment algorithms on children’s speech,
T. J. Mahr, V . Berisha, K. Kawabata, J. Liss, and K. C. Hus- tad, “Performance of forced-alignment algorithms on children’s speech,”JSLHR, vol. 64, no. 6S, pp. 2213–2222, 2021
2021
-
[17]
Tradition or innovation: A comparison of modern ASR methods for forced alignment,
R. Rousso, E. Cohen, J. Keshet, and E. Chodroff, “Tradition or innovation: A comparison of modern ASR methods for forced alignment,”arXiv preprint arXiv:2406.19363, 2024
arXiv 2024
-
[18]
The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation,
M. C. Kelley, S. J. Perry, and B. V . Tucker, “The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation,”Phonetica, vol. 81, no. 5, pp. 451–508, 2024
2024
-
[19]
The variation in con- versation (ViC) project: Creation of the Buckeye Corpus of Con- versational Speech,
S. Kiesling, L. Dilley, and W. D. Raymond, “The variation in con- versation (ViC) project: Creation of the Buckeye Corpus of Con- versational Speech,”Language V ariation and Change, pp. 55–97, 2006
2006
-
[20]
Using automatic alignment to ana- lyze endangered language data: Testing the viability of untrained alignment,
C. DiCanio, H. Nam, D. H. Whalen, H. Timothy Bunnell, J. D. Amith, and R. C. Garc ´ıa, “Using automatic alignment to ana- lyze endangered language data: Testing the viability of untrained alignment,”JASA, vol. 134, no. 3, pp. 2235–2246, 2013
2013
-
[21]
Forced alignment for understudied language varieties: Testing Prosodylab-Aligner with Tongan data,
L. M. Johnson, M. Di Paolo, and A. Bell, “Forced alignment for understudied language varieties: Testing Prosodylab-Aligner with Tongan data,”Language Documentation & Conservation, vol. 12, pp. 80–123, 2018
2018
-
[22]
A Robin Hood approach to forced align- ment: English-trained algorithms and their use on Australian lan- guages,
S. Babinski, R. Dockum, J. H. Craft, A. Fergus, D. Golden- berg, and C. Bowern, “A Robin Hood approach to forced align- ment: English-trained algorithms and their use on Australian lan- guages,”LSA, vol. 4, pp. 3–1, 2019
2019
-
[23]
The use of phone categories and cross-language modeling for phone align- ment of Panara,
E. P. Ahn, E. Chodroff, M. Lapierre, and G.-A. Levow, “The use of phone categories and cross-language modeling for phone align- ment of Panara,” inInterspeech, 2024, pp. 1505–1509
2024
-
[24]
Multilingual MFA: Forced Align- ment on Low-Resource Related Languages,
A. Tosolini and C. Bowern, “Multilingual MFA: Forced Align- ment on Low-Resource Related Languages,” inEighth Compu- tEL, 2025, pp. 100–109
2025
-
[25]
Assessing the accuracy of existing forced alignment software on varieties of British English,
L. MacKenzie and D. Turton, “Assessing the accuracy of existing forced alignment software on varieties of British English,”Lin- guistics V anguard, vol. 6, no. s1, p. 20180061, 2020
2020
-
[26]
Maximiz- ing accuracy of forced alignment for spontaneous child speech,
R. Fromont, L. Clark, J. W. Black, and M. Blackwood, “Maximiz- ing accuracy of forced alignment for spontaneous child speech,” Language Development Research, vol. 3, no. 1, 2023
2023
-
[27]
Examining fac- tors influencing the viability of automatic acoustic analysis of child speech,
T. Knowles, M. Clayards, and M. Sonderegger, “Examining fac- tors influencing the viability of automatic acoustic analysis of child speech,”JSLHR, vol. 61, no. 10, pp. 2487–2501, 2018
2018
-
[28]
A semi-automatic pipeline for transcribing and segmenting child speech,
P. Christodoulidou, J. Tanner, J. Stuart-Smith, M. McAuliffe, M. Murali, A. Smith, L. Taylor, J. Cleland, and A. Kuschmann, “A semi-automatic pipeline for transcribing and segmenting child speech,” inInterspeech, 2025
2025
-
[29]
Analysis of forced aligner performance on L2 English speech,
S. Williams, P. Foulkes, and V . Hughes, “Analysis of forced aligner performance on L2 English speech,”Speech Communi- cation, vol. 158, p. 103042, 2024
2024
-
[30]
SPPAS: a tool for the phonetic segmentations of speech,
B. Bigi, “SPPAS: a tool for the phonetic segmentations of speech,” inLREC, 2012, pp. 1748–1755
2012
-
[31]
BFA: Real-time Multilingual Text-to-speech Forced Alignment,
A. Rehman, J. Cai, J.-J. Zhang, and X. Yang, “BFA: Real-time Multilingual Text-to-speech Forced Alignment,”arXiv preprint arXiv:2509.23147, 2025
arXiv 2025
-
[32]
Recent ad- vances in technologies for resource creation and mobilization in language documentation,
A. L. Berez-Kroeker, S. Gabber, and A. Slayton, “Recent ad- vances in technologies for resource creation and mobilization in language documentation,”Annual Review of Linguistics, vol. 9, no. 1, pp. 195–214, 2023
2023
-
[33]
Computational sociophonetics using automatic speech recognition,
R. Coto-Solano, “Computational sociophonetics using automatic speech recognition,”Language and Linguistics Compass, vol. 16, no. 9, p. e12474, 2022
2022
-
[34]
Comparing language- specific and cross-language acoustic models for low-resource phonetic forced alignment,
E. Chodroff, E. P. Ahn, and H. Dolatian, “Comparing language- specific and cross-language acoustic models for low-resource phonetic forced alignment,”Language Documentation & Conser- vation, vol. 19, pp. 201 – 223, 2025
2025
-
[35]
Probabilistic analysis of pronunciation with ’MAUS’,
F. Schiel and A. Kipp, “Probabilistic analysis of pronunciation with ’MAUS’,”ZAS Papers in Linguistics, vol. 11, pp. 51–60, 1998
1998
-
[36]
Investigating /l/ variation in English through forced alignment,
J. Yuan and M. Y . Liberman, “Investigating /l/ variation in English through forced alignment,” inInterspeech, 2009
2009
-
[37]
Acoustic reduction in conversational Dutch: A quantitative anal- ysis based on automatically generated segmental transcriptions,
B. Schuppler, M. Ernestus, O. Scharenborg, and L. Boves, “Acoustic reduction in conversational Dutch: A quantitative anal- ysis based on automatically generated segmental transcriptions,” Journal of Phonetics, vol. 39, no. 1, pp. 96–109, 2011
2011
-
[38]
Large-scale analysis of Spanish/s/- lenition using audiobooks,
N. Ryant and M. Liberman, “Large-scale analysis of Spanish/s/- lenition using audiobooks,” inMeetings on Acoustics, vol. 28, no. 1, 2016, p. 060005
2016
-
[39]
Extracting linguistic knowledge from speech: A study of stop realization in 5 Romance languages,
Y . Wu, M. Hutin, I. Vasilescu, L. Lamel, and M. Adda-Decker, “Extracting linguistic knowledge from speech: A study of stop realization in 5 Romance languages,” inLREC, 2022, pp. 3257– 3263
2022
-
[40]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inLREC, 2020, pp. 4218–4222
2020
-
[41]
MLS: A large-scale multilingual dataset for speech research,
V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” ArXiv, vol. abs/2012.03411, 2020
Pith/arXiv arXiv 2012
-
[42]
Massively Multi- lingual Pronunciation Modeling with WikiPron,
J. L. Lee, L. F. Ashby, M. E. Garza, Y . Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman, “Massively Multi- lingual Pronunciation Modeling with WikiPron,” inLREC, 2020, pp. 4223–4228
2020
-
[43]
Pynini: A Python library for weighted finite-state grammar compilation,
K. Gorman, “Pynini: A Python library for weighted finite-state grammar compilation,” inSIGFSM Workshop on Statistical NLP and Weighted Automata, 2016, pp. 75–80
2016
-
[44]
Epitran: Precision G2P for many languages,
D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” inLREC, 2018
2018
-
[45]
The cross- linguistic phonological frequencies (XPF) corpus,
U. C. Priva, E. Strand, S. Yang, W. Mizgerd, A. Creighton, J. Bai, R. Mathew, A. Shao, J. Schuster, and D. Wiepert, “The cross- linguistic phonological frequencies (XPF) corpus,” 2021
2021
-
[46]
LaBB-CAT: An annotation store,
R. Fromont and J. Hay, “LaBB-CAT: An annotation store,” inAus- tralasian language technology association workshop, 2012, pp. 113–117
2012
-
[47]
EMU-SDMS: Ad- vanced speech database management and analysis in R,
R. Winkelmann, J. Harrington, and K. J ¨ansch, “EMU-SDMS: Ad- vanced speech database management and analysis in R,”Com- puter Speech & Language, vol. 45, pp. 392–410, 2017
2017
-
[48]
Pyannote.audio: neural building blocks for speaker diarization,
H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote.audio: neural building blocks for speaker diarization,” inICASSP. IEEE, 2020, pp. 7124–7128
2020
-
[49]
Whisperx: Time- accurate speech transcription of long-form audio,
M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time- accurate speech transcription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023
arXiv 2023
-
[50]
SpeechBrain: A general-purpose speech toolkit,
M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, and others, “SpeechBrain: A general-purpose speech toolkit,”arXiv preprint arXiv:2106.04624, 2021
arXiv 2021
-
[51]
V oxcommunis: A corpus for cross- linguistic phonetic analysis,
E. Ahn and E. Chodroff, “V oxcommunis: A corpus for cross- linguistic phonetic analysis,” inLREC, 2022, pp. 5286–5294
2022
-
[52]
A pipeline for the large-scale acoustic analysis of streamed content,
S. Coats, “A pipeline for the large-scale acoustic analysis of streamed content,” inCMC-Corpora 2023, 2023, pp. 51–54
2023
-
[53]
The HTK hidden Markov model toolkit: Design and philosophy,
S. J. Young, “The HTK hidden Markov model toolkit: Design and philosophy,” 1993
1993
-
[54]
Julius—an open source real-time large vocabulary recognition engine,
A. Lee, T. Kawahara, and K. Shikano, “Julius—an open source real-time large vocabulary recognition engine,” 2001
2001
-
[55]
The Kaldi Speech Recognition Toolkit,
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” inIEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB
2011
-
[56]
NeMo Forced Aligner and its application to word alignment for subtitle genera- tion,
E. Rastorgueva, V . Lavrukhin, and B. Ginsburg, “NeMo Forced Aligner and its application to word alignment for subtitle genera- tion,” inInterspeech 2023, 2023, pp. 5257–5258
2023
-
[57]
Phone-to-audio alignment without text: A semi-supervised approach,
J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,”ICASSP, 2022
2022
-
[58]
Globalphone: A multilin- gual text & speech database in 20 languages,
T. Schultz, N. T. Vu, and T. Schlippe, “Globalphone: A multilin- gual text & speech database in 20 languages,” inICASSP. IEEE, 2013, pp. 8126–8130
2013
-
[59]
Lib- rispeech: an ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” inICASSP. IEEE, 2015, pp. 5206–5210
2015
-
[60]
GlobalPhone: Pronunciation Dictio- naries in 20 Languages
T. Schultz and T. Schlippe, “GlobalPhone: Pronunciation Dictio- naries in 20 Languages.” inLREC, 2014, pp. 337–341
2014
-
[61]
Phonetisaurus: Ex- ploring grapheme-to-phoneme conversion with joint n-gram mod- els in the WFST framework,
J. R. Novak, N. Minematsu, and K. Hirose, “Phonetisaurus: Ex- ploring grapheme-to-phoneme conversion with joint n-gram mod- els in the WFST framework,”Natural Language Engineering, vol. 22, no. 6, pp. 907–938, 2016
2016
-
[62]
Pronunci- ation and silence probability modeling for ASR,
G. Chen, H. Xu, M. Wu, D. Povey, and S. Khudanpur, “Pronunci- ation and silence probability modeling for ASR,” inInterspeech, 2015, pp. 533–537
2015
-
[63]
Multilingual tedx corpus for speech recognition and translation,
E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni, M. Negri, M. Turchi, D. W. Oard, and M. Post, “Multilingual tedx corpus for speech recognition and translation,” inInterspeech, 2021
2021
-
[64]
One size does not fit all: Adapt- ing the Montreal Forced Aligner (MFA) to your data,
M. McAuliffe and K. Gunter, “One size does not fit all: Adapt- ing the Montreal Forced Aligner (MFA) to your data,” 2025, workshop at the 2025 LSA Summer Institute. Available at https://github.com/mmcauliffe/mfa-adaptation
2025
-
[65]
Model cards for model reporting,
M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” inF AccT, 2019, pp. 220–229
2019
-
[66]
Pkuseg: A toolkit for multi-domain chinese word segmentation
R. Luo, J. Xu, Y . Zhang, X. Ren, and X. Sun, “Pkuseg: A toolkit for multi-domain chinese word segmentation.”CoRR, vol. abs/1906.11455, 2019. [Online]. Available: https://arxiv.org/abs/ 1906.11455
arXiv 1906
-
[67]
Sudachi: A japanese tokenizer for business,
K. Takaoka, S. Hisamoto, N. Kawahara, M. Sakamoto, Y . Uchida, and Y . Matsumoto, “Sudachi: A japanese tokenizer for business,” inLREC, 2018
2018
-
[68]
PyThaiNLP: Thai natural language processing in Python,
W. Phatthiyaphaibun, K. Chaovavanich, C. Polpanu- mas, A. Suriyawongkul, L. Lowphansirikul, P. Chormai, P. Limkonchotiwat, T. Suntorntip, and C. Udomcharoenchaikit, “PyThaiNLP: Thai natural language processing in Python,” in NLP-OSS, L. Tan, D. Milajevs, G. Chauhan, J. Gwinnup, and E. Rippeth, Eds. Singapore, Singapore: Empirical Methods in Natural Langua...
2023
-
[69]
Polyglot and Speech Corpus Tools: A System for Repre- senting, Integrating, and Querying Speech Corpora,
M. McAuliffe, E. Stengel-Eskin, M. Socolof, and M. Sondereg- ger, “Polyglot and Speech Corpus Tools: A System for Repre- senting, Integrating, and Querying Speech Corpora,” inINTER- SPEECH, 2017, pp. 3887–3891
2017
-
[70]
Montreal Forced Aligner: Speech-to-text alignment in 2025,
M. McAuliffe and K. Gunter, “Montreal Forced Aligner: Speech-to-text alignment in 2025,” 2025, workshop at the 2025 Montreal Open Tools Symposium. Avail- able at https://colab.research.google.com/drive/1kqaSSyx- DEV AxrSmoWhJTNXtEsVI15yf
2025
-
[71]
Phonetic forced alignment with the Montreal Forced Aligner,
E. Chodroff, “Phonetic forced alignment with the Montreal Forced Aligner,” 2021, available at https://www.youtube.com/watch?v=Zhj-ccMDj w and https://eleanorchodroff.com/tutorial/montreal-forced- aligner.html
2021
-
[72]
A Gentle Guide to Montreal Forced Aligner,
C. Xu, “A Gentle Guide to Montreal Forced Aligner,” 2024, avail- able at https://chenzixu.rbind.io/resources/1forcedalignment/fa6/
2024
-
[73]
DARPA TIMIT acoustic-phonetic continous speech cor- pus,
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech cor- pus,”NASA STI/Recon technical report, vol. 93, p. 27403, 1993
1993
-
[74]
Buckeye Corpus of Conversa- tional Speech,
M. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier, “Buckeye Corpus of Conversa- tional Speech,” 2007, available at www.buckeyecorpus.osu.edu
2007
-
[75]
Corpus of Spontaneous Japanese: Its design and evaluation,
K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evaluation,” inISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003
2003
-
[76]
The Korean Corpus of Spontaneous Speech,
W. Yun, K. Yoon, S. Park, J. Lee, S. Cho, D. Kang, K. Byun, H. Hahn, and J. Kim, “The Korean Corpus of Spontaneous Speech,”Journal of the Korean Society of Speech Sciences, vol. 7(2), pp. 103–109, 2015
2015
-
[77]
A forced-alignment-based study of declarative sentence-ending,
T.-J. Yoon and Y . Kang, “A forced-alignment-based study of declarative sentence-ending,” inSpeechProsody 2012, 2012, pp. 559–562
2012
-
[78]
wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020
2020
-
[79]
Scal- ing speech technology to 1,000+ languages,
V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, and others, “Scal- ing speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.