PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

Dung D. Le; Giang Son Nguyen; Hieu Minh Truong; Nhu Vo; Tung X. Nguyen; Wray Buntine

arxiv: 2606.12911 · v1 · pith:ZPK4HX4Onew · submitted 2026-06-11 · 💻 cs.CL

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

Giang Son Nguyen , Tung X. Nguyen , Hieu Minh Truong , Nhu Vo , Wray Buntine , Dung D. Le This is my paper

Pith reviewed 2026-06-27 06:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords Vietnamese speech translationASR error analysisphonetic data augmentationneural machine translationcascaded speech translationerror robustnessphonetic embeddings

0 comments

The pith

Fine-tuning on phonetically corrupted data improves Vietnamese speech translation robustness to ASR errors by up to 2.04 BLEU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first classifies Vietnamese ASR substitution errors by phonetic cause and uses statistical modeling to show these errors, not random noise, drive most downstream translation damage in cascaded systems. It then introduces a data-augmentation technique that replaces words with phonetically close alternatives drawn from embeddings to create training examples that mimic real ASR mistakes. Fine-tuning an NMT model on the resulting augmented FLEURS Vietnamese-English set raises performance on erroneous ASR inputs while also lifting results on clean text. A sympathetic reader would care because the approach directly targets error propagation without requiring changes to the ASR or NMT architecture itself.

Core claim

Most ASR substitution errors in Vietnamese arise from phonetic confusions; generating similar corruptions via phonetic word embeddings for data augmentation and then fine-tuning NMT on the augmented set produces models that translate both erroneous ASR outputs and clean text more accurately than standard fine-tuning.

What carries the argument

Phonetically-Informed Data Augmentation (PiDA), which creates ASR-like training examples by substituting words with phonetically similar alternatives drawn from phonetic word embeddings.

If this is right

Fine-tuning on the PiDA-augmented FLEURS set raises BLEU on erroneous ASR inputs by as much as 2.04 points over standard fine-tuning.
The same augmented training also produces a small improvement on clean-text inputs.
Phonetic confusions account for the majority of Vietnamese ASR substitution errors.
These phonetic errors measurably degrade cascaded speech-translation quality according to linear mixed-effects models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phonetic-substitution approach could be tested on other tonal or phonetically dense languages where ASR errors follow similar patterns.
PiDA might be combined with other augmentation strategies such as noise injection or back-translation to compound robustness gains.
The error categorization step could be reused to diagnose weaknesses in existing ASR systems for Vietnamese.

Load-bearing premise

Substitutions produced by phonetic word embeddings match the distribution and downstream effect of actual ASR errors closely enough to improve real-world translation performance.

What would settle it

An experiment in which fine-tuning on PiDA-augmented data yields no BLEU gain or a loss on held-out ASR outputs compared with standard fine-tuning on the same base corpus.

Figures

Figures reproduced from arXiv: 2606.12911 by Dung D. Le, Giang Son Nguyen, Hieu Minh Truong, Nhu Vo, Tung X. Nguyen, Wray Buntine.

**Figure 1.** Figure 1: Qualitative examples: (1) A sentence from the training set corrupted by PiDA; (2) A sentence from the test set where a model finetuned with PiDA data translates correctly despite ASR errors. dom noise and that phonetic ASR errors meaningfully impact NMT performance. This finding directly motivates our approach: since ASR errors are predominantly phonetic, synthetic augmentation should also be phonetic. We… view at source ↗

read the original abstract

Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PiDA gives a modest BLEU gain on noisy Vietnamese ST inputs via phonetic augmentation, but the paper needs to show that the substitutions actually track real ASR confusion patterns.

read the letter

The main takeaway is that the authors first categorize Vietnamese ASR substitution errors with linear mixed-effects modeling, conclude most are phonetic, and then use phonetic word embeddings to create training data that improves NMT robustness to those errors. They report up to +2.04 BLEU on erroneous ASR outputs and a small lift on clean text when fine-tuning on the augmented FLEURS Vietnamese-English set.

What stands out is the error categorization step for this language pair and the direct link from that analysis to the augmentation method. The work targets a practical pain point in cascaded speech translation for a low-resource tonal language, and the empirical result on both noisy and clean performance is worth noting.

The soft spot is the missing verification that the embedding-derived substitutions match the observed error distribution. The stress-test concern holds: without an explicit comparison to the empirical confusion matrix (especially tone and vowel patterns), it is unclear whether PiDA is doing targeted robustness training or just generic noise injection. The abstract also gives no information on baseline systems, statistical significance, or controls for data volume, so the reported gains are hard to assess from the summary alone.

This paper is for researchers working on cascaded ST pipelines or robustness techniques for tonal or low-resource languages. A reader already familiar with phonetic embeddings and error analysis in ASR would find the Vietnamese-specific categorization and the augmentation recipe useful.

I would send it to peer review. The core idea is grounded in a real problem and produces measurable results, even if the alignment between augmentations and actual errors needs tighter evidence in the full manuscript.

Referee Report

3 major / 2 minor

Summary. The paper systematically categorizes ASR substitution errors in Vietnamese using Linear Mixed-Effects Modelling on FLEURS data, concludes that most errors are phonetic in origin and degrade downstream NMT, and proposes PiDA: data augmentation that substitutes words with phonetically similar alternatives drawn from phonetic word embeddings. Fine-tuning an NMT model on the PiDA-augmented FLEURS Vietnamese-English corpus yields up to +2.04 BLEU on ASR-error-containing inputs relative to standard fine-tuning, with a modest gain on clean text as well.

Significance. If the central empirical result holds after verification, the work supplies a concrete, language-specific robustness technique for cascaded speech translation that directly targets the dominant error type identified by the LME analysis. The explicit use of phonetic embeddings to simulate substitutions, combined with the error categorization step, offers a reproducible template that could be tested on other tonal or phonetically complex languages; the modest clean-text improvement is an additional practical benefit.

major comments (3)

[§3 and §4] §3 (error categorization) and §4 (PiDA construction): the claim that PiDA substitutions accurately simulate real ASR error distributions rests on the assertion that phonetic embeddings capture the observed confusion patterns, yet no quantitative alignment check (e.g., cosine similarity between embedding-derived substitution probabilities and the empirical confusion matrix, or separate reporting of tone/vowel error overlap) is provided. Without this, the +2.04 BLEU gain could arise from generic noise rather than targeted phonetic robustness.
[§5] §5 (experiments): the reported BLEU improvements are presented without the number of random seeds, standard deviations, or statistical significance tests (paired bootstrap or approximate randomization), and without explicit controls that hold total training data size constant between the baseline and PiDA-augmented conditions. These omissions make it impossible to judge whether the gains are robust or merely an artifact of increased data volume.
[Table 2 / §5.2] Table 2 / §5.2: the comparison is limited to standard fine-tuning; no ablation against other augmentation baselines (random word substitution, back-translation, or non-phonetic noise injection) is reported, so it remains unclear whether the phonetic embedding component is necessary for the observed robustness gain.

minor comments (2)

[Abstract] The abstract states a concrete BLEU number but the main text should cross-reference the exact table/row that produces the +2.04 figure.
[§4] Notation for phonetic embeddings (e.g., how the similarity threshold or top-k selection is chosen) should be formalized in an equation or algorithm box for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical claims regarding PiDA's alignment with ASR errors, statistical robustness, and comparative evaluation. We address each point below and commit to revisions where appropriate.

read point-by-point responses

Referee: [§3 and §4] §3 (error categorization) and §4 (PiDA construction): the claim that PiDA substitutions accurately simulate real ASR error distributions rests on the assertion that phonetic embeddings capture the observed confusion patterns, yet no quantitative alignment check (e.g., cosine similarity between embedding-derived substitution probabilities and the empirical confusion matrix, or separate reporting of tone/vowel error overlap) is provided. Without this, the +2.04 BLEU gain could arise from generic noise rather than targeted phonetic robustness.

Authors: We agree that an explicit quantitative alignment between the phonetic embedding substitutions and the empirical ASR confusion matrix would provide stronger support for the targeted nature of PiDA. In the revised manuscript we will add this analysis, including cosine similarity between embedding-derived substitution probabilities and the observed confusion patterns from the FLEURS ASR outputs, as well as separate reporting of tone versus vowel error overlap. This will help demonstrate that the gains derive from phonetic robustness rather than generic noise. The existing LME analysis already establishes the phonetic origin of most substitutions, but the additional check will directly address the concern. revision: yes
Referee: [§5] §5 (experiments): the reported BLEU improvements are presented without the number of random seeds, standard deviations, or statistical significance tests (paired bootstrap or approximate randomization), and without explicit controls that hold total training data size constant between the baseline and PiDA-augmented conditions. These omissions make it impossible to judge whether the gains are robust or merely an artifact of increased data volume.

Authors: We acknowledge these omissions weaken the ability to assess robustness. In the revision we will report results across multiple random seeds with standard deviations and include statistical significance testing via paired bootstrap resampling. For data volume, we will clarify the exact construction of the PiDA-augmented corpus and add an explicit control experiment that matches total training tokens between conditions (e.g., by applying random substitutions to reach equivalent size). If the original experiments did not hold size constant, this control will be newly run and reported. revision: yes
Referee: [Table 2 / §5.2] Table 2 / §5.2: the comparison is limited to standard fine-tuning; no ablation against other augmentation baselines (random word substitution, back-translation, or non-phonetic noise injection) is reported, so it remains unclear whether the phonetic embedding component is necessary for the observed robustness gain.

Authors: We agree that ablations against non-phonetic augmentation methods are necessary to isolate the contribution of the phonetic embeddings. In the revised manuscript we will add these baselines, including random word substitution at the same rate as PiDA and a non-phonetic noise injection method, evaluated on both clean and ASR-error inputs. This will allow direct comparison to confirm that the phonetic component drives the robustness improvement. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical augmentation result is measured on held-out data

full rationale

The paper's central claim is an empirical performance gain (+2.04 BLEU) obtained by fine-tuning an NMT model on PiDA-augmented data and evaluating on ASR-error and clean test sets. No equations, fitted parameters, or self-citations are invoked to derive the improvement by construction; the result is presented as the measured outcome of the augmentation procedure applied to FLEURS. The categorization of ASR errors via Linear Mixed-Effects Modelling and the use of phonetic embeddings are independent methodological steps whose validity is assessed externally by downstream translation metrics, not by internal reduction to the input data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; full paper may contain additional parameters or assumptions not visible here.

axioms (1)

domain assumption Most ASR substitution errors in Vietnamese arise from phonetic confusions rather than random noise.
Stated as confirmed by Linear Mixed-Effects Modelling in the abstract.

pith-pipeline@v0.9.1-grok · 5696 in / 1080 out tokens · 17760 ms · 2026-06-27T06:45:09.990723+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 2 canonical work pages

[1]

Introduction Speech translation (ST) converts spoken language directly into text in another language. T wo paradigms dominate: end-to-end ST (E2E ST), which maps audio directly to translated text, and cascaded ST, which combines an Automatic Speech Recogni- tion (ASR) system with a Neural Machine Translation (NMT) model. Despite progress in E2E ST, recent...
[2]

This lack of high-quality datasets motivates our use of cascaded ST where ASR [13, 14, 15] and NMT [ 16, 17, 18] modules can be tuned separately with more abundant resources

Related Work Vietnamese Speech Translation.Vietnamese ST remains un- derrepresented in the literature, with only two data sources pro- viding (audio, transcripts, translation) triplets: the Vietnamese split of multilingual datasets MultiMed-ST [ 4] (16 hours, 9.1k arXiv:2606.12911v1 [cs.CL] 11 Jun 2026 samples) and FLEURS [ 10] (12 hours, 4.2k samples). T...

Pith/arXiv arXiv 2026
[3]

[5] show that NMT robustness improves through synthetic noise augmentation when noise type and amount are carefully calibrated

fine-tune the NMT component on paired clean and noisy tran- scripts to improve cascaded ST. [5] show that NMT robustness improves through synthetic noise augmentation when noise type and amount are carefully calibrated. [ 7] use adversarial train- ing to improve robustness. Recently, MEDSAGE [ 9] extracts error statistics from real ASR outputs and uses a ...
[4]

We instead adopt the pretrained XPhoneBERT em- beddings to model phonetic similarity for data augmentation

to provide contextualized phoneme representations for Text- to-Speech. We instead adopt the pretrained XPhoneBERT em- beddings to model phonetic similarity for data augmentation
[5]

ASR Error Analysis 3.1. Error Extraction and Alignment We use PhoWhisper-large ( PhoWhisper -large) [ 24] and wav2vec2-base ( wav2vec2-base-vietnamese-250h) [25], Vietnamese ASR models based on Whisper [ 26] and Wav2vec 2.0 [27], to transcribe the Vietnamese training split of FLEURS Vietnamese-English (3k samples). ASR outputs are word-aligned with refere...
[6]

Precomputation Phase

Phonetically-Informed Data Augmentation We propose thePhonetically-InformedDataAugmentation (PiDA) pipeline, consisting of six components across two phases: 4.1. Precomputation Phase
[7]

We retrieve the top 50,000 most frequent Vietnamese words and fil- ter to entries matching valid Vietnamese orthographic patterns, yielding approximately 9,400 unique syllables

Syllable Inventory Construction.We extract the Vietnamese syllable inventory using the wordfreq library [31], which pro- vides frequency statistics derived from large web corpora 2. We retrieve the top 50,000 most frequent Vietnamese words and fil- ter to entries matching valid Vietnamese orthographic patterns, yielding approximately 9,400 unique syllables
[8]

Phoneme Conversion.Each syllable is converted to International Phonetic Alphabet (IPA) using CharsiuG2P [ 32], the grapheme-to-phoneme system used by XPhoneBERT
[9]

1No erroris excluded because its coefficient is statistically insignifi- cant in a preliminary model, and it does not represent semantic errors

Embedding Extraction.We pass each phoneme se- quence through XPhoneBERT (xphonebert-base) [22] and mean-pool the output hidden states across all phoneme posi- tions (excluding the [CLS] and [SEP] tokens) to obtain a 768-dimensional vector per syllable. 1No erroris excluded because its coefficient is statistically insignifi- cant in a preliminary model, an...
[10]

For each syllable, we precompute its top-50 phonetically similar neighbors and their cosine similarity scores

Similarity Index Construction.We L2-normalize all syllable embeddings and build a FAISS index [ 33] using inner product search for efficient approximate nearest-neighbor re- trieval. For each syllable, we precompute its top-50 phonetically similar neighbors and their cosine similarity scores. 4.2. Augmentation Phase
[11]

For each word wi, we sample from a Bernoulli distribution with probability p=WER train (the observed word error rate on the training set)

Error Annotation.We follow the error annotation procedure of [ 9]: Given an input sentence to corrupt, we annotate indi- vidual words with error markers based on ASR error statistics computed from the training set. For each word wi, we sample from a Bernoulli distribution with probability p=WER train (the observed word error rate on the training set). If ...
[12]

For deletion markers, the word is simply removed

Corruption.We then process the annotated text to gen- erate the corrupted output. For deletion markers, the word is simply removed. For substitution markers, we perform phonetic corruption with the following procedure. For each syllable si marked for substitution, we sample a replacement from its top- k (k= 5 ) precomputed phonetic neighbors using tempera...
[13]

Experimental Setup Data.We use PiDA to corrupt the training split of FLEURS, then evaluate the fine-tuned models on 0.9k test samples

Experiments 5.1. Experimental Setup Data.We use PiDA to corrupt the training split of FLEURS, then evaluate the fine-tuned models on 0.9k test samples. Models.We use PhoWhisper-large and wav2vec2-base as the ASR systems, and VinAI-Translate as the NMT. Training protocol.We fine-tune VinAI-Translate with: 3 epochs, batch size 8, learning rate 3×10 −5, maxi...
[14]

This motivatedPhonetically-InformedDataAugmentation using XPhoneBERT embeddings (PiDA), which generates syn- thetic errors by substituting phonetically similar syllables

Conclusion & Future Work We presented the first systematic categorization of ASR errors for Vietnamese ST, showing that most substitution errors stem from structured phonetic confusions, and that these errors sub- stantially affect downstream NMT performance. This motivatedPhonetically-InformedDataAugmentation using XPhoneBERT embeddings (PiDA), which gen...
[15]

This research was funded under Project ID VUNI.2324.CC06

Acknowledgments The research results are a part of the outputs of the Cross-College projectRobust Vietnamese–English Clinical and Educational Medical Translation, a collaboration between the College of Engineering & Computer Science (CECS) and the College of Health Sciences (CHS), VinUniversity. This research was funded under Project ID VUNI.2324.CC06. Gi...
[16]

No generative AI tools were used to produce any scientific content, experimental results, data analysis, or conclusions

Generative AI Use Disclosure The authors used generative AI tools solely to assist with minor language editing and readability improvements. No generative AI tools were used to produce any scientific content, experimental results, data analysis, or conclusions
[17]

NICT’s Cascaded and End-To-End Speech Translation Systems using Whisper and IndicTrans2 for the Indic Task,

R. Dabre and H. Song, “NICT’s Cascaded and End-To-End Speech Translation Systems using Whisper and IndicTrans2 for the Indic Task,” inProceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), E. Salesky, M. Federico, and M. Carpuat, Eds. Bangkok, Thailand (in-person and online): Association for Computational Linguistics...

2024
[18]

CMU’s IWSLT 2024 offline speech translation system: A cascaded approach for long-form robustness,

B. Y an, P . Fernandes, J. Tian, S. Ouyang, W . Chen, K. Livescu, L. Li, G. Neubig, and S. Watanabe, “CMU’s IWSLT 2024 offline speech translation system: A cascaded approach for long-form robustness,” inProceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), E. Salesky, M. Federico, and M. Carpuat, Eds. Bangkok, Thail...

2024
[19]

End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model,

Y . Higuchi, T. Ogawa, and T. Kobayashi, “End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model,” inInterspeech 2025, 2025, pp. 21–25

2025
[20]

MultiMed-ST: Large-scale Many- to-many Multilingual Medical Speech Translation,

K. Le-Duc, T. Tran, B. P . Tat, N. K. H. Bui, Q. D. Anh, H.-P . Tran, T. T. Nguyen, L. Nguyen, T. M. Phan, T. T. P . Tran, C. Ngo, K. X. Nguyen, and T. Nguyen-Tang, “MultiMed-ST: Large-scale Many- to-many Multilingual Medical Speech Translation,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Assoc...

2025
[21]

Toward Robust Neural Machine Translation for Noisy Input Sequences,

M. Sperber, J. Niehues, and A. Waibel, “Toward Robust Neural Machine Translation for Noisy Input Sequences,” inProceedings of the 14th International Conference on Spoken Language Translation, S. Sakti and M. Utiyama, Eds. Tokyo, Japan: International Workshop on Spoken Language Translation, Dec. 14-15 2017, pp. 90–96. [Online]. Available: https://aclanthol...

2017
[22]

Synthetic and Natural Noise Both Break Neural Machine Translation,

Y . Belinkov and Y . Bisk, “Synthetic and Natural Noise Both Break Neural Machine Translation,” inInternational Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=BJ8vJebC-

2018
[23]

Towards Robust Neural Machine Translation,

Y . Cheng, Z. Tu, F. Meng, J. Zhai, and Y . Liu, “Towards Robust Neural Machine Translation,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), I. Gurevych and Y . Miyao, Eds. Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 1756–1766. [Online]. Available: http...

2018
[24]

Robust Neural Machine Translation for Clean and Noisy Speech Transcripts,

M. Di Gangi, R. Enyedi, A. Brusadin, and M. Federico, “Robust Neural Machine Translation for Clean and Noisy Speech Transcripts,” inProceedings of the 16th International Conference on Spoken Language Translation, J. Niehues, R. Cattoni, S. St ¨uker, M. Negri, M. T urchi, T.-L. Ha, E. Salesky, R. Sanabria, L. Barrault, L. Specia, and M. Federico, Eds. Hong...

2019
[25]

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues,

K. Binici, A. R. Kashyap, V . Schlegel, A. T. Liu, V . P . Dwivedi, T.-T. Nguyen, X. Gao, N. F. Chen, and S. Winkler, “MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues,” in AI4X 2025 International Conference, 2025. [Online]. Available: https://openreview.net/forum?id=rWOaUq6UBS

2025
[26]

FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798–805

2023
[27]

PWESuite: Phonetic word embeddings and tasks they facilitate,

V . Zouhar, K. Chang, C. Cui, N. B. Carlson, N. R. Robinson, M. Sachan, and D. R. Mortensen, “PWESuite: Phonetic word embeddings and tasks they facilitate,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.- Y . Kan, V . Hoste, A. Lenci, S. Sakti, ...

2024
[28]

PSET: a phonetics-semantics evaluation testbed,

G. Sperduti and D. Nguyen, “PSET: a phonetics-semantics evaluation testbed,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 7346–7356. [Online]. Available: https: //aclanthol...

2025
[29]

ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark,

T. X. Nguyen, N. Vo, G.-S. Nguyen, D. M. Hoang, C. D. Huynh, I. J. Unanue, M. Piccardi, W. Buntine, and D. D. Le, “ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark,” 2026. [Online]. Available: https://arxiv.org/abs/2602.12911

Pith/arXiv arXiv 2026
[30]

Common Voice: A Massively-Multilingual Speech Corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. T yers, and G. Weber, “Common Voice: A Massively-Multilingual Speech Corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: htt...

2020
[31]

A non-expert Kaldi recipe for Vietnamese Speech Recognition System,

H.-T. Luong and H.-Q. Vu, “A non-expert Kaldi recipe for Vietnamese Speech Recognition System,” inProceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016). Osaka, Japan: The COLING 2016 Organizing Comm...

2016
[32]

Improving Vietnamese-English Medical Machine Translation,

N. Vo, D. Q. Nguyen, D. D. Le, M. Piccardi, and W. Buntine, “Improving Vietnamese-English Medical Machine Translation,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.- Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Italia: ELRA ...

2024
[33]

MTet: Multi-domain Translation for English and Vietnamese,

C. Ngo, T. H. Trinh, L. Phan, H. Tran, T. Dang, H. Nguyen, M. Nguyen, and M.-T. Luong, “MTet: Multi-domain Translation for English and Vietnamese,” 2022. [Online]. Available: https://arxiv.org/abs/2210.05610

arXiv 2022
[34]

PhoMT: A high-quality and large-scale benchmark dataset for Vietnamese-English machine translation,

L. Doan, L. T. Nguyen, N. L. Tran, T. Hoang, and D. Q. Nguyen, “PhoMT: A high-quality and large-scale benchmark dataset for Vietnamese-English machine translation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 4...

2021
[35]

Assessing the impact of speech recognition errors on machine translation quality,

N. Ruiz and M. Federico, “Assessing the impact of speech recognition errors on machine translation quality,” inProceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, Y . Al-Onaizan and M. Simard, Eds. Vancouver, Canada: Association for Machine Translation in the Americas, Oct. 22-26 2014, pp. 2...

2014
[36]

PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors,

D. R. Mortensen, P . Littell, A. Bharadwaj, K. Goyal, C. Dyer, and L. Levin, “PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors,” inProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Y . Matsumoto and R. Prasad, Eds. Osaka, Japan: The COLING 2016 Organizing Committee, De...

2016
[37]

2020 , isbn =

A. Fang, S. Filice, N. Limsopatham, and O. Rokhlenko, “Using Phoneme Representations to Build Predictive Models Robust to ASR Errors,” inProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’20. New Y ork, NY , USA: Association for Computing Machinery, 2020, p. 699–708. [Online]. Avail...

work page doi:10.1145/3397271.3401050 2020
[38]

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech,

L. The Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech,” inInterspeech 2023, 2023, pp. 5506–5510

2023
[39]

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to- Speech,

J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to- Speech,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 5530–5540. [Online]. Available: https: //proceed...

2021
[40]

Phowhisper: Automatic speech recognition for vietnamese,

T.-T. Le, L. T. Nguyen, and D. Q. Nguyen, “Phowhisper: Automatic speech recognition for vietnamese,” inThe Second Tiny Papers Track at ICLR 2024, 2024. [Online]. Available: https://openreview.net/forum?id=x3c3MkJfpG

2024
[41]

Vietnamese end-to-end speech recognition using wav2vec 2.0,

T. B. Nguyen, “Vietnamese end-to-end speech recognition using wav2vec 2.0,” 09 2021. [Online]. Available: https: //github.com/vietai/ASR

2021
[42]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–2...

2023
[43]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”
[44]

Available: https://arxiv.org/abs/2006.11477

[Online]. Available: https://arxiv.org/abs/2006.11477

arXiv 2006
[45]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

Gemini Team, Google, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2507.06261

Pith/arXiv arXiv 2025
[46]

A Vietnamese-English Neural Machine Trans- lation System,

Tuan-Duy H. Nguyen, Duy Phung, Duy Tran-Cong Nguyen, Hieu Minh Tran, Manh Luong, Tin Duy Vo, Hung Hai Bui, Dinh Phung, Dat Quoc Nguyen, “A Vietnamese-English Neural Machine Trans- lation System,” inInterspeech 2022, 2022, pp. 5543–5544

2022
[47]

Multilingual denoising pre-training for neural machine translation,

Y . Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/

2020
[48]

2022 , publisher =

R. Speer, “rspeer/wordfreq: v3.0,” Sep. 2022. [Online]. Available: https://doi.org/10.5281/zenodo.7199437

work page doi:10.5281/zenodo.7199437 2022
[49]

ByT5 model for massively multilingual grapheme-to-phoneme conversion,

Jian Zhu, Cong Zhang, David Jurgens, “ByT5 model for massively multilingual grapheme-to-phoneme conversion,” inInterspeech 2022, 2022, pp. 446–450

2022
[50]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019

2019
[51]

The llama 3 herd of models,

Llama Team, AI @ Meta, “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024
[52]

Mistral 7b,

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P . Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”
[53]

Available: https://arxiv.org/abs/2310.06825

[Online]. Available: https://arxiv.org/abs/2310.06825

Pith/arXiv arXiv
[54]

Llama-SEA-LION-v3.5-8B-R,

SEA-LION Team, “Llama-SEA-LION-v3.5-8B-R,” https:// huggingface.co/aisingapore/Llama-SEA-LION-v3.5-8B-R, 2024, Hugging Face model release

2024
[55]

Vistral- 7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese,

C. V . Nguyen, T. Nguyen, Q. Nguyen, H. Nguyen, B. Pl ¨uster, N. Pham, H. Nguyen, P . Schramowski, and T. Nguyen, “Vistral- 7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese,” 2023

2023
[56]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W .-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040/

2002
[57]

COMET- 22: Unbabel-IST 2022 Submission for the Metrics Shared Task,

R. Rei, J. G. C. de Souza, D. Alves, C. Zerva, A. C. Farinha, T. Glushkova, A. Lavie, L. Coheur, and A. F. T. Martins, “COMET- 22: Unbabel-IST 2022 Submission for the Metrics Shared Task,” inProceedings of the Seventh Conference on Machine Translation (WMT). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics, Dec. 2022, pp...

2022

[1] [1]

Introduction Speech translation (ST) converts spoken language directly into text in another language. T wo paradigms dominate: end-to-end ST (E2E ST), which maps audio directly to translated text, and cascaded ST, which combines an Automatic Speech Recogni- tion (ASR) system with a Neural Machine Translation (NMT) model. Despite progress in E2E ST, recent...

[2] [2]

This lack of high-quality datasets motivates our use of cascaded ST where ASR [13, 14, 15] and NMT [ 16, 17, 18] modules can be tuned separately with more abundant resources

Related Work Vietnamese Speech Translation.Vietnamese ST remains un- derrepresented in the literature, with only two data sources pro- viding (audio, transcripts, translation) triplets: the Vietnamese split of multilingual datasets MultiMed-ST [ 4] (16 hours, 9.1k arXiv:2606.12911v1 [cs.CL] 11 Jun 2026 samples) and FLEURS [ 10] (12 hours, 4.2k samples). T...

Pith/arXiv arXiv 2026

[3] [3]

[5] show that NMT robustness improves through synthetic noise augmentation when noise type and amount are carefully calibrated

fine-tune the NMT component on paired clean and noisy tran- scripts to improve cascaded ST. [5] show that NMT robustness improves through synthetic noise augmentation when noise type and amount are carefully calibrated. [ 7] use adversarial train- ing to improve robustness. Recently, MEDSAGE [ 9] extracts error statistics from real ASR outputs and uses a ...

[4] [4]

We instead adopt the pretrained XPhoneBERT em- beddings to model phonetic similarity for data augmentation

to provide contextualized phoneme representations for Text- to-Speech. We instead adopt the pretrained XPhoneBERT em- beddings to model phonetic similarity for data augmentation

[5] [5]

ASR Error Analysis 3.1. Error Extraction and Alignment We use PhoWhisper-large ( PhoWhisper -large) [ 24] and wav2vec2-base ( wav2vec2-base-vietnamese-250h) [25], Vietnamese ASR models based on Whisper [ 26] and Wav2vec 2.0 [27], to transcribe the Vietnamese training split of FLEURS Vietnamese-English (3k samples). ASR outputs are word-aligned with refere...

[6] [6]

Precomputation Phase

Phonetically-Informed Data Augmentation We propose thePhonetically-InformedDataAugmentation (PiDA) pipeline, consisting of six components across two phases: 4.1. Precomputation Phase

[7] [7]

We retrieve the top 50,000 most frequent Vietnamese words and fil- ter to entries matching valid Vietnamese orthographic patterns, yielding approximately 9,400 unique syllables

Syllable Inventory Construction.We extract the Vietnamese syllable inventory using the wordfreq library [31], which pro- vides frequency statistics derived from large web corpora 2. We retrieve the top 50,000 most frequent Vietnamese words and fil- ter to entries matching valid Vietnamese orthographic patterns, yielding approximately 9,400 unique syllables

[8] [8]

Phoneme Conversion.Each syllable is converted to International Phonetic Alphabet (IPA) using CharsiuG2P [ 32], the grapheme-to-phoneme system used by XPhoneBERT

[9] [9]

1No erroris excluded because its coefficient is statistically insignifi- cant in a preliminary model, and it does not represent semantic errors

Embedding Extraction.We pass each phoneme se- quence through XPhoneBERT (xphonebert-base) [22] and mean-pool the output hidden states across all phoneme posi- tions (excluding the [CLS] and [SEP] tokens) to obtain a 768-dimensional vector per syllable. 1No erroris excluded because its coefficient is statistically insignifi- cant in a preliminary model, an...

[10] [10]

For each syllable, we precompute its top-50 phonetically similar neighbors and their cosine similarity scores

Similarity Index Construction.We L2-normalize all syllable embeddings and build a FAISS index [ 33] using inner product search for efficient approximate nearest-neighbor re- trieval. For each syllable, we precompute its top-50 phonetically similar neighbors and their cosine similarity scores. 4.2. Augmentation Phase

[11] [11]

For each word wi, we sample from a Bernoulli distribution with probability p=WER train (the observed word error rate on the training set)

Error Annotation.We follow the error annotation procedure of [ 9]: Given an input sentence to corrupt, we annotate indi- vidual words with error markers based on ASR error statistics computed from the training set. For each word wi, we sample from a Bernoulli distribution with probability p=WER train (the observed word error rate on the training set). If ...

[12] [12]

For deletion markers, the word is simply removed

Corruption.We then process the annotated text to gen- erate the corrupted output. For deletion markers, the word is simply removed. For substitution markers, we perform phonetic corruption with the following procedure. For each syllable si marked for substitution, we sample a replacement from its top- k (k= 5 ) precomputed phonetic neighbors using tempera...

[13] [13]

Experimental Setup Data.We use PiDA to corrupt the training split of FLEURS, then evaluate the fine-tuned models on 0.9k test samples

Experiments 5.1. Experimental Setup Data.We use PiDA to corrupt the training split of FLEURS, then evaluate the fine-tuned models on 0.9k test samples. Models.We use PhoWhisper-large and wav2vec2-base as the ASR systems, and VinAI-Translate as the NMT. Training protocol.We fine-tune VinAI-Translate with: 3 epochs, batch size 8, learning rate 3×10 −5, maxi...

[14] [14]

This motivatedPhonetically-InformedDataAugmentation using XPhoneBERT embeddings (PiDA), which generates syn- thetic errors by substituting phonetically similar syllables

Conclusion & Future Work We presented the first systematic categorization of ASR errors for Vietnamese ST, showing that most substitution errors stem from structured phonetic confusions, and that these errors sub- stantially affect downstream NMT performance. This motivatedPhonetically-InformedDataAugmentation using XPhoneBERT embeddings (PiDA), which gen...

[15] [15]

This research was funded under Project ID VUNI.2324.CC06

Acknowledgments The research results are a part of the outputs of the Cross-College projectRobust Vietnamese–English Clinical and Educational Medical Translation, a collaboration between the College of Engineering & Computer Science (CECS) and the College of Health Sciences (CHS), VinUniversity. This research was funded under Project ID VUNI.2324.CC06. Gi...

[16] [16]

No generative AI tools were used to produce any scientific content, experimental results, data analysis, or conclusions

Generative AI Use Disclosure The authors used generative AI tools solely to assist with minor language editing and readability improvements. No generative AI tools were used to produce any scientific content, experimental results, data analysis, or conclusions

[17] [17]

NICT’s Cascaded and End-To-End Speech Translation Systems using Whisper and IndicTrans2 for the Indic Task,

R. Dabre and H. Song, “NICT’s Cascaded and End-To-End Speech Translation Systems using Whisper and IndicTrans2 for the Indic Task,” inProceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), E. Salesky, M. Federico, and M. Carpuat, Eds. Bangkok, Thailand (in-person and online): Association for Computational Linguistics...

2024

[18] [18]

CMU’s IWSLT 2024 offline speech translation system: A cascaded approach for long-form robustness,

B. Y an, P . Fernandes, J. Tian, S. Ouyang, W . Chen, K. Livescu, L. Li, G. Neubig, and S. Watanabe, “CMU’s IWSLT 2024 offline speech translation system: A cascaded approach for long-form robustness,” inProceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), E. Salesky, M. Federico, and M. Carpuat, Eds. Bangkok, Thail...

2024

[19] [19]

End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model,

Y . Higuchi, T. Ogawa, and T. Kobayashi, “End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model,” inInterspeech 2025, 2025, pp. 21–25

2025

[20] [20]

MultiMed-ST: Large-scale Many- to-many Multilingual Medical Speech Translation,

K. Le-Duc, T. Tran, B. P . Tat, N. K. H. Bui, Q. D. Anh, H.-P . Tran, T. T. Nguyen, L. Nguyen, T. M. Phan, T. T. P . Tran, C. Ngo, K. X. Nguyen, and T. Nguyen-Tang, “MultiMed-ST: Large-scale Many- to-many Multilingual Medical Speech Translation,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Assoc...

2025

[21] [21]

Toward Robust Neural Machine Translation for Noisy Input Sequences,

M. Sperber, J. Niehues, and A. Waibel, “Toward Robust Neural Machine Translation for Noisy Input Sequences,” inProceedings of the 14th International Conference on Spoken Language Translation, S. Sakti and M. Utiyama, Eds. Tokyo, Japan: International Workshop on Spoken Language Translation, Dec. 14-15 2017, pp. 90–96. [Online]. Available: https://aclanthol...

2017

[22] [22]

Synthetic and Natural Noise Both Break Neural Machine Translation,

Y . Belinkov and Y . Bisk, “Synthetic and Natural Noise Both Break Neural Machine Translation,” inInternational Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=BJ8vJebC-

2018

[23] [23]

Towards Robust Neural Machine Translation,

Y . Cheng, Z. Tu, F. Meng, J. Zhai, and Y . Liu, “Towards Robust Neural Machine Translation,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), I. Gurevych and Y . Miyao, Eds. Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 1756–1766. [Online]. Available: http...

2018

[24] [24]

Robust Neural Machine Translation for Clean and Noisy Speech Transcripts,

M. Di Gangi, R. Enyedi, A. Brusadin, and M. Federico, “Robust Neural Machine Translation for Clean and Noisy Speech Transcripts,” inProceedings of the 16th International Conference on Spoken Language Translation, J. Niehues, R. Cattoni, S. St ¨uker, M. Negri, M. T urchi, T.-L. Ha, E. Salesky, R. Sanabria, L. Barrault, L. Specia, and M. Federico, Eds. Hong...

2019

[25] [25]

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues,

K. Binici, A. R. Kashyap, V . Schlegel, A. T. Liu, V . P . Dwivedi, T.-T. Nguyen, X. Gao, N. F. Chen, and S. Winkler, “MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues,” in AI4X 2025 International Conference, 2025. [Online]. Available: https://openreview.net/forum?id=rWOaUq6UBS

2025

[26] [26]

FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798–805

2023

[27] [27]

PWESuite: Phonetic word embeddings and tasks they facilitate,

V . Zouhar, K. Chang, C. Cui, N. B. Carlson, N. R. Robinson, M. Sachan, and D. R. Mortensen, “PWESuite: Phonetic word embeddings and tasks they facilitate,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.- Y . Kan, V . Hoste, A. Lenci, S. Sakti, ...

2024

[28] [28]

PSET: a phonetics-semantics evaluation testbed,

G. Sperduti and D. Nguyen, “PSET: a phonetics-semantics evaluation testbed,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 7346–7356. [Online]. Available: https: //aclanthol...

2025

[29] [29]

ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark,

T. X. Nguyen, N. Vo, G.-S. Nguyen, D. M. Hoang, C. D. Huynh, I. J. Unanue, M. Piccardi, W. Buntine, and D. D. Le, “ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark,” 2026. [Online]. Available: https://arxiv.org/abs/2602.12911

Pith/arXiv arXiv 2026

[30] [30]

Common Voice: A Massively-Multilingual Speech Corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. T yers, and G. Weber, “Common Voice: A Massively-Multilingual Speech Corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: htt...

2020

[31] [31]

A non-expert Kaldi recipe for Vietnamese Speech Recognition System,

H.-T. Luong and H.-Q. Vu, “A non-expert Kaldi recipe for Vietnamese Speech Recognition System,” inProceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016). Osaka, Japan: The COLING 2016 Organizing Comm...

2016

[32] [32]

Improving Vietnamese-English Medical Machine Translation,

N. Vo, D. Q. Nguyen, D. D. Le, M. Piccardi, and W. Buntine, “Improving Vietnamese-English Medical Machine Translation,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.- Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Italia: ELRA ...

2024

[33] [33]

MTet: Multi-domain Translation for English and Vietnamese,

C. Ngo, T. H. Trinh, L. Phan, H. Tran, T. Dang, H. Nguyen, M. Nguyen, and M.-T. Luong, “MTet: Multi-domain Translation for English and Vietnamese,” 2022. [Online]. Available: https://arxiv.org/abs/2210.05610

arXiv 2022

[34] [34]

PhoMT: A high-quality and large-scale benchmark dataset for Vietnamese-English machine translation,

L. Doan, L. T. Nguyen, N. L. Tran, T. Hoang, and D. Q. Nguyen, “PhoMT: A high-quality and large-scale benchmark dataset for Vietnamese-English machine translation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 4...

2021

[35] [35]

Assessing the impact of speech recognition errors on machine translation quality,

N. Ruiz and M. Federico, “Assessing the impact of speech recognition errors on machine translation quality,” inProceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, Y . Al-Onaizan and M. Simard, Eds. Vancouver, Canada: Association for Machine Translation in the Americas, Oct. 22-26 2014, pp. 2...

2014

[36] [36]

PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors,

D. R. Mortensen, P . Littell, A. Bharadwaj, K. Goyal, C. Dyer, and L. Levin, “PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors,” inProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Y . Matsumoto and R. Prasad, Eds. Osaka, Japan: The COLING 2016 Organizing Committee, De...

2016

[37] [37]

2020 , isbn =

A. Fang, S. Filice, N. Limsopatham, and O. Rokhlenko, “Using Phoneme Representations to Build Predictive Models Robust to ASR Errors,” inProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’20. New Y ork, NY , USA: Association for Computing Machinery, 2020, p. 699–708. [Online]. Avail...

work page doi:10.1145/3397271.3401050 2020

[38] [38]

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech,

L. The Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech,” inInterspeech 2023, 2023, pp. 5506–5510

2023

[39] [39]

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to- Speech,

J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to- Speech,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 5530–5540. [Online]. Available: https: //proceed...

2021

[40] [40]

Phowhisper: Automatic speech recognition for vietnamese,

T.-T. Le, L. T. Nguyen, and D. Q. Nguyen, “Phowhisper: Automatic speech recognition for vietnamese,” inThe Second Tiny Papers Track at ICLR 2024, 2024. [Online]. Available: https://openreview.net/forum?id=x3c3MkJfpG

2024

[41] [41]

Vietnamese end-to-end speech recognition using wav2vec 2.0,

T. B. Nguyen, “Vietnamese end-to-end speech recognition using wav2vec 2.0,” 09 2021. [Online]. Available: https: //github.com/vietai/ASR

2021

[42] [42]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–2...

2023

[43] [43]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”

[44] [44]

Available: https://arxiv.org/abs/2006.11477

[Online]. Available: https://arxiv.org/abs/2006.11477

arXiv 2006

[45] [45]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

Gemini Team, Google, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2507.06261

Pith/arXiv arXiv 2025

[46] [46]

A Vietnamese-English Neural Machine Trans- lation System,

Tuan-Duy H. Nguyen, Duy Phung, Duy Tran-Cong Nguyen, Hieu Minh Tran, Manh Luong, Tin Duy Vo, Hung Hai Bui, Dinh Phung, Dat Quoc Nguyen, “A Vietnamese-English Neural Machine Trans- lation System,” inInterspeech 2022, 2022, pp. 5543–5544

2022

[47] [47]

Multilingual denoising pre-training for neural machine translation,

Y . Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/

2020

[48] [48]

2022 , publisher =

R. Speer, “rspeer/wordfreq: v3.0,” Sep. 2022. [Online]. Available: https://doi.org/10.5281/zenodo.7199437

work page doi:10.5281/zenodo.7199437 2022

[49] [49]

ByT5 model for massively multilingual grapheme-to-phoneme conversion,

Jian Zhu, Cong Zhang, David Jurgens, “ByT5 model for massively multilingual grapheme-to-phoneme conversion,” inInterspeech 2022, 2022, pp. 446–450

2022

[50] [50]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019

2019

[51] [51]

The llama 3 herd of models,

Llama Team, AI @ Meta, “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024

[52] [52]

Mistral 7b,

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P . Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”

[53] [53]

Available: https://arxiv.org/abs/2310.06825

[Online]. Available: https://arxiv.org/abs/2310.06825

Pith/arXiv arXiv

[54] [54]

Llama-SEA-LION-v3.5-8B-R,

SEA-LION Team, “Llama-SEA-LION-v3.5-8B-R,” https:// huggingface.co/aisingapore/Llama-SEA-LION-v3.5-8B-R, 2024, Hugging Face model release

2024

[55] [55]

Vistral- 7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese,

C. V . Nguyen, T. Nguyen, Q. Nguyen, H. Nguyen, B. Pl ¨uster, N. Pham, H. Nguyen, P . Schramowski, and T. Nguyen, “Vistral- 7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese,” 2023

2023

[56] [56]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W .-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040/

2002

[57] [57]

COMET- 22: Unbabel-IST 2022 Submission for the Metrics Shared Task,

R. Rei, J. G. C. de Souza, D. Alves, C. Zerva, A. C. Farinha, T. Glushkova, A. Lavie, L. Coheur, and A. F. T. Martins, “COMET- 22: Unbabel-IST 2022 Submission for the Metrics Shared Task,” inProceedings of the Seventh Conference on Machine Translation (WMT). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics, Dec. 2022, pp...

2022