PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation
Pith reviewed 2026-06-27 06:45 UTC · model grok-4.3
The pith
Fine-tuning on phonetically corrupted data improves Vietnamese speech translation robustness to ASR errors by up to 2.04 BLEU.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Most ASR substitution errors in Vietnamese arise from phonetic confusions; generating similar corruptions via phonetic word embeddings for data augmentation and then fine-tuning NMT on the augmented set produces models that translate both erroneous ASR outputs and clean text more accurately than standard fine-tuning.
What carries the argument
Phonetically-Informed Data Augmentation (PiDA), which creates ASR-like training examples by substituting words with phonetically similar alternatives drawn from phonetic word embeddings.
If this is right
- Fine-tuning on the PiDA-augmented FLEURS set raises BLEU on erroneous ASR inputs by as much as 2.04 points over standard fine-tuning.
- The same augmented training also produces a small improvement on clean-text inputs.
- Phonetic confusions account for the majority of Vietnamese ASR substitution errors.
- These phonetic errors measurably degrade cascaded speech-translation quality according to linear mixed-effects models.
Where Pith is reading between the lines
- The same phonetic-substitution approach could be tested on other tonal or phonetically dense languages where ASR errors follow similar patterns.
- PiDA might be combined with other augmentation strategies such as noise injection or back-translation to compound robustness gains.
- The error categorization step could be reused to diagnose weaknesses in existing ASR systems for Vietnamese.
Load-bearing premise
Substitutions produced by phonetic word embeddings match the distribution and downstream effect of actual ASR errors closely enough to improve real-world translation performance.
What would settle it
An experiment in which fine-tuning on PiDA-augmented data yields no BLEU gain or a loss on held-out ASR outputs compared with standard fine-tuning on the same base corpus.
Figures
read the original abstract
Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper systematically categorizes ASR substitution errors in Vietnamese using Linear Mixed-Effects Modelling on FLEURS data, concludes that most errors are phonetic in origin and degrade downstream NMT, and proposes PiDA: data augmentation that substitutes words with phonetically similar alternatives drawn from phonetic word embeddings. Fine-tuning an NMT model on the PiDA-augmented FLEURS Vietnamese-English corpus yields up to +2.04 BLEU on ASR-error-containing inputs relative to standard fine-tuning, with a modest gain on clean text as well.
Significance. If the central empirical result holds after verification, the work supplies a concrete, language-specific robustness technique for cascaded speech translation that directly targets the dominant error type identified by the LME analysis. The explicit use of phonetic embeddings to simulate substitutions, combined with the error categorization step, offers a reproducible template that could be tested on other tonal or phonetically complex languages; the modest clean-text improvement is an additional practical benefit.
major comments (3)
- [§3 and §4] §3 (error categorization) and §4 (PiDA construction): the claim that PiDA substitutions accurately simulate real ASR error distributions rests on the assertion that phonetic embeddings capture the observed confusion patterns, yet no quantitative alignment check (e.g., cosine similarity between embedding-derived substitution probabilities and the empirical confusion matrix, or separate reporting of tone/vowel error overlap) is provided. Without this, the +2.04 BLEU gain could arise from generic noise rather than targeted phonetic robustness.
- [§5] §5 (experiments): the reported BLEU improvements are presented without the number of random seeds, standard deviations, or statistical significance tests (paired bootstrap or approximate randomization), and without explicit controls that hold total training data size constant between the baseline and PiDA-augmented conditions. These omissions make it impossible to judge whether the gains are robust or merely an artifact of increased data volume.
- [Table 2 / §5.2] Table 2 / §5.2: the comparison is limited to standard fine-tuning; no ablation against other augmentation baselines (random word substitution, back-translation, or non-phonetic noise injection) is reported, so it remains unclear whether the phonetic embedding component is necessary for the observed robustness gain.
minor comments (2)
- [Abstract] The abstract states a concrete BLEU number but the main text should cross-reference the exact table/row that produces the +2.04 figure.
- [§4] Notation for phonetic embeddings (e.g., how the similarity threshold or top-k selection is chosen) should be formalized in an equation or algorithm box for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical claims regarding PiDA's alignment with ASR errors, statistical robustness, and comparative evaluation. We address each point below and commit to revisions where appropriate.
read point-by-point responses
-
Referee: [§3 and §4] §3 (error categorization) and §4 (PiDA construction): the claim that PiDA substitutions accurately simulate real ASR error distributions rests on the assertion that phonetic embeddings capture the observed confusion patterns, yet no quantitative alignment check (e.g., cosine similarity between embedding-derived substitution probabilities and the empirical confusion matrix, or separate reporting of tone/vowel error overlap) is provided. Without this, the +2.04 BLEU gain could arise from generic noise rather than targeted phonetic robustness.
Authors: We agree that an explicit quantitative alignment between the phonetic embedding substitutions and the empirical ASR confusion matrix would provide stronger support for the targeted nature of PiDA. In the revised manuscript we will add this analysis, including cosine similarity between embedding-derived substitution probabilities and the observed confusion patterns from the FLEURS ASR outputs, as well as separate reporting of tone versus vowel error overlap. This will help demonstrate that the gains derive from phonetic robustness rather than generic noise. The existing LME analysis already establishes the phonetic origin of most substitutions, but the additional check will directly address the concern. revision: yes
-
Referee: [§5] §5 (experiments): the reported BLEU improvements are presented without the number of random seeds, standard deviations, or statistical significance tests (paired bootstrap or approximate randomization), and without explicit controls that hold total training data size constant between the baseline and PiDA-augmented conditions. These omissions make it impossible to judge whether the gains are robust or merely an artifact of increased data volume.
Authors: We acknowledge these omissions weaken the ability to assess robustness. In the revision we will report results across multiple random seeds with standard deviations and include statistical significance testing via paired bootstrap resampling. For data volume, we will clarify the exact construction of the PiDA-augmented corpus and add an explicit control experiment that matches total training tokens between conditions (e.g., by applying random substitutions to reach equivalent size). If the original experiments did not hold size constant, this control will be newly run and reported. revision: yes
-
Referee: [Table 2 / §5.2] Table 2 / §5.2: the comparison is limited to standard fine-tuning; no ablation against other augmentation baselines (random word substitution, back-translation, or non-phonetic noise injection) is reported, so it remains unclear whether the phonetic embedding component is necessary for the observed robustness gain.
Authors: We agree that ablations against non-phonetic augmentation methods are necessary to isolate the contribution of the phonetic embeddings. In the revised manuscript we will add these baselines, including random word substitution at the same rate as PiDA and a non-phonetic noise injection method, evaluated on both clean and ASR-error inputs. This will allow direct comparison to confirm that the phonetic component drives the robustness improvement. revision: yes
Circularity Check
No circularity; empirical augmentation result is measured on held-out data
full rationale
The paper's central claim is an empirical performance gain (+2.04 BLEU) obtained by fine-tuning an NMT model on PiDA-augmented data and evaluating on ASR-error and clean test sets. No equations, fitted parameters, or self-citations are invoked to derive the improvement by construction; the result is presented as the measured outcome of the augmentation procedure applied to FLEURS. The categorization of ASR errors via Linear Mixed-Effects Modelling and the use of phonetic embeddings are independent methodological steps whose validity is assessed externally by downstream translation metrics, not by internal reduction to the input data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Most ASR substitution errors in Vietnamese arise from phonetic confusions rather than random noise.
Reference graph
Works this paper leans on
-
[1]
Introduction Speech translation (ST) converts spoken language directly into text in another language. T wo paradigms dominate: end-to-end ST (E2E ST), which maps audio directly to translated text, and cascaded ST, which combines an Automatic Speech Recogni- tion (ASR) system with a Neural Machine Translation (NMT) model. Despite progress in E2E ST, recent...
-
[2]
Related Work Vietnamese Speech Translation.Vietnamese ST remains un- derrepresented in the literature, with only two data sources pro- viding (audio, transcripts, translation) triplets: the Vietnamese split of multilingual datasets MultiMed-ST [ 4] (16 hours, 9.1k arXiv:2606.12911v1 [cs.CL] 11 Jun 2026 samples) and FLEURS [ 10] (12 hours, 4.2k samples). T...
Pith/arXiv arXiv 2026
-
[3]
[5] show that NMT robustness improves through synthetic noise augmentation when noise type and amount are carefully calibrated
fine-tune the NMT component on paired clean and noisy tran- scripts to improve cascaded ST. [5] show that NMT robustness improves through synthetic noise augmentation when noise type and amount are carefully calibrated. [ 7] use adversarial train- ing to improve robustness. Recently, MEDSAGE [ 9] extracts error statistics from real ASR outputs and uses a ...
-
[4]
We instead adopt the pretrained XPhoneBERT em- beddings to model phonetic similarity for data augmentation
to provide contextualized phoneme representations for Text- to-Speech. We instead adopt the pretrained XPhoneBERT em- beddings to model phonetic similarity for data augmentation
-
[5]
ASR Error Analysis 3.1. Error Extraction and Alignment We use PhoWhisper-large ( PhoWhisper -large) [ 24] and wav2vec2-base ( wav2vec2-base-vietnamese-250h) [25], Vietnamese ASR models based on Whisper [ 26] and Wav2vec 2.0 [27], to transcribe the Vietnamese training split of FLEURS Vietnamese-English (3k samples). ASR outputs are word-aligned with refere...
-
[6]
Precomputation Phase
Phonetically-Informed Data Augmentation We propose thePhonetically-InformedDataAugmentation (PiDA) pipeline, consisting of six components across two phases: 4.1. Precomputation Phase
-
[7]
We retrieve the top 50,000 most frequent Vietnamese words and fil- ter to entries matching valid Vietnamese orthographic patterns, yielding approximately 9,400 unique syllables
Syllable Inventory Construction.We extract the Vietnamese syllable inventory using the wordfreq library [31], which pro- vides frequency statistics derived from large web corpora 2. We retrieve the top 50,000 most frequent Vietnamese words and fil- ter to entries matching valid Vietnamese orthographic patterns, yielding approximately 9,400 unique syllables
-
[8]
Phoneme Conversion.Each syllable is converted to International Phonetic Alphabet (IPA) using CharsiuG2P [ 32], the grapheme-to-phoneme system used by XPhoneBERT
-
[9]
1No erroris excluded because its coefficient is statistically insignifi- cant in a preliminary model, and it does not represent semantic errors
Embedding Extraction.We pass each phoneme se- quence through XPhoneBERT (xphonebert-base) [22] and mean-pool the output hidden states across all phoneme posi- tions (excluding the [CLS] and [SEP] tokens) to obtain a 768-dimensional vector per syllable. 1No erroris excluded because its coefficient is statistically insignifi- cant in a preliminary model, an...
-
[10]
For each syllable, we precompute its top-50 phonetically similar neighbors and their cosine similarity scores
Similarity Index Construction.We L2-normalize all syllable embeddings and build a FAISS index [ 33] using inner product search for efficient approximate nearest-neighbor re- trieval. For each syllable, we precompute its top-50 phonetically similar neighbors and their cosine similarity scores. 4.2. Augmentation Phase
-
[11]
For each word wi, we sample from a Bernoulli distribution with probability p=WER train (the observed word error rate on the training set)
Error Annotation.We follow the error annotation procedure of [ 9]: Given an input sentence to corrupt, we annotate indi- vidual words with error markers based on ASR error statistics computed from the training set. For each word wi, we sample from a Bernoulli distribution with probability p=WER train (the observed word error rate on the training set). If ...
-
[12]
For deletion markers, the word is simply removed
Corruption.We then process the annotated text to gen- erate the corrupted output. For deletion markers, the word is simply removed. For substitution markers, we perform phonetic corruption with the following procedure. For each syllable si marked for substitution, we sample a replacement from its top- k (k= 5 ) precomputed phonetic neighbors using tempera...
-
[13]
Experimental Setup Data.We use PiDA to corrupt the training split of FLEURS, then evaluate the fine-tuned models on 0.9k test samples
Experiments 5.1. Experimental Setup Data.We use PiDA to corrupt the training split of FLEURS, then evaluate the fine-tuned models on 0.9k test samples. Models.We use PhoWhisper-large and wav2vec2-base as the ASR systems, and VinAI-Translate as the NMT. Training protocol.We fine-tune VinAI-Translate with: 3 epochs, batch size 8, learning rate 3×10 −5, maxi...
-
[14]
This motivatedPhonetically-InformedDataAugmentation using XPhoneBERT embeddings (PiDA), which generates syn- thetic errors by substituting phonetically similar syllables
Conclusion & Future Work We presented the first systematic categorization of ASR errors for Vietnamese ST, showing that most substitution errors stem from structured phonetic confusions, and that these errors sub- stantially affect downstream NMT performance. This motivatedPhonetically-InformedDataAugmentation using XPhoneBERT embeddings (PiDA), which gen...
-
[15]
This research was funded under Project ID VUNI.2324.CC06
Acknowledgments The research results are a part of the outputs of the Cross-College projectRobust Vietnamese–English Clinical and Educational Medical Translation, a collaboration between the College of Engineering & Computer Science (CECS) and the College of Health Sciences (CHS), VinUniversity. This research was funded under Project ID VUNI.2324.CC06. Gi...
-
[16]
No generative AI tools were used to produce any scientific content, experimental results, data analysis, or conclusions
Generative AI Use Disclosure The authors used generative AI tools solely to assist with minor language editing and readability improvements. No generative AI tools were used to produce any scientific content, experimental results, data analysis, or conclusions
-
[17]
NICT’s Cascaded and End-To-End Speech Translation Systems using Whisper and IndicTrans2 for the Indic Task,
R. Dabre and H. Song, “NICT’s Cascaded and End-To-End Speech Translation Systems using Whisper and IndicTrans2 for the Indic Task,” inProceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), E. Salesky, M. Federico, and M. Carpuat, Eds. Bangkok, Thailand (in-person and online): Association for Computational Linguistics...
2024
-
[18]
CMU’s IWSLT 2024 offline speech translation system: A cascaded approach for long-form robustness,
B. Y an, P . Fernandes, J. Tian, S. Ouyang, W . Chen, K. Livescu, L. Li, G. Neubig, and S. Watanabe, “CMU’s IWSLT 2024 offline speech translation system: A cascaded approach for long-form robustness,” inProceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), E. Salesky, M. Federico, and M. Carpuat, Eds. Bangkok, Thail...
2024
-
[19]
End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model,
Y . Higuchi, T. Ogawa, and T. Kobayashi, “End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model,” inInterspeech 2025, 2025, pp. 21–25
2025
-
[20]
MultiMed-ST: Large-scale Many- to-many Multilingual Medical Speech Translation,
K. Le-Duc, T. Tran, B. P . Tat, N. K. H. Bui, Q. D. Anh, H.-P . Tran, T. T. Nguyen, L. Nguyen, T. M. Phan, T. T. P . Tran, C. Ngo, K. X. Nguyen, and T. Nguyen-Tang, “MultiMed-ST: Large-scale Many- to-many Multilingual Medical Speech Translation,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Assoc...
2025
-
[21]
Toward Robust Neural Machine Translation for Noisy Input Sequences,
M. Sperber, J. Niehues, and A. Waibel, “Toward Robust Neural Machine Translation for Noisy Input Sequences,” inProceedings of the 14th International Conference on Spoken Language Translation, S. Sakti and M. Utiyama, Eds. Tokyo, Japan: International Workshop on Spoken Language Translation, Dec. 14-15 2017, pp. 90–96. [Online]. Available: https://aclanthol...
2017
-
[22]
Synthetic and Natural Noise Both Break Neural Machine Translation,
Y . Belinkov and Y . Bisk, “Synthetic and Natural Noise Both Break Neural Machine Translation,” inInternational Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=BJ8vJebC-
2018
-
[23]
Towards Robust Neural Machine Translation,
Y . Cheng, Z. Tu, F. Meng, J. Zhai, and Y . Liu, “Towards Robust Neural Machine Translation,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), I. Gurevych and Y . Miyao, Eds. Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 1756–1766. [Online]. Available: http...
2018
-
[24]
Robust Neural Machine Translation for Clean and Noisy Speech Transcripts,
M. Di Gangi, R. Enyedi, A. Brusadin, and M. Federico, “Robust Neural Machine Translation for Clean and Noisy Speech Transcripts,” inProceedings of the 16th International Conference on Spoken Language Translation, J. Niehues, R. Cattoni, S. St ¨uker, M. Negri, M. T urchi, T.-L. Ha, E. Salesky, R. Sanabria, L. Barrault, L. Specia, and M. Federico, Eds. Hong...
2019
-
[25]
MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues,
K. Binici, A. R. Kashyap, V . Schlegel, A. T. Liu, V . P . Dwivedi, T.-T. Nguyen, X. Gao, N. F. Chen, and S. Winkler, “MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues,” in AI4X 2025 International Conference, 2025. [Online]. Available: https://openreview.net/forum?id=rWOaUq6UBS
2025
-
[26]
FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798–805
2023
-
[27]
PWESuite: Phonetic word embeddings and tasks they facilitate,
V . Zouhar, K. Chang, C. Cui, N. B. Carlson, N. R. Robinson, M. Sachan, and D. R. Mortensen, “PWESuite: Phonetic word embeddings and tasks they facilitate,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.- Y . Kan, V . Hoste, A. Lenci, S. Sakti, ...
2024
-
[28]
PSET: a phonetics-semantics evaluation testbed,
G. Sperduti and D. Nguyen, “PSET: a phonetics-semantics evaluation testbed,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 7346–7356. [Online]. Available: https: //aclanthol...
2025
-
[29]
ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark,
T. X. Nguyen, N. Vo, G.-S. Nguyen, D. M. Hoang, C. D. Huynh, I. J. Unanue, M. Piccardi, W. Buntine, and D. D. Le, “ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark,” 2026. [Online]. Available: https://arxiv.org/abs/2602.12911
Pith/arXiv arXiv 2026
-
[30]
Common Voice: A Massively-Multilingual Speech Corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. T yers, and G. Weber, “Common Voice: A Massively-Multilingual Speech Corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: htt...
2020
-
[31]
A non-expert Kaldi recipe for Vietnamese Speech Recognition System,
H.-T. Luong and H.-Q. Vu, “A non-expert Kaldi recipe for Vietnamese Speech Recognition System,” inProceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016). Osaka, Japan: The COLING 2016 Organizing Comm...
2016
-
[32]
Improving Vietnamese-English Medical Machine Translation,
N. Vo, D. Q. Nguyen, D. D. Le, M. Piccardi, and W. Buntine, “Improving Vietnamese-English Medical Machine Translation,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.- Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Italia: ELRA ...
2024
-
[33]
MTet: Multi-domain Translation for English and Vietnamese,
C. Ngo, T. H. Trinh, L. Phan, H. Tran, T. Dang, H. Nguyen, M. Nguyen, and M.-T. Luong, “MTet: Multi-domain Translation for English and Vietnamese,” 2022. [Online]. Available: https://arxiv.org/abs/2210.05610
arXiv 2022
-
[34]
PhoMT: A high-quality and large-scale benchmark dataset for Vietnamese-English machine translation,
L. Doan, L. T. Nguyen, N. L. Tran, T. Hoang, and D. Q. Nguyen, “PhoMT: A high-quality and large-scale benchmark dataset for Vietnamese-English machine translation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 4...
2021
-
[35]
Assessing the impact of speech recognition errors on machine translation quality,
N. Ruiz and M. Federico, “Assessing the impact of speech recognition errors on machine translation quality,” inProceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, Y . Al-Onaizan and M. Simard, Eds. Vancouver, Canada: Association for Machine Translation in the Americas, Oct. 22-26 2014, pp. 2...
2014
-
[36]
PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors,
D. R. Mortensen, P . Littell, A. Bharadwaj, K. Goyal, C. Dyer, and L. Levin, “PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors,” inProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Y . Matsumoto and R. Prasad, Eds. Osaka, Japan: The COLING 2016 Organizing Committee, De...
2016
-
[37]
A. Fang, S. Filice, N. Limsopatham, and O. Rokhlenko, “Using Phoneme Representations to Build Predictive Models Robust to ASR Errors,” inProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’20. New Y ork, NY , USA: Association for Computing Machinery, 2020, p. 699–708. [Online]. Avail...
-
[38]
XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech,
L. The Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech,” inInterspeech 2023, 2023, pp. 5506–5510
2023
-
[39]
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to- Speech,
J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to- Speech,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 5530–5540. [Online]. Available: https: //proceed...
2021
-
[40]
Phowhisper: Automatic speech recognition for vietnamese,
T.-T. Le, L. T. Nguyen, and D. Q. Nguyen, “Phowhisper: Automatic speech recognition for vietnamese,” inThe Second Tiny Papers Track at ICLR 2024, 2024. [Online]. Available: https://openreview.net/forum?id=x3c3MkJfpG
2024
-
[41]
Vietnamese end-to-end speech recognition using wav2vec 2.0,
T. B. Nguyen, “Vietnamese end-to-end speech recognition using wav2vec 2.0,” 09 2021. [Online]. Available: https: //github.com/vietai/ASR
2021
-
[42]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–2...
2023
-
[43]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”
-
[44]
Available: https://arxiv.org/abs/2006.11477
[Online]. Available: https://arxiv.org/abs/2006.11477
arXiv 2006
-
[45]
Gemini Team, Google, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2507.06261
Pith/arXiv arXiv 2025
-
[46]
A Vietnamese-English Neural Machine Trans- lation System,
Tuan-Duy H. Nguyen, Duy Phung, Duy Tran-Cong Nguyen, Hieu Minh Tran, Manh Luong, Tin Duy Vo, Hung Hai Bui, Dinh Phung, Dat Quoc Nguyen, “A Vietnamese-English Neural Machine Trans- lation System,” inInterspeech 2022, 2022, pp. 5543–5544
2022
-
[47]
Multilingual denoising pre-training for neural machine translation,
Y . Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/
2020
-
[48]
R. Speer, “rspeer/wordfreq: v3.0,” Sep. 2022. [Online]. Available: https://doi.org/10.5281/zenodo.7199437
-
[49]
ByT5 model for massively multilingual grapheme-to-phoneme conversion,
Jian Zhu, Cong Zhang, David Jurgens, “ByT5 model for massively multilingual grapheme-to-phoneme conversion,” inInterspeech 2022, 2022, pp. 446–450
2022
-
[50]
Billion-scale similarity search with GPUs,
J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019
2019
-
[51]
Llama Team, AI @ Meta, “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783
Pith/arXiv arXiv 2024
-
[52]
Mistral 7b,
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P . Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”
-
[53]
Available: https://arxiv.org/abs/2310.06825
[Online]. Available: https://arxiv.org/abs/2310.06825
-
[54]
Llama-SEA-LION-v3.5-8B-R,
SEA-LION Team, “Llama-SEA-LION-v3.5-8B-R,” https:// huggingface.co/aisingapore/Llama-SEA-LION-v3.5-8B-R, 2024, Hugging Face model release
2024
-
[55]
Vistral- 7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese,
C. V . Nguyen, T. Nguyen, Q. Nguyen, H. Nguyen, B. Pl ¨uster, N. Pham, H. Nguyen, P . Schramowski, and T. Nguyen, “Vistral- 7B-Chat - Towards a State-of-the-Art Large Language Model for Vietnamese,” 2023
2023
-
[56]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W .-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040/
2002
-
[57]
COMET- 22: Unbabel-IST 2022 Submission for the Metrics Shared Task,
R. Rei, J. G. C. de Souza, D. Alves, C. Zerva, A. C. Farinha, T. Glushkova, A. Lavie, L. Coheur, and A. F. T. Martins, “COMET- 22: Unbabel-IST 2022 Submission for the Metrics Shared Task,” inProceedings of the Seventh Conference on Machine Translation (WMT). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics, Dec. 2022, pp...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.