PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Chaewoon Bang; Changi Hong; Dayeon Ku; Do Hyun Lee; Hong Kook Kim; Hwayoung Park; Yoonah Song

arxiv: 2604.09111 · v4 · submitted 2026-04-10 · 📡 eess.AS · cs.AI

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Changi Hong , Yoonah Song , Hwayoung Park , Chaewoon Bang , Dayeon Ku , Do Hyun Lee , Hong Kook Kim This is my paper

Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3

classification 📡 eess.AS cs.AI

keywords phonetic synchronizationautomated dubbingtext-to-speechlip synchronizationisochronydynamic time warpingcross-lingual dubbing

0 comments

The pith

Phonetic synchronization paraphrases translated text to match source speech duration and lip movements in automated dubbing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a synchronization method for AI automated dubbing that addresses timing and lip alignment issues when converting speech across languages. It first uses a language model to paraphrase the translated text so the target speech lasts as long as the source. It then applies phonetic synchronization to select target vowels with pronunciations close to the source vowels. A sympathetic reader cares because mismatched timing or mouth movements in dubbed video disrupts natural viewing, and the method claims measurable gains over ordinary text-to-speech and sometimes over professional voice actors.

Core claim

The authors show that phonetic synchronization, which uses dynamic time warping on vowel distances taken from lip-reading training data, lets the target text compose similar-sounding vowels to the source while preserving duration through isochrony paraphrasing. Extending this to PS-Comet, which jointly weighs semantic similarity, produces PS-TTS and PS-Comet TTS systems that outperform standard TTS on objective lip-sync metrics and surpass voice actors on Korean-to-English and English-to-Korean dubbing tasks, with consistent gains across Korean-English-French language pairs.

What carries the argument

Phonetic synchronization (PS) using dynamic time warping (DTW) with local costs based on vowel distances from lip-reading data to align target vowel pronunciations with the source.

If this is right

PS-TTS produces higher lip-sync accuracy and better duration matching than TTS without the phonetic step on Korean and English lip-reading datasets.
PS-Comet improves semantic preservation while retaining the lip-sync benefit, performing best overall across tested language pairs.
The systems outperform voice-actor dubbing in Korean-to-English and English-to-Korean directions on the voice-actor dataset.
The approach extends directly to additional languages such as French, confirming cross-linguistic applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same vowel-alignment step could be inserted into existing TTS pipelines for any video localization workflow that already has lip-reading data available.
If vowel distance tables can be built quickly for additional languages, the method would lower the cost of producing synchronized dubs for large video libraries.
Real-time variants might become feasible for live events once the paraphrasing and DTW steps are optimized for low latency.

Load-bearing premise

Vowel distances measured from lip-reading training data will reliably predict and produce accurate lip synchronization when applied to new translated text across languages.

What would settle it

Running the PS-TTS system on a held-out set of dubbed videos from a new language pair and finding no statistically significant gain in lip-sync accuracy metrics compared with baseline TTS would falsify the central effectiveness claim.

read the original abstract

Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes vowels with pronunciations similar to source vowels. Third, we extend this approach to PSComet, which jointly considers semantic and phonetic similarity to preserve meaning better. The proposed methods are incorporated into text-to-speech systems, PS-TTS and PS-Comet TTS. The performance evaluation using Korean and English lip-reading datasets and a voice-actor dubbing dataset demonstrates that both systems outperform TTS without PS on several objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. We extend the experiments to French, testing all pairs among these languages to evaluate cross-linguistic applicability. Across all language pairs, PS-Comet performed best, balancing lip-sync accuracy with semantic preservation, confirming that PS-Comet achieves more accurate lip-sync with semantic preservation than PS alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a DTW-based phonetic matching step on top of isochrony paraphrasing for dubbing TTS, but the claims of beating voice actors lack the metrics and ablations needed to confirm the vowel-distance term is doing real work.

read the letter

The new piece here is the phonetic synchronization step: they run DTW on precomputed vowel distances taken from lip-reading corpora to pick paraphrases whose target vowels are visually close to the source. PS-Comet then balances that against semantic similarity from the language model. That combination is a clean, practical extension of standard isochrony work and they test it on Korean-English, English-Korean, and French pairs using both lip-reading sets and a voice-actor dubbing set. The method description is straightforward and the cross-language results are at least directionally consistent with PS-Comet coming out on top for the lip-sync versus meaning trade-off. That is useful to see even if the absolute gains are modest. The soft spot is exactly the one the stress-test flags. The vowel-distance matrix is derived from training data on lip-reading corpora, then applied to new translated sentences after LM paraphrasing. The abstract says the systems beat both plain TTS and voice actors on objective metrics, yet supplies no numbers, no statistical tests, no ablation that isolates the DTW cost from the isochrony paraphrasing alone, and no check on whether the distance matrix still predicts well on held-out text. Without those, it is impossible to know whether the reported lip-sync gains come from the phonetic term or simply from better duration matching. The assumption that visual vowel distances transfer across languages and to new content is plausible but remains untested in the reported experiments. This paper is aimed at people working on automated dubbing and lip-sync TTS. A reader who wants concrete implementation ideas for the DTW cost or the joint semantic-phonetic objective will find something to try. It does not yet have the evidential detail for a strong citation or for claiming a clear advance over voice-actor baselines. I would send it to peer review if the full manuscript includes the missing metrics, ablations, and cross-validation of the distance matrix; the core idea is worth a careful look once the numbers are on the table.

Referee Report

2 major / 2 minor

Summary. The paper proposes PS-TTS for automated dubbing, which paraphrases translated text in two steps—isochrony via language-model paraphrasing to match source duration, and phonetic synchronization (PS) via DTW alignment using precomputed vowel distances from lip-reading training data to match source vowels—then extends it to PS-Comet TTS that jointly optimizes semantic and phonetic similarity. The systems are evaluated on Korean/English lip-reading corpora and a voice-actor dubbing dataset, with extension to French, claiming superior objective metrics over baseline TTS and outperformance versus human voice actors on Korean-English and English-Korean pairs.

Significance. If the reported gains hold under rigorous verification, the work would offer a practical, language-agnostic technique for improving lip synchronization in automated dubbing without retraining the underlying TTS model, addressing a persistent barrier to natural multilingual video localization. The explicit use of DTW on visual-feature-derived vowel costs and the PS-Comet semantic extension represent a clear engineering contribution over purely duration-based or unconstrained translation pipelines.

major comments (2)

[Evaluation] Evaluation section: the central claim that PS-TTS and PS-Comet TTS 'outperform voice actors in Korean-to-English and English-to-Korean dubbing' and 'outperform TTS without PS on several objective metrics' is unsupported by any reported numerical values, tables, confidence intervals, or statistical tests; without these, the generalization from training-set vowel distances to held-out translated sentences cannot be assessed.
[Method] Method description of PS (around the DTW step): the local cost matrix for vowel distances is stated to be 'measured from training data' on lip-reading datasets, yet no equation, feature extraction procedure, or cross-validation on unseen text is supplied; this is load-bearing because the skeptic concern (failure of the distance matrix to transfer after LM paraphrasing) directly affects whether the reported dubbing gains can be attributed to PS rather than isochrony alone.

minor comments (2)

[Abstract] Abstract: 'several objective metrics' and 'cross-linguistic applicability' are mentioned without naming the metrics (e.g., lip-sync error, duration error, semantic similarity scores) or the exact language-pair results.
[Method] Notation: 'PSComet' is introduced without a hyphen or consistent capitalization relative to 'PS-TTS'; clarify the exact joint objective function.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and constructive feedback on our manuscript. We appreciate the referee's identification of areas where additional clarity and detail will strengthen the presentation of our results. We address each major comment below and commit to revisions that enhance the rigor and transparency of the work without altering the core contributions.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim that PS-TTS and PS-Comet TTS 'outperform voice actors in Korean-to-English and English-to-Korean dubbing' and 'outperform TTS without PS on several objective metrics' is unsupported by any reported numerical values, tables, confidence intervals, or statistical tests; without these, the generalization from training-set vowel distances to held-out translated sentences cannot be assessed.

Authors: We acknowledge that the evaluation section would be strengthened by more explicit presentation of the supporting data. While the manuscript reports that experiments were conducted on the Korean/English lip-reading corpora and voice-actor dubbing dataset (with extension to French), the specific numerical values, comparative tables, confidence intervals, and statistical tests were not included with sufficient detail. In the revised manuscript, we will add comprehensive tables listing all objective metrics (lip-sync accuracy via DTW alignment scores, semantic similarity via embedding distances, duration matching ratios) for PS-TTS, PS-Comet TTS, baseline TTS, and human voice actors, along with confidence intervals and results of paired statistical tests (e.g., t-tests or Wilcoxon tests) to substantiate the outperformance claims and allow assessment of generalization to held-out paraphrased sentences. revision: yes
Referee: [Method] Method description of PS (around the DTW step): the local cost matrix for vowel distances is stated to be 'measured from training data' on lip-reading datasets, yet no equation, feature extraction procedure, or cross-validation on unseen text is supplied; this is load-bearing because the skeptic concern (failure of the distance matrix to transfer after LM paraphrasing) directly affects whether the reported dubbing gains can be attributed to PS rather than isochrony alone.

Authors: We agree that the DTW component requires a more precise and self-contained description to address concerns about transferability. The local cost matrix is derived by extracting visual lip-shape embeddings from the lip-reading training data for each vowel class and computing Euclidean distances between these embeddings. In the revision, we will insert a formal equation for the cost matrix C(v_s, v_t) = ||f(v_s) - f(v_t)|| where f denotes the feature extractor trained on the lip-reading corpus, along with a step-by-step description of the feature extraction pipeline. We will also add a cross-validation subsection reporting alignment performance on held-out translated and paraphrased sentences to demonstrate that the matrix generalizes beyond the training distribution and that phonetic synchronization contributes gains beyond isochrony alone. revision: yes

Circularity Check

0 steps flagged

No circularity in the PS-TTS derivation chain

full rationale

The paper precomputes vowel distance costs empirically from lip-reading training corpora, then feeds these fixed costs into standard DTW to select phonetically similar target paraphrases (after LM-driven isochrony adjustment). The resulting texts are synthesized via PS-TTS or PS-Comet TTS and scored on separate voice-actor dubbing datasets plus cross-lingual lip-reading tests. No equation or claim reduces the reported lip-sync gains to a quantity defined by the method's own fitted outputs, nor does any load-bearing premise rest on a self-citation whose content is itself unverified; the pipeline remains externally falsifiable against held-out data and human baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard signal-processing and language-model techniques plus one domain assumption about vowel distances approximating lip movements; no free parameters or new physical entities are introduced beyond the procedural combination itself.

axioms (1)

domain assumption Vowel pronunciations can be compared via distances derived from training data to approximate lip movements
This assumption underpins the local cost function in the DTW step for phonetic synchronization.

invented entities (2)

Phonetic Synchronization (PS) no independent evidence
purpose: Adjust target text vowels to match source pronunciation for lip-sync preservation
New procedural step introduced for the dubbing task
PSComet no independent evidence
purpose: Jointly optimize semantic and phonetic similarity during text adjustment
Extension of the base PS method

pith-pipeline@v0.9.0 · 5613 in / 1342 out tokens · 45498 ms · 2026-05-10T17:16:27.319303+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

[1]

In: Proc

Radford, A., et al.: Robust speech recognition via large -scale weak supervision. In: Proc. ICML, pp. 28492–28518. Honolulu, HI, USA (2023)

work page 2023
[2]

arXiv:2303.13780 (2023)

Peng, K., et al.: Towards making the most of ChatGPT for machine translation. arXiv:2303.13780 (2023)

work page arXiv 2023
[3]

Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023

Vyas, A., et al.: Audiobox: Unified audio generation with natural language prompts. arXiv:2312.15821 (2023)

work page arXiv 2023
[4]

In: Proc

Oktem, A., Farrús, M., Bonafonte, A.: Prosodic phrase alignment for machine dubbing. In: Proc. INTERSPEECH, pp. 4215–4219. Graz, Austria (2019)

work page 2019
[5]

arXiv:2011.03530 (2020)

Yang, Y., et al.: Large-scale multilingual audio-visual dubbing. arXiv:2011.03530 (2020)

work page arXiv 2011
[6]

In: Proc

Hu, C., et al.: Neural dubber: Dubbing for videos according to scripts. In: Proc. NeurIPS, pp. 16582–16595. New Orleans, LA, USA (2021)

work page 2021
[7]

Brannon, W., Virkar, Y., Thompson, B.: Dubbing in practice: A l arge-scale study of hu- man localization with insights for automatic dubbing. Trans. Association for Computation- al Linguistics 11, 419–435 (2023)

work page 2023
[8]

In: Proc

Virkar, Y., et al.: Improvements to prosodic alignment for automatic dubbing. In: Proc. ICASSP, pp. 7543–7574. Toronto, ON, Canada (2021) 14

work page 2021
[9]

In: Orero, P

Chaume, F.: Synchronization in dubbing: A translation approach. In: Orero, P. (ed.) Topics in Audiovisual Translation, pp. 35 –52. John Benjamins B.V., Amsterdam, Netherlands (2004)

work page 2004
[10]

In: Proc

Federico, M., et al.: From speech -to-speech trans lation to automatic dubbing. In: Proc. 17th Int. Conf. Spoken Language Translation, pp. 257–264. Virtual (2020)

work page 2020
[11]

In: Proc

Wu, Y., et al.: VideoDubber: Machine translation with speech -aware length control for video dubbing. In: Proc. AAAI, vol. 37, pp. 13772–13779. Washington, DC, USA (2023)

work page 2023
[12]

ACM Transactions on Graphics 38(6), 1–13 (2019)

Kim, H., et al.: Neural style -preserving visual dubbing. ACM Transactions on Graphics 38(6), 1–13 (2019)

work page 2019
[13]

Frontiers in Signal Processing 3 (2023)

Bigioi, D., Corcoran, P.: Multilingual video dubbing —A technology review and current challenges. Frontiers in Signal Processing 3 (2023). DOI: 10.3389/frsip.2023.1230755

work page doi:10.3389/frsip.2023.1230755 2023
[14]

In: Proc

Dras, M., Han, C.: Korean -English MT and S -tag. In: Proc. Sixth Int. Workshop on Tree Adjoining Grammar and Related Frameworks, pp. 206–215. Venice, Italy (2002)

work page 2002
[15]

Journal of Intercultural Communication 11(1), 1–9 (2011)

González-Iglesias, J.D., Toda, F.: Dubbing or subtitling interculturalism: Choices and con- straints. Journal of Intercultural Communication 11(1), 1–9 (2011)

work page 2011
[16]

Sensors 21(23), art

Fenghour, S., et al.: An effective conversion of visemes to words for high -performance au- tomatic lipreading. Sensors 21(23), art. no. 7890 (2021). DOI: 10.3390/s21237890

work page doi:10.3390/s21237890 2021
[17]

In: Proc

Abel, A., et al.: Maximising audio -visual correlation with automatic lip tracking and vow- el-based segmentation. In: Proc. Biometric ID Management and Multimodal Communica- tion, pp. 65–72. Berlin, Germany (2009)

work page 2009
[18]

In: Proc

Casanova, E., et al.: YourTTS: Towards zero -shot multi-speaker TTS and zero -shot voice conversion for everyone. In: Proc. ICML, pp. 2709–2720. Baltimore, MD, USA (2022)

work page 2022
[19]

IEEE Transactions on Multimedia 17(5), 603–615 (2015)

Harte, N., Gillen, E.: TCD -TIMIT: An audio -visual corpus of continuous speech. IEEE Transactions on Multimedia 17(5), 603–615 (2015)

work page 2015
[20]

In: Proc

Prajwal, K.R., et al.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proc. 28th ACM Int. Conf. Multimedia, pp. 484–492. Seattle, WA, USA (2020)

work page 2020
[21]

In: Proc

Saeki, T., et al.: UTMOS: UTokyo -SaruLab s ystem for VoiceMOS challenge 2022. In: Proc. Interspeech, pp. 4521–4525. Incheon, Korea (2022)

work page 2022
[22]

Kuielab-mdx- net: A two-stream neural network for music demixing,

Kim, M., et al.: KUIELab -MDX-Net: A two -stream neural network for music demixing. arXiv:2111.12203 (2021)

work page arXiv 2021
[23]

-J.: Knowledge distillation-based training of speech en- hancement for noise -robust automatic speech recognition

Lee, G.W., Kim, H.K., Kong, D. -J.: Knowledge distillation-based training of speech en- hancement for noise -robust automatic speech recognition. IEEE Access 12, 72707 –72720 (2024). DOI: 10.1109/ACCESS.2024.3403761

work page doi:10.1109/access.2024.3403761 2024
[24]

arXiv:2209.10357 (2022)

Park, D., et al.: GIST-AiTeR system for the diarization task of the 2022 VoxCeleb speaker recognition challenge. arXiv:2209.10357 (2022)

work page arXiv 2022
[25]

In: Proc

Tiedemann, J.: The Tatoeba translation challenge—Realistic data sets for low resource and multilingual MT. In: Proc. 5th Conf. Machine Translation, pp. 1174–1182. Virtual (2020)

work page 2020
[26]

In: Proc

Lakew, S.M., et al.: Machine t ranslation verbosity control for automatic dubbing. In: Proc. ICASSP, pp. 7538–7542. Toronto, Canada (2021)

work page 2021
[27]

In: Proc

Lakew, S.M., et al.: Isometric MT: Neural machine translation for automatic dubbing. In: Proc. ICASSP, pp. 6242–6246. Singapore (2022)

work page 2022
[28]

arXiv:2112.08548 (2021)

Tam, D., et a l.: Isochrony -aware neural machine translation for automatic dubbing. arXiv:2112.08548 (2021)

work page arXiv 2021
[29]

In: Proc

Swiatkowski, J., et al.: Expressive machine dubbing through phrase -level cross -lingual prosody transfer. In: Proc. Interspeech, pp. 5015–5019. Dublin, Ireland (2023)

work page 2023
[30]

In: Proc

Kong, J., Kim, J., Bae, J.: HiFi -GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In: Proc. NeurIPS, vol. 33, pp. 17022–17033. Virtual (2020)

work page 2020
[31]

Xtts: a massively mul- tilingual zero-shot text-to-speech model,

Casanova, E., et al.: XTTS: A massively multilingual zero-shot text -to-speech model. arXiv:2406.04904 (2024)

work page arXiv 2024
[32]

G2P module GitHub, https://github.com/Kyubyong/g2pk, last accessed 2025/12/29

work page 2025
[33]

Phonemizer GitHub, https://github.com/bootphon/phonemizer, last accessed 2025/12/29

work page 2025
[34]

In: Proc

Shih, K.J., et al.: RAD -TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis. In: Proc. ICML Workshop. Vienna, Austria (2021)

work page 2021
[35]

In: Proc

Kim, J., Kong, J., Bae, J.: Glow -TTS: A generative flow for text -to-speech via monotonic alignment search. In: Proc. NeurIPS, vol. 33, pp. 8067–8077. Vancouver, Canada (2020)

work page 2020
[36]

In: Proc

Sharma, M., et al.: Intra -sentential speaking rate control in neural text -to-speech for auto- matic dubbing. In: Proc. Interspeech, pp. 3151–3155. Brno, Czech Republic (2021)

work page 2021
[37]

In: Proc

Graves, A., et al.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proc. 23rd ICML, pp. 369 –376. Honolulu, USA (2006)

work page 2006
[38]

CTC forced aligner GitHub, https://github.com/MahmoudAshraf97/ctc-forced-aligner, last accessed 2025/12/29

work page 2025
[39]

In: Proc

Elias, I., et al.: Parallel Tacotron 2: A non -autoregressive neural TTS model with differen- tiable duration modeling. In: Proc. Interspeech, pp. 141–145. Brno, Czech Republic (2021)

work page 2021
[40]

In: Proc

Feng, F., et al.: Language-agnostic BERT sentence embedding. In: Proc. 60th ACL. (2022)

work page 2022
[41]

GPT-4 Technical Report

OpenAI: GPT-4 technical report. arXiv:2303.08774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Journal of Machine Learn- ing Research 9, 2579–2605 (2008)

Van der Maaten, L., Hinton, G.: Visualizing data using t -SNE. Journal of Machine Learn- ing Research 9, 2579–2605 (2008)

work page 2008
[43]

Journal of Machine Learning Research 21(118), 1–6 (2020)

Tavenard, R., et al.: Tslearn: A machine learning toolkit for time series data. Journal of Machine Learning Research 21(118), 1–6 (2020)

work page 2020
[44]

Tslearn GitHub, https://github.com/tslearn-team/tslearn, last accessed 2025/12/29

work page 2025
[45]

In: Proc

Guzmán, F., et al.: The FLORES evaluation datasets for low -resource machine translation: Nepali–English and Sinhala –English. In: Proc. ACL, pp. 6098 –6111. Florence, Italy (2019)

work page 2019
[46]

arXiv preprint arXiv:2009.09025 , year=

Ricardo, R., et al.: COMET: A neural framework for MT evaluation. arXiv:2009.09025 (2020)

work page arXiv 2009
[47]

In: Proc

Chin, Y.: ROUGE: A package for automatic evaluation of s ummaries. In: Proc. Text Summarization Branches Out, pp. 74–81, Barcelona, Spain (2004)

work page 2004
[48]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers, I. Gurevych,: Sentence -BERT: Sentence embeddings using Siamese BERT - networks. arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908
[49]

In: Proc

Zen, H., et al.: LibriTTS: A corpus derived from LibriSpeech for text-to-speech. In: Proc. Interspeech, pp. 1526–1530. Graz, Austria (2019)

work page 2019
[50]

KMSSS Data, https://aihub.or.kr, last accessed 2025/12/29

work page 2025
[51]

In: Proc

Chung, J.S., Zisserman, A.: Out of time: Automated lip sync in the wild. In: Proc. Asian Conf. Computer Vision, pp. 251–263. Taipei, Taiwan (2016)

work page 2016
[52]

Syncnet GitHub, https://github.com/joonson/syncnet_python, last accessed 2025/12/29

work page 2025
[53]

arXiv:2203.11389 (2022)

Huang, et al.: The VoiceMOS challenge 2022. arXiv:2203.11389 (2022)

work page arXiv 2022
[54]

SpeechMOS GitHub, https://github.com/tarepan/SpeechMOS, last accessed 2025/12/29

work page 2025
[55]

Deepfake Homepage, https://app.vozo.ai, last accessed 2025/12/29

work page 2025
[56]

In: Proc

Rassool, R.: VMAF reproducibility: Validating a perceptual practical video quality metric. In: Proc. IEEE Int. Symp. BMSB, pp. 1–2. Cagliari, Italy (2017)

work page 2017
[57]

arXiv:1812.10464 (2019)

Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero -shot cross-lingual transfer and beyond. arXiv:1812.10464 (2019)

work page arXiv 2019
[58]

arXiv:2505.01263 (2025)

Cong, G., et al.: FlowDubber: Movie dubbing with LLM -based semantic-aware learning and flow matching based voice enhancing. arXiv:2505.01263 (2025)

work page arXiv 2025
[59]

arXiv:2303.05322 (2023)

Chen, Q., et al.: Improving few -shot learning for talking face system with TTS data aug- mentation. arXiv:2303.05322 (2023)

work page arXiv 2023

[1] [1]

In: Proc

Radford, A., et al.: Robust speech recognition via large -scale weak supervision. In: Proc. ICML, pp. 28492–28518. Honolulu, HI, USA (2023)

work page 2023

[2] [2]

arXiv:2303.13780 (2023)

Peng, K., et al.: Towards making the most of ChatGPT for machine translation. arXiv:2303.13780 (2023)

work page arXiv 2023

[3] [3]

Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023

Vyas, A., et al.: Audiobox: Unified audio generation with natural language prompts. arXiv:2312.15821 (2023)

work page arXiv 2023

[4] [4]

In: Proc

Oktem, A., Farrús, M., Bonafonte, A.: Prosodic phrase alignment for machine dubbing. In: Proc. INTERSPEECH, pp. 4215–4219. Graz, Austria (2019)

work page 2019

[5] [5]

arXiv:2011.03530 (2020)

Yang, Y., et al.: Large-scale multilingual audio-visual dubbing. arXiv:2011.03530 (2020)

work page arXiv 2011

[6] [6]

In: Proc

Hu, C., et al.: Neural dubber: Dubbing for videos according to scripts. In: Proc. NeurIPS, pp. 16582–16595. New Orleans, LA, USA (2021)

work page 2021

[7] [7]

Brannon, W., Virkar, Y., Thompson, B.: Dubbing in practice: A l arge-scale study of hu- man localization with insights for automatic dubbing. Trans. Association for Computation- al Linguistics 11, 419–435 (2023)

work page 2023

[8] [8]

In: Proc

Virkar, Y., et al.: Improvements to prosodic alignment for automatic dubbing. In: Proc. ICASSP, pp. 7543–7574. Toronto, ON, Canada (2021) 14

work page 2021

[9] [9]

In: Orero, P

Chaume, F.: Synchronization in dubbing: A translation approach. In: Orero, P. (ed.) Topics in Audiovisual Translation, pp. 35 –52. John Benjamins B.V., Amsterdam, Netherlands (2004)

work page 2004

[10] [10]

In: Proc

Federico, M., et al.: From speech -to-speech trans lation to automatic dubbing. In: Proc. 17th Int. Conf. Spoken Language Translation, pp. 257–264. Virtual (2020)

work page 2020

[11] [11]

In: Proc

Wu, Y., et al.: VideoDubber: Machine translation with speech -aware length control for video dubbing. In: Proc. AAAI, vol. 37, pp. 13772–13779. Washington, DC, USA (2023)

work page 2023

[12] [12]

ACM Transactions on Graphics 38(6), 1–13 (2019)

Kim, H., et al.: Neural style -preserving visual dubbing. ACM Transactions on Graphics 38(6), 1–13 (2019)

work page 2019

[13] [13]

Frontiers in Signal Processing 3 (2023)

Bigioi, D., Corcoran, P.: Multilingual video dubbing —A technology review and current challenges. Frontiers in Signal Processing 3 (2023). DOI: 10.3389/frsip.2023.1230755

work page doi:10.3389/frsip.2023.1230755 2023

[14] [14]

In: Proc

Dras, M., Han, C.: Korean -English MT and S -tag. In: Proc. Sixth Int. Workshop on Tree Adjoining Grammar and Related Frameworks, pp. 206–215. Venice, Italy (2002)

work page 2002

[15] [15]

Journal of Intercultural Communication 11(1), 1–9 (2011)

González-Iglesias, J.D., Toda, F.: Dubbing or subtitling interculturalism: Choices and con- straints. Journal of Intercultural Communication 11(1), 1–9 (2011)

work page 2011

[16] [16]

Sensors 21(23), art

Fenghour, S., et al.: An effective conversion of visemes to words for high -performance au- tomatic lipreading. Sensors 21(23), art. no. 7890 (2021). DOI: 10.3390/s21237890

work page doi:10.3390/s21237890 2021

[17] [17]

In: Proc

Abel, A., et al.: Maximising audio -visual correlation with automatic lip tracking and vow- el-based segmentation. In: Proc. Biometric ID Management and Multimodal Communica- tion, pp. 65–72. Berlin, Germany (2009)

work page 2009

[18] [18]

In: Proc

Casanova, E., et al.: YourTTS: Towards zero -shot multi-speaker TTS and zero -shot voice conversion for everyone. In: Proc. ICML, pp. 2709–2720. Baltimore, MD, USA (2022)

work page 2022

[19] [19]

IEEE Transactions on Multimedia 17(5), 603–615 (2015)

Harte, N., Gillen, E.: TCD -TIMIT: An audio -visual corpus of continuous speech. IEEE Transactions on Multimedia 17(5), 603–615 (2015)

work page 2015

[20] [20]

In: Proc

Prajwal, K.R., et al.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proc. 28th ACM Int. Conf. Multimedia, pp. 484–492. Seattle, WA, USA (2020)

work page 2020

[21] [21]

In: Proc

Saeki, T., et al.: UTMOS: UTokyo -SaruLab s ystem for VoiceMOS challenge 2022. In: Proc. Interspeech, pp. 4521–4525. Incheon, Korea (2022)

work page 2022

[22] [22]

Kuielab-mdx- net: A two-stream neural network for music demixing,

Kim, M., et al.: KUIELab -MDX-Net: A two -stream neural network for music demixing. arXiv:2111.12203 (2021)

work page arXiv 2021

[23] [23]

-J.: Knowledge distillation-based training of speech en- hancement for noise -robust automatic speech recognition

Lee, G.W., Kim, H.K., Kong, D. -J.: Knowledge distillation-based training of speech en- hancement for noise -robust automatic speech recognition. IEEE Access 12, 72707 –72720 (2024). DOI: 10.1109/ACCESS.2024.3403761

work page doi:10.1109/access.2024.3403761 2024

[24] [24]

arXiv:2209.10357 (2022)

Park, D., et al.: GIST-AiTeR system for the diarization task of the 2022 VoxCeleb speaker recognition challenge. arXiv:2209.10357 (2022)

work page arXiv 2022

[25] [25]

In: Proc

Tiedemann, J.: The Tatoeba translation challenge—Realistic data sets for low resource and multilingual MT. In: Proc. 5th Conf. Machine Translation, pp. 1174–1182. Virtual (2020)

work page 2020

[26] [26]

In: Proc

Lakew, S.M., et al.: Machine t ranslation verbosity control for automatic dubbing. In: Proc. ICASSP, pp. 7538–7542. Toronto, Canada (2021)

work page 2021

[27] [27]

In: Proc

Lakew, S.M., et al.: Isometric MT: Neural machine translation for automatic dubbing. In: Proc. ICASSP, pp. 6242–6246. Singapore (2022)

work page 2022

[28] [28]

arXiv:2112.08548 (2021)

Tam, D., et a l.: Isochrony -aware neural machine translation for automatic dubbing. arXiv:2112.08548 (2021)

work page arXiv 2021

[29] [29]

In: Proc

Swiatkowski, J., et al.: Expressive machine dubbing through phrase -level cross -lingual prosody transfer. In: Proc. Interspeech, pp. 5015–5019. Dublin, Ireland (2023)

work page 2023

[30] [30]

In: Proc

Kong, J., Kim, J., Bae, J.: HiFi -GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In: Proc. NeurIPS, vol. 33, pp. 17022–17033. Virtual (2020)

work page 2020

[31] [31]

Xtts: a massively mul- tilingual zero-shot text-to-speech model,

Casanova, E., et al.: XTTS: A massively multilingual zero-shot text -to-speech model. arXiv:2406.04904 (2024)

work page arXiv 2024

[32] [32]

G2P module GitHub, https://github.com/Kyubyong/g2pk, last accessed 2025/12/29

work page 2025

[33] [33]

Phonemizer GitHub, https://github.com/bootphon/phonemizer, last accessed 2025/12/29

work page 2025

[34] [34]

In: Proc

Shih, K.J., et al.: RAD -TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis. In: Proc. ICML Workshop. Vienna, Austria (2021)

work page 2021

[35] [35]

In: Proc

Kim, J., Kong, J., Bae, J.: Glow -TTS: A generative flow for text -to-speech via monotonic alignment search. In: Proc. NeurIPS, vol. 33, pp. 8067–8077. Vancouver, Canada (2020)

work page 2020

[36] [36]

In: Proc

Sharma, M., et al.: Intra -sentential speaking rate control in neural text -to-speech for auto- matic dubbing. In: Proc. Interspeech, pp. 3151–3155. Brno, Czech Republic (2021)

work page 2021

[37] [37]

In: Proc

Graves, A., et al.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proc. 23rd ICML, pp. 369 –376. Honolulu, USA (2006)

work page 2006

[38] [38]

CTC forced aligner GitHub, https://github.com/MahmoudAshraf97/ctc-forced-aligner, last accessed 2025/12/29

work page 2025

[39] [39]

In: Proc

Elias, I., et al.: Parallel Tacotron 2: A non -autoregressive neural TTS model with differen- tiable duration modeling. In: Proc. Interspeech, pp. 141–145. Brno, Czech Republic (2021)

work page 2021

[40] [40]

In: Proc

Feng, F., et al.: Language-agnostic BERT sentence embedding. In: Proc. 60th ACL. (2022)

work page 2022

[41] [41]

GPT-4 Technical Report

OpenAI: GPT-4 technical report. arXiv:2303.08774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Journal of Machine Learn- ing Research 9, 2579–2605 (2008)

Van der Maaten, L., Hinton, G.: Visualizing data using t -SNE. Journal of Machine Learn- ing Research 9, 2579–2605 (2008)

work page 2008

[43] [43]

Journal of Machine Learning Research 21(118), 1–6 (2020)

Tavenard, R., et al.: Tslearn: A machine learning toolkit for time series data. Journal of Machine Learning Research 21(118), 1–6 (2020)

work page 2020

[44] [44]

Tslearn GitHub, https://github.com/tslearn-team/tslearn, last accessed 2025/12/29

work page 2025

[45] [45]

In: Proc

Guzmán, F., et al.: The FLORES evaluation datasets for low -resource machine translation: Nepali–English and Sinhala –English. In: Proc. ACL, pp. 6098 –6111. Florence, Italy (2019)

work page 2019

[46] [46]

arXiv preprint arXiv:2009.09025 , year=

Ricardo, R., et al.: COMET: A neural framework for MT evaluation. arXiv:2009.09025 (2020)

work page arXiv 2009

[47] [47]

In: Proc

Chin, Y.: ROUGE: A package for automatic evaluation of s ummaries. In: Proc. Text Summarization Branches Out, pp. 74–81, Barcelona, Spain (2004)

work page 2004

[48] [48]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers, I. Gurevych,: Sentence -BERT: Sentence embeddings using Siamese BERT - networks. arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908

[49] [49]

In: Proc

Zen, H., et al.: LibriTTS: A corpus derived from LibriSpeech for text-to-speech. In: Proc. Interspeech, pp. 1526–1530. Graz, Austria (2019)

work page 2019

[50] [50]

KMSSS Data, https://aihub.or.kr, last accessed 2025/12/29

work page 2025

[51] [51]

In: Proc

Chung, J.S., Zisserman, A.: Out of time: Automated lip sync in the wild. In: Proc. Asian Conf. Computer Vision, pp. 251–263. Taipei, Taiwan (2016)

work page 2016

[52] [52]

Syncnet GitHub, https://github.com/joonson/syncnet_python, last accessed 2025/12/29

work page 2025

[53] [53]

arXiv:2203.11389 (2022)

Huang, et al.: The VoiceMOS challenge 2022. arXiv:2203.11389 (2022)

work page arXiv 2022

[54] [54]

SpeechMOS GitHub, https://github.com/tarepan/SpeechMOS, last accessed 2025/12/29

work page 2025

[55] [55]

Deepfake Homepage, https://app.vozo.ai, last accessed 2025/12/29

work page 2025

[56] [56]

In: Proc

Rassool, R.: VMAF reproducibility: Validating a perceptual practical video quality metric. In: Proc. IEEE Int. Symp. BMSB, pp. 1–2. Cagliari, Italy (2017)

work page 2017

[57] [57]

arXiv:1812.10464 (2019)

Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero -shot cross-lingual transfer and beyond. arXiv:1812.10464 (2019)

work page arXiv 2019

[58] [58]

arXiv:2505.01263 (2025)

Cong, G., et al.: FlowDubber: Movie dubbing with LLM -based semantic-aware learning and flow matching based voice enhancing. arXiv:2505.01263 (2025)

work page arXiv 2025

[59] [59]

arXiv:2303.05322 (2023)

Chen, Q., et al.: Improving few -shot learning for talking face system with TTS data aug- mentation. arXiv:2303.05322 (2023)

work page arXiv 2023