pith. sign in

arxiv: 2411.17690 · v3 · submitted 2024-11-26 · 💻 cs.MM · cs.CV· cs.SD· eess.AS

Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

Pith reviewed 2026-05-23 17:19 UTC · model grok-4.3

classification 💻 cs.MM cs.CVcs.SDeess.AS
keywords multimodal synchronizationdecoder-only transformervideo-text-to-speechpositional encodingmodality orderingtemporal alignmentphoneme-level metrics
0
0 comments X

The pith

Both global sequential indexing and co-temporal ordered indexing enable strong synchronization of video, text, and speech in a unified decoder-only transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how decoder-only transformers coordinate modalities sampled at different rates by training on video-text-to-speech synthesis. It finds that two positional encoding approaches—assigning unique IDs across all tokens or reusing identical IDs for tokens that occur at the same time—both produce accurate temporal alignment without separate timestamp inputs. Text supplies the content needed for intelligible speech while video supplies timing and expressive cues, and the sequence in which modalities are presented creates a trade-off between strong results on training-like data and better handling of new domains. A new phoneme-level metric is introduced to measure timing precision more finely than standard frame-level scores.

Core claim

In the Visatronic decoder-only transformer trained on VoxCeleb2, global sequential indexing (unique position IDs across modalities) and co-temporal ordered indexing (identical IDs for temporally corresponding tokens) both achieve strong synchronization performance. Text ensures intelligibility while video supplies temporal cues and emotional expressiveness. Video-first ordering yields stronger in-domain performance, whereas text-first ordering generalizes more robustly to unseen domains. Diverse large-scale training supports transferable synchronization strategies, and the introduced TimeSync metric exposes per-phoneme timing errors missed by coarser measures.

What carries the argument

Positional encoding strategies of global sequential indexing and co-temporal ordered indexing that align tokens from heterogeneous modalities inside a single decoder-only transformer.

If this is right

  • Text and video supply complementary signals that together improve intelligibility, timing, and expressiveness in generated speech.
  • Modality ordering produces a consistent trade-off between in-domain accuracy and cross-domain robustness.
  • Large-scale diverse training enables synchronization strategies to transfer across domains.
  • Phoneme-level metrics such as TimeSync diagnose timing misalignments that frame-level metrics overlook.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same indexing methods could simplify alignment in other multimodal decoder tasks such as video-conditioned captioning.
  • Choosing modality order at training time offers a practical lever for controlling generalization without changing the architecture.
  • If co-temporal indexing works without explicit timestamps, it reduces the need for additional metadata pipelines in deployed multimodal systems.

Load-bearing premise

The synchronization behaviors and modality-ordering trade-offs observed on the VoxCeleb2-trained VTTS task represent general multimodal synchronization mechanisms in decoder-only transformers.

What would settle it

Repeating the same experiments on a different multimodal generation task or dataset and observing that one indexing method collapses while the other remains effective would falsify the claim of general applicability.

Figures

Figures reproduced from arXiv: 2411.17690 by Akshita Gupta, Karren Dai Yang, Navdeep Jaitly, Richard He Bai, Tatiana Likhomanenko, Zakaria Aldeneh.

Figure 1
Figure 1. Figure 1: Visatronic overview. In addition to existing text to speech (leftmost) and lips to speech tasks (middle), we address multimodal generative task (rightmost), video-text to speech (VTTS), where the model is conditioned on the video of talking people and corresponding text transcriptions in order to generate speech. Visatronic is a unified decoder-only transformer that processes video v (grey), text t (grey),… view at source ↗
Figure 2
Figure 2. Figure 2: Video representation. Each video frame at time t is encoded via a VQ-VAE [54] into a downsampled spatial grid in R H′×W′×D. Each vector at location (h, w) is quantized to a discrete token using the learned codebook Cv via l2 similarity. These discrete tokens are embedded into R D′ and aggregated across the spatial grid to produce the final frame-level embedding input to the transformer. See Section 2.2 for… view at source ↗
Figure 3
Figure 3. Figure 3: Speech representation. We follow the speech discretization process from dMel [4]: each continuous mel-filterbank at time t extracted from the raw audio is mapped into a discrete values using a codebook of evenly spaced values. Afterwards, each discretized log mel-filterbank at time t is mapped through a learnable embedding layer, all representations for log mel-filterbanks at time t are stacked together an… view at source ↗
Figure 4
Figure 4. Figure 4: Input sequence for Visatronic. We encode all modalities into a discrete token space (see Figures 2 and 3), which is directly consumed by the decoder-only transformer. Each modality’s discrete representation is indicated by a colored square. Each row illustrates a different strategy for combining multimodal information to learn temporal alignment across modalities: (top) text precedes video, which is follow… view at source ↗
Figure 5
Figure 5. Figure 5: TimeSync. Visualization of phoneme-level alignment used for computing the TimeSync. Left: alignment in ground truth audio before (blue) and after (green) removing silence (“sp”) segments. Right: aligned phoneme positions between ground truth (green) and generated (red) audio, where TimeSync is computed as the absolute difference between segment centers (measured in seconds) [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative alignment and spectrogram analysis. Left: Log mel-spectrogram comparison for TTS (top), GT (middle), and VTTS (VT-ordered, bottom). VTTS (VT-ordered) better matches GT’s timing (393 frames) and energy patterns, unlike TTS which overextends (445 frames). Right: TimeSync visualization of phoneme alignment for the same example. Ground truth and generated phoneme segment centers are plotted on x- a… view at source ↗
Figure 7
Figure 7. Figure 7: Human evaluation. Task description for the crowd-sourced raters to evaluate intelligibility, naturalness and synchronization of the ground truth or generated speech: speech is overlayed with the video and they are played together for the raters. paper), each frame is first discretized using a pretrained VQ-VAE encoder, resulting in a H′ × W′ grid of tokens, where each token is embedded via a learnable tabl… view at source ↗
Figure 8
Figure 8. Figure 8: Human evaluation. Task description for the crowd-sourced raters to evaluate correspon￾dence between facial expressions and emotions in speech for ground truth and generated speech: speech is overlayed with the video and they are played together for the raters [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Human evaluation. Task description for the crowd-sourced raters to evaluate how close emotions in generated speech follows the ground truth [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of log mel￾spectrograms. Visualization of generated log mel-spectrograms: Text-to-Speech (TTS, top), Ground Truth (GT, middle), and our Video￾Text-to-Speech (VTTS, bottom). VTTS (VT￾ordered) demonstrates better temporal align￾ment with GT (367 frames) compared to TTS (419 frames), showing the benefit of video con￾ditioning for maintaining correct speech du￾ration. The spectral patte… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison of log mel￾spectrograms. Visualization of generated log mel-spectrograms from different methods: Text￾to-Speech (TTS, top), Ground Truth (GT, mid￾dle), and our Video-Text-to-Speech (VTTS, bot￾tom). VTTS (VT-ordered) demonstrates bet￾ter temporal alignment with GT (208 frames) compared to TTS (261 frames), showing the benefit of video conditioning for maintaining correct speech durat… view at source ↗
Figure 15
Figure 15. Figure 15: Alignment between phonemes. Temporal alignment visualization for failure case corresponding to [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Distribution for TimeSync. We show the difference (left) and absolute difference (right) between ground truth and generated speech phoneme locations (location of the center of the phoneme segment) in time measured in seconds. The ground truth text is used to align it to both ground truth and generated speech. For generated speech we use models TTS, VTTS (VT-ordered) and VTTS (TV-ordered). 20 [PITH_FULL_I… view at source ↗
read the original abstract

Unified decoder-only transformers have shown promise for multimodal generation, yet the mechanisms by which they synchronize modalities with heterogeneous sampling rates remain underexplored. We investigate these mechanisms through video-text-to-speech (VTTS) synthesis-a controlled task requiring fine-grained temporal alignment between sparse text, video, and continuous speech. Using a unified decoder-only transformer, dubbed Visatronic, trained on VoxCeleb2, we study: (i) how modalities contribute complementary information, (ii) how positional encoding strategies enable synchronization across heterogeneous rates, (iii) how modality ordering shapes the trade-off between in-domain performance and cross-domain transfer, (iv) how phoneme-level synchronization metrics provide diagnostic insight into per-phoneme timing errors. Our findings reveal that both "global sequential indexing'' (unique position IDs across modalities) and "co-temporal ordered indexing'' (identical IDs for temporally corresponding tokens) achieve strong synchronization performance, with co-temporal ordered indexing providing a simple mechanism without explicit timestamp metadata. Both text and video contribute complementary signals: text ensures intelligibility while video provides temporal cues and emotional expressiveness. Modality ordering reveals a consistent trade-off: video-first ordering achieves stronger in-domain performance while text-first ordering generalizes more robustly to unseen domains. Our findings also reveal, that diverse large-scale training enables transferable synchronization strategies. To enable fine-grained analysis, we also introduce TimeSync, a phoneme-level metric that reveals temporal misalignments overlooked by frame-level metrics. These insights establish VTTS as a valuable testbed for understanding temporal synchronization in unified multimodal decoders.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates mechanisms of multimodal synchronization in unified decoder-only transformers via a video-text-to-speech (VTTS) task using the Visatronic model trained on VoxCeleb2. It examines how modalities contribute complementary information (text for intelligibility, video for temporal/emotional cues), compares positional encoding strategies (global sequential indexing vs. co-temporal ordered indexing), analyzes modality ordering trade-offs (video-first for in-domain performance vs. text-first for cross-domain generalization), and introduces the TimeSync phoneme-level metric to diagnose temporal misalignments. The central claims are that both indexing approaches enable strong synchronization without explicit timestamps and that diverse training yields transferable strategies.

Significance. If the empirical findings on indexing and ordering hold under broader validation, the work offers concrete design insights for handling heterogeneous sampling rates in decoder-only multimodal models and introduces a useful fine-grained diagnostic (TimeSync) that improves on frame-level metrics. The paper's strength lies in its controlled VTTS testbed and direct measurement of synchronization behaviors rather than derived claims.

major comments (2)
  1. [Abstract] Abstract: The assertion that the reported synchronization behaviors and modality-ordering trade-offs reveal 'general mechanisms' for decoder-only multimodal transformers is load-bearing for the paper's contribution but rests on experiments limited to VTTS on VoxCeleb2 (with speech-centric held-out domains); no ablations on alternative tasks, architectures, or non-speech modalities are described, leaving open whether results are specific to this setup's sampling rates and causal attention.
  2. [Abstract] Abstract: The abstract states clear experimental findings on indexing performance and modality contributions, yet provides no quantitative results, statistical tests, ablation controls, or details on architecture/training procedure, making it impossible to assess whether reported differences are robust or influenced by post-hoc metric/data choices.
minor comments (2)
  1. The description of 'global sequential indexing' and 'co-temporal ordered indexing' would benefit from explicit pseudocode or a small diagram to clarify token-to-ID mapping across modalities.
  2. Clarify whether TimeSync is evaluated with statistical significance testing across phonemes or speakers, as this would strengthen the diagnostic claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the scope of our claims and the abstract's level of detail. We address each point below with proposed revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that the reported synchronization behaviors and modality-ordering trade-offs reveal 'general mechanisms' for decoder-only multimodal transformers is load-bearing for the paper's contribution but rests on experiments limited to VTTS on VoxCeleb2 (with speech-centric held-out domains); no ablations on alternative tasks, architectures, or non-speech modalities are described, leaving open whether results are specific to this setup's sampling rates and causal attention.

    Authors: We agree that the experiments are limited to the VTTS task on VoxCeleb2 and that broader validation would be required to claim fully general mechanisms across decoder-only multimodal models. The manuscript uses VTTS as a controlled testbed precisely because of its heterogeneous sampling rates and fine-grained alignment demands, but we will revise the abstract to qualify the language. We will change the final sentence from 'These insights establish VTTS as a valuable testbed for understanding temporal synchronization in unified multimodal decoders' to 'These insights, demonstrated in the VTTS setting, provide concrete design considerations for handling heterogeneous sampling rates in decoder-only multimodal models.' This removes the load-bearing generality claim while preserving the contribution. revision: yes

  2. Referee: [Abstract] Abstract: The abstract states clear experimental findings on indexing performance and modality contributions, yet provides no quantitative results, statistical tests, ablation controls, or details on architecture/training procedure, making it impossible to assess whether reported differences are robust or influenced by post-hoc metric/data choices.

    Authors: Abstracts are conventionally high-level, but we accept that including key quantitative anchors would improve evaluability. In revision we will add concise references to core results (e.g., 'co-temporal ordered indexing matches global sequential indexing on TimeSync while improving cross-domain generalization under text-first ordering') and note that full architecture, training, and statistical details appear in Sections 3–5. Because space constraints prevent exhaustive ablation descriptions in the abstract itself, we treat this as a partial revision focused on the most salient metrics. revision: partial

Circularity Check

0 steps flagged

No circularity; all claims are direct empirical measurements from trained models on VoxCeleb2 VTTS task

full rationale

The paper reports experimental results from training a unified decoder-only transformer (Visatronic) on VoxCeleb2 for video-text-to-speech synthesis. It compares positional encoding strategies (global sequential vs. co-temporal ordered indexing), modality orderings (video-first vs. text-first), and measures contributions via metrics including a new phoneme-level TimeSync. All stated findings (synchronization performance, complementarity of text/video, in-domain vs. generalization trade-offs) are presented as outcomes of these trained-model evaluations rather than any derivation, fitted-parameter prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the inputs by construction. The work is self-contained against external benchmarks as a set of controlled ablation experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims depend on the empirical outcomes of training a large neural network whose weights are fitted to the VoxCeleb2 dataset; the work assumes the decoder-only transformer architecture possesses sufficient capacity to learn cross-modal temporal alignments from data alone.

free parameters (2)
  • neural network weights
    All model parameters are optimized during training on the dataset and directly determine the observed synchronization behavior.
  • training hyperparameters
    Learning rate, batch size, and optimization choices are selected to produce the reported performance.
axioms (1)
  • domain assumption A decoder-only transformer can learn to align heterogeneous modalities when trained on paired video-text-speech data.
    This assumption underpins the entire experimental program and is not derived within the paper.

pith-pipeline@v0.9.0 · 5848 in / 1413 out tokens · 55068 ms · 2026-05-23T17:19:10.911523+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

  1. [1]

    LRS3-TED: a large-scale dataset for visual speech recognition

    Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018

  2. [2]

    Lip2audspec: Speech reconstruction from silent lip movements video

    Hassan Akbari, Himani Arora, Liangliang Cao, and Nima Mesgarani. Lip2audspec: Speech reconstruction from silent lip movements video. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2516–2520. IEEE, 2018

  3. [3]

    A 3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing

    He Bai, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, and Liang Huang. A 3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In Proceedings of the 39th International Conference on Machine Learning, pages 1399–1411. PMLR, 2022

  4. [4]

    dmel: Speech tokenization made simple

    He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. dmel: Speech tokenization made simple. arXiv preprint arXiv:2407.15835, 2024

  5. [5]

    Audiolm: a language modeling approach to audio generation

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

  6. [6]

    A Short Note about Kinetics-600

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

  7. [7]

    Adaspeech: Adaptive text to speech for custom voice

    Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993, 2021

  8. [8]

    V2c: Visual voice cloning

    Qi Chen, Mingkui Tan, Yuankai Qi, Jiaqiu Zhou, Yuanqing Li, and Qi Wu. V2c: Visual voice cloning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21242–21251, 2022

  9. [9]

    Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding

    Jeongsoo Choi, Joanna Hong, and Yong Man Ro. Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7812–7821, 2023

  10. [10]

    V oxCeleb2: Deep Speaker Recogni- tion

    Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxCeleb2: Deep Speaker Recogni- tion. In Proc. Interspeech 2018, pages 1086–1090, 2018

  11. [11]

    Learning to dub movies via hierarchical prosody models

    Gaoxiang Cong, Liang Li, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming- Hsuan Yang, and Qingming Huang. Learning to dub movies via hierarchical prosody models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14687–14697, 2023

  12. [12]

    Styledubber: towards multi-scale style learning for movie dubbing

    Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton van den Hengel, Ming-Hsuan Yang, Chenggang Yan, and Qingming Huang. Styledubber: towards multi-scale style learning for movie dubbing. arXiv preprint arXiv:2402.12636, 2024

  13. [13]

    Real time speech enhancement in the waveform domain

    Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. In Interspeech, 2020

  14. [14]

    High Fidelity Neural Audio Compression

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. 10

  15. [15]

    Vid2speech: speech reconstruction from silent video

    Ariel Ephrat and Shmuel Peleg. Vid2speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5095–5099. IEEE, 2017

  16. [16]

    Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image

    Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, and Koichiro Mori. Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image. In INTERSPEECH, pages 1321–1325, 2020

  17. [17]

    Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech regeneration

    Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, and Yossi Adi. Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech regeneration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18795–18805, 2023

  18. [18]

    Neural dubber: Dubbing for videos according to scripts

    Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, and Hang Zhao. Neural dubber: Dubbing for videos according to scripts. Advances in neural information processing systems, 34:16582–16595, 2021

  19. [19]

    Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech

    Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. Advances in Neural Information Processing Systems, 35:10970–10983, 2022

  20. [20]

    Transfer learning from speaker verification to multispeaker text-to-speech synthesis

    Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31, 2018

  21. [21]

    Neural voice cloning with a few low-quality samples

    Sunghee Jung and Hoirin Kim. Neural voice cloning with a few low-quality samples. arXiv preprint arXiv:2006.06940, 2020

  22. [22]

    Glow-tts: A generative flow for text-to-speech via monotonic alignment search

    Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020

  23. [23]

    Lip to speech synthesis with visual context attentional gan

    Minsu Kim, Joanna Hong, and Yong Man Ro. Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems, 34:2758–2770, 2021

  24. [24]

    Lip-to-speech synthesis in the wild with multi- task learning

    Minsu Kim, Joanna Hong, and Yong Man Ro. Lip-to-speech synthesis in the wild with multi- task learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  25. [25]

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V . Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig ...

  26. [26]

    DiffWave: A Versatile Diffusion Model for Audio Synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020

  27. [27]

    Imaginary voice: Face-styled diffusion model for text-to-speech

    Jiyoung Lee, Joon Son Chung, and Soo-Whan Chung. Imaginary voice: Face-styled diffusion model for text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  28. [28]

    Pvae-tts: Adaptive text-to- speech via progressive style adaptation

    Ji-Hyun Lee, Sang-Hoon Lee, Ji-Hoon Kim, and Seong-Whan Lee. Pvae-tts: Adaptive text-to- speech via progressive style adaptation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6312–6316. IEEE, 2022

  29. [29]

    Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior

    Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, and Tie-Yan Liu. Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. arXiv preprint arXiv:2106.06406, 2021. 11

  30. [30]

    Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis

    Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh, Ji-Hoon Kim, and Seong-Whan Lee. Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021

  31. [31]

    M3tts: Multi-modal text-to-speech of multi-scale style control for dubbing

    Yan Liu, Li-Fang Wei, Xinyuan Qian, Tian-Hao Zhang, Song-Lu Chen, and Xu-Cheng Yin. M3tts: Multi-modal text-to-speech of multi-scale style control for dubbing. Pattern Recognition Letters, 179:158–164, 2024

  32. [32]

    V oxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks

    Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, and Shinji Watanabe. V oxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 13326–13330. IEEE, 2024

  33. [33]

    Mm1: Methods, analysis & insights from multimodal llm pre-training, 2024

    Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier Biard, Sam Dodge, Philipp Dufter, Bowen Zhang, Dhruti Shah, Xianzhi Du, Futang Peng, Haotian Zhang, Floris Weers, Anton Belyi, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu He, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang...

  34. [34]

    Matcha-tts: A fast tts architecture with conditional flow matching

    Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-tts: A fast tts architecture with conditional flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11341–11345. IEEE, 2024

  35. [35]

    Meta-stylespeech: Multi- speaker adaptive text-to-speech generation

    Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang. Meta-stylespeech: Multi- speaker adaptive text-to-speech generation. In International Conference on Machine Learning, pages 7748–7759. PMLR, 2021

  36. [36]

    Svts: scalable video-to-speech synthesis

    Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W Schuller, and Maja Pantic. Svts: scalable video-to-speech synthesis. arXiv preprint arXiv:2205.02058, 2022

  37. [37]

    End-to-end video-to-speech synthesis using generative adversarial networks

    Rodrigo Mira, Konstantinos V ougioukas, Pingchuan Ma, Stavros Petridis, Björn W Schuller, and Maja Pantic. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE transactions on cybernetics, 53(6):3454–3466, 2022

  38. [38]

    Grad- tts: A diffusion probabilistic model for text-to-speech

    Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad- tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021

  39. [39]

    Learning individual speaking styles for accurate lip to speech synthesis

    KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13796–13805, 2020

  40. [40]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023

  41. [41]

    Fastspeech: Fast, robust and controllable text to speech

    Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019

  42. [42]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv preprint arXiv:2410.19168, 2024

  43. [43]

    Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

    Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018

  44. [44]

    Learning lip-based audio-visual speaker embeddings with av-hubert

    Bowen Shi, Abdelrahman Mohamed, and Wei-Ning Hsu. Learning lip-based audio-visual speaker embeddings with av-hubert. arXiv preprint arXiv:2205.07180, 2022. 12

  45. [45]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  46. [46]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  47. [47]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  48. [48]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  49. [49]

    Deep neural networks for small footprint text-dependent speaker verification

    Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez- Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014

  50. [50]

    Residual- guided personalized speech synthesis based on face image

    Jianrong Wang, Zixuan Wang, Xiaosheng Hu, Xuewei Li, Qiang Fang, and Li Liu. Residual- guided personalized speech synthesis based on face image. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4743–

  51. [51]

    VioLA: Unified codec language models for speech recognition, synthesis, and translation

    Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, and Furu Wei. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023

  52. [52]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025

  53. [53]

    Parallel wavegan: A fast waveform gen- eration model based on generative adversarial networks with multi-resolution spectrogram

    Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform gen- eration model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020

  54. [54]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

  55. [55]

    Lipvoicer: Generating speech from silent videos guided by lip reading

    Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, and Ethan Fetaya. Lipvoicer: Generating speech from silent videos guided by lip reading. In The Twelfth International Conference on Learning Representations, 2024

  56. [56]

    The htk book

    Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. The htk book. Cambridge university engineering department, 3(175):12, 2002

  57. [57]

    Scaling autoregressive models for content-rich text-to-image generation, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

  58. [58]

    Statistical parametric speech synthesis

    Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009

  59. [59]

    From speaker to dubber: movie dubbing with prosody and duration consistency learning

    Zhedong Zhang, Liang Li, Gaoxiang Cong, Haibing Yin, Yuhan Gao, Chenggang Yan, Anton van den Hengel, and Yuankai Qi. From speaker to dubber: movie dubbing with prosody and duration consistency learning. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7523–7532, 2024. 13 A Ethics Discussion The advancement of speech technologie...

  60. [60]

    It consists of over 1M face-cropped YouTube videos from more than 6k distinct identities, resulting in 1.6k hours of speechw/o paired transcription

    VoxCeleb2 [10] is a large-scale audio-visual dataset primarily designed for speaker recognition task but applicable to various audio-visual processing domains. It consists of over 1M face-cropped YouTube videos from more than 6k distinct identities, resulting in 1.6k hours of speechw/o paired transcription. The dataset is characterized by high variability...