Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

Akshita Gupta; Karren Dai Yang; Navdeep Jaitly; Richard He Bai; Tatiana Likhomanenko; Zakaria Aldeneh

arxiv: 2411.17690 · v3 · submitted 2024-11-26 · 💻 cs.MM · cs.CV· cs.SD· eess.AS

Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

Akshita Gupta , Tatiana Likhomanenko , Karren Dai Yang , Richard He Bai , Zakaria Aldeneh , Navdeep Jaitly This is my paper

Pith reviewed 2026-05-23 17:19 UTC · model grok-4.3

classification 💻 cs.MM cs.CVcs.SDeess.AS

keywords multimodal synchronizationdecoder-only transformervideo-text-to-speechpositional encodingmodality orderingtemporal alignmentphoneme-level metrics

0 comments

The pith

Both global sequential indexing and co-temporal ordered indexing enable strong synchronization of video, text, and speech in a unified decoder-only transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how decoder-only transformers coordinate modalities sampled at different rates by training on video-text-to-speech synthesis. It finds that two positional encoding approaches—assigning unique IDs across all tokens or reusing identical IDs for tokens that occur at the same time—both produce accurate temporal alignment without separate timestamp inputs. Text supplies the content needed for intelligible speech while video supplies timing and expressive cues, and the sequence in which modalities are presented creates a trade-off between strong results on training-like data and better handling of new domains. A new phoneme-level metric is introduced to measure timing precision more finely than standard frame-level scores.

Core claim

In the Visatronic decoder-only transformer trained on VoxCeleb2, global sequential indexing (unique position IDs across modalities) and co-temporal ordered indexing (identical IDs for temporally corresponding tokens) both achieve strong synchronization performance. Text ensures intelligibility while video supplies temporal cues and emotional expressiveness. Video-first ordering yields stronger in-domain performance, whereas text-first ordering generalizes more robustly to unseen domains. Diverse large-scale training supports transferable synchronization strategies, and the introduced TimeSync metric exposes per-phoneme timing errors missed by coarser measures.

What carries the argument

Positional encoding strategies of global sequential indexing and co-temporal ordered indexing that align tokens from heterogeneous modalities inside a single decoder-only transformer.

If this is right

Text and video supply complementary signals that together improve intelligibility, timing, and expressiveness in generated speech.
Modality ordering produces a consistent trade-off between in-domain accuracy and cross-domain robustness.
Large-scale diverse training enables synchronization strategies to transfer across domains.
Phoneme-level metrics such as TimeSync diagnose timing misalignments that frame-level metrics overlook.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same indexing methods could simplify alignment in other multimodal decoder tasks such as video-conditioned captioning.
Choosing modality order at training time offers a practical lever for controlling generalization without changing the architecture.
If co-temporal indexing works without explicit timestamps, it reduces the need for additional metadata pipelines in deployed multimodal systems.

Load-bearing premise

The synchronization behaviors and modality-ordering trade-offs observed on the VoxCeleb2-trained VTTS task represent general multimodal synchronization mechanisms in decoder-only transformers.

What would settle it

Repeating the same experiments on a different multimodal generation task or dataset and observing that one indexing method collapses while the other remains effective would falsify the claim of general applicability.

Figures

Figures reproduced from arXiv: 2411.17690 by Akshita Gupta, Karren Dai Yang, Navdeep Jaitly, Richard He Bai, Tatiana Likhomanenko, Zakaria Aldeneh.

**Figure 1.** Figure 1: Visatronic overview. In addition to existing text to speech (leftmost) and lips to speech tasks (middle), we address multimodal generative task (rightmost), video-text to speech (VTTS), where the model is conditioned on the video of talking people and corresponding text transcriptions in order to generate speech. Visatronic is a unified decoder-only transformer that processes video v (grey), text t (grey),… view at source ↗

**Figure 2.** Figure 2: Video representation. Each video frame at time t is encoded via a VQ-VAE [54] into a downsampled spatial grid in R H′×W′×D. Each vector at location (h, w) is quantized to a discrete token using the learned codebook Cv via l2 similarity. These discrete tokens are embedded into R D′ and aggregated across the spatial grid to produce the final frame-level embedding input to the transformer. See Section 2.2 for… view at source ↗

**Figure 3.** Figure 3: Speech representation. We follow the speech discretization process from dMel [4]: each continuous mel-filterbank at time t extracted from the raw audio is mapped into a discrete values using a codebook of evenly spaced values. Afterwards, each discretized log mel-filterbank at time t is mapped through a learnable embedding layer, all representations for log mel-filterbanks at time t are stacked together an… view at source ↗

**Figure 4.** Figure 4: Input sequence for Visatronic. We encode all modalities into a discrete token space (see Figures 2 and 3), which is directly consumed by the decoder-only transformer. Each modality’s discrete representation is indicated by a colored square. Each row illustrates a different strategy for combining multimodal information to learn temporal alignment across modalities: (top) text precedes video, which is follow… view at source ↗

**Figure 5.** Figure 5: TimeSync. Visualization of phoneme-level alignment used for computing the TimeSync. Left: alignment in ground truth audio before (blue) and after (green) removing silence (“sp”) segments. Right: aligned phoneme positions between ground truth (green) and generated (red) audio, where TimeSync is computed as the absolute difference between segment centers (measured in seconds) [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 6.** Figure 6: Qualitative alignment and spectrogram analysis. Left: Log mel-spectrogram comparison for TTS (top), GT (middle), and VTTS (VT-ordered, bottom). VTTS (VT-ordered) better matches GT’s timing (393 frames) and energy patterns, unlike TTS which overextends (445 frames). Right: TimeSync visualization of phoneme alignment for the same example. Ground truth and generated phoneme segment centers are plotted on x- a… view at source ↗

**Figure 7.** Figure 7: Human evaluation. Task description for the crowd-sourced raters to evaluate intelligibility, naturalness and synchronization of the ground truth or generated speech: speech is overlayed with the video and they are played together for the raters. paper), each frame is first discretized using a pretrained VQ-VAE encoder, resulting in a H′ × W′ grid of tokens, where each token is embedded via a learnable tabl… view at source ↗

**Figure 8.** Figure 8: Human evaluation. Task description for the crowd-sourced raters to evaluate correspondence between facial expressions and emotions in speech for ground truth and generated speech: speech is overlayed with the video and they are played together for the raters [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Human evaluation. Task description for the crowd-sourced raters to evaluate how close emotions in generated speech follows the ground truth [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of log melspectrograms. Visualization of generated log mel-spectrograms: Text-to-Speech (TTS, top), Ground Truth (GT, middle), and our VideoText-to-Speech (VTTS, bottom). VTTS (VTordered) demonstrates better temporal alignment with GT (367 frames) compared to TTS (419 frames), showing the benefit of video conditioning for maintaining correct speech duration. The spectral patte… view at source ↗

**Figure 12.** Figure 12: Qualitative comparison of log melspectrograms. Visualization of generated log mel-spectrograms from different methods: Textto-Speech (TTS, top), Ground Truth (GT, middle), and our Video-Text-to-Speech (VTTS, bottom). VTTS (VT-ordered) demonstrates better temporal alignment with GT (208 frames) compared to TTS (261 frames), showing the benefit of video conditioning for maintaining correct speech durat… view at source ↗

**Figure 15.** Figure 15: Alignment between phonemes. Temporal alignment visualization for failure case corresponding to [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Distribution for TimeSync. We show the difference (left) and absolute difference (right) between ground truth and generated speech phoneme locations (location of the center of the phoneme segment) in time measured in seconds. The ground truth text is used to align it to both ground truth and generated speech. For generated speech we use models TTS, VTTS (VT-ordered) and VTTS (TV-ordered). 20 [PITH_FULL_I… view at source ↗

read the original abstract

Unified decoder-only transformers have shown promise for multimodal generation, yet the mechanisms by which they synchronize modalities with heterogeneous sampling rates remain underexplored. We investigate these mechanisms through video-text-to-speech (VTTS) synthesis-a controlled task requiring fine-grained temporal alignment between sparse text, video, and continuous speech. Using a unified decoder-only transformer, dubbed Visatronic, trained on VoxCeleb2, we study: (i) how modalities contribute complementary information, (ii) how positional encoding strategies enable synchronization across heterogeneous rates, (iii) how modality ordering shapes the trade-off between in-domain performance and cross-domain transfer, (iv) how phoneme-level synchronization metrics provide diagnostic insight into per-phoneme timing errors. Our findings reveal that both "global sequential indexing'' (unique position IDs across modalities) and "co-temporal ordered indexing'' (identical IDs for temporally corresponding tokens) achieve strong synchronization performance, with co-temporal ordered indexing providing a simple mechanism without explicit timestamp metadata. Both text and video contribute complementary signals: text ensures intelligibility while video provides temporal cues and emotional expressiveness. Modality ordering reveals a consistent trade-off: video-first ordering achieves stronger in-domain performance while text-first ordering generalizes more robustly to unseen domains. Our findings also reveal, that diverse large-scale training enables transferable synchronization strategies. To enable fine-grained analysis, we also introduce TimeSync, a phoneme-level metric that reveals temporal misalignments overlooked by frame-level metrics. These insights establish VTTS as a valuable testbed for understanding temporal synchronization in unified multimodal decoders.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows both global and co-temporal positional indexing work for synchronization in their VTTS decoder model, plus a clear ordering trade-off, but the results stay tied to this one speech task.

read the letter

The main things to know are that both global sequential indexing and co-temporal ordered indexing achieve strong synchronization in the Visatronic decoder-only model, and that video-first ordering improves in-domain performance while text-first ordering generalizes better to held-out domains. They also introduce the TimeSync phoneme-level metric to catch timing errors that frame-level measures miss. Text and video play complementary roles, with text supporting intelligibility and video adding timing and emotional cues. The experiments use VoxCeleb2 with some unseen domains, which gives a controlled testbed for the VTTS task. The direct comparison of the two indexing approaches is useful because co-temporal indexing avoids needing explicit timestamps, and the ordering trade-off is documented consistently. These are the concrete empirical contributions. The soft spot is scope. Everything rests on a single decoder-only transformer trained for video-text-to-speech on speech-centric data. No tests appear on other tasks, modalities, or architectures, so it is hard to tell whether the indexing behaviors or the ordering effects come from general properties of unified decoders or from the specific sampling rates and causal attention in this VTTS setup. The claim that diverse large-scale training enables transferable strategies is stated but not backed by broad ablations. This paper is mainly for people building or analyzing decoder-only multimodal generators who need practical guidance on positional encodings and input ordering for temporal alignment. A reader focused on speech synthesis or similar generation tasks would get the most out of the indexing and metric details. It deserves peer review because the empirical comparisons and the new metric are specific enough to be checked and potentially useful, even if the generality needs more evidence.

Referee Report

2 major / 2 minor

Summary. The paper investigates mechanisms of multimodal synchronization in unified decoder-only transformers via a video-text-to-speech (VTTS) task using the Visatronic model trained on VoxCeleb2. It examines how modalities contribute complementary information (text for intelligibility, video for temporal/emotional cues), compares positional encoding strategies (global sequential indexing vs. co-temporal ordered indexing), analyzes modality ordering trade-offs (video-first for in-domain performance vs. text-first for cross-domain generalization), and introduces the TimeSync phoneme-level metric to diagnose temporal misalignments. The central claims are that both indexing approaches enable strong synchronization without explicit timestamps and that diverse training yields transferable strategies.

Significance. If the empirical findings on indexing and ordering hold under broader validation, the work offers concrete design insights for handling heterogeneous sampling rates in decoder-only multimodal models and introduces a useful fine-grained diagnostic (TimeSync) that improves on frame-level metrics. The paper's strength lies in its controlled VTTS testbed and direct measurement of synchronization behaviors rather than derived claims.

major comments (2)

[Abstract] Abstract: The assertion that the reported synchronization behaviors and modality-ordering trade-offs reveal 'general mechanisms' for decoder-only multimodal transformers is load-bearing for the paper's contribution but rests on experiments limited to VTTS on VoxCeleb2 (with speech-centric held-out domains); no ablations on alternative tasks, architectures, or non-speech modalities are described, leaving open whether results are specific to this setup's sampling rates and causal attention.
[Abstract] Abstract: The abstract states clear experimental findings on indexing performance and modality contributions, yet provides no quantitative results, statistical tests, ablation controls, or details on architecture/training procedure, making it impossible to assess whether reported differences are robust or influenced by post-hoc metric/data choices.

minor comments (2)

The description of 'global sequential indexing' and 'co-temporal ordered indexing' would benefit from explicit pseudocode or a small diagram to clarify token-to-ID mapping across modalities.
Clarify whether TimeSync is evaluated with statistical significance testing across phonemes or speakers, as this would strengthen the diagnostic claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the scope of our claims and the abstract's level of detail. We address each point below with proposed revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the reported synchronization behaviors and modality-ordering trade-offs reveal 'general mechanisms' for decoder-only multimodal transformers is load-bearing for the paper's contribution but rests on experiments limited to VTTS on VoxCeleb2 (with speech-centric held-out domains); no ablations on alternative tasks, architectures, or non-speech modalities are described, leaving open whether results are specific to this setup's sampling rates and causal attention.

Authors: We agree that the experiments are limited to the VTTS task on VoxCeleb2 and that broader validation would be required to claim fully general mechanisms across decoder-only multimodal models. The manuscript uses VTTS as a controlled testbed precisely because of its heterogeneous sampling rates and fine-grained alignment demands, but we will revise the abstract to qualify the language. We will change the final sentence from 'These insights establish VTTS as a valuable testbed for understanding temporal synchronization in unified multimodal decoders' to 'These insights, demonstrated in the VTTS setting, provide concrete design considerations for handling heterogeneous sampling rates in decoder-only multimodal models.' This removes the load-bearing generality claim while preserving the contribution. revision: yes
Referee: [Abstract] Abstract: The abstract states clear experimental findings on indexing performance and modality contributions, yet provides no quantitative results, statistical tests, ablation controls, or details on architecture/training procedure, making it impossible to assess whether reported differences are robust or influenced by post-hoc metric/data choices.

Authors: Abstracts are conventionally high-level, but we accept that including key quantitative anchors would improve evaluability. In revision we will add concise references to core results (e.g., 'co-temporal ordered indexing matches global sequential indexing on TimeSync while improving cross-domain generalization under text-first ordering') and note that full architecture, training, and statistical details appear in Sections 3–5. Because space constraints prevent exhaustive ablation descriptions in the abstract itself, we treat this as a partial revision focused on the most salient metrics. revision: partial

Circularity Check

0 steps flagged

No circularity; all claims are direct empirical measurements from trained models on VoxCeleb2 VTTS task

full rationale

The paper reports experimental results from training a unified decoder-only transformer (Visatronic) on VoxCeleb2 for video-text-to-speech synthesis. It compares positional encoding strategies (global sequential vs. co-temporal ordered indexing), modality orderings (video-first vs. text-first), and measures contributions via metrics including a new phoneme-level TimeSync. All stated findings (synchronization performance, complementarity of text/video, in-domain vs. generalization trade-offs) are presented as outcomes of these trained-model evaluations rather than any derivation, fitted-parameter prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the inputs by construction. The work is self-contained against external benchmarks as a set of controlled ablation experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims depend on the empirical outcomes of training a large neural network whose weights are fitted to the VoxCeleb2 dataset; the work assumes the decoder-only transformer architecture possesses sufficient capacity to learn cross-modal temporal alignments from data alone.

free parameters (2)

neural network weights
All model parameters are optimized during training on the dataset and directly determine the observed synchronization behavior.
training hyperparameters
Learning rate, batch size, and optimization choices are selected to produce the reported performance.

axioms (1)

domain assumption A decoder-only transformer can learn to align heterogeneous modalities when trained on paired video-text-speech data.
This assumption underpins the entire experimental program and is not derived within the paper.

pith-pipeline@v0.9.0 · 5848 in / 1413 out tokens · 55068 ms · 2026-05-23T17:19:10.911523+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

both 'global sequential indexing' (unique position IDs across modalities) and 'co-temporal ordered indexing' (identical IDs for temporally corresponding tokens) achieve strong synchronization performance
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TimeSync phoneme-level metric... VTTS as a valuable testbed for understanding temporal synchronization in unified multimodal decoders

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

[1]

LRS3-TED: a large-scale dataset for visual speech recognition

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Lip2audspec: Speech reconstruction from silent lip movements video

Hassan Akbari, Himani Arora, Liangliang Cao, and Nima Mesgarani. Lip2audspec: Speech reconstruction from silent lip movements video. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2516–2520. IEEE, 2018

work page 2018
[3]

A 3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing

He Bai, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, and Liang Huang. A 3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In Proceedings of the 39th International Conference on Machine Learning, pages 1399–1411. PMLR, 2022

work page 2022
[4]

dmel: Speech tokenization made simple

He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. dmel: Speech tokenization made simple. arXiv preprint arXiv:2407.15835, 2024

work page arXiv 2024
[5]

Audiolm: a language modeling approach to audio generation

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

work page 2023
[6]

A Short Note about Kinetics-600

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Adaspeech: Adaptive text to speech for custom voice

Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993, 2021

work page arXiv 2021
[8]

V2c: Visual voice cloning

Qi Chen, Mingkui Tan, Yuankai Qi, Jiaqiu Zhou, Yuanqing Li, and Qi Wu. V2c: Visual voice cloning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21242–21251, 2022

work page 2022
[9]

Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding

Jeongsoo Choi, Joanna Hong, and Yong Man Ro. Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7812–7821, 2023

work page 2023
[10]

V oxCeleb2: Deep Speaker Recogni- tion

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxCeleb2: Deep Speaker Recogni- tion. In Proc. Interspeech 2018, pages 1086–1090, 2018

work page 2018
[11]

Learning to dub movies via hierarchical prosody models

Gaoxiang Cong, Liang Li, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming- Hsuan Yang, and Qingming Huang. Learning to dub movies via hierarchical prosody models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14687–14697, 2023

work page 2023
[12]

Styledubber: towards multi-scale style learning for movie dubbing

Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton van den Hengel, Ming-Hsuan Yang, Chenggang Yan, and Qingming Huang. Styledubber: towards multi-scale style learning for movie dubbing. arXiv preprint arXiv:2402.12636, 2024

work page arXiv 2024
[13]

Real time speech enhancement in the waveform domain

Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. In Interspeech, 2020

work page 2020
[14]

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Vid2speech: speech reconstruction from silent video

Ariel Ephrat and Shmuel Peleg. Vid2speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5095–5099. IEEE, 2017

work page 2017
[16]

Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image

Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, and Koichiro Mori. Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image. In INTERSPEECH, pages 1321–1325, 2020

work page 2020
[17]

Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech regeneration

Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, and Yossi Adi. Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech regeneration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18795–18805, 2023

work page 2023
[18]

Neural dubber: Dubbing for videos according to scripts

Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, and Hang Zhao. Neural dubber: Dubbing for videos according to scripts. Advances in neural information processing systems, 34:16582–16595, 2021

work page 2021
[19]

Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. Advances in Neural Information Processing Systems, 35:10970–10983, 2022

work page 2022
[20]

Transfer learning from speaker verification to multispeaker text-to-speech synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31, 2018

work page 2018
[21]

Neural voice cloning with a few low-quality samples

Sunghee Jung and Hoirin Kim. Neural voice cloning with a few low-quality samples. arXiv preprint arXiv:2006.06940, 2020

work page arXiv 2006
[22]

Glow-tts: A generative flow for text-to-speech via monotonic alignment search

Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020

work page 2020
[23]

Lip to speech synthesis with visual context attentional gan

Minsu Kim, Joanna Hong, and Yong Man Ro. Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems, 34:2758–2770, 2021

work page 2021
[24]

Lip-to-speech synthesis in the wild with multi- task learning

Minsu Kim, Joanna Hong, and Yong Man Ro. Lip-to-speech synthesis in the wild with multi- task learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[25]

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V . Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig ...

work page 2024
[26]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[27]

Imaginary voice: Face-styled diffusion model for text-to-speech

Jiyoung Lee, Joon Son Chung, and Soo-Whan Chung. Imaginary voice: Face-styled diffusion model for text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[28]

Pvae-tts: Adaptive text-to- speech via progressive style adaptation

Ji-Hyun Lee, Sang-Hoon Lee, Ji-Hoon Kim, and Seong-Whan Lee. Pvae-tts: Adaptive text-to- speech via progressive style adaptation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6312–6316. IEEE, 2022

work page 2022
[29]

Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior

Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, and Tie-Yan Liu. Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. arXiv preprint arXiv:2106.06406, 2021. 11

work page arXiv 2021
[30]

Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis

Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh, Ji-Hoon Kim, and Seong-Whan Lee. Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021

work page 2021
[31]

M3tts: Multi-modal text-to-speech of multi-scale style control for dubbing

Yan Liu, Li-Fang Wei, Xinyuan Qian, Tian-Hao Zhang, Song-Lu Chen, and Xu-Cheng Yin. M3tts: Multi-modal text-to-speech of multi-scale style control for dubbing. Pattern Recognition Letters, 179:158–164, 2024

work page 2024
[32]

V oxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks

Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, and Shinji Watanabe. V oxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 13326–13330. IEEE, 2024

work page 2024
[33]

Mm1: Methods, analysis & insights from multimodal llm pre-training, 2024

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier Biard, Sam Dodge, Philipp Dufter, Bowen Zhang, Dhruti Shah, Xianzhi Du, Futang Peng, Haotian Zhang, Floris Weers, Anton Belyi, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu He, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang...

work page 2024
[34]

Matcha-tts: A fast tts architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-tts: A fast tts architecture with conditional flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11341–11345. IEEE, 2024

work page 2024
[35]

Meta-stylespeech: Multi- speaker adaptive text-to-speech generation

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang. Meta-stylespeech: Multi- speaker adaptive text-to-speech generation. In International Conference on Machine Learning, pages 7748–7759. PMLR, 2021

work page 2021
[36]

Svts: scalable video-to-speech synthesis

Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W Schuller, and Maja Pantic. Svts: scalable video-to-speech synthesis. arXiv preprint arXiv:2205.02058, 2022

work page arXiv 2022
[37]

End-to-end video-to-speech synthesis using generative adversarial networks

Rodrigo Mira, Konstantinos V ougioukas, Pingchuan Ma, Stavros Petridis, Björn W Schuller, and Maja Pantic. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE transactions on cybernetics, 53(6):3454–3466, 2022

work page 2022
[38]

Grad- tts: A diffusion probabilistic model for text-to-speech

Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad- tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021

work page 2021
[39]

Learning individual speaking styles for accurate lip to speech synthesis

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13796–13805, 2020

work page 2020
[40]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023
[41]

Fastspeech: Fast, robust and controllable text to speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019

work page 2019
[42]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv preprint arXiv:2410.19168, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018

work page 2018
[44]

Learning lip-based audio-visual speaker embeddings with av-hubert

Bowen Shi, Abdelrahman Mohamed, and Wei-Ning Hsu. Learning lip-based audio-visual speaker embeddings with av-hubert. arXiv preprint arXiv:2205.07180, 2022. 12

work page arXiv 2022
[45]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[46]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Deep neural networks for small footprint text-dependent speaker verification

Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez- Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014

work page 2014
[50]

Residual- guided personalized speech synthesis based on face image

Jianrong Wang, Zixuan Wang, Xiaosheng Hu, Xuewei Li, Qiang Fang, and Li Liu. Residual- guided personalized speech synthesis based on face image. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4743–

work page 2022
[51]

VioLA: Unified codec language models for speech recognition, synthesis, and translation

Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, and Furu Wei. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023

work page arXiv 2023
[52]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Parallel wavegan: A fast waveform gen- eration model based on generative adversarial networks with multi-resolution spectrogram

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform gen- eration model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020

work page 2020
[54]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Lipvoicer: Generating speech from silent videos guided by lip reading

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, and Ethan Fetaya. Lipvoicer: Generating speech from silent videos guided by lip reading. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[56]

The htk book

Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. The htk book. Cambridge university engineering department, 3(175):12, 2002

work page 2002
[57]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

work page 2022
[58]

Statistical parametric speech synthesis

Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009

work page 2009
[59]

From speaker to dubber: movie dubbing with prosody and duration consistency learning

Zhedong Zhang, Liang Li, Gaoxiang Cong, Haibing Yin, Yuhan Gao, Chenggang Yan, Anton van den Hengel, and Yuankai Qi. From speaker to dubber: movie dubbing with prosody and duration consistency learning. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7523–7532, 2024. 13 A Ethics Discussion The advancement of speech technologie...

work page 2024
[60]

It consists of over 1M face-cropped YouTube videos from more than 6k distinct identities, resulting in 1.6k hours of speechw/o paired transcription

VoxCeleb2 [10] is a large-scale audio-visual dataset primarily designed for speaker recognition task but applicable to various audio-visual processing domains. It consists of over 1M face-cropped YouTube videos from more than 6k distinct identities, resulting in 1.6k hours of speechw/o paired transcription. The dataset is characterized by high variability...

work page

[1] [1]

LRS3-TED: a large-scale dataset for visual speech recognition

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Lip2audspec: Speech reconstruction from silent lip movements video

Hassan Akbari, Himani Arora, Liangliang Cao, and Nima Mesgarani. Lip2audspec: Speech reconstruction from silent lip movements video. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2516–2520. IEEE, 2018

work page 2018

[3] [3]

A 3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing

He Bai, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, and Liang Huang. A 3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In Proceedings of the 39th International Conference on Machine Learning, pages 1399–1411. PMLR, 2022

work page 2022

[4] [4]

dmel: Speech tokenization made simple

He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. dmel: Speech tokenization made simple. arXiv preprint arXiv:2407.15835, 2024

work page arXiv 2024

[5] [5]

Audiolm: a language modeling approach to audio generation

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

work page 2023

[6] [6]

A Short Note about Kinetics-600

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Adaspeech: Adaptive text to speech for custom voice

Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993, 2021

work page arXiv 2021

[8] [8]

V2c: Visual voice cloning

Qi Chen, Mingkui Tan, Yuankai Qi, Jiaqiu Zhou, Yuanqing Li, and Qi Wu. V2c: Visual voice cloning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21242–21251, 2022

work page 2022

[9] [9]

Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding

Jeongsoo Choi, Joanna Hong, and Yong Man Ro. Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7812–7821, 2023

work page 2023

[10] [10]

V oxCeleb2: Deep Speaker Recogni- tion

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxCeleb2: Deep Speaker Recogni- tion. In Proc. Interspeech 2018, pages 1086–1090, 2018

work page 2018

[11] [11]

Learning to dub movies via hierarchical prosody models

Gaoxiang Cong, Liang Li, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming- Hsuan Yang, and Qingming Huang. Learning to dub movies via hierarchical prosody models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14687–14697, 2023

work page 2023

[12] [12]

Styledubber: towards multi-scale style learning for movie dubbing

Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton van den Hengel, Ming-Hsuan Yang, Chenggang Yan, and Qingming Huang. Styledubber: towards multi-scale style learning for movie dubbing. arXiv preprint arXiv:2402.12636, 2024

work page arXiv 2024

[13] [13]

Real time speech enhancement in the waveform domain

Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. In Interspeech, 2020

work page 2020

[14] [14]

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Vid2speech: speech reconstruction from silent video

Ariel Ephrat and Shmuel Peleg. Vid2speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5095–5099. IEEE, 2017

work page 2017

[16] [16]

Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image

Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, and Koichiro Mori. Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image. In INTERSPEECH, pages 1321–1325, 2020

work page 2020

[17] [17]

Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech regeneration

Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, and Yossi Adi. Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech regeneration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18795–18805, 2023

work page 2023

[18] [18]

Neural dubber: Dubbing for videos according to scripts

Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, and Hang Zhao. Neural dubber: Dubbing for videos according to scripts. Advances in neural information processing systems, 34:16582–16595, 2021

work page 2021

[19] [19]

Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. Advances in Neural Information Processing Systems, 35:10970–10983, 2022

work page 2022

[20] [20]

Transfer learning from speaker verification to multispeaker text-to-speech synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31, 2018

work page 2018

[21] [21]

Neural voice cloning with a few low-quality samples

Sunghee Jung and Hoirin Kim. Neural voice cloning with a few low-quality samples. arXiv preprint arXiv:2006.06940, 2020

work page arXiv 2006

[22] [22]

Glow-tts: A generative flow for text-to-speech via monotonic alignment search

Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020

work page 2020

[23] [23]

Lip to speech synthesis with visual context attentional gan

Minsu Kim, Joanna Hong, and Yong Man Ro. Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems, 34:2758–2770, 2021

work page 2021

[24] [24]

Lip-to-speech synthesis in the wild with multi- task learning

Minsu Kim, Joanna Hong, and Yong Man Ro. Lip-to-speech synthesis in the wild with multi- task learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023

[25] [25]

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V . Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig ...

work page 2024

[26] [26]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[27] [27]

Imaginary voice: Face-styled diffusion model for text-to-speech

Jiyoung Lee, Joon Son Chung, and Soo-Whan Chung. Imaginary voice: Face-styled diffusion model for text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023

[28] [28]

Pvae-tts: Adaptive text-to- speech via progressive style adaptation

Ji-Hyun Lee, Sang-Hoon Lee, Ji-Hoon Kim, and Seong-Whan Lee. Pvae-tts: Adaptive text-to- speech via progressive style adaptation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6312–6316. IEEE, 2022

work page 2022

[29] [29]

Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior

Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, and Tie-Yan Liu. Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. arXiv preprint arXiv:2106.06406, 2021. 11

work page arXiv 2021

[30] [30]

Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis

Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh, Ji-Hoon Kim, and Seong-Whan Lee. Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021

work page 2021

[31] [31]

M3tts: Multi-modal text-to-speech of multi-scale style control for dubbing

Yan Liu, Li-Fang Wei, Xinyuan Qian, Tian-Hao Zhang, Song-Lu Chen, and Xu-Cheng Yin. M3tts: Multi-modal text-to-speech of multi-scale style control for dubbing. Pattern Recognition Letters, 179:158–164, 2024

work page 2024

[32] [32]

V oxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks

Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, and Shinji Watanabe. V oxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 13326–13330. IEEE, 2024

work page 2024

[33] [33]

Mm1: Methods, analysis & insights from multimodal llm pre-training, 2024

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier Biard, Sam Dodge, Philipp Dufter, Bowen Zhang, Dhruti Shah, Xianzhi Du, Futang Peng, Haotian Zhang, Floris Weers, Anton Belyi, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu He, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang...

work page 2024

[34] [34]

Matcha-tts: A fast tts architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-tts: A fast tts architecture with conditional flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11341–11345. IEEE, 2024

work page 2024

[35] [35]

Meta-stylespeech: Multi- speaker adaptive text-to-speech generation

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang. Meta-stylespeech: Multi- speaker adaptive text-to-speech generation. In International Conference on Machine Learning, pages 7748–7759. PMLR, 2021

work page 2021

[36] [36]

Svts: scalable video-to-speech synthesis

Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W Schuller, and Maja Pantic. Svts: scalable video-to-speech synthesis. arXiv preprint arXiv:2205.02058, 2022

work page arXiv 2022

[37] [37]

End-to-end video-to-speech synthesis using generative adversarial networks

Rodrigo Mira, Konstantinos V ougioukas, Pingchuan Ma, Stavros Petridis, Björn W Schuller, and Maja Pantic. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE transactions on cybernetics, 53(6):3454–3466, 2022

work page 2022

[38] [38]

Grad- tts: A diffusion probabilistic model for text-to-speech

Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad- tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021

work page 2021

[39] [39]

Learning individual speaking styles for accurate lip to speech synthesis

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13796–13805, 2020

work page 2020

[40] [40]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023

[41] [41]

Fastspeech: Fast, robust and controllable text to speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019

work page 2019

[42] [42]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv preprint arXiv:2410.19168, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018

work page 2018

[44] [44]

Learning lip-based audio-visual speaker embeddings with av-hubert

Bowen Shi, Abdelrahman Mohamed, and Wei-Ning Hsu. Learning lip-based audio-visual speaker embeddings with av-hubert. arXiv preprint arXiv:2205.07180, 2022. 12

work page arXiv 2022

[45] [45]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024

[46] [46]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[49] [49]

Deep neural networks for small footprint text-dependent speaker verification

Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez- Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014

work page 2014

[50] [50]

Residual- guided personalized speech synthesis based on face image

Jianrong Wang, Zixuan Wang, Xiaosheng Hu, Xuewei Li, Qiang Fang, and Li Liu. Residual- guided personalized speech synthesis based on face image. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4743–

work page 2022

[51] [51]

VioLA: Unified codec language models for speech recognition, synthesis, and translation

Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, and Furu Wei. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023

work page arXiv 2023

[52] [52]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Parallel wavegan: A fast waveform gen- eration model based on generative adversarial networks with multi-resolution spectrogram

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform gen- eration model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020

work page 2020

[54] [54]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[55] [55]

Lipvoicer: Generating speech from silent videos guided by lip reading

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, and Ethan Fetaya. Lipvoicer: Generating speech from silent videos guided by lip reading. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[56] [56]

The htk book

Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. The htk book. Cambridge university engineering department, 3(175):12, 2002

work page 2002

[57] [57]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

work page 2022

[58] [58]

Statistical parametric speech synthesis

Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009

work page 2009

[59] [59]

From speaker to dubber: movie dubbing with prosody and duration consistency learning

Zhedong Zhang, Liang Li, Gaoxiang Cong, Haibing Yin, Yuhan Gao, Chenggang Yan, Anton van den Hengel, and Yuankai Qi. From speaker to dubber: movie dubbing with prosody and duration consistency learning. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7523–7532, 2024. 13 A Ethics Discussion The advancement of speech technologie...

work page 2024

[60] [60]

It consists of over 1M face-cropped YouTube videos from more than 6k distinct identities, resulting in 1.6k hours of speechw/o paired transcription

VoxCeleb2 [10] is a large-scale audio-visual dataset primarily designed for speaker recognition task but applicable to various audio-visual processing domains. It consists of over 1M face-cropped YouTube videos from more than 6k distinct identities, resulting in 1.6k hours of speechw/o paired transcription. The dataset is characterized by high variability...

work page