AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
hub
Audiobox: Unified audio generation with natural language prompts
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.
AVBench is a benchmark for human-centric AV generation evaluation featuring ten fine-grained dimensions and preference-learned evaluators that output continuous probabilistic scores from binary decisions.
Appropriateness of TTS varies independently across domains while naturalness scores penalize stylized speech and reward spontaneity.
Unison presents a unified audio-video generation model that decouples speech and sound effects while using bidirectional forcing to synchronize with motion, claiming SOTA perceptual quality and alignment.
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.
PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.
DreamAudio generates audio clips that incorporate user-specified personalized audio events from reference samples while remaining aligned with text prompts.
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.
Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.
citing papers explorer
-
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
-
HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
-
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts
PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.
-
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
AVBench is a benchmark for human-centric AV generation evaluation featuring ten fine-grained dimensions and preference-learned evaluators that output continuous probabilistic scores from binary decisions.
-
Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation
Appropriateness of TTS varies independently across domains while naturalness scores penalize stylized speech and reward spontaneity.
-
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
Unison presents a unified audio-video generation model that decouples speech and sound effects while using bidirectional forcing to synchronize with motion, claiming SOTA perceptual quality and alignment.
-
A unified perspective on fine-tuning and sampling with diffusion and flow models
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.
-
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.
-
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
DreamAudio generates audio clips that incorporate user-specified personalized audio events from reference samples while remaining aligned with text prompts.
-
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
-
VoxCPM2 Technical Report
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
-
UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.
-
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.
-
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
-
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
-
Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck
A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.
-
Flow Matching Guide and Code
Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.