hub

Audiobox: Unified audio generation with natural language prompts

Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al · 2023 · arXiv 2312.15821

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.

A unified perspective on fine-tuning and sampling with diffusion and flow models

stat.ML · 2026-04-30 · unverdicted · novelty 6.0

A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

eess.AS · 2026-04-10 · unverdicted · novelty 6.0

PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.

Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control

math.OC · 2026-03-28 · unverdicted · novelty 6.0

Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

cs.SD · 2025-09-07 · unverdicted · novelty 6.0

DreamAudio generates audio clips that incorporate user-specified personalized audio events from reference samples while remaining aligned with text prompts.

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

cs.SD · 2025-02-07 · unverdicted · novelty 6.0

Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

cs.SD · 2026-05-01 · unverdicted · novelty 5.0

A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.

Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

cs.SD · 2026-04-07 · unverdicted · novelty 5.0

A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.

Movie Gen: A Cast of Media Foundation Models

cs.CV · 2024-10-17 · unverdicted · novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

Flow Matching Guide and Code

cs.LG · 2024-12-09 · unverdicted · novelty 2.0

Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.

citing papers explorer

Showing 11 of 11 citing papers.

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 35
Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
A unified perspective on fine-tuning and sampling with diffusion and flow models stat.ML · 2026-04-30 · unverdicted · none · ref 50
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing eess.AS · 2026-04-10 · unverdicted · none · ref 3
PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.
Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control math.OC · 2026-03-28 · unverdicted · none · ref 14
Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models cs.SD · 2025-09-07 · unverdicted · none · ref 75
DreamAudio generates audio clips that incorporate user-specified personalized audio events from reference samples while remaining aligned with text prompts.
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound cs.SD · 2025-02-07 · unverdicted · none · ref 86
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV · 2026-05-17 · unverdicted · none · ref 52
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation cs.SD · 2026-05-01 · unverdicted · none · ref 29
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck cs.SD · 2026-04-07 · unverdicted · none · ref 7
A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 69
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Flow Matching Guide and Code cs.LG · 2024-12-09 · unverdicted · none · ref 87
Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.

Audiobox: Unified audio generation with natural language prompts

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer