hub

Audiobox: Unified audio generation with natural language prompts

Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al · 2023 · arXiv 2312.15821

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

eess.AS · 2026-06-22 · unverdicted · novelty 7.0

AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

eess.AS · 2026-06-08 · unverdicted · novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

cs.SD · 2026-05-27 · unverdicted · novelty 7.0

PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

cs.AI · 2026-05-23 · unverdicted · novelty 7.0

AVBench is a benchmark for human-centric AV generation evaluation featuring ten fine-grained dimensions and preference-learned evaluators that output continuous probabilistic scores from binary decisions.

Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation

eess.AS · 2026-06-30 · unverdicted · novelty 6.0

Appropriateness of TTS varies independently across domains while naturalness scores penalize stylized speech and reward spontaneity.

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

cs.CV · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

Unison presents a unified audio-video generation model that decouples speech and sound effects while using bidirectional forcing to synchronize with motion, claiming SOTA perceptual quality and alignment.

A unified perspective on fine-tuning and sampling with diffusion and flow models

stat.ML · 2026-04-30 · unverdicted · novelty 6.0

A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

eess.AS · 2026-04-10 · unverdicted · novelty 6.0

PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

cs.SD · 2025-09-07 · unverdicted · novelty 6.0

DreamAudio generates audio clips that incorporate user-specified personalized audio events from reference samples while remaining aligned with text prompts.

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

cs.SD · 2025-02-07 · unverdicted · novelty 6.0

Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.

VoxCPM2 Technical Report

cs.SD · 2026-06-05 · unverdicted · novelty 5.0

VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

eess.AS · 2026-05-29 · unverdicted · novelty 5.0

UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

eess.AS · 2026-05-29 · unverdicted · novelty 5.0

ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

cs.SD · 2026-05-01 · unverdicted · novelty 5.0

A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.

Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

cs.SD · 2026-04-07 · unverdicted · novelty 5.0

A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.

Movie Gen: A Cast of Media Foundation Models

cs.CV · 2024-10-17 · unverdicted · novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement

cs.SD · 2026-06-01 · unverdicted · novelty 4.0

EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.

Flow Matching Guide and Code

cs.LG · 2024-12-09 · unverdicted · novelty 2.0

Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.

Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control

math.OC · 2026-03-28

citing papers explorer

Showing 19 of 19 citing papers after filters.

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation eess.AS · 2026-06-22 · unverdicted · none · ref 26
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis eess.AS · 2026-06-08 · unverdicted · none · ref 63
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts cs.SD · 2026-05-27 · unverdicted · none · ref 11
PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models cs.AI · 2026-05-23 · unverdicted · none · ref 28
AVBench is a benchmark for human-centric AV generation evaluation featuring ten fine-grained dimensions and preference-learned evaluators that output continuous probabilistic scores from binary decisions.
Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation eess.AS · 2026-06-30 · unverdicted · none · ref 43
Appropriateness of TTS varies independently across domains while naturalness scores penalize stylized speech and reward spontaneity.
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 35 · 2 links
Unison presents a unified audio-video generation model that decouples speech and sound effects while using bidirectional forcing to synchronize with motion, claiming SOTA perceptual quality and alignment.
A unified perspective on fine-tuning and sampling with diffusion and flow models stat.ML · 2026-04-30 · unverdicted · none · ref 50
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing eess.AS · 2026-04-10 · unverdicted · none · ref 3
PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models cs.SD · 2025-09-07 · unverdicted · none · ref 75
DreamAudio generates audio clips that incorporate user-specified personalized audio events from reference samples while remaining aligned with text prompts.
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound cs.SD · 2025-02-07 · unverdicted · none · ref 86
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
VoxCPM2 Technical Report cs.SD · 2026-06-05 · unverdicted · none · ref 32
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion eess.AS · 2026-05-29 · unverdicted · none · ref 57
UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment eess.AS · 2026-05-29 · unverdicted · none · ref 63
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV · 2026-05-17 · unverdicted · none · ref 52
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation cs.SD · 2026-05-01 · unverdicted · none · ref 29
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck cs.SD · 2026-04-07 · unverdicted · none · ref 7
A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 69
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement cs.SD · 2026-06-01 · unverdicted · none · ref 30
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.
Flow Matching Guide and Code cs.LG · 2024-12-09 · unverdicted · none · ref 87
Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.

Audiobox: Unified audio generation with natural language prompts

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer