Vibevoice technical report

Peng, Z · 2025 · arXiv 2508.19205

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

cs.SD · 2026-04-22 · unverdicted · novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

cs.CV · 2025-12-16 · unverdicted · novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

eess.AS · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.

Qwen3-TTS Technical Report

cs.SD · 2026-01-22 · unverdicted · novelty 6.0

Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

cs.CL · 2026-05-26 · unverdicted · novelty 4.0

UNIQUE enables efficient top-k sparse attention in LLMs by using a mean-plus-std page importance score and a soft-mask training approach, achieving up to 11.4x kernel speedup while preserving performance.

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

cs.SD · 2026-05-26 · unverdicted · novelty 4.0

PilotTTS achieves lowest WER 1.50% (en) and CER 0.87% (zh) plus highest speaker similarity on Seed-TTS Eval using a Q-Former conditioned autoregressive architecture and a released multi-stage open data pipeline.

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

cs.SD · 2026-04-14

citing papers explorer

Showing 10 of 10 citing papers.

ATIR: Towards Audio-Text Interleaved Contextual Retrieval cs.SD · 2026-04-22 · unverdicted · none · ref 36
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech eess.AS · 2026-04-20 · unverdicted · none · ref 25
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body cs.CV · 2025-12-16 · unverdicted · none · ref 88
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding cs.SD · 2026-06-03 · unverdicted · none · ref 22
CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue eess.AS · 2026-05-29 · unverdicted · none · ref 36 · 2 links
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration cs.CV · 2026-05-25 · unverdicted · none · ref 27
StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.
Qwen3-TTS Technical Report cs.SD · 2026-01-22 · unverdicted · none · ref 15
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.
UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training cs.CL · 2026-05-26 · unverdicted · none · ref 2
UNIQUE enables efficient top-k sparse attention in LLMs by using a mean-plus-std page importance score and a soft-mask training approach, achieving up to 11.4x kernel speedup while preserving performance.
PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis cs.SD · 2026-05-26 · unverdicted · none · ref 30
PilotTTS achieves lowest WER 1.50% (en) and CER 0.87% (zh) plus highest speaker similarity on Seed-TTS Eval using a Q-Former conditioned autoregressive architecture and a released multi-stage open data pipeline.
On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation cs.SD · 2026-04-14 · unreviewed · ref 21

Vibevoice technical report

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer