{"total":14,"items":[{"citing_arxiv_id":"2606.09098","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis","primary_cat":"eess.AS","submitted_at":"2026-06-08T06:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07080","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"dots.tts Technical Report","primary_cat":"cs.SD","submitted_at":"2026-06-05T09:19:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06928","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VoxCPM2 Technical Report","primary_cat":"cs.SD","submitted_at":"2026-06-05T05:43:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06357","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation","primary_cat":"cs.SD","submitted_at":"2026-06-04T16:25:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"F3-Tokenizer adapts audio autoencoder latents with noise-regularized bottleneck (channel normalization and stochastic perturbation) and a representation encoder (RQ-MTP plus frozen-LLM supervision) to support both high-dimensional understanding representations and normalized continuous generation ta","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04418","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding","primary_cat":"cs.SD","submitted_at":"2026-06-03T03:56:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30993","ref_index":36,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue","primary_cat":"eess.AS","submitted_at":"2026-05-29T08:27:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27740","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training","primary_cat":"cs.CL","submitted_at":"2026-05-26T22:32:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"UNIQUE enables efficient top-k sparse attention in LLMs by using a mean-plus-std page importance score and a soft-mask training approach, achieving up to 11.4x kernel speedup while preserving performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27258","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis","primary_cat":"cs.SD","submitted_at":"2026-05-26T16:36:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"PilotTTS achieves lowest WER 1.50% (en) and CER 0.87% (zh) plus highest speaker similarity on Seed-TTS Eval using a Q-Former conditioned autoregressive architecture and a released multi-stage open data pipeline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25659","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration","primary_cat":"cs.CV","submitted_at":"2026-05-25T10:04:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20267","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ATIR: Towards Audio-Text Interleaved Contextual Retrieval","primary_cat":"cs.SD","submitted_at":"2026-04-22T07:11:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17958","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech","primary_cat":"eess.AS","submitted_at":"2026-04-20T08:39:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CoRRabs/2402.01912 (2024). arXiv:2402.01912 doi:10.48550/ARXIV.2402.01912 [24] Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, and Alex Smola. 2025. EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressive- ness, and Linguistic Challenges Using Model-as-a-Judge.CoRRabs/2505.23009 (2025). arXiv:2505.23009 doi:10.48550/ARXIV.2505.23009 [25] Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. 2025. VibeVoice Technical Report.CoRRabs/2508.19205 (2025). arXiv:2508.19205 doi:10.48550/ARXIV.2508.19205 [26] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever."},{"citing_arxiv_id":"2604.12383","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation","primary_cat":"cs.SD","submitted_at":"2026-04-14T07:17:55+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.15621","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen3-TTS Technical Report","primary_cat":"cs.SD","submitted_at":"2026-01-22T03:51:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14234","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body","primary_cat":"cs.CV","submitted_at":"2025-12-16T09:41:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InCVPR, 2024. [86] William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InICCV, pages 4195-4205, 2023. [87] Ziqiao Peng, Yihao Luo, Yue Shi, Hao Xu, Xiangyu Zhu, Hongyan Liu, Jun He, and Zhaoxin Fan. Selftalk: A self- supervised commutative training diagram to comprehend 3d talking faces. InACMMM, pages 5292-5301, 2023. [88] Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, et al. Vibevoice technical report.arXiv preprint arXiv:2508.19205, 2025. [89] Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, and Yike Guo. Co3 gesture: Towards coherent concurrent"}],"limit":50,"offset":0}