{"total":12,"items":[{"citing_arxiv_id":"2605.12310","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling","primary_cat":"cs.SD","submitted_at":"2026-05-12T15:57:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"encompassing multiple languages, audio durations, and speaker counts. For speech data, we adopt the Emilia dataset [19], a 101k-hour multilingual speech corpus rich in expressive speak- ing styles, which provides a robust foundation for modeling natural speech. A small subset is sampled for regular voice conversion train- ing. For singing data, we utilize m4singer [20], OpenSinger [21], OpenCpop [22], PopBuTFy [23] and V ocalSet [24], which contain English and Chinese singing data with clean, single-melody vo- cals. The m4singer [20] additionally includes a subset with MIDI annotations. Notably, as no suitable open-source dataset provides ground- truth vocals paired with harmony, we simulate real-world scenar- ios by extracting vocal tracks directly from accompanied full-mix"},{"citing_arxiv_id":"2605.01638","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection","primary_cat":"cs.CV","submitted_at":"2026-05-02T22:56:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"others. OOD GenBuster-200K [84], VideoPainter/VPBench [8], and others. Sora [67], Pika [52], Gen3 [75], and others. Audio Set Multilingual LibriSpeech [71], PartialEdit [96], and others. Dia-1.6B [51], Kokoro-82M [35], Chatterbox [73], and others. OOD Common V oice [4], LlamaPartialSpoof [64], and others. Higgs-Audio [10], CosyV oice [27], Fish Speech [59], and others. A V-TH Set celebVHQ [101], Hallo3 [22], HDTF [97], MA VOS [20], TalkVid / TalkVid-bench [15], and others. AniPortrait [83], EchoMimic [16], Hallo2 [21], and others. OOD FakeA VCeleb [46], TalkingHead-1KH [82], and others. deepspeak-v2 [7], Ditto [56], ACTalker [36], and others. method in Omni-Fake-OOD appears in Omni-Fake-Set. 3.3. Overall Data Quality"},{"citing_arxiv_id":"2604.23742","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RTCFake: Speech Deepfake Detection in Real-Time Communication","primary_cat":"cs.SD","submitted_at":"2026-04-26T14:42:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24794","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data","primary_cat":"cs.CR","submitted_at":"2026-04-25T23:17:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"V.O.I.C.E is a new taxonomy that organizes synthetic voice risks into five categories and shows how they interact with exposure, visibility, and legal context using empirical incident data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16211","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations","primary_cat":"cs.SD","submitted_at":"2026-04-17T16:20:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11283","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey","primary_cat":"cs.CV","submitted_at":"2026-04-13T10:42:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Video-LLaMA 2 [55], LLaMA-Adapter [19], AudioVisual [56], AVicuna [57], , SEAMLESSM4T [58], Au-HuBERT [59], Artemis [60], PLLaVA [61], PG-Video-LLaVA [62], GroundingGPT [63], Vidi [64], REEF [65], Video-xl [66] The Expressive Performer LLM-driven TTS MegaTTS 2 [67], StyleTTS 2, CosyVoice [68], CosyVoice 2 [69], PromptTTS [70], PromptTTS 2 [71], Fish-Speech [72], HALL-E [73], VoxInstruct [74] Spark-TTS [75], InstructTTS [76], EMO-DPO [77], LLM-augmented Seed-TTS [78], MegaTTS 3 [79], F5-TTS [80], E2 TTS [81], YourTTS [82], XTTS [83], StyleTTS [84], Takin [85], DurIAN-E [86], FireREDTTS [87], FireREDTTS-2 [88], CoFi-Speech [89], VALL-E R [90], GenerSpeech [91], ControlSpeech [92], NaturalSpeech [93], NaturalSpeech2 [94], NaturalSpeech3 [95],"},{"citing_arxiv_id":"2604.00688","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-01T09:45:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[39] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages.Journal of Machine Learning Research, 25(97):1-52, 2024. [40] Resemble AI. Chatterbox-TTS.https://github.com/resemble-ai/chatterbox, 2025. GitHub repository. [41] Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yi- jin Xing. Fish-speech: Leveraging large language models for advanced multilingual text-to- speech synthesis.arXiv preprint arXiv:2411.01156, 2024. [42] Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al."},{"citing_arxiv_id":"2604.02374","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative","primary_cat":"cs.SD","submitted_at":"2026-03-31T20:35:26+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.02364","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus","primary_cat":"cs.SD","submitted_at":"2026-03-02T20:11:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LRLspoof corpus and threshold-transfer evaluation demonstrate that spoof detection performance varies markedly across languages, identifying language as an independent domain shift factor.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.15621","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen3-TTS Technical Report","primary_cat":"cs.SD","submitted_at":"2026-01-22T03:51:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.19414","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection","primary_cat":"eess.AS","submitted_at":"2025-10-22T09:34:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.01284","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation","primary_cat":"cs.MM","submitted_at":"2025-09-30T21:03:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A single generative model uses twin DiT backbones with blockwise cross-attention and scaled-RoPE timing exchange to synthesize synchronized audio-video directly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}