{"total":28,"items":[{"citing_arxiv_id":"2606.01016","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects","primary_cat":"cs.CL","submitted_at":"2026-05-31T05:13:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00507","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LaSR: Context-Aware Speech Recognition via Latent Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-30T03:44:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28063","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts","primary_cat":"cs.SD","submitted_at":"2026-05-27T07:15:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22083","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching","primary_cat":"cs.SD","submitted_at":"2026-05-21T07:22:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20830","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech","primary_cat":"eess.AS","submitted_at":"2026-05-20T07:21:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17488","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-17T14:56:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16964","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis","primary_cat":"eess.AS","submitted_at":"2026-05-16T12:37:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16026","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation","primary_cat":"cs.CL","submitted_at":"2026-05-15T15:01:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2ST-Omni 2 uses typology-informed hierarchical encoding, gated Dual-CTC, and typology-aware prompting to improve multilingual S2ST over flat-label baselines on CVSS-C, with gains in low-data regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17583","ref_index":100,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech","primary_cat":"cs.CV","submitted_at":"2026-05-14T13:31:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09568","ref_index":23,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations","primary_cat":"eess.AS","submitted_at":"2026-05-10T14:29:35+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"iFlytek1 Commercial 1,885 - 851 - 313 1,474 4,523 8.58% Houshan2 Commercial 1,678 - 989 - 1,502 - 4,169 7.91% ElevenLabs3 Commercial 1,794 539 1,221 340 833 618 5,345 10.14% Cartesia4 Commercial 1,609 - 809 - 933 868 4,219 8.00% OpenAI5 Commercial 876 - 1,415 - 717 1,187 4,195 7.96% Chatterbox6 Open-source 1,884 2,915 1,371 - 939 - 7,109 13.48% CosyV oice 3.0 [23] Open-source 1,770 2,688 1,098 348 1,371 - 7,275 13.80% Qwen3-TTS [24] Open-source 2,204 - 1,280 336 1,060 - 4,880 9.26% Fish Audio S2 Pro [25] Open-source 1,232 569 1,592 654 1,074 841 5,962 11.31% Piper7 Open-source 2,991 - 608 - - 1,450 5,049 9.58% Total - 17,923 6,711 11,234 1,678 8,742 6,438 52,726 100.00% 1 https://www.xfyun.cn/services/online tts 2 https://www."},{"citing_arxiv_id":"2605.09386","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech","primary_cat":"eess.AS","submitted_at":"2026-05-10T07:24:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"For sampling, the temperature is selected separately for each model on the validation set, whose pipeline is described in Appendix K. 8 6.3 Evaluations Evaluation datasets.We use thetest-cleansubset of LibriTTS [ 37] as the validation set, follow- ing [35] to construct prompt-target pairs. For testing, we use thetest-enandtest-zhof Seed-TTS [ 2] test sets, and theenandzhsubsets of CosyV oice 3 [8] test sets. Objective evaluations.We use UTMOS [ 21] to evaluate naturalness. Following Seed-TTS [2], we use Whisper-large-v3 [20] to compute word error rate (WER) for English and Paraformer-zh [10] to compute character error rate (CER) for Chinese. For speaker similarity (SIM), we extract speaker embeddings using the WavLM-large [4] speaker verification model and compute the cosine similarity."},{"citing_arxiv_id":"2605.06765","ref_index":109,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02223","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization","primary_cat":"cs.SD","submitted_at":"2026-05-04T04:54:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26296","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding","primary_cat":"eess.AS","submitted_at":"2026-04-29T04:51:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22225","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis","primary_cat":"cs.CL","submitted_at":"2026-04-24T05:01:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment on a 1,600-sample Mandarin test set while profiling six TTS paradigms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19221","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction","primary_cat":"cs.AI","submitted_at":"2026-04-21T08:24:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for full-duplex interaction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17435","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation","primary_cat":"cs.CL","submitted_at":"2026-04-19T13:34:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and emotional fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22821","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use","primary_cat":"cs.SD","submitted_at":"2026-04-17T16:41:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16056","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AST: Adaptive, Seamless, and Training-Free Precise Speech Editing","primary_cat":"cs.SD","submitted_at":"2026-04-17T13:30:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15037","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench","primary_cat":"cs.AI","submitted_at":"2026-04-16T14:06:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14548","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"synthesize audio with CosyV oice3 [29], chosen for its strong speaker-identity preservation in both Chinese and English. For Tier 2, we manually curate salient and unambiguous paralinguistic or environmental cues so that benchmark failures are less likely to be driven by weak or noisy audio evidence. All synthesized audio is filtered with Whisper-large-v3 [30], discarding samples with WER > 5%. Full construction details, including the prompt-audio pool composition and quality control pipeline, are provided in Appendix D. Evaluation Models and JudgingWe curate a model set with demonstrated strong audio understand- ing. Guided by MMSU [1] and MMAU-Pro [2], two benchmarks that emphasize paralinguistic and"},{"citing_arxiv_id":"2604.10580","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark","primary_cat":"cs.CL","submitted_at":"2026-04-12T10:57:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10065","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:07:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"darin with enhanced polyphone disambiguation-challenges and insights,\"arXiv preprint arXiv:2501.17790, 2025. [25] C.-J. Hsu, C.-S. Liu, M.-H. Chen, M. Chen, P.-C. Hsu, Y .-C. Chen, and D.-S. Shiu, \"The breeze 2 herd of models: Traditional chinese llms based on llama with vision-aware and function-calling capa- bilities,\"arXiv preprint arXiv:2501.13921, 2025. [26] C.-K. Yanget al., \"Building a taiwanese mandarin spoken lan- guage model: A first attempt,\"arXiv preprint arXiv:2411.07111, 2024. [27] C.-Y . Hsiaoet al., \"Analyzing Mitigation Strategies for Catas- trophic Forgetting in End-to-End Training of Spoken Language Models,\" inInterspeech 2025, 2025, pp. 3234-3238. [28] K.-H. Lu, Z. Chen, S.-W. Fu, H. Huang, B."},{"citing_arxiv_id":"2604.00688","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-01T09:45:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"OmniV oice-Emilia surpasses all NAR baselines (F5-TTS [14], ZipV oice [16], MaskGCT [19]) trained on the same Emilia corpus, verifying the effectiveness of our proposed architecture. The final multilingual version OmniV oice model yields competitive overall performance across all benchmarks against baselines trained on unconstrained datasets (IndexTTS2 [5], CosyV oice3 [33], V oxCPM [10], Qwen3-TTS [42]), with particular advantages in speaker similarity and intelligibility. This demonstrates OmniV oice's strong capability on the two most high-resource languages. 4.2 Evaluation on Multilingual Benchmarks We validate OmniV oice's multilingual capability on the 24-language MiniMax-Multilingual-24 benchmark and the 102-language FLEURS-Multilingual-102 benchmark."},{"citing_arxiv_id":"2603.19857","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts","primary_cat":"cs.SD","submitted_at":"2026-03-20T11:19:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.12783","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise","primary_cat":"cs.IR","submitted_at":"2026-02-13T10:08:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14234","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body","primary_cat":"cs.CV","submitted_at":"2025-12-16T09:41:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024. [23] Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xi- ang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable stream- ing speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024. [24] Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, et al. Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025. [25] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik"},{"citing_arxiv_id":"2509.17765","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen3-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2025-09-22T13:26:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}