{"total":68,"items":[{"citing_arxiv_id":"2606.11167","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models","primary_cat":"cs.CL","submitted_at":"2026-06-09T17:46:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04418","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding","primary_cat":"cs.SD","submitted_at":"2026-06-03T03:56:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02739","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement","primary_cat":"cs.SD","submitted_at":"2026-06-01T18:05:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01016","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects","primary_cat":"cs.CL","submitted_at":"2026-05-31T05:13:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00851","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning","primary_cat":"cs.SD","submitted_at":"2026-05-30T18:53:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sympatheia introduces a continuous affect-conditioned speech dialogue model and the Sympatheia-18k synthetic dataset, showing improved emotional appropriateness over baselines when speech cues are limited.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00324","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs Need Encoders for Semantic IDs Too","primary_cat":"cs.IR","submitted_at":"2026-05-29T20:01:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30256","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:20:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29859","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables","primary_cat":"eess.AS","submitted_at":"2026-05-28T12:39:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MELD jointly optimizes a discrete latent variable encoder on mel-spectrograms with an autoregressive speech LM, claiming gains over codec and mel baselines on zero-shot TTS/STT plus fewer autoregressive artifacts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23373","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ","primary_cat":"cs.SD","submitted_at":"2026-05-22T08:37:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20755","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action","primary_cat":"eess.AS","submitted_at":"2026-05-20T05:54:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20519","ref_index":22,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Codec-Robust Attacks on Audio LLMs","primary_cat":"cs.SD","submitted_at":"2026-05-19T21:39:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20356","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T18:11:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Full-duplex SDMs show strong representational synchronization that peaks near zero lag and degrades with noise, with internal states encoding anticipatory turn-taking cues detectable ahead of time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19541","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning","primary_cat":"cs.SD","submitted_at":"2026-05-19T08:40:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20266","ref_index":117,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"-✗✓ ✓ SpeechVerse [113] May 2024 Flan-T5-XL 3B EN Contin. -✗✓ ✓ GAMA [114] Jun 2024 LLaMA2-7B 7B EN Contin. 2.2M audio-caption pairs✗✓ ✓ Qwen2-Audio [18] Jul 2024 Qwen-7B 7B Multi. Contin. 520K Hrs audio✗✓ ✓ FunAudioLLM [115] Jul 2024 - - Multi. - -✗✓ ✓ Mini-Omni [116] Aug 2024 Qwen2-0.5B 0.5B - Discrete 8K Hrs speech + 2M text examples✓ ✓ ✓ Moshi [117] Sep 2024 Helium 7B EN Discrete 7M Hrs audio + 2.1T text tokens✓ ✓ ✓ LLaMA-Omni [118] Sep 2024 Llama-3.1-8B-Instruct 8B EN Contin. -✗✓ ✓ Parrot [119] Sep 2024 Llama 3.1-8B 8B EN Discrete 74,554 Hrs audio✓✗✓ OmniFlatten [120] Oct 2024 Qwen2-0.5B 0.5B EN, CN Discrete -✓ ✓ ✓ IntrinsicVoice [121] Oct 2024 Qwen2-7B-Instruct 7B - Discrete 20K Hrs audio✗✓ ✓"},{"citing_arxiv_id":"2605.18613","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAME: A Semantically-Aligned Music Autoencoder","primary_cat":"cs.SD","submitted_at":"2026-05-18T16:23:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAME is a semantically regularized transformer autoencoder for music that delivers 4096x compression with open-weights release of large and small variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17085","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Taming Audio VAEs via Target-KL Regularization","primary_cat":"cs.SD","submitted_at":"2026-05-16T17:01:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15442","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization","primary_cat":"eess.AS","submitted_at":"2026-05-14T21:53:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Task-dependent simulation strategies for synthetic conversational data allow synthetic-only training to approach real-data baselines for multi-talker ASR and diarization, with mixing yielding further gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14591","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Privacy Auditing with Zero (0) Training Run","primary_cat":"cs.CR","submitted_at":"2026-05-14T09:00:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14555","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis","primary_cat":"cs.SD","submitted_at":"2026-05-14T08:32:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11192","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Exploring Token-Space Manipulation in Latent Audio Tokenizers","primary_cat":"cs.SD","submitted_at":"2026-05-11T19:58:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LATTE creates a compact latent token bottleneck in audio tokenizers that aggregates global information and enables unsupervised editing of attributes like speaker identity via token swapping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11098","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling","primary_cat":"cs.SD","submitted_at":"2026-05-11T18:04:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10199","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue","primary_cat":"cs.CL","submitted_at":"2026-05-11T08:46:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"single-stream sequence, as in SyncLLM [46], OmniFlatten [59], NTPP [48], and SALMONN-omni [55]. This yields a simple autoregressive formulation, but increases sequence length and often introduces many silent user chunks into the context. Another line adopts channel fusion, combining the user and system streams at each time step before entering the LLM, as in Moshi [ 7], LSLM [31], SLAM-duplex [19], FLM-Audio [54], and Fun-Audio-Chat [44]. Other architectures have also been explored, including dual decoders with cross-stream attention [ 34], dual-LLM designs [12], and encoder-decoder models that jointly encode both streams before decoding [32]. In contrast, our cross-attention variant keeps the user stream as a separate memory accessed during generation."},{"citing_arxiv_id":"2605.10084","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PoDAR: Power-Disentangled Audio Representation for Generative Modeling","primary_cat":"eess.AS","submitted_at":"2026-05-11T07:05:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when applied to Stable Audio VAE with F5-TTS.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"only V AEs with architectures built around pretrained encoders such as DINO[21], SigLIP[22] or MAE[23] to yield semantically rich latents that accelerate the convergence of diffusion models. Although these advancements were primarily developed for the image domain, there has been parallel progress on the audio front. In the development of the Moshi codec [10], the authors align the primary codebook latents with a pre-trained WavLM encoder [24] and similarly, the DualCodec framework [11] aligns its representations with a pre-trained w2v-BERT-2.0 model [25]. Classifier-free guidanceClassifier-Free Guidance (CFG) [ 15] is a fundamental component of modern generative models, since it not only improves conditional adherence, but also generation"},{"citing_arxiv_id":"2605.08608","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation","primary_cat":"eess.AS","submitted_at":"2026-05-09T02:07:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We compare WavCodec against repre- sentative low-bitrate neural codecs and tokenizers. These baselines can be roughly divided into two groups. The first group does not explicitly introduce semantic supervision, including SimCodec [60], BigCodec [56], and WavTokenizer [19]. The second group incorpo- rates semantic constraints from pretrained speech models, includ- ing Mimi [7], XY-Tokenizer [12], X-codec2 [61], and BiCodec [47]. While high-fidelity codecs such as DAC [24] and EnCodec [6] can achieve strong reconstruction quality at higher bitrates, prior work has shown that their performance degrades sharply as bitrate is reduced [47, 61]; therefore, we do not include them as low-bitrate baselines in our comparisons. For all codec baselines, we use the"},{"citing_arxiv_id":"2605.06870","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse","primary_cat":"cs.LG","submitted_at":"2026-05-07T19:13:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06765","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06628","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation","primary_cat":"eess.IV","submitted_at":"2026-05-07T17:42:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constrained sensors across modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06582","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:11:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"treats speech tokens as symbolic sequences whose edit-distance behavior matters for retrieval. Its objective can be understood as adding a CTC-style pairwise sequence constraint to a frame-level token posterior model. Given paired views, wav2tok encourages the token sequence from one view to be likely under the framewise posterior sequence of the other view: −logpCTC(T + i |Pi)−logpCTC(Ti|P+ i ),(14) whereP i andP + i are frame-indexed token-posterior sequences. This is already more sequence-aware than purely local geometric assignment: the loss requires a paired token sequence to be recoverable under a monotonic alignment model, rather than only requiring aligned frames to be close in embedding space. PairAlign retains this central idea-paired views should agree as symbolic sequences-but changes both"},{"citing_arxiv_id":"2605.05927","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM","primary_cat":"cs.CL","submitted_at":"2026-05-07T09:32:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13943-13970, 2025. [2] Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024. [3] Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024. [4] Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, and Irwin King. Think before you talk: Enhancing meaningful dialogue generation in full-duplex speech language"},{"citing_arxiv_id":"2605.04613","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2026-05-06T08:03:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"on complex multi-module architectures that are diﬀicult to scale. 2.2 Large Audio Language Models (LALMs) Large Audio Language Models (LALMs) extend text LLMs to audio-based understanding by align- ing audio and text representations within a shared modeling framework [ 15]. Depending on the design, this alignment can be achieved either mainly in the audio tokenizer [ 4, 39] or directly in the language model through interleaved or parallel prompting [ 5, 37]. After multimodal adaptation and task-specific finetuning [3], LALMs have achieved strong performance in ASR [2, 40, 27] and general audio understanding [5, 8]. Recent studies have further demonstrated their promise in music-related tasks, including song structure analysis [ 28, 12, 30] and music captioning [ 9, 36]."},{"citing_arxiv_id":"2605.03937","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model","primary_cat":"cs.SD","submitted_at":"2026-05-05T16:27:33+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26296","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding","primary_cat":"eess.AS","submitted_at":"2026-04-29T04:51:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23295","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations","primary_cat":"cs.CL","submitted_at":"2026-04-25T13:18:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21406","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge","primary_cat":"eess.AS","submitted_at":"2026-04-23T08:21:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20842","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20940","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sema: Semantic Transport for Real-Time Multimodal Agents","primary_cat":"cs.MM","submitted_at":"2026-04-22T14:29:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"needs spatial layout and text content, not pixel-level texture. By compressing media into tokenized representations that preserve only what the downstream model uses, while dis- carding perceptual details irrelevant to the agent task, we achieve the 64-210× bandwidth ratios shown in Figures 1-2 without sacrificing task accuracy. Moreover, because com- pression efficiency correlates with model capability [11, 17], the semantic capacity of a fixed physical link grows as tok- enizer models improve, which is a scaling property no con- ventional codec can match. Decoupling Time: Event Sequences vs. Continuous Play- out.Human-facing RTC stacks deliver audio continuously at playout rate and use jitter buffers to smooth timing varia- tion, machinery that exists solely because human perception"},{"citing_arxiv_id":"2604.19949","ref_index":184,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages","primary_cat":"eess.AS","submitted_at":"2026-04-21T19:54:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16622","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning","primary_cat":"cs.CL","submitted_at":"2026-04-17T18:19:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A contrastive LLM fine-tuning method creates joint embeddings for dialogue contexts and backchannel realizations, improving retrieval performance and alignment with human judgments over raw WavLM features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14604","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection","primary_cat":"cs.CR","submitted_at":"2026-04-16T04:22:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"bines them with text tokens as input to the LLM back- bone. During audio tokenization, LALMs extract acoustic features from raw audio signals and then apply vector quantization techniques to derive discrete audio tokens. Meanwhile, the LLM backbone extends its vocabulary and embedding matrix to accommodate audio tokens. Instead of audio tokenization, thecontinuous-featurescheme [32]- [43] directly aligns audio and text inputs within a unified embedding space. Such LALMs project acoustic features Figure 1: Different audio-text integration schemes in LALMs (speech synthesis process omitted). into the text space via a modality adapter, which is often implemented as a multi-layer perceptron [32]-[39], cross- attention layers [40] or a transformer [41]-[43]."},{"citing_arxiv_id":"2604.12145","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization","primary_cat":"eess.AS","submitted_at":"2026-04-13T23:49:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on downstream tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11594","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models","primary_cat":"eess.AS","submitted_at":"2026-04-13T15:06:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"26 3.42 / 3.49 Gemini-2.5-flash 3.68 / 3.73 3.64 / 3.93 3.26 / 3.393.60/3.60 sions, D2 and D3 are assessed by human judges. Ten experi- enced listeners evaluated both the Chinese and English subsets under a blind rating protocol. 4. Experiments 4.1. Experimental setup We evaluate eight ALMs in two categories. The open-source group includes Freeze-Omni [27], GLM-4-V oice [28], Kimi- Audio [29], Step-Audio-2-mini [30], and Qwen2.5-Omni [22]. The closed-source group consists of Doubao-realtime, GPT-4o- audio [1], and Gemini-2.5-flash [2]. These models represent the current state of the art in ALM performance. 4.2. Results and analysis Multi-Turn Emotional Tracking and ReasoningAs shown in Table 3, closed-source models (e."},{"citing_arxiv_id":"2604.11424","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation","primary_cat":"cs.CL","submitted_at":"2026-04-13T13:06:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio expressiveness on EchoMind after training on 800 hours of data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10065","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:07:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"models with reinforcement learning from AI feedback,\" in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 20 395-20 411. [Online]. Available: https: //aclanthology.org/2025.acl-long.997/ [53] C. Chen, K. Hu, C.-H. H. Yang, A. Pasad, E. Casanova, W. Wang, S.-W. Fu, J. Li, Z. Chen, J. Balamet al., \"Reinforcement learn- ing enhanced full-duplex spoken dialogue language models for conversational interactions,\" inSecond Conference on Language Modeling, 2025. [54] S. Arora, J. Tian, J. Shi, H. Futami, Y . Kashiwagi, E. Tsunoo, and S. Watanabe, \"Optimizing conversational quality in spoken"},{"citing_arxiv_id":"2604.08363","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation","primary_cat":"cs.SD","submitted_at":"2026-04-09T15:27:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This requires modeling both arXiv:2604.08363v1 [cs.SD] 9 Apr 2026 Conference'17, July 2017, Washington, DC, USA Xiaosu Su, Zihan Sun, Peilei Jia, and Jun Gao stable speaker timbre and turn-level expressive variation: the former is better characterized by speaker-level global conditioning, while the latter needs to be explicitly modeled at the current turn[7, 28, 40]. A central open problem, however, is how to preserve a satisfactorily designed timbre for ongoing generation. Once a desired voice has been crafted through text-guided design, it must be reliably reused across diverse utterances and dialogue contexts. VoiceSculptor ad- dresses this through a two-stage \"voice design + cloning\" pipeline, in which the designed voice is first rendered into a reference utter-"},{"citing_arxiv_id":"2604.08000","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory","primary_cat":"cs.AI","submitted_at":"2026-04-09T09:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06129","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer","primary_cat":"cs.CV","submitted_at":"2026-04-07T17:40:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01897","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection","primary_cat":"cs.SD","submitted_at":"2026-04-02T11:00:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"These methods are lightweight and fast, but primarily capture speech presence rather than communicative intent, making them prone to false triggers from backchannels, hesitations, or background noise. The second group introduces explicit turn prediction modules using learned models. Representative examples include Smart Turn, TEN Turn Detection, and Easy Turn [14]. The second ap- proach enhances conversational intent detection by leveraging learned models and text-based cues, making it more adaptable to complex dialogues. However, turn detection still faces significant challenges in both methodology and data. Existing approaches struggle to balance accuracy and efficiency, particularly in real-time, noisy,"},{"citing_arxiv_id":"2603.25551","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Voxtral TTS","primary_cat":"cs.AI","submitted_at":"2026-03-26T15:23:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.22267","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TiCo: Time-Controllable Spoken Dialogue Model","primary_cat":"cs.CL","submitted_at":"2026-03-23T17:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17837","ref_index":9,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning","primary_cat":"eess.AS","submitted_at":"2026-03-18T15:30:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}