{"total":32,"items":[{"citing_arxiv_id":"2606.00851","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning","primary_cat":"cs.SD","submitted_at":"2026-05-30T18:53:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sympatheia introduces a continuous affect-conditioned speech dialogue model and the Sympatheia-18k synthetic dataset, showing improved emotional appropriateness over baselines when speech cues are limited.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00507","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LaSR: Context-Aware Speech Recognition via Latent Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-30T03:44:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21008","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Audio Reasoning in Multimodal Foundation Models","primary_cat":"eess.AS","submitted_at":"2026-05-20T10:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2) Spoken Language Models (SLMs):SLMs are foundation models built for spoken dialogue interactions. Similar to LALMs, SLMs can take audio and optional text as inputs, but their outputs are speech representations or waveforms rather than text. Architecturally, they can follow either the encoder- projector-LLM-decoder design that resembles LALMs but adds a speech generation module [36], [37] after the LLM, or a more unified token-based design [38], [39], in which contin- uous speech is first converted into discrete speech tokens and both input and output speech are modeled autoregressively in 4 a shared token space together with text. The latter formulation is more end-to-end, but it also introduces challenges such as long token sequences, multi-codebook token structures, and"},{"citing_arxiv_id":"2605.20755","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action","primary_cat":"eess.AS","submitted_at":"2026-05-20T05:54:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20266","ref_index":124,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Parrot [119] Sep 2024 Llama 3.1-8B 8B EN Discrete 74,554 Hrs audio✓✗✓ OmniFlatten [120] Oct 2024 Qwen2-0.5B 0.5B EN, CN Discrete -✓ ✓ ✓ IntrinsicVoice [121] Oct 2024 Qwen2-7B-Instruct 7B - Discrete 20K Hrs audio✗✓ ✓ DiVA [122] Oct 2024 Llama 3 8B EN Contin. -✗✓ ✓ Freeze-Omni [123] Nov 2024 Qwen2-7B-Instruct 7B EN, CN Contin. -✓ ✓ ✓ GLM-4-Voice [124] Dec 2024 GLM-4-9B 9B EN, CN Discrete 1T tokens✗✓ ✓ KE-Omni [125] Dec 2024 LLaMA-3.1-8B-Instruct 8B EN, CN Contin. -✗✓ ✓ MERaLiON-Audio [126] Dec 2024 SEA-LION V3 10B Multi. Contin. -✗✓ ✓ Year 2025 MinMo [60] Jan 2025 Qwen2.5-7B-Instruct 7B Multi. Contin. -✓ ✓ ✓ FireRedASR [12] Jan 2025 Qwen2-7B-Instruct 7B Multi. Contin. -✗✓ ✓ Step-Audio [127]"},{"citing_arxiv_id":"2605.11098","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling","primary_cat":"cs.SD","submitted_at":"2026-05-11T18:04:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10199","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue","primary_cat":"cs.CL","submitted_at":"2026-05-11T08:46:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"including ASR, TTS, audio understanding, and spoken dialogue [57, 6, 56, 52, 25, 21, 22, 16, 15]. A common paradigm is to map speech into the LLM embedding space and represent speech and text as a single sequence that can be processed autoregressively. Recent systems further enable end-to-end spoken dialogue by generating both text and speech outputs within a unified framework [56, 10, 11, 42]. Despite their strong performance, these models typically assume strict turn taking: the system first consumes the user's complete utterance and only then generates its response. As a result, they do not naturally support full-duplex behaviors such as interruption and backchannels handling. Our work builds on this line of speech-based LLMs, but extends it to full-duplex spoken"},{"citing_arxiv_id":"2605.06765","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05927","ref_index":12,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM","primary_cat":"cs.CL","submitted_at":"2026-05-07T09:32:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"reasoning and language understanding capabilities of TLMs to spoken input and output. However, despite this shared backbone, a major challenge remains: themodality gap-speech-based question answering (QA) performance often remains substantially worse than the text-based QA performance of the underlying TLM, limiting the practical usability of SLMs. For example, prior studies [ 11] report that GLM-4-V oice [12] suffers up to a 20% performance drop on several QA benchmarks. Most existing studies attempt to bridge the modality gap from theoutput side. Early SLMs use the LLM backbone to generate only speech tokens, enabling fully end-to-end spoken interaction [13-15]. Later work improves performance by having the LLM first generate intermediate text tokens before"},{"citing_arxiv_id":"2605.00329","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation","primary_cat":"cs.SD","submitted_at":"2026-05-01T01:13:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AudioTurbo≈2000 1.1B 5 22.18 - 1.30 8.88 - - - - AudioTurbo≈2000 1.1B 10 20.65 - 1.29 9.40 - - - - AUDIODEAR w/o Dist. 1700 191M 1 22.09 3.82 1.22 8.07 0.298 - - 2.61 AUDIODEAR 1700 191M 1 18.672.791.069.66 0.334 4.27±0.043.27±0.062.61 energy-scoring head are provided in Appendix E. During training, we apply a masking rate randomly sampled from the range [70,100) to the audio latents, enabling masked generative modeling with the energy-distance objective. For representation distillation, we adopt the transformer back- bone of the diffusion-based state-of-the-art model IMPACT (Huang et al., 2025) as the teacher, and integrate the distil- lation loss with the energy-distance objective using a distil- lation weight λ= 1000 , as defined in Equation 5. Unless otherwise specified, we train with a batch size of 2048 and a learning rate of 1e−3. At inference time, we follow IM- PACT by setting the number of decoding iterations to 64. Following related work (Ma et al., 2025), we apply classifier- free guidance during inference, with CFG scale set to 4.0. Ablation studies and implementation details on CFG can be found in Appendix F. 4.3. Evaluation We evaluate our proposed TTA generation framework us- ing both objective and subjective metrics. For objective assessment, we report Fr 'echet distance (FD; Heusel et al. 2017), Fr'echet audio distance (FAD; Kilgour et al. 2018), Kullback-Leibler divergence (KL), and inception score (IS; Salimans et al. 2016) following the AudioLDM evaluation protocol 2, and CLAP similarity (Wu et al., 2023) using the same pre-trained CLAP model employed by IMPACT. The CLAP model used for training 3 is different from the one used for evaluation4 to avoid taking advantage of training and evaluating with the same model. Subjective evaluation is conducted on 90 generated audio samples conditioned on the AudioCaps evaluation set prompts, using the user inter- face and rating criteria defined in AudioBox (Vyas et al., 2023). Each sample receiv"},{"citing_arxiv_id":"2604.21406","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge","primary_cat":"eess.AS","submitted_at":"2026-04-23T08:21:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20842","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18489","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints","primary_cat":"cs.SD","submitted_at":"2026-04-20T16:40:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14604","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection","primary_cat":"cs.CR","submitted_at":"2026-04-16T04:22:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"details. To combine their strengths, thehybridscheme [44] fuses tokenized audio embeddings and projected acoustic features with text embeddings as the LLM input. Given the integrated input, listening-only LALMs [36]-[42] generate text response. Full-duplex LALMs generate both text and audio tokens in a parallel [31]-[34], [43] or interleaved [28]- [30] manner, followed by speech synthesis. In practice, LALMs are predominantly employed in two fundamental task categories [37]:(1) audio analysis: LALMs process speech, sound, or music signals alongside text instructions to perform audio understanding or reason- ing. In this task, LALMs consume the audio input as data for analysis;(2) voice chat:LALMs listen and respond in"},{"citing_arxiv_id":"2604.13804","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-15T12:39:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In recent years, the development of multimodal technologies has enabled the alignment of audio modalities with large model inputs, thereby facilitating extensive audio understanding by large lan- guage models. Some studies encode speech into discrete tokens and incorporate them into LLMs, allowing the models to accept audio in- put, as seen in works such as SpeechGPT [40] and AudioPaLM [17]. Models like SALMONN [ 30] and Qwen-Audio [7, 8] are trained on large-scale, multi-task datasets, equipping them to perform a variety of downstream tasks including speech recognition, speech translation, and audio event detection. A subset of research applies large audio models to spoken dialogue, enabling more intelligent interactions, for example, by mining paralinguistic factors such as"},{"citing_arxiv_id":"2604.11594","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models","primary_cat":"eess.AS","submitted_at":"2026-04-13T15:06:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09222","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking","primary_cat":"cs.SD","submitted_at":"2026-04-10T11:27:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to Challenge AI Safety by Humanizing LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, 14322-14350. doi:10.18653/V1/2024.ACL-LONG.773 [41] Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 (Findings of ACL, Vol. EMNLP 2023), Houda Bouamor, Juan Pino, and Kalika Bali (Eds."},{"citing_arxiv_id":"2604.08003","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs","primary_cat":"eess.AS","submitted_at":"2026-04-09T09:07:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 log det ΣAA|C det ΣBB|C det Σ[A,B]|C ,(27) where the conditional covariance matrices are defined via the Schur complement, e.g. ΣAA|C = ΣAA −Σ ACΣ−1 CC ΣCA,(28) and analogously forΣBB|C and Σ[A,B]|C . Here, ΣAA|C captures the residual variability ofA after removing the components that can be linearly explained byC. Proof.By definition, I(A;B|C) =h(A|C) +h(B|C)−h(A, B|C).(29) For jointly Gaussian variables, conditional distributions remain Gaussian, and their conditional entropies are determined by conditional covariance matrices: h(A|C) = 1 2 log (2πe)dA det ΣAA|C \u0001 ,(30) h(B|C) = 1 2 log (2πe)dB det ΣBB|C \u0001 ,(31) h(A, B|C) = 1 2 log (2πe)dA+dB det Σ[A,B]|C \u0001 .(32) Substituting into the definition of conditional mutual information yields"},{"citing_arxiv_id":"2604.01897","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection","primary_cat":"cs.SD","submitted_at":"2026-04-02T11:00:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"leads to full-duplex interaction, where the system must pro- cess speech perception, partial semantic understanding, and re- sponse planning concurrently while the user is still speaking. Unlike turn-based settings [7], a full-duplex system is required to make online decisions about when to continue speaking, when to yield the floor, and when to insert or interrupt [8, 9, 10]. These decisions involve a delicate latency-accuracy trade-off: reacting too late increases overlap and errors, while reacting too early risks truncating semantics and degrading coherence, especially under noisy and overlapped observations. Although large language models excel at reasoning and generation with complete textual inputs, integrating them into low-latency full-"},{"citing_arxiv_id":"2603.22267","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TiCo: Time-Controllable Spoken Dialogue Model","primary_cat":"cs.CL","submitted_at":"2026-03-23T17:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17837","ref_index":40,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning","primary_cat":"eess.AS","submitted_at":"2026-03-18T15:30:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.23578","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models","primary_cat":"cs.CL","submitted_at":"2025-12-29T16:23:54+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14234","ref_index":129,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body","primary_cat":"cs.CV","submitted_at":"2025-12-16T09:41:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"We utilize a strong pretrained audio-text back- bone and interleave the newly added modality through an audio-text-motion pathway within a global cross-attention layer that operates over an interleaved token stream. Crucially, our approach does not require large-scale au- dio-text-motion pretraining. Instead, we leverage the pretrained capacity of speech LLMs [129] and attach a lightweight per-layer modality expert-a small Transformer block that produces face and body queries and reads the backbone's key/value via cross-attention. Because these experts are side-car modules, the backbone's architecture and weights remain intact, allowing us to utilize off-the- shelf checkpoints while adding new modalities with mini-"},{"citing_arxiv_id":"2510.09592","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models","primary_cat":"cs.CL","submitted_at":"2025-10-10T17:50:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MPS proposes a dual-brain architecture separating formulation reasoning from articulation to achieve real-time CoT in SLMs with accuracy comparable to full pre-computation but much lower latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23435","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models","primary_cat":"cs.SD","submitted_at":"2025-09-27T18:08:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AudioRole provides 1M+ character-grounded audio-text dialogues from TV series plus ARP-Eval to train and measure audio role-playing models, with ARP-Model showing 0.31 acoustic and 0.36 content personalization scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22220","ref_index":84,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs","primary_cat":"cs.CL","submitted_at":"2025-09-26T11:32:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.14804","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages","primary_cat":"cs.SD","submitted_at":"2025-09-18T09:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces XLSR-Thai encoder, U-Align alignment, and Thai-SUP data pipeline to enable multitask speech understanding SLLMs for Thai.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.16632","ref_index":78,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Step-Audio 2 Technical Report","primary_cat":"cs.CL","submitted_at":"2025-07-22T14:23:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.18425","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kimi-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2025-04-25T15:31:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"semantic tokens and complementary continuous vectors of acoustic information to effectively represent speech signals for downstream tasks. This tokenization allows the model to leverage the efficiency and semantic focus of discrete tokens while benefiting from the rich acoustic details captured by continuous representations. We incorporate the discrete semantic tokens proposed by GLM-4-V oice [84]. This component utilizes a supervised speech tokenizer derived from an automatic speech recognition (ASR) model. 3 Kimi-Audio Technical Report Audio Detokenizer Audio Head Text Head Shared LLM Layer Text Token Audio Token Audio Delay Blank Token Whisper Encoder Adaptor Audio Tokenizer Audio Embedding Figure 2: Overview of the Kimi-Audio model architecture: (1) an audio tokenizer that extracts discrete semantic tokens"},{"citing_arxiv_id":"2504.08528","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On The Landscape of Spoken Language Models: A Comprehensive Survey","primary_cat":"cs.CL","submitted_at":"2025-04-11T13:40:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.01743","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs","primary_cat":"cs.CL","submitted_at":"2025-03-03T17:05:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.11946","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction","primary_cat":"cs.CL","submitted_at":"2025-02-17T15:58:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a new benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}