SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
Vocalbench: Benchmarking the vocal conversational abilities for speech interaction models
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 1polarities
use dataset 1representative citing papers
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
AudioRole provides 1M+ character-grounded audio-text dialogues from TV series plus ARP-Eval to train and measure audio role-playing models, with ARP-Model showing 0.31 acoustic and 0.36 content personalization scores.
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Aligned acoustic concept tokens from eGeMAPS improve UAR in ALM-based SER on FAU-Aibo and IEMOCAP while shuffled or corrupted tokens reduce performance without collapsing predictions, indicating partial anchoring to audio.
citing papers explorer
-
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
-
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.
-
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
-
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
-
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
AudioRole provides 1M+ character-grounded audio-text dialogues from TV series plus ARP-Eval to train and measure audio role-playing models, with ARP-Model showing 0.31 acoustic and 0.36 content personalization scores.
-
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.
-
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
-
Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition
Aligned acoustic concept tokens from eGeMAPS improve UAR in ALM-based SER on FAU-Aibo and IEMOCAP while shuffled or corrupted tokens reduce performance without collapsing predictions, indicating partial anchoring to audio.