Vocalbench: Benchmarking the vocal conversational abilities for speech interaction models

· 2025 · arXiv 2505.15727

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

eess.AS · 2026-06-01 · unverdicted · novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

eess.AS · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.

Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

cs.CL · 2025-12-29 · accept · novelty 7.0

Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

eess.AS · 2025-09-30 · unverdicted · novelty 7.0

Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.

AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

cs.SD · 2025-09-27 · unverdicted · novelty 7.0

AudioRole provides 1M+ character-grounded audio-text dialogues from TV series plus ARP-Eval to train and measure audio role-playing models, with ARP-Model showing 0.31 acoustic and 0.36 content personalization scores.

Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation

eess.AS · 2026-06-29 · unverdicted · novelty 6.0

PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

cs.SD · 2026-05-18 · unverdicted · novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

cs.SD · 2026-06-05 · unverdicted · novelty 4.0

Aligned acoustic concept tokens from eGeMAPS improve UAR in ALM-based SER on FAU-Aibo and IEMOCAP while shuffled or corrupted tokens reduce performance without collapsing predictions, indicating partial anchoring to audio.

citing papers explorer

Showing 8 of 8 citing papers.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing eess.AS · 2026-06-01 · unverdicted · none · ref 12
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action eess.AS · 2026-05-20 · unverdicted · none · ref 42 · 2 links
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models cs.CL · 2025-12-29 · accept · none · ref 28
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models eess.AS · 2025-09-30 · unverdicted · none · ref 41
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models cs.SD · 2025-09-27 · unverdicted · none · ref 15
AudioRole provides 1M+ character-grounded audio-text dialogues from TV series plus ARP-Eval to train and measure audio role-playing models, with ARP-Model showing 0.31 acoustic and 0.36 content personalization scores.
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation eess.AS · 2026-06-29 · unverdicted · none · ref 51
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 186
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition cs.SD · 2026-06-05 · unverdicted · none · ref 17
Aligned acoustic concept tokens from eGeMAPS improve UAR in ALM-based SER on FAU-Aibo and IEMOCAP while shuffled or corrupted tokens reduce performance without collapsing predictions, indicating partial anchoring to audio.

Vocalbench: Benchmarking the vocal conversational abilities for speech interaction models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer