pith. machine review for the scientific record. sign in

arxiv: 2503.05085 · v2 · submitted 2025-03-07 · 💻 cs.CL · cs.SD· eess.AS

Recognition: unknown

S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models

Authors on Pith no claims yet
classification 💻 cs.CL cs.SDeess.AS
keywords paralinguisticmodelss2s-arenaspeechsystemsevaluatingevaluationexpressive
0
0 comments X
read the original abstract

Recent advances in large language models (LLMs) have fundamentally reshaped speech-to-speech (S2S) systems, enabling increasingly natural spoken interaction. However, existing benchmarks still rely heavily on text-based evaluation and largely ignore paralinguistic cues such as prosody, emotion, and speaker traits, which are central to expressive and human-like communication. We introduce S2S-Arena, a speech-native benchmark for evaluating instruction-following S2S models with explicit assessment of both semantic understanding and paralinguistic expression. S2S-Arena features a four-level interaction protocol that systematically probes models under increasing paralinguistic complexity, a two-stage data construction pipeline that produces 1,243 speech samples spanning 100+ real-world tasks, and an arena-style evaluation framework that enables reference-free, pairwise comparison directly in the speech modality. Benchmarking 10 state-of-the-art S2S systems over 1,000+ comparisons reveals substantial performance gaps (especially under complex paralinguistic demands) between current academic and industrial systems. Our analysis further identifies key design factors governing expressive instruction following, providing actionable insights for building more natural, robust, and human-aligned speech agents.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

  2. MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

    eess.AS 2026-04 unverdicted novelty 7.0

    MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

  3. NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

    cs.SD 2026-04 unverdicted novelty 7.0

    NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.