S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models
Pith reviewed 2026-05-23 00:30 UTC · model grok-4.3
The pith
S2S-Arena benchmark reveals substantial performance gaps in paralinguistic instruction following between academic and industrial speech-to-speech systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
S2S-Arena is a speech-native benchmark that uses a four-level interaction protocol, a two-stage data construction pipeline yielding 1,243 samples over 100+ tasks, and reference-free pairwise comparison to demonstrate that current S2S systems exhibit substantial performance gaps under increasing paralinguistic demands, with industrial models generally outperforming academic ones.
What carries the argument
The four-level interaction protocol combined with the two-stage data construction pipeline that generates speech samples probing paralinguistic instruction following at increasing levels of complexity.
If this is right
- Industrial S2S systems maintain higher performance than academic systems when paralinguistic demands become complex.
- Specific design choices in model architecture and training data control how well a system follows instructions involving prosody and emotion.
- Reference-free pairwise comparison in the speech domain can surface differences that text-only metrics miss.
- The identified design factors supply concrete directions for improving naturalness and robustness in future speech agents.
Where Pith is reading between the lines
- Extending the benchmark to additional languages or longer multi-turn dialogues could test whether the observed gaps persist outside English and short exchanges.
- The arena format might be adapted to compare open-source versus closed-source models on privacy-sensitive paralinguistic tasks where reference audio is unavailable.
- If the performance gaps trace mainly to training data diversity, targeted data augmentation focused on emotion and speaker variation could narrow them without changing model scale.
Load-bearing premise
The four-level interaction protocol and two-stage data construction pipeline produce speech samples that accurately and unbiasedly probe paralinguistic instruction following in real-world tasks.
What would settle it
A follow-up study that collects new human preference judgments on the same model outputs using unscripted live conversations and finds either no performance gap or reversed model rankings would falsify the benchmark's core validity.
read the original abstract
Recent advances in large language models (LLMs) have fundamentally reshaped speech-to-speech (S2S) systems, enabling increasingly natural spoken interaction. However, existing benchmarks still rely heavily on text-based evaluation and largely ignore paralinguistic cues such as prosody, emotion, and speaker traits, which are central to expressive and human-like communication. We introduce S2S-Arena, a speech-native benchmark for evaluating instruction-following S2S models with explicit assessment of both semantic understanding and paralinguistic expression. S2S-Arena features a four-level interaction protocol that systematically probes models under increasing paralinguistic complexity, a two-stage data construction pipeline that produces 1,243 speech samples spanning 100+ real-world tasks, and an arena-style evaluation framework that enables reference-free, pairwise comparison directly in the speech modality. Benchmarking 10 state-of-the-art S2S systems over 1,000+ comparisons reveals substantial performance gaps (especially under complex paralinguistic demands) between current academic and industrial systems. Our analysis further identifies key design factors governing expressive instruction following, providing actionable insights for building more natural, robust, and human-aligned speech agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces S2S-Arena, a speech-native benchmark for evaluating paralinguistic instruction following in S2S models. It proposes a four-level interaction protocol that increases paralinguistic complexity, a two-stage pipeline generating 1,243 speech samples over 100+ real-world tasks, and an arena-style reference-free pairwise evaluation in the speech modality. Benchmarking 10 S2S systems across 1,000+ comparisons shows substantial performance gaps, particularly under complex paralinguistic demands, between academic and industrial systems, along with analysis of key design factors.
Significance. If the benchmark construction and evaluation protocol are shown to be valid and unbiased, the work would provide a valuable speech-native resource that shifts evaluation away from text proxies and highlights concrete gaps in current S2S systems' handling of prosody, emotion, and speaker traits. The identification of design factors could offer actionable guidance for improving expressive instruction following.
major comments (2)
- [Abstract / Benchmark Construction] The abstract and benchmark description provide no details on validation of the four-level protocol or two-stage pipeline (e.g., inter-annotator agreement for speech sample quality, error analysis of the 1,243 samples, or controls for annotator bias in paralinguistic judgments). This is load-bearing for the central performance-gap claim, as the gaps are only interpretable if the probes are shown to be accurate and unbiased.
- [Evaluation Framework] The arena-style evaluation framework is described as reference-free and pairwise in the speech modality, but the manuscript supplies no quantitative evidence (e.g., agreement metrics, consistency across raters, or comparison to text-based baselines) that this protocol reliably distinguishes semantic vs. paralinguistic failures. Without such evidence the reported gaps between academic and industrial systems cannot be confidently attributed to paralinguistic demands.
minor comments (2)
- [Benchmark Construction] Clarify the exact criteria used to select the 10 S2S systems and the 100+ tasks to allow reproducibility.
- [Analysis] The claim of 'actionable insights' from the design-factor analysis would benefit from explicit mapping to model components (e.g., which architectural choices correlate with which paralinguistic levels).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for explicit validation of the benchmark construction and evaluation protocol. We agree these elements are important for interpreting the performance gaps and will revise the manuscript to address both major comments.
read point-by-point responses
-
Referee: [Abstract / Benchmark Construction] The abstract and benchmark description provide no details on validation of the four-level protocol or two-stage pipeline (e.g., inter-annotator agreement for speech sample quality, error analysis of the 1,243 samples, or controls for annotator bias in paralinguistic judgments). This is load-bearing for the central performance-gap claim, as the gaps are only interpretable if the probes are shown to be accurate and unbiased.
Authors: We agree that the current manuscript lacks these validation details. In the revision we will add a new subsection on benchmark validation that reports inter-annotator agreement for the two-stage pipeline, an error analysis of the 1,243 samples, and explicit controls for annotator bias in paralinguistic judgments. These additions will directly support the reliability of the reported gaps between academic and industrial systems. revision: yes
-
Referee: [Evaluation Framework] The arena-style evaluation framework is described as reference-free and pairwise in the speech modality, but the manuscript supplies no quantitative evidence (e.g., agreement metrics, consistency across raters, or comparison to text-based baselines) that this protocol reliably distinguishes semantic vs. paralinguistic failures. Without such evidence the reported gaps between academic and industrial systems cannot be confidently attributed to paralinguistic demands.
Authors: We acknowledge that the manuscript currently provides no quantitative evidence for the reliability of the arena-style protocol. In the revised version we will include rater agreement metrics, consistency statistics across raters, and a direct comparison against text-based evaluation baselines to demonstrate that the speech-native pairwise judgments reliably separate semantic from paralinguistic failures. revision: yes
Circularity Check
No significant circularity
full rationale
This is an empirical benchmark paper introducing S2S-Arena with a four-level protocol, two-stage data pipeline, and arena-style evaluation. The abstract and description contain no equations, fitted parameters, predictions, or derivations that could reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are present. The performance-gap claim follows directly from applying the described construction and comparison methods to the 10 systems; the work is self-contained as a benchmark without internal circular reductions.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 4 Pith papers
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
-
NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.
-
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.