SpeakerSleuth: Can Large Audio-Language Models Judge Speaker Consistency across Multi-turn Dialogues?
Pith reviewed 2026-05-16 16:14 UTC · model grok-4.3
The pith
Large audio-language models prioritize text over acoustics when judging speaker consistency in multi-turn dialogues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LALMs struggle to reliably judge speaker consistency across multi-turn dialogues. Given audio samples from the same speaker, some models overpredict inconsistency while others are overly lenient. When textual context from other interlocutors is provided, performance degrades as models prioritize textual coherence over acoustic cues and fail to detect even obvious changes such as gender switches. Models perform better when comparing and ranking acoustic variants, indicating they possess acoustic discrimination abilities but do not apply them effectively in consistency evaluation tasks.
What carries the argument
The SpeakerSleuth benchmark, which consists of three tasks designed to evaluate LALMs on speaker consistency detection with varying acoustic difficulty levels across four datasets.
If this is right
- LALMs cannot yet serve as reliable judges for speaker consistency in audio dialogues due to their detection struggles.
- Providing textual context causes models to ignore acoustic information and focus on text.
- Models have inherent acoustic discrimination capabilities as shown by better performance in comparison and ranking tasks.
- Addressing the text-over-acoustics bias is necessary to create reliable audio-language judges.
Where Pith is reading between the lines
- Similar modality biases may exist in other evaluation tasks involving audio and language.
- Future model training could incorporate techniques to balance attention between text and audio modalities.
- The benchmark might be adapted to assess consistency in other attributes like emotion or speaking style.
- Real-world dialogue systems could benefit from hybrid judges that combine LALMs with dedicated acoustic analyzers.
Load-bearing premise
The benchmark's three tasks and controlled difficulty levels capture the essential real-world demands for speaker consistency judgment, with human verification providing accurate ground truth labels.
What would settle it
Demonstrating that LALMs correctly identify speaker inconsistencies like gender switches even when textual context is provided would challenge the finding of text prioritization bias.
read the original abstract
Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn dialogues remains unexplored. We present \textbf{SpeakerSleuth}, a benchmark evaluating whether LALMs can reliably judge speaker consistency across multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating twelve widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors' turns are provided as textual context, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in comparing and ranking acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges. Our code and data are available at https://github.com/holi-lab/SpeakerSleuth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpeakerSleuth, a benchmark with three tasks for evaluating whether Large Audio-Language Models (LALMs) can judge speaker consistency across multi-turn dialogues. It constructs 1,818 human-verified instances spanning four synthetic and real-speech datasets with controlled acoustic difficulty levels, then evaluates twelve LALMs. The central empirical finding is that models exhibit a strong text-over-acoustics bias: performance degrades sharply when textual interlocutor turns are supplied (including failure to detect obvious gender switches), while pure acoustic variant ranking is substantially stronger.
Significance. If the reported patterns hold, the work supplies direct evidence of a modality imbalance in current LALMs that limits their reliability as audio-language judges. The public release of code and data, together with the scale of the human-verified test set, makes the contribution reproducible and extensible. The results are relevant to any application that relies on LALMs for speech evaluation or dialogue assessment.
major comments (2)
- [Section 4] Section 4 (Experiments) and Table 2: the dramatic degradation when textual context is added is presented as the key evidence for text-over-acoustics bias, yet no statistical significance tests (p-values, confidence intervals, or paired comparisons) are reported for the performance drops across the twelve models; without these, it is difficult to judge whether the observed differences are robust or could be explained by variance in the 1,818-instance set.
- [Section 3.2] Section 3.2 (Benchmark Construction): the claim that acoustic difficulty is 'controlled' across instances is central to attributing failures to modality bias rather than acoustic complexity; the manuscript should explicitly state the acoustic features or metrics used for stratification and report how many instances fall into each difficulty bin.
minor comments (2)
- [Abstract] Abstract: the exact number of models (twelve) and datasets (four) should be stated numerically rather than left implicit.
- [Figure 3] Figure 3 caption: the legend for acoustic-variant ranking curves is unclear about which line corresponds to which model family.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive comments. We address the two major points below and will incorporate clarifications and additional analyses in the revised manuscript.
read point-by-point responses
-
Referee: [Section 4] Section 4 (Experiments) and Table 2: the dramatic degradation when textual context is added is presented as the key evidence for text-over-acoustics bias, yet no statistical significance tests (p-values, confidence intervals, or paired comparisons) are reported for the performance drops across the twelve models; without these, it is difficult to judge whether the observed differences are robust or could be explained by variance in the 1,818-instance set.
Authors: We agree that statistical significance testing would strengthen the presentation of the results. In the revised manuscript we will add paired tests (McNemar's test for the consistency detection tasks and Wilcoxon signed-rank test for ranking) together with 95% confidence intervals for all reported metrics in Table 2 and the associated text. These additions will confirm that the observed performance drops are statistically significant. revision: yes
-
Referee: [Section 3.2] Section 3.2 (Benchmark Construction): the claim that acoustic difficulty is 'controlled' across instances is central to attributing failures to modality bias rather than acoustic complexity; the manuscript should explicitly state the acoustic features or metrics used for stratification and report how many instances fall into each difficulty bin.
Authors: We acknowledge that the current manuscript does not provide sufficient detail on the stratification procedure. We will revise Section 3.2 to explicitly describe the acoustic metrics used (signal-to-noise ratio, speaker-embedding cosine similarity from a pre-trained verification model, and prosodic variation) and will add a table reporting the number of instances per difficulty bin (low/medium/high) for each of the four source datasets. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a purely empirical benchmark study. It defines three tasks, constructs 1,818 human-verified instances across synthetic and real datasets with controlled acoustic difficulty, runs twelve LALMs, and reports direct performance metrics (detection accuracy, ranking quality, degradation under textual context). No equations, parameter fits, uniqueness theorems, or self-citations are used to derive the central claims; the text-over-acoustics bias is observed from the experimental patterns themselves. The work is self-contained against external benchmarks and human ground truth.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human verification provides reliable ground truth for speaker consistency labels.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency across multi-turn dialogues through three tasks... 1,818 human-verified evaluation instances... models prioritize textual coherence over acoustic cues
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.