The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models

Dennis Fucci; Guillaume Wisniewski; Lina Conti; Luisa Bentivogli; Marco Gaido; Matteo Negri

arxiv: 2509.26543 · v1 · submitted 2025-09-30 · 💻 cs.CL · cs.AI

The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models

Lina Conti , Dennis Fucci , Marco Gaido , Matteo Negri , Guillaume Wisniewski , Luisa Bentivogli This is my paper

Pith reviewed 2026-05-18 11:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords contrastive explanationsspeech-to-textfeature attributionspeech translationgender assignmentspectrogram analysisexplainable AIaudio features

0 comments

The pith

A method using feature attribution on spectrograms produces contrastive explanations for why speech-to-text models select one output over another.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the first technique for contrastive explanations in speech-to-text models, which explain the choice of a target output instead of a foil alternative by examining influences from the input audio. It adapts existing feature attribution methods to spectrogram representations so that specific regions of the audio signal can be linked to particular model decisions. A case study on gender assignment in speech translation demonstrates that the resulting explanations correctly isolate the audio features responsible for selecting one gender over the other. A sympathetic reader would care because such explanations are viewed as more informative than standard ones and could help users understand, debug, and trust generative speech systems.

Core claim

By drawing from feature attribution techniques, the authors propose the first method to obtain contrastive explanations in speech-to-text generative models through analysis of how parts of the input spectrogram influence the choice between alternative outputs. Through a case study on gender assignment in speech translation, they show that the method accurately identifies the audio features that drive the selection of one gender over another.

What carries the argument

Contrastive feature attribution on spectrogram inputs, which distinguishes the influence of audio regions on target versus foil outputs in generative speech-to-text models.

If this is right

The approach enables identification of specific audio cues behind gender decisions in speech translation systems.
It supplies a practical way to generate more informative explanations than standard feature attributions for S2T models.
The technique offers a foundation for extending contrastive explanations to other speech-to-text tasks beyond gender assignment.
Developers can use the output to locate and address unwanted biases in how models process audio inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectrogram-based attribution could be tested on tasks involving accent, emotion, or speaker identity to check broader applicability.
Combining these explanations with text-side attributions might produce joint audio-text accounts of translation decisions.
If the method proves stable across languages, it could support audits of deployed speech systems for fairness compliance.

Load-bearing premise

Feature attribution techniques can be directly and effectively applied to spectrogram inputs to produce meaningful contrastive explanations that distinguish influences on target versus foil outputs in generative speech-to-text models.

What would settle it

If modifying the spectrogram regions highlighted by the explanations does not change the model's output from target to foil while unhighlighted regions do, the claim that the method accurately identifies driving audio features would be falsified.

read the original abstract

Contrastive explanations, which indicate why an AI system produced one output (the target) instead of another (the foil), are widely regarded in explainable AI as more informative and interpretable than standard explanations. However, obtaining such explanations for speech-to-text (S2T) generative models remains an open challenge. Drawing from feature attribution techniques, we propose the first method to obtain contrastive explanations in S2T by analyzing how parts of the input spectrogram influence the choice between alternative outputs. Through a case study on gender assignment in speech translation, we show that our method accurately identifies the audio features that drive the selection of one gender over another. By extending the scope of contrastive explanations to S2T, our work provides a foundation for better understanding S2T models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces the first contrastive explanation method for speech-to-text models by adapting feature attribution to spectrograms, shown via a gender assignment case study in translation.

read the letter

The one thing to know about this paper is that it introduces what appears to be the first approach to contrastive explanations specifically for speech-to-text generative models. By using feature attribution on the input spectrogram, it aims to highlight which parts of the audio drive the model toward one output rather than a foil alternative, and they demonstrate this on a gender assignment task in speech translation. What stands out positively is the identification of a genuine gap in the literature. Most contrastive explanation work focuses on text or image models, and extending it to S2T makes sense given how these systems are used in practice. The choice of gender assignment as the case study is practical because it involves a clear decision point where understanding the influencing audio features could reveal biases or decision patterns in translation outputs. If the paper includes visualizations or specific examples of attributions pointing to relevant acoustic segments, that adds value for practitioners. On the softer side, the validation seems thin from the description. The claim that the method accurately identifies the driving features relies on a case study that is outlined at a high level, without apparent quantitative metrics, baseline comparisons, or experiments that confirm the attributions match actual causal audio properties like pitch or formants. The stress-test note correctly flags the need for intervention tests or correlations with known gender cues to verify that the saliency maps are not just artifacts. Without those, the central results are harder to trust fully, though this might be addressable in revisions. Overall, this is the kind of paper that would interest people working on interpretability for speech and multimodal models. A reader looking for new tools to debug or understand S2T decisions could pick up useful ideas here, particularly if they are already familiar with feature attribution methods from other domains. It shows clear thinking in applying existing ideas to a new modality and engages honestly with the challenge of generative outputs. I think it deserves to go to peer review so that experts can assess the technical details and suggest improvements to the evaluation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the first method to obtain contrastive explanations for speech-to-text (S2T) generative models. It adapts feature attribution techniques to spectrogram inputs to identify how specific audio regions influence the model to select one output (target) over an alternative (foil). The approach is demonstrated via a case study on gender assignment in speech translation, where the method is claimed to accurately identify the audio features driving selection of one gender over another.

Significance. If the attributions reliably isolate causal acoustic drivers rather than model artifacts or input correlations, the work would meaningfully extend contrastive explanations—an established XAI tool valued for its focus on why one output was chosen over another—to the S2T domain. This could support better debugging of generative audio models, particularly on bias-related behaviors such as gender in translation. The paper correctly draws from established feature attribution ideas and avoids circularity by using an independent case study.

major comments (2)

[Abstract / Case Study] Abstract and case-study description: the central claim that the method 'accurately identifies the audio features that drive the selection of one gender over another' rests on a high-level description without reported quantitative metrics, controls, or validation steps. No correlation is shown with established gender cues (F0, formant structure) and no intervention results (e.g., perturbation of attributed regions and measured output flip rate) are provided. This directly undermines the accuracy assertion that is load-bearing for the paper's contribution.
[Method] Method section: the adaptation of feature attribution to spectrogram inputs for autoregressive S2T decoding is presented at a high level. It is unclear how attributions are aggregated across decoding steps or whether stability checks against spurious saliency maps were performed; without these details the claim that the maps distinguish target versus foil influences cannot be evaluated.

minor comments (2)

[Abstract] The abstract would be clearer if it named the specific feature attribution technique (gradient-based, perturbation-based, etc.) rather than referring only to 'drawing from feature attribution techniques.'
[Introduction] Notation for target/foil outputs and spectrogram regions should be introduced consistently in the main text to aid readers unfamiliar with contrastive XAI.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. We plan to make revisions to address the concerns raised regarding validation and methodological details.

read point-by-point responses

Referee: [Abstract / Case Study] Abstract and case-study description: the central claim that the method 'accurately identifies the audio features that drive the selection of one gender over another' rests on a high-level description without reported quantitative metrics, controls, or validation steps. No correlation is shown with established gender cues (F0, formant structure) and no intervention results (e.g., perturbation of attributed regions and measured output flip rate) are provided. This directly undermines the accuracy assertion that is load-bearing for the paper's contribution.

Authors: We acknowledge that the case study in the current manuscript is presented qualitatively without quantitative metrics or validation experiments. To address this, we will augment the paper with quantitative analyses, including correlations between the attributed spectrogram regions and established acoustic cues for gender such as F0 and formant structure. We will also conduct and report intervention studies involving perturbations of the highlighted audio features and measure the resulting changes in the model's gender selection output. These additions will provide stronger empirical grounding for our claims. revision: yes
Referee: [Method] Method section: the adaptation of feature attribution to spectrogram inputs for autoregressive S2T decoding is presented at a high level. It is unclear how attributions are aggregated across decoding steps or whether stability checks against spurious saliency maps were performed; without these details the claim that the maps distinguish target versus foil influences cannot be evaluated.

Authors: We agree that the Method section would benefit from greater detail. In the revision, we will elaborate on the procedure for aggregating feature attributions across the multiple decoding steps of the autoregressive S2T model. We will specify the aggregation method used and include results from stability checks to verify that the saliency maps are reliable and not artifacts. This will clarify how the contrastive attributions effectively differentiate the influences of the target and foil outputs. revision: yes

Circularity Check

0 steps flagged

No circularity: proposal and case-study validation are self-contained

full rationale

The paper proposes an adaptation of existing feature attribution methods to produce contrastive explanations for spectrogram inputs in S2T models and demonstrates the approach via an independent case study on gender assignment. No equations, fitted parameters, or self-citations are shown to reduce the central claim to a tautology or to the inputs by construction. The derivation chain relies on external feature-attribution literature and empirical results on held-out audio data rather than self-referential definitions or predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the approach adapts existing feature attribution methods without introducing new postulated components.

pith-pipeline@v0.9.0 · 5673 in / 1010 out tokens · 31192 ms · 2026-05-18T11:52:53.243955+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We repurpose the relative scorer of Jacovi et al. (2021) ... SCR(t,f) = p(t)/(p(t)+p(f)) − ˜p(t)/(˜p(t)+˜p(f))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SPES segments the input spectrogram ... influence of each segment on the model’s output is quantified by comparing the perturbed and original probabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.