Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

Mark D. Plumbley; Wenwu Wang; Yanze Xu

arxiv: 2604.23354 · v2 · pith:7ECUX2WTnew · submitted 2026-04-25 · 📡 eess.AS · cs.AI· eess.SP

Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

Yanze Xu , Wenwu Wang , Mark D. Plumbley This is my paper

Pith reviewed 2026-05-08 06:54 UTC · model grok-4.3

classification 📡 eess.AS cs.AIeess.SP

keywords explainable AIspeaker recognitionlatent representationshierarchical clusteringSLINKHDBSCANHCCMLiebig's score

0 comments

The pith

Speaker recognition neural networks organize their latent representations into hierarchical clusters that align with semantic attributes like gender and nationality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how speaker recognition networks structure their internal representations by applying hierarchical clustering algorithms. It finds that these representations form nested clusters rather than separate independent groups, as shown by SLINK and HDBSCAN. To make sense of these clusters, the authors create HCCM to link them directly to semantic classes or combinations of classes. Liebig's score then evaluates how well the links work, pointing out what prevents better alignment between the network's learned patterns and human-defined speaker categories.

Core claim

This work shows that applying Single-Linkage Clustering and HDBSCAN to the latent space of a speaker recognition network uncovers hierarchical clustering phenomena, where clusters have relationships rather than being isolated. The new HCCM algorithm matches these clusters to semantic classes, succeeding for single classes such as male or UK and for conjunctions such as male and UK or female and Ireland. Liebig's score measures the quality of these matches to identify the main limitations in the matching process.

What carries the argument

The Hierarchical Cluster-Class Matching (HCCM) algorithm, which establishes one-to-one correspondences between hierarchical clusters from SLINK or HDBSCAN and predefined semantic classes or their conjunctions, evaluated using Liebig's score.

If this is right

Clusters can correspond to individual semantic classes or to their logical combinations, indicating that the network encodes interacting attributes.
The matching process helps diagnose whether poor performance comes from the clustering step or from the choice of semantic labels.
Successful matches demonstrate that certain speaker attributes are explicitly grouped in the representation space.
Liebig's score provides a quantitative way to compare how different networks or clustering methods capture semantic structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the hierarchical structure reflects real speaker attribute hierarchies, it could guide the creation of networks that learn more disentangled representations.
This method might be applied to other tasks in audio processing to reveal hidden organizations in learned features.
Manual verification of the content in matched clusters could further validate whether the alignments capture intended meanings.

Load-bearing premise

The hierarchical clusters identified by SLINK and HDBSCAN reflect meaningful semantic groupings of speakers that can be systematically matched by HCCM in a non-random fashion.

What would settle it

Running HCCM on the clusters and finding that the majority of matches have low Liebig's scores or that the clusters contain utterances not sharing the expected semantic properties.

Figures

Figures reproduced from arXiv: 2604.23354 by Mark D. Plumbley, Wenwu Wang, Yanze Xu.

**Figure 1.** Figure 1: An approximate 2-dimensional visualisation for the view at source ↗

**Figure 2.** Figure 2: An illustration of the Hierarchical Cluster-Class Match view at source ↗

**Figure 3.** Figure 3: An illustration for interpreting the matching degree view at source ↗

**Figure 4.** Figure 4: Pesudocode of the DBSCAN from RJGB Campello’s view at source ↗

**Figure 5.** Figure 5: An illustration of intersecting predefined representation view at source ↗

**Figure 6.** Figure 6: An overview of experimental procedures separately, because when the density constraint parameter minP ts in McInnes et al.’s HDBSCAN implementation is set to 0, the mutual reachability distance space coincides with the original Euclidean distance space, in which case McInnes et al.’s HDBSCAN implementation reduces to running SLINK directly in the original Euclidean distance space. More detailed setups of h… view at source ↗

**Figure 7.** Figure 7: Cluster-class matching results [26] for evaluating the hierarchical clustering results obtained by applying SLINK (i.e. minP ts = 0) and HDBSCAN (i.e. minP ts = 2, 4, 6, 8, 12, 16, 21, 27)) to representations (i.e. embeddings) of 0.2, 1, 2, and 4-sec audios. hierarchical representation clusters of both Fig. 7b and Fig. 7c are consistently produced by applying SLINK to 4-second audio representations, achiev… view at source ↗

read the original abstract

Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls within the Explainable AI (XAI) domain. This paper proposes to study an XAI topic: uncovering the unknown organisation in the representations, particularly those a speaker recognition network learns from utterances, for recognising speaker identity. Past studies have employed algorithms (e.g. K-means) to analyse how network representations can be naturally organised into independent clusters in different ways, i.e., to analyse flat clustering phenomena within the space defined by these representations, referred to as the network representation space. In contrast, this work applies two algorithms, Single-Linkage Clustering (SLINK) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), to analyse how representations form hierarchical clusters in different ways, i.e., to analyse hierarchical clustering phenomena within the network representation space. To further understand these hierarchical clustering phenomena, we propose a new algorithm termed Hierarchical Cluster-Class Matching (HCCM). HCCM provides a semantic interpretation for the hierarchical clusters produced by SLINK and HDBSCAN by matching them to predefined semantic classes. Through this process, some clusters are interpreted as individual semantic classes (e.g. male), whereas others are interpreted as conjunctions of individual semantic classes (e.g. female and Ireland). In addition, we develop a new metric, the Liebig score, to quantify how well a cluster matches a semantic class, which helps identify the factor that most strongly limits each match.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces HCCM and Liebig's score to match hierarchical clusters in speaker embeddings to semantic classes like gender and accent, but without null models the hierarchy claim rests on shaky ground.

read the letter

The main point is that this work moves from flat clustering like t-SNE and k-means to SLINK and HDBSCAN on speaker recognition embeddings, then adds HCCM to match the resulting dendrograms or density clusters to metadata labels and conjunctions, scored by a new metric they call Liebig's score. Some matches land on single classes and others on pairs, which they use to diagnose what limits performance. That extension to hierarchical structure and the matching procedure are the concrete novelties relative to the cited prior work. The approach is straightforward to understand and could be useful for spotting how networks encode nested speaker attributes. The paper does a reasonable job laying out the motivation from existing XAI visualization techniques and showing example alignments. The soft spot is the missing controls. Hierarchical algorithms will always induce some tree or density structure on finite data, so the claim that this reveals intrinsic hierarchical phenomena in the network space needs a baseline comparison, such as shuffled embeddings or permuted labels, to show the matches are stronger than chance or than what linear factors already known in speaker data would produce. The abstract gives no numbers, error rates, or validation that the matches are not post-hoc, which leaves the central results hard to assess. If the full paper supplies those and the methods section is reproducible, the contribution becomes more solid. This is for people working on interpretability in audio models who already use clustering tools and want a way to tie clusters back to semantic metadata. A reader in that niche could get practical ideas from the HCCM design and the score, even if they end up modifying it. The paper shows clear engagement with the literature on representation analysis and proposes distinct algorithms, so it deserves peer review rather than a desk reject. I'd send it out but ask the referees to focus on the baseline experiments and quantitative results.

Referee Report

3 major / 2 minor

Summary. The paper claims that speaker recognition networks exhibit hierarchical (rather than flat) clustering phenomena in their latent representations. It demonstrates this by applying SLINK and HDBSCAN to speaker embeddings, then introduces Hierarchical Cluster-Class Matching (HCCM) to align the resulting dendrograms or density hierarchies one-to-one with semantic metadata classes (gender, accent, etc.) and their conjunctions; a new metric called Liebig's score is proposed to quantify matching performance and diagnose limiting factors.

Significance. If the observed hierarchies prove intrinsic rather than algorithm-imposed and the HCCM matches are shown to be non-arbitrary, the work would supply a concrete, reproducible method for moving XAI in speaker recognition beyond t-SNE/K-means visualizations toward interpretable hierarchical structure, with potential utility for diagnosing embedding biases or improving downstream tasks.

major comments (3)

[Methods/Results] Methods and Results sections: No null-model controls (shuffled embeddings, random vectors with matched covariance, or label-permuted baselines) are reported to test whether the dendrograms or density hierarchies recovered by SLINK/HDBSCAN are stronger or more semantically aligned than those expected from unstructured point clouds; without such controls the central claim that the algorithms reveal 'hierarchical clustering phenomena within the network representation space' cannot be distinguished from the fact that any finite set of points induces some hierarchy under these algorithms.
[Abstract/Results] Abstract and Results: The assertion that 'some hierarchical clusters are successfully matched' to classes or conjunctions is presented without any quantitative metrics (matching accuracy, Liebig's score values, confusion matrices, or statistical significance tests), error analysis, or validation that the HCCM alignments are not post-hoc; this absence makes the performance claims unverifiable and prevents assessment of whether Liebig's score actually diagnoses limiting factors.
[Proposed Method] Definition of HCCM and Liebig's score: The one-to-one matching procedure and the scoring formula are introduced as novel contributions, yet the manuscript provides no formal algorithmic description, complexity analysis, or proof that the matching is unique or stable under small perturbations of the dendrogram; these omissions render the new entities difficult to reproduce or compare against existing hierarchical clustering evaluation methods.

minor comments (2)

[Introduction] Notation for semantic classes and their conjunctions is introduced informally; a small table or explicit enumeration of the metadata attributes used would improve clarity.
[Figures] Figure captions for the dendrograms and cluster visualizations should include the exact hyper-parameters (minimum cluster size for HDBSCAN, linkage threshold for SLINK) and the number of embeddings plotted.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each of the major comments below, indicating the revisions we plan to make to enhance the manuscript's rigor and clarity.

read point-by-point responses

Referee: [Methods/Results] Methods and Results sections: No null-model controls (shuffled embeddings, random vectors with matched covariance, or label-permuted baselines) are reported to test whether the dendrograms or density hierarchies recovered by SLINK/HDBSCAN are stronger or more semantically aligned than those expected from unstructured point clouds; without such controls the central claim that the algorithms reveal 'hierarchical clustering phenomena within the network representation space' cannot be distinguished from the fact that any finite set of points induces some hierarchy under these algorithms.

Authors: We agree that null-model controls are crucial to substantiate our claims about intrinsic hierarchical clustering in the learned representations rather than artifacts of the clustering algorithms. In the revised version, we will add experiments using shuffled embeddings, random vectors with matched covariance structure, and label-permuted baselines. We will apply SLINK and HDBSCAN to these controls and compare the resulting hierarchies and their semantic alignments using Liebig's score and other metrics, including statistical tests to demonstrate significance. revision: yes
Referee: [Abstract/Results] Abstract and Results: The assertion that 'some hierarchical clusters are successfully matched' to classes or conjunctions is presented without any quantitative metrics (matching accuracy, Liebig's score values, confusion matrices, or statistical significance tests), error analysis, or validation that the HCCM alignments are not post-hoc; this absence makes the performance claims unverifiable and prevents assessment of whether Liebig's score actually diagnoses limiting factors.

Authors: We acknowledge the need for quantitative support. Although the manuscript introduces Liebig's score for this purpose, we will revise the Results section to prominently feature specific numerical values of Liebig's score for the reported matches, include confusion matrices for the HCCM procedure, provide error analysis, and conduct statistical significance tests. We will also detail the deterministic steps in HCCM to show that alignments are not arbitrary post-hoc choices but follow predefined matching criteria. revision: yes
Referee: [Proposed Method] Definition of HCCM and Liebig's score: The one-to-one matching procedure and the scoring formula are introduced as novel contributions, yet the manuscript provides no formal algorithmic description, complexity analysis, or proof that the matching is unique or stable under small perturbations of the dendrogram; these omissions render the new entities difficult to reproduce or compare against existing hierarchical clustering evaluation methods.

Authors: We will include a formal description of the HCCM algorithm with pseudocode in the Methods section. A complexity analysis will be added, showing that the procedure scales as O(N log N) where N is the number of clusters. For stability, we will perform empirical tests by introducing small perturbations to the dendrograms and measuring the consistency of the matches. While a general mathematical proof of uniqueness may require additional assumptions on the data distribution and is beyond the current scope, the empirical evidence and comparison to standard hierarchical evaluation metrics will be provided to facilitate reproducibility and comparison. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper applies the standard algorithms SLINK and HDBSCAN to existing speaker embeddings to identify clusters, then introduces the new HCCM procedure for one-to-one semantic matching and Liebig's score for quantification. No step reduces by construction to its inputs: there are no self-definitional loops where a claimed result is presupposed in the definition of the method, no fitted parameters relabeled as predictions, and no load-bearing self-citations or imported uniqueness theorems. The central claims rest on external clustering routines and a novel matching algorithm whose performance is evaluated against predefined metadata classes, keeping the chain self-contained and independent of the target observations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claims rest on the assumption that hierarchical clustering algorithms will reveal semantically meaningful structure and that the proposed matching can be evaluated meaningfully; no free parameters, axioms, or invented physical entities are introduced.

invented entities (2)

Hierarchical Cluster-Class Matching (HCCM) no independent evidence
purpose: Perform one-to-one matching between hierarchical clusters and semantic classes
New algorithm introduced to link representation clusters to labels such as gender or nationality.
Liebig's score no independent evidence
purpose: Quantify performance of the cluster-class matching
New metric proposed to diagnose factors limiting matching success.

pith-pipeline@v0.9.0 · 5577 in / 1237 out tokens · 73986 ms · 2026-05-08T06:54:00.456162+00:00 · methodology

Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)