pith. sign in

arxiv: 2601.19709 · v1 · submitted 2026-01-27 · 💻 cs.SD · cs.AI

Hyperbolic Additive Margin Softmax with Hierarchical Information for Speaker Verification

Pith reviewed 2026-05-16 10:44 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords speaker verificationhyperbolic spacesoftmax lossadditive marginhierarchical embeddingequal error ratespeaker embedding
0
0 comments X

The pith

Projecting speaker embeddings into hyperbolic space lets softmax losses capture hierarchical structure and lower verification error rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that standard Euclidean softmax losses cannot adequately represent the hierarchical relations inside speaker features. It replaces the usual distance computation with hyperbolic distance after mapping both embeddings and speaker centers onto the hyperboloid model. The resulting H-Softmax loss embeds hierarchy directly into the training objective; an additive-margin variant, HAM-Softmax, adds a separation term on top. On common speaker verification test sets the hyperbolic versions produce average relative equal-error-rate reductions of 27.84 percent versus plain softmax and 14.23 percent versus additive-margin softmax. The claim is that the same network backbone can therefore be kept while the loss itself supplies the missing hierarchical modeling capacity.

Core claim

Replacing the Euclidean inner-product term inside the softmax with hyperbolic distance, after first projecting the Euclidean embeddings and class centers onto the hyperboloid, produces speaker embeddings that simultaneously improve verification accuracy and retain the ability to encode hierarchical relations among speakers.

What carries the argument

Hyperbolic Softmax loss that projects Euclidean vectors to the hyperboloid and substitutes hyperbolic distance for Euclidean distance inside the softmax computation.

If this is right

  • Existing speaker embedding networks can adopt the new losses without changing the feature-extraction layers.
  • Margin-based separation remains compatible, so the gains of additive margins combine with the geometric benefits of hyperbolic space.
  • The same loss construction applies to any embedding task whose labels contain natural hierarchy, such as language or accent groups within speakers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the improvement persists even on datasets whose speaker labels lack explicit hierarchy, the benefit may stem from the geometry of the loss surface rather than from explicit hierarchy encoding.
  • The curvature radius of the hyperboloid becomes a new tunable parameter; measuring performance across a range of curvatures would show whether the reported gains are robust or sensitive to that choice.
  • On corpora that supply explicit hierarchical speaker metadata, the learned hyperbolic embeddings could be checked directly for tree-recovery accuracy against Euclidean baselines.

Load-bearing premise

Mapping embeddings into hyperbolic space adds useful hierarchical modeling without discarding distance relations that the network originally learned in Euclidean space.

What would settle it

Training identical networks with the hyperbolic and Euclidean losses on the same speaker verification data and observing equal or higher EER for the hyperbolic version on held-out trials would disprove the claimed benefit.

read the original abstract

Speaker embedding learning based on Euclidean space has achieved significant progress, but it is still insufficient in modeling hierarchical information within speaker features. Hyperbolic space, with its negative curvature geometric properties, can efficiently represent hierarchical information within a finite volume, making it more suitable for the feature distribution of speaker embeddings. In this paper, we propose Hyperbolic Softmax (H-Softmax) and Hyperbolic Additive Margin Softmax (HAM-Softmax) based on hyperbolic space. H-Softmax incorporates hierarchical information into speaker embeddings by projecting embeddings and speaker centers into hyperbolic space and computing hyperbolic distances. HAM-Softmax further enhances inter-class separability by introducing margin constraint on this basis. Experimental results show that H-Softmax and HAM-Softmax achieve average relative EER reductions of 27.84% and 14.23% compared with standard Softmax and AM-Softmax, respectively, demonstrating that the proposed methods effectively improve speaker verification performance and at the same time preserve the capability of hierarchical structure modeling. The code will be released at https://github.com/PunkMale/HAM-Softmax.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Hyperbolic Softmax (H-Softmax) and Hyperbolic Additive Margin Softmax (HAM-Softmax) for speaker embedding learning. Embeddings and speaker centers are projected into hyperbolic space, with hyperbolic distances replacing Euclidean ones to incorporate hierarchical structure; HAM-Softmax adds a margin constraint for improved separability. The central claim is that these yield average relative EER reductions of 27.84% versus standard Softmax and 14.23% versus AM-Softmax while preserving hierarchical modeling capability, with code to be released.

Significance. If the performance gains are shown to arise specifically from hyperbolic geometry rather than ancillary effects, the work would offer a concrete demonstration of hyperbolic embeddings for hierarchical speaker data, potentially improving verification in domains with tree-like speaker relationships and encouraging further geometry-aware losses in audio processing.

major comments (3)
  1. [Experimental Results] Experimental Results section: the reported average relative EER reductions of 27.84% and 14.23% are stated without dataset sizes, speaker counts, trial numbers, baseline implementation details, or statistical significance tests, preventing verification of the central performance claim.
  2. [Method] Method section: no ablation holds all factors fixed while toggling only the manifold (hyperbolic vs. Euclidean distance), so the attribution of gains to hierarchical modeling versus generic regularization cannot be assessed.
  3. [Abstract and §4] Abstract and §4: the assertion that the methods 'preserve the capability of hierarchical structure modeling' is unsupported by any diagnostic such as Gromov hyperbolicity scores, tree-distortion metrics on speaker metadata, or Poincaré-ball visualizations of embeddings.
minor comments (1)
  1. [Abstract] The abstract states that code will be released at a GitHub link but supplies no license information or reproducibility checklist.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment point by point below and outline the revisions we will implement to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: the reported average relative EER reductions of 27.84% and 14.23% are stated without dataset sizes, speaker counts, trial numbers, baseline implementation details, or statistical significance tests, preventing verification of the central performance claim.

    Authors: We agree that more details are required to verify the performance claims. In the revised manuscript, we will provide the dataset sizes, speaker counts, trial numbers, full baseline implementation details, and include statistical significance tests for the EER reductions. revision: yes

  2. Referee: [Method] Method section: no ablation holds all factors fixed while toggling only the manifold (hyperbolic vs. Euclidean distance), so the attribution of gains to hierarchical modeling versus generic regularization cannot be assessed.

    Authors: We recognize the need for a controlled ablation. We will add an ablation experiment in the revised version that keeps all factors fixed except for the choice of manifold (hyperbolic versus Euclidean distance) to isolate the contribution of hyperbolic geometry. revision: yes

  3. Referee: [Abstract and §4] Abstract and §4: the assertion that the methods 'preserve the capability of hierarchical structure modeling' is unsupported by any diagnostic such as Gromov hyperbolicity scores, tree-distortion metrics on speaker metadata, or Poincaré-ball visualizations of embeddings.

    Authors: We agree that additional diagnostics would better support this assertion. In the revision, we will add Poincaré-ball visualizations of the embeddings and report Gromov hyperbolicity scores to demonstrate the preservation of hierarchical structure. revision: yes

Circularity Check

0 steps flagged

No circularity; hyperbolic projection is an explicit modeling choice validated on external EER

full rationale

The paper defines H-Softmax and HAM-Softmax by projecting embeddings and centers into hyperbolic space and replacing Euclidean distances with hyperbolic ones; this is an ansatz for hierarchy rather than a derived claim that reduces to fitted inputs. Reported gains (27.84% and 14.23% relative EER reduction) are measured against independent Euclidean baselines on standard speaker verification test sets, providing external falsifiability. No equations, self-citations, or uniqueness theorems are invoked that would make the result equivalent to its own construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the geometric property that hyperbolic space can represent hierarchies efficiently and on the assumption that the projection step preserves useful information from the original embeddings.

free parameters (1)
  • hyperbolic curvature parameter
    A curvature value must be chosen or tuned to define the hyperbolic manifold; its specific value is not stated in the abstract but is required for the distance computations.
axioms (1)
  • domain assumption Hyperbolic space with negative curvature can efficiently represent hierarchical information within a finite volume.
    Invoked directly in the abstract as the reason hyperbolic space is more suitable than Euclidean space for speaker features.

pith-pipeline@v0.9.0 · 5488 in / 1238 out tokens · 45424 ms · 2026-05-16T10:44:07.691285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    INTRODUCTION Speaker verification (SV) aims to determine whether a given speech segment belongs to a target speaker [1]. With the development of deep learning, speaker embedding learning methods based on neural networks have gradually replaced traditional approaches and have significantly improved SV performance by learning discriminative speaker embeddin...

  2. [2]

    RELA TED WORKS 2.1. Discriminative Loss for Speaker Recognition Early speaker verification adopted the standard Softmax with cross- entropy loss, which could effectively classify speakers within the arXiv:2601.19709v1 [cs.SD] 27 Jan 2026 training set but performed poorly in open-set scenarios. To enlarge inter-class distance and reduce intra-class distanc...

  3. [3]

    Current popular loss functions are usually constructed in Euclidean space based on margin-based softmax

    METHODOLOGY Motivation.In speaker verification tasks, it is generally expected that the model learns discriminative speaker embeddings. Current popular loss functions are usually constructed in Euclidean space based on margin-based softmax. However, the differences between speaker identities in the real world resemble a tree-like structure rather than a f...

  4. [4]

    Datasets and Experiments Setup Datasets.We evaluated the proposed method on V oxCeleb1 [25], V oxCeleb2 [26], and CNCeleb [27]

    EXPERIMENTS 4.1. Datasets and Experiments Setup Datasets.We evaluated the proposed method on V oxCeleb1 [25], V oxCeleb2 [26], and CNCeleb [27]. V oxCeleb1 consists of 1,211 speakers with 148,642 utterances, while V oxCeleb2 contains 5,994 speakers and 1,092,009 utterances. For evaluation, we used the clean versions of V ox1-O, V ox1-E, and V ox1-H. CNCel...

  5. [5]

    For V oxCeleb1, the model is trained for 150 epochs; for V oxCeleb2 and CNCeleb, the model is trained for 100 epochs

    We report performance using equal error rate (EER) and min- imum detection cost function (minDCF) withP target = 0.05. For V oxCeleb1, the model is trained for 150 epochs; for V oxCeleb2 and CNCeleb, the model is trained for 100 epochs. Baselines.For fair comparison, we compare the proposed H- Softmax without margin penalty with the standard Softmax and S...

  6. [6]

    CONCLUSIONS This paper proposes H-Softmax and HAM-Softmax based on hy- perbolic space, aimed at enhancing hierarchical modeling and inter- class separability of speaker embeddings. Experimental results show that H-Softmax outperforms margin-based methods on cross-domain complex data, while HAM-Softmax achieves the best or second-best performance on all da...

  7. [7]

    Small Group

    ACKNOWLEDGEMENTS This work was supported in part by the National Natural Science Foundation of China under Grant 62366051, in part by the State Grid Xinjiang Electric Power Company and Xinjiang Siji Informa- tion Technology Co., Ltd. under Grant SGITXX00ZHXX2200262, and in part by the “Small Group” Aid Xinjiang Project under Grant 51052501207

  8. [8]

    Overview of speaker modeling and its ap- plications: From the lens of deep speaker representation learn- ing,

    Shuai Wang, Zhengyang Chen, Kong Aik Lee, Yanmin Qian, and Haizhou Li, “Overview of speaker modeling and its ap- plications: From the lens of deep speaker representation learn- ing,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 32, pp. 4971–4998, 2024

  9. [9]

    X-vectors: Robust dnn em- beddings for speaker recognition,

    David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn em- beddings for speaker recognition,” inICASSP, 2018, pp. 5329– 5333

  10. [10]

    Multi-view speaker embed- ding learning for enhanced stability and discriminability,

    Liang He, Zhihua Fang, Zuoer Chen, Minqiang Xu, Ying Meng, and Penghao Wang, “Multi-view speaker embed- ding learning for enhanced stability and discriminability,” in ICASSP, 2024, pp. 10081–10085

  11. [11]

    Ad- ditive margin softmax for face verification,

    Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, “Ad- ditive margin softmax for face verification,”IEEE Signal Pro- cessing Letters, vol. 25, no. 7, pp. 926–930, 2018

  12. [12]

    Arcface: Additive angular margin loss for deep face recognition,

    Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kot- sia, and Stefanos Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 5962– 5979, 2022

  13. [13]

    Speaker clustering using decision tree-based phone cluster models with multi-space probability distributions,

    Han-Ping Shen, Jui-Feng Yeh, and Chung-Hsien Wu, “Speaker clustering using decision tree-based phone cluster models with multi-space probability distributions,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1289–1300, 2011

  14. [14]

    Using gender, phonation and age to interpret automatically discovered speech attributes for explainable speaker recognition,

    Carole Millot, Clara Ponchard, C ´edric Gendrot, Jean-Franc ¸ois Bonastre, and Orane Dufour, “Using gender, phonation and age to interpret automatically discovered speech attributes for explainable speaker recognition,” inINTERSPEECH, 2025, pp. 3638–3642

  15. [15]

    Lightweight speaker recognition in poincar ´e spaces,

    Jieun Lee, Kim Sung-Bin, Seokhyeong Kang, and Tae-Hyun Oh, “Lightweight speaker recognition in poincar ´e spaces,” IEEE Signal Processing Letters, vol. 29, pp. 224–228, 2022

  16. [16]

    Hyperbolic representation learning: Revisiting and ad- vancing,

    Menglin Yang, Min Zhou, Rex Ying, Yankai Chen, and Irwin King, “Hyperbolic representation learning: Revisiting and ad- vancing,” inICML, 2023, pp. 39639–39659

  17. [17]

    Poincar ´e embeddings for learning hierarchical representations,

    Maximilian Nickel and Douwe Kiela, “Poincar ´e embeddings for learning hierarchical representations,” inNeurIPS, 2017, pp. 6341–6350

  18. [18]

    Hyperbolic image em- beddings,

    Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, and Victor Lempitsky, “Hyperbolic image em- beddings,” inCVPR, 2020, pp. 6417–6427

  19. [19]

    Hyperbolic distance-based speech separation,

    Darius Petermann and Minje Kim, “Hyperbolic distance-based speech separation,” inICASSP, 2024, pp. 1191–1195

  20. [20]

    Large margin softmax loss for speaker verification,

    Yi Liu, Liang He, and Jia Liu, “Large margin softmax loss for speaker verification,” inINTERSPEECH, 2019, pp. 2873– 2877

  21. [21]

    Ensemble additive margin softmax for speaker verification,

    Ya-Qi Yu, Lei Fan, and Wu-Jun Li, “Ensemble additive margin softmax for speaker verification,” inICASSP, 2019, pp. 6046– 6050

  22. [22]

    Dynamic margin soft- max loss for speaker verification,

    Dao Zhou, Longbiao Wang, Kong Aik Lee, Yibo Wu, Meng Liu, Jianwu Dang, and Jianguo Wei, “Dynamic margin soft- max loss for speaker verification,” inINTERSPEECH, 2020, pp. 3800–3804

  23. [23]

    Adaptive margin circle loss for speaker verification,

    Runqiu Xiao, Xiaoxiao Miao, Wenchao Wang, Pengyuan Zhang, Bin Cai, and Liuping Luo, “Adaptive margin circle loss for speaker verification,” inINTERSPEECH, 2021, pp. 4618–4622

  24. [24]

    Real additive mar- gin softmax for speaker verification,

    Lantian Li, Ruiqian Nai, and Dong Wang, “Real additive mar- gin softmax for speaker verification,” inICASSP, 2022, pp. 7527–7531

  25. [25]

    Discrim- inative speaker representation via contrastive learning with class-aware attention in angular space,

    Zhe Li, Man-Wai Mak, and Helen Mei-Ling Meng, “Discrim- inative speaker representation via contrastive learning with class-aware attention in angular space,” inICASSP, 2023, pp. 1–5

  26. [26]

    Exploring binary classification loss for speaker verification,

    Bing Han, Zhengyang Chen, and Yanmin Qian, “Exploring binary classification loss for speaker verification,” inICASSP, 2023, pp. 1–5

  27. [27]

    Sphereface2: Binary classification is all you need for deep face recognition,

    Yandong Wen, Weiyang Liu, Adrian Weller, Bhiksha Raj, and Rita Singh, “Sphereface2: Binary classification is all you need for deep face recognition,” inICLR, 2022

  28. [28]

    Adaptive large margin fine-tuning for robust speaker verification,

    Leying Zhang, Zhengyang Chen, and Yanmin Qian, “Adaptive large margin fine-tuning for robust speaker verification,” in ICASSP, 2023, pp. 1–5

  29. [29]

    Deep noise-aware quality loss for speaker verification,

    Pantid Chantangphol, Theerat Sakdejayont, Monchai Lertsut- thiwong, and Tawunrat Chalothorn, “Deep noise-aware quality loss for speaker verification,” inCIKM, 2024, pp. 3669–3673

  30. [30]

    Hy- perbolic neural networks,

    Octavian Ganea, Gary Becigneul, and Thomas Hofmann, “Hy- perbolic neural networks,” inNeurIPS, 2018, pp. 5350–5360

  31. [31]

    Learning structured representations with hyperbolic embed- dings,

    Aditya Sinha, Siqi Zeng, Makoto Yamada, and Han Zhao, “Learning structured representations with hyperbolic embed- dings,” inNeurIPS, 2024, pp. 91220–91259

  32. [32]

    V oxCeleb: A Large-Scale Speaker Identification Dataset,

    Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “V oxCeleb: A Large-Scale Speaker Identification Dataset,” in INTERSPEECH, 2017, pp. 2616–2620

  33. [33]

    V oxCeleb2: Deep Speaker Recognition,

    Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inINTERSPEECH, 2018, pp. 1086–1090

  34. [34]

    Cn-celeb: Multi-genre speaker recognition,

    Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, Ravichander Vipperla, Thomas Fang Zheng, and Dong Wang, “Cn-celeb: Multi-genre speaker recognition,”Speech Communication, vol. 137, pp. 77–91, 2022

  35. [35]

    Musan: A music, speech, and noise corpus,

    David Snyder, Guoguo Chen, and Daniel Povey, “Musan: A music, speech, and noise corpus,” 2015

  36. [36]

    A study on data augmen- tation of reverberant speech for robust speech recognition,

    Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur, “A study on data augmen- tation of reverberant speech for robust speech recognition,” in ICASSP, 2017, pp. 5220–5224

  37. [37]

    ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

    Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in INTERSPEECH, 2020, pp. 3830–3834

  38. [38]

    Adam: A method for stochastic optimization,

    Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” inICLR, 2015