Hyperbolic Additive Margin Softmax with Hierarchical Information for Speaker Verification
Pith reviewed 2026-05-16 10:44 UTC · model grok-4.3
The pith
Projecting speaker embeddings into hyperbolic space lets softmax losses capture hierarchical structure and lower verification error rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Replacing the Euclidean inner-product term inside the softmax with hyperbolic distance, after first projecting the Euclidean embeddings and class centers onto the hyperboloid, produces speaker embeddings that simultaneously improve verification accuracy and retain the ability to encode hierarchical relations among speakers.
What carries the argument
Hyperbolic Softmax loss that projects Euclidean vectors to the hyperboloid and substitutes hyperbolic distance for Euclidean distance inside the softmax computation.
If this is right
- Existing speaker embedding networks can adopt the new losses without changing the feature-extraction layers.
- Margin-based separation remains compatible, so the gains of additive margins combine with the geometric benefits of hyperbolic space.
- The same loss construction applies to any embedding task whose labels contain natural hierarchy, such as language or accent groups within speakers.
Where Pith is reading between the lines
- If the improvement persists even on datasets whose speaker labels lack explicit hierarchy, the benefit may stem from the geometry of the loss surface rather than from explicit hierarchy encoding.
- The curvature radius of the hyperboloid becomes a new tunable parameter; measuring performance across a range of curvatures would show whether the reported gains are robust or sensitive to that choice.
- On corpora that supply explicit hierarchical speaker metadata, the learned hyperbolic embeddings could be checked directly for tree-recovery accuracy against Euclidean baselines.
Load-bearing premise
Mapping embeddings into hyperbolic space adds useful hierarchical modeling without discarding distance relations that the network originally learned in Euclidean space.
What would settle it
Training identical networks with the hyperbolic and Euclidean losses on the same speaker verification data and observing equal or higher EER for the hyperbolic version on held-out trials would disprove the claimed benefit.
read the original abstract
Speaker embedding learning based on Euclidean space has achieved significant progress, but it is still insufficient in modeling hierarchical information within speaker features. Hyperbolic space, with its negative curvature geometric properties, can efficiently represent hierarchical information within a finite volume, making it more suitable for the feature distribution of speaker embeddings. In this paper, we propose Hyperbolic Softmax (H-Softmax) and Hyperbolic Additive Margin Softmax (HAM-Softmax) based on hyperbolic space. H-Softmax incorporates hierarchical information into speaker embeddings by projecting embeddings and speaker centers into hyperbolic space and computing hyperbolic distances. HAM-Softmax further enhances inter-class separability by introducing margin constraint on this basis. Experimental results show that H-Softmax and HAM-Softmax achieve average relative EER reductions of 27.84% and 14.23% compared with standard Softmax and AM-Softmax, respectively, demonstrating that the proposed methods effectively improve speaker verification performance and at the same time preserve the capability of hierarchical structure modeling. The code will be released at https://github.com/PunkMale/HAM-Softmax.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Hyperbolic Softmax (H-Softmax) and Hyperbolic Additive Margin Softmax (HAM-Softmax) for speaker embedding learning. Embeddings and speaker centers are projected into hyperbolic space, with hyperbolic distances replacing Euclidean ones to incorporate hierarchical structure; HAM-Softmax adds a margin constraint for improved separability. The central claim is that these yield average relative EER reductions of 27.84% versus standard Softmax and 14.23% versus AM-Softmax while preserving hierarchical modeling capability, with code to be released.
Significance. If the performance gains are shown to arise specifically from hyperbolic geometry rather than ancillary effects, the work would offer a concrete demonstration of hyperbolic embeddings for hierarchical speaker data, potentially improving verification in domains with tree-like speaker relationships and encouraging further geometry-aware losses in audio processing.
major comments (3)
- [Experimental Results] Experimental Results section: the reported average relative EER reductions of 27.84% and 14.23% are stated without dataset sizes, speaker counts, trial numbers, baseline implementation details, or statistical significance tests, preventing verification of the central performance claim.
- [Method] Method section: no ablation holds all factors fixed while toggling only the manifold (hyperbolic vs. Euclidean distance), so the attribution of gains to hierarchical modeling versus generic regularization cannot be assessed.
- [Abstract and §4] Abstract and §4: the assertion that the methods 'preserve the capability of hierarchical structure modeling' is unsupported by any diagnostic such as Gromov hyperbolicity scores, tree-distortion metrics on speaker metadata, or Poincaré-ball visualizations of embeddings.
minor comments (1)
- [Abstract] The abstract states that code will be released at a GitHub link but supplies no license information or reproducibility checklist.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment point by point below and outline the revisions we will implement to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results section: the reported average relative EER reductions of 27.84% and 14.23% are stated without dataset sizes, speaker counts, trial numbers, baseline implementation details, or statistical significance tests, preventing verification of the central performance claim.
Authors: We agree that more details are required to verify the performance claims. In the revised manuscript, we will provide the dataset sizes, speaker counts, trial numbers, full baseline implementation details, and include statistical significance tests for the EER reductions. revision: yes
-
Referee: [Method] Method section: no ablation holds all factors fixed while toggling only the manifold (hyperbolic vs. Euclidean distance), so the attribution of gains to hierarchical modeling versus generic regularization cannot be assessed.
Authors: We recognize the need for a controlled ablation. We will add an ablation experiment in the revised version that keeps all factors fixed except for the choice of manifold (hyperbolic versus Euclidean distance) to isolate the contribution of hyperbolic geometry. revision: yes
-
Referee: [Abstract and §4] Abstract and §4: the assertion that the methods 'preserve the capability of hierarchical structure modeling' is unsupported by any diagnostic such as Gromov hyperbolicity scores, tree-distortion metrics on speaker metadata, or Poincaré-ball visualizations of embeddings.
Authors: We agree that additional diagnostics would better support this assertion. In the revision, we will add Poincaré-ball visualizations of the embeddings and report Gromov hyperbolicity scores to demonstrate the preservation of hierarchical structure. revision: yes
Circularity Check
No circularity; hyperbolic projection is an explicit modeling choice validated on external EER
full rationale
The paper defines H-Softmax and HAM-Softmax by projecting embeddings and centers into hyperbolic space and replacing Euclidean distances with hyperbolic ones; this is an ansatz for hierarchy rather than a derived claim that reduces to fitted inputs. Reported gains (27.84% and 14.23% relative EER reduction) are measured against independent Euclidean baselines on standard speaker verification test sets, providing external falsifiability. No equations, self-citations, or uniqueness theorems are invoked that would make the result equivalent to its own construction. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- hyperbolic curvature parameter
axioms (1)
- domain assumption Hyperbolic space with negative curvature can efficiently represent hierarchical information within a finite volume.
Lean theorems connected to this paper
-
IndisputableMonolith/CostJ-cost functional equation / cosh identities echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
hyperbolic distance between any two points x, y ∈ D^d_c is: dist_{D_c}(x,y) = arcosh(1 + 2c‖x−y‖² / ((1−c‖x‖²)(1−c‖y‖²)))
-
IndisputableMonolith/Foundationreality_from_one_distinction / hyperbolic geometry emergence echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
H-Softmax incorporates hierarchical information into speaker embeddings by projecting embeddings and speaker centers into hyperbolic space and computing hyperbolic distances
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Speaker verification (SV) aims to determine whether a given speech segment belongs to a target speaker [1]. With the development of deep learning, speaker embedding learning methods based on neural networks have gradually replaced traditional approaches and have significantly improved SV performance by learning discriminative speaker embeddin...
-
[2]
RELA TED WORKS 2.1. Discriminative Loss for Speaker Recognition Early speaker verification adopted the standard Softmax with cross- entropy loss, which could effectively classify speakers within the arXiv:2601.19709v1 [cs.SD] 27 Jan 2026 training set but performed poorly in open-set scenarios. To enlarge inter-class distance and reduce intra-class distanc...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
METHODOLOGY Motivation.In speaker verification tasks, it is generally expected that the model learns discriminative speaker embeddings. Current popular loss functions are usually constructed in Euclidean space based on margin-based softmax. However, the differences between speaker identities in the real world resemble a tree-like structure rather than a f...
-
[4]
EXPERIMENTS 4.1. Datasets and Experiments Setup Datasets.We evaluated the proposed method on V oxCeleb1 [25], V oxCeleb2 [26], and CNCeleb [27]. V oxCeleb1 consists of 1,211 speakers with 148,642 utterances, while V oxCeleb2 contains 5,994 speakers and 1,092,009 utterances. For evaluation, we used the clean versions of V ox1-O, V ox1-E, and V ox1-H. CNCel...
-
[5]
We report performance using equal error rate (EER) and min- imum detection cost function (minDCF) withP target = 0.05. For V oxCeleb1, the model is trained for 150 epochs; for V oxCeleb2 and CNCeleb, the model is trained for 100 epochs. Baselines.For fair comparison, we compare the proposed H- Softmax without margin penalty with the standard Softmax and S...
-
[6]
CONCLUSIONS This paper proposes H-Softmax and HAM-Softmax based on hy- perbolic space, aimed at enhancing hierarchical modeling and inter- class separability of speaker embeddings. Experimental results show that H-Softmax outperforms margin-based methods on cross-domain complex data, while HAM-Softmax achieves the best or second-best performance on all da...
-
[7]
ACKNOWLEDGEMENTS This work was supported in part by the National Natural Science Foundation of China under Grant 62366051, in part by the State Grid Xinjiang Electric Power Company and Xinjiang Siji Informa- tion Technology Co., Ltd. under Grant SGITXX00ZHXX2200262, and in part by the “Small Group” Aid Xinjiang Project under Grant 51052501207
-
[8]
Shuai Wang, Zhengyang Chen, Kong Aik Lee, Yanmin Qian, and Haizhou Li, “Overview of speaker modeling and its ap- plications: From the lens of deep speaker representation learn- ing,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 32, pp. 4971–4998, 2024
work page 2024
-
[9]
X-vectors: Robust dnn em- beddings for speaker recognition,
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn em- beddings for speaker recognition,” inICASSP, 2018, pp. 5329– 5333
work page 2018
-
[10]
Multi-view speaker embed- ding learning for enhanced stability and discriminability,
Liang He, Zhihua Fang, Zuoer Chen, Minqiang Xu, Ying Meng, and Penghao Wang, “Multi-view speaker embed- ding learning for enhanced stability and discriminability,” in ICASSP, 2024, pp. 10081–10085
work page 2024
-
[11]
Ad- ditive margin softmax for face verification,
Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, “Ad- ditive margin softmax for face verification,”IEEE Signal Pro- cessing Letters, vol. 25, no. 7, pp. 926–930, 2018
work page 2018
-
[12]
Arcface: Additive angular margin loss for deep face recognition,
Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kot- sia, and Stefanos Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 5962– 5979, 2022
work page 2022
-
[13]
Han-Ping Shen, Jui-Feng Yeh, and Chung-Hsien Wu, “Speaker clustering using decision tree-based phone cluster models with multi-space probability distributions,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1289–1300, 2011
work page 2011
-
[14]
Carole Millot, Clara Ponchard, C ´edric Gendrot, Jean-Franc ¸ois Bonastre, and Orane Dufour, “Using gender, phonation and age to interpret automatically discovered speech attributes for explainable speaker recognition,” inINTERSPEECH, 2025, pp. 3638–3642
work page 2025
-
[15]
Lightweight speaker recognition in poincar ´e spaces,
Jieun Lee, Kim Sung-Bin, Seokhyeong Kang, and Tae-Hyun Oh, “Lightweight speaker recognition in poincar ´e spaces,” IEEE Signal Processing Letters, vol. 29, pp. 224–228, 2022
work page 2022
-
[16]
Hyperbolic representation learning: Revisiting and ad- vancing,
Menglin Yang, Min Zhou, Rex Ying, Yankai Chen, and Irwin King, “Hyperbolic representation learning: Revisiting and ad- vancing,” inICML, 2023, pp. 39639–39659
work page 2023
-
[17]
Poincar ´e embeddings for learning hierarchical representations,
Maximilian Nickel and Douwe Kiela, “Poincar ´e embeddings for learning hierarchical representations,” inNeurIPS, 2017, pp. 6341–6350
work page 2017
-
[18]
Hyperbolic image em- beddings,
Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, and Victor Lempitsky, “Hyperbolic image em- beddings,” inCVPR, 2020, pp. 6417–6427
work page 2020
-
[19]
Hyperbolic distance-based speech separation,
Darius Petermann and Minje Kim, “Hyperbolic distance-based speech separation,” inICASSP, 2024, pp. 1191–1195
work page 2024
-
[20]
Large margin softmax loss for speaker verification,
Yi Liu, Liang He, and Jia Liu, “Large margin softmax loss for speaker verification,” inINTERSPEECH, 2019, pp. 2873– 2877
work page 2019
-
[21]
Ensemble additive margin softmax for speaker verification,
Ya-Qi Yu, Lei Fan, and Wu-Jun Li, “Ensemble additive margin softmax for speaker verification,” inICASSP, 2019, pp. 6046– 6050
work page 2019
-
[22]
Dynamic margin soft- max loss for speaker verification,
Dao Zhou, Longbiao Wang, Kong Aik Lee, Yibo Wu, Meng Liu, Jianwu Dang, and Jianguo Wei, “Dynamic margin soft- max loss for speaker verification,” inINTERSPEECH, 2020, pp. 3800–3804
work page 2020
-
[23]
Adaptive margin circle loss for speaker verification,
Runqiu Xiao, Xiaoxiao Miao, Wenchao Wang, Pengyuan Zhang, Bin Cai, and Liuping Luo, “Adaptive margin circle loss for speaker verification,” inINTERSPEECH, 2021, pp. 4618–4622
work page 2021
-
[24]
Real additive mar- gin softmax for speaker verification,
Lantian Li, Ruiqian Nai, and Dong Wang, “Real additive mar- gin softmax for speaker verification,” inICASSP, 2022, pp. 7527–7531
work page 2022
-
[25]
Zhe Li, Man-Wai Mak, and Helen Mei-Ling Meng, “Discrim- inative speaker representation via contrastive learning with class-aware attention in angular space,” inICASSP, 2023, pp. 1–5
work page 2023
-
[26]
Exploring binary classification loss for speaker verification,
Bing Han, Zhengyang Chen, and Yanmin Qian, “Exploring binary classification loss for speaker verification,” inICASSP, 2023, pp. 1–5
work page 2023
-
[27]
Sphereface2: Binary classification is all you need for deep face recognition,
Yandong Wen, Weiyang Liu, Adrian Weller, Bhiksha Raj, and Rita Singh, “Sphereface2: Binary classification is all you need for deep face recognition,” inICLR, 2022
work page 2022
-
[28]
Adaptive large margin fine-tuning for robust speaker verification,
Leying Zhang, Zhengyang Chen, and Yanmin Qian, “Adaptive large margin fine-tuning for robust speaker verification,” in ICASSP, 2023, pp. 1–5
work page 2023
-
[29]
Deep noise-aware quality loss for speaker verification,
Pantid Chantangphol, Theerat Sakdejayont, Monchai Lertsut- thiwong, and Tawunrat Chalothorn, “Deep noise-aware quality loss for speaker verification,” inCIKM, 2024, pp. 3669–3673
work page 2024
-
[30]
Octavian Ganea, Gary Becigneul, and Thomas Hofmann, “Hy- perbolic neural networks,” inNeurIPS, 2018, pp. 5350–5360
work page 2018
-
[31]
Learning structured representations with hyperbolic embed- dings,
Aditya Sinha, Siqi Zeng, Makoto Yamada, and Han Zhao, “Learning structured representations with hyperbolic embed- dings,” inNeurIPS, 2024, pp. 91220–91259
work page 2024
-
[32]
V oxCeleb: A Large-Scale Speaker Identification Dataset,
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “V oxCeleb: A Large-Scale Speaker Identification Dataset,” in INTERSPEECH, 2017, pp. 2616–2620
work page 2017
-
[33]
V oxCeleb2: Deep Speaker Recognition,
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inINTERSPEECH, 2018, pp. 1086–1090
work page 2018
-
[34]
Cn-celeb: Multi-genre speaker recognition,
Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, Ravichander Vipperla, Thomas Fang Zheng, and Dong Wang, “Cn-celeb: Multi-genre speaker recognition,”Speech Communication, vol. 137, pp. 77–91, 2022
work page 2022
-
[35]
Musan: A music, speech, and noise corpus,
David Snyder, Guoguo Chen, and Daniel Povey, “Musan: A music, speech, and noise corpus,” 2015
work page 2015
-
[36]
A study on data augmen- tation of reverberant speech for robust speech recognition,
Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur, “A study on data augmen- tation of reverberant speech for robust speech recognition,” in ICASSP, 2017, pp. 5220–5224
work page 2017
-
[37]
Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in INTERSPEECH, 2020, pp. 3830–3834
work page 2020
-
[38]
Adam: A method for stochastic optimization,
Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” inICLR, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.