Similarity-Distance-Magnitude Activations
Pith reviewed 2026-05-18 15:59 UTC · model grok-4.3
The pith
The SDM activation function improves robustness to covariate shifts in selective classification over pre-trained language models compared to softmax.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The SDM activation augments standard softmax with similarity awareness from correctly predicted depth-matches into training and distance-to-training-distribution awareness in addition to magnitude awareness. This enables the SDM estimator, via data-driven partitioning of the class-wise empirical CDFs, to control the class- and prediction-conditional accuracy among selective classifications. When applied to pre-trained language models, this yields greater robustness to covariate shifts and out-of-distribution inputs than softmax-based calibration while remaining informative in-distribution.
What carries the argument
The Similarity-Distance-Magnitude activation function that adds similarity awareness via depth-matches and distance awareness to the magnitude of softmax outputs for improved robustness and exemplar-based interpretability.
If this is right
- The SDM estimator controls conditional accuracy in selective classifications.
- It provides greater robustness to covariate shifts than softmax methods.
- It stays informative over in-distribution data.
- It supports interpretability-by-exemplar through dense matching.
Where Pith is reading between the lines
- This method might extend to vision models or other domains facing distribution shifts.
- Future work could test combinations with existing calibration techniques for even better performance.
Load-bearing premise
That the addition of similarity awareness from depth-matches and distance awareness will produce a net gain in robustness to shifts without degrading in-distribution performance or causing instabilities.
What would settle it
Running the SDM estimator and softmax methods on a dataset with controlled covariate shift and measuring if selective accuracy is higher for SDM or if in-distribution informativeness drops.
read the original abstract
We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Similarity-Distance-Magnitude (SDM) activation function as a replacement for softmax in the final layer of pre-trained language models. SDM augments magnitude (decision-boundary) awareness with similarity awareness (via correctly predicted depth-matches to training examples) and distance-to-training-distribution awareness, while also enabling interpretability-by-exemplar. It further defines the SDM estimator, which constructs class-wise partitions by thresholding the empirical CDFs of SDM activation values computed on training data, with the goal of controlling class- and prediction-conditional accuracy in selective classification. The central claim is that this yields greater robustness to covariate shifts and out-of-distribution inputs than existing softmax-based calibration methods, while remaining informative on in-distribution data.
Significance. If the robustness and accuracy-control claims are substantiated with appropriate evidence, the work could provide a practically useful activation for selective classification in LLMs, particularly in settings with distribution shift, and the exemplar-based interpretability is a potential added benefit. The data-driven CDF partitioning approach is attractive in principle if it can be shown to generalize.
major comments (2)
- [Abstract] Abstract: the assertion that 'the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations' is presented without any quantitative results, ablation studies, or derivation details; this is load-bearing for the central claim because the superiority in robustness is the primary asserted advantage over prior work.
- [SDM estimator description] SDM estimator description: class-wise partitions are obtained by thresholding the empirical CDF of SDM activation values on ID training data, yet the robustness claim requires that these fixed partitions continue to deliver the target conditional accuracy on shifted or OOD inputs. Because the similarity (depth-match) and distance terms are functions of the training distribution, a covariate shift can alter the joint distribution of the three terms and therefore the rank order of SDM scores; no bound, invariance argument, or Lipschitz analysis is supplied showing that the added awareness terms compensate for such mismatches.
minor comments (2)
- The explicit functional form of the SDM activation (how the similarity, distance, and magnitude components are combined) should be stated with numbered equations for reproducibility.
- Clarify the precise definition and computation of 'correctly predicted depth-matches' used in the similarity term, including any hyperparameters involved.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive comments. We address each major point below, indicating revisions where the manuscript is updated to strengthen the presentation of results and limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations' is presented without any quantitative results, ablation studies, or derivation details; this is load-bearing for the central claim because the superiority in robustness is the primary asserted advantage over prior work.
Authors: We agree that the abstract should more explicitly ground the robustness claim in the paper's empirical findings. The full manuscript reports quantitative comparisons and ablation studies in the experimental sections demonstrating improved performance under covariate shifts and OOD inputs relative to softmax baselines. We have revised the abstract to include a concise reference to these key results (e.g., retained conditional accuracy under shifts) while respecting length limits. revision: yes
-
Referee: [SDM estimator description] SDM estimator description: class-wise partitions are obtained by thresholding the empirical CDF of SDM activation values on ID training data, yet the robustness claim requires that these fixed partitions continue to deliver the target conditional accuracy on shifted or OOD inputs. Because the similarity (depth-match) and distance terms are functions of the training distribution, a covariate shift can alter the joint distribution of the three terms and therefore the rank order of SDM scores; no bound, invariance argument, or Lipschitz analysis is supplied showing that the added awareness terms compensate for such mismatches.
Authors: The referee correctly notes the lack of a formal bound or invariance argument. The SDM estimator uses fixed empirical CDF thresholds derived from training data, and while the similarity and distance terms are intended to reflect an input's relation to the training distribution, the manuscript does not derive a Lipschitz condition or rank-order preservation guarantee under covariate shift. Our support for robustness is empirical, based on evaluations across multiple shift scenarios. In revision we have added an explicit discussion of this theoretical gap and listed it as future work. revision: partial
Circularity Check
No significant circularity in SDM activation or estimator
full rationale
The paper defines the SDM activation explicitly as an additive combination of similarity (depth-matches), distance-to-training-distribution, and magnitude terms, then constructs the estimator via direct partitioning of the class-wise empirical CDFs computed on training data. Robustness claims to covariate shift and OOD inputs are presented as empirical outcomes when the activation is substituted into pre-trained language models, not as first-principles derivations or predictions that reduce to the training partitions by construction. No self-citation chains, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided description; the approach remains a self-contained empirical proposal whose performance on shifted data is not forced by the training-time CDF construction itself.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard softmax properties serve as the baseline magnitude awareness component.
invented entities (2)
-
Similarity-Distance-Magnitude (SDM) activation function
no independent evidence
-
SDM estimator
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SDM(z′)i=(2+q)d⋅z′i / ∑c(2+q)d⋅z′c (Eq. 6); q defined by consecutive correctly-predicted depth matches (Eq. 4); d via min(1−eCDFyc(dnearest)) (Eq. 5); HIGH-RELIABILITY region via Alg. 1 q′min search on class-wise eCDFs of SDM output
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
data-driven partitioning of class-wise empirical CDFs … to control class- and prediction-conditional accuracy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Uncertainty Sets for Image Classifiers using Conformal Prediction
Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty Sets for Image Classifiers using Conformal Prediction . In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=eNdiU_DbM9
work page 2021
-
[3]
The Internal State of an LLM Knows When It ' s Lying
Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it ' s lying. pp.\ 967--976, Singapore, December 2023. doi:10.18653/v1/2023.findings-emnlp.68. URL 2023.findings-emnlp.68
-
[4]
Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78 0 (1): 0 1 -- 3, 1950. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2. URL https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml
-
[5]
C. K. Chow. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, EC-6 0 (4): 0 247--254, 1957. doi:10.1109/TEC.1957.5222035
-
[6]
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 0 (1): 0 21--27, 1967. doi:10.1109/TIT.1967.1053964
-
[7]
A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77 0 (379): 0 605--610, 1982. doi:10.1080/01621459.1982.10477856. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856
-
[8]
A Probabilistic Theory of Pattern Recognition
Luc Devroye, L \'a szl \'o Gy \"o rfi, and G \'a bor Lugosi. A Probabilistic Theory of Pattern Recognition . In Stochastic Modelling and Applied Probability, 1996
work page 1996
-
[9]
The limits of distribution-free conditional predictive inference
Rina Foygel Barber, Emmanuel J Cand \`e s, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference . Information and Inference: A Journal of the IMA, 10 0 (2): 0 455--482, 08 2020. ISSN 2049-8772. doi:10.1093/imaiai/iaaa017. URL https://doi.org/10.1093/imaiai/iaaa017
-
[10]
Dropout as a bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 1050--1059, New York, New York, USA, 20--22 Jun 201...
work page 2016
-
[11]
Selective classification for deep neural networks
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/4a8423d5e91fd...
work page 2017
-
[12]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks . In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 1321--1330. JMLR.org, 2017
work page 2017
-
[13]
Top-label calibration and multiclass-to-binary reductions
Chirag Gupta and Aaditya Ramdas. Top-label calibration and multiclass-to-binary reductions . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WqoBaaPHS-
work page 2022
- [14]
-
[15]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Meelis Kull, Miquel Perello-Nieto, Markus K\" a ngsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration . Curran Associates Inc., Red Hook, NY, USA, 2019
work page 2019
-
[18]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceed...
work page 2017
-
[19]
Distribution-free prediction bands for non-parametric regression
Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76 0 (1): 0 71--96, 2014. doi:https://doi.org/10.1111/rssb.12021. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12021
-
[20]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...
work page 2011
-
[21]
Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek
Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Inform...
work page 2019
-
[22]
John C. Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods . In Advances in Large Margin Classifiers, pp.\ 61--74. MIT Press, 1999
work page 1999
-
[23]
Yaniv Romano, Matteo Sesia, and Emmanuel J. Cand\` e s. Classification with valid and adaptive coverage. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546
work page 2020
-
[24]
S em E val-2017 task 4: Sentiment analysis in T witter
Sara Rosenthal, Noura Farra, and Preslav Nakov. S em E val-2017 task 4: Sentiment analysis in T witter. In Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Jurgens (eds.), Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017) , pp.\ 502--518, Vancouver, Canada, August 2017. Ass...
-
[25]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg
work page 2017
-
[26]
Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Bo Sch \"o n
Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Bo Sch \"o n. Evaluating model calibration in classification. In International Conference on Artificial Intelligence and Statistics, 2019. URL https://api.semanticscholar.org/CorpusID:67749814
work page 2019
-
[27]
L. G. Valiant. A theory of the learnable. Commun. ACM, 27 0 (11): 0 1134–1142, nov 1984. ISSN 0001-0782. doi:10.1145/1968.1972. URL https://doi.org/10.1145/1968.1972
-
[28]
Gomez, ukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, pp.\ 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964
work page 2017
-
[29]
Conditional validity of inductive conformal predictors
Vladimir Vovk. Conditional validity of inductive conformal predictors. In Steven C. H. Hoi and Wray Buntine (eds.), Proceedings of the Asian Conference on Machine Learning, volume 25 of Proceedings of Machine Learning Research, pp.\ 475--490, Singapore Management University, Singapore, 04--06 Nov 2012. PMLR. URL https://proceedings.mlr.press/v25/vovk12.html
work page 2012
-
[30]
Algorithmic Learning in a Random World
Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522
work page 2005
-
[31]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.