pith. sign in

arxiv: 2509.12760 · v4 · submitted 2025-09-16 · 💻 cs.LG · cs.CL

Similarity-Distance-Magnitude Activations

Pith reviewed 2026-05-18 15:59 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords similarity-distance-magnitudeactivation functionselective classificationcovariate shiftout-of-distributionlanguage modelssoftmaxempirical CDF
0
0 comments X

The pith

The SDM activation function improves robustness to covariate shifts in selective classification over pre-trained language models compared to softmax.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the Similarity-Distance-Magnitude activation to replace softmax in the final layer of language models. It incorporates similarity through matches to correctly predicted training examples at certain depths and distance awareness from the training distribution. This supports an estimator that partitions empirical class-wise CDFs to select high-accuracy predictions. A reader would care if this leads to more trustworthy selective decisions when data distributions change.

Core claim

The SDM activation augments standard softmax with similarity awareness from correctly predicted depth-matches into training and distance-to-training-distribution awareness in addition to magnitude awareness. This enables the SDM estimator, via data-driven partitioning of the class-wise empirical CDFs, to control the class- and prediction-conditional accuracy among selective classifications. When applied to pre-trained language models, this yields greater robustness to covariate shifts and out-of-distribution inputs than softmax-based calibration while remaining informative in-distribution.

What carries the argument

The Similarity-Distance-Magnitude activation function that adds similarity awareness via depth-matches and distance awareness to the magnitude of softmax outputs for improved robustness and exemplar-based interpretability.

If this is right

  • The SDM estimator controls conditional accuracy in selective classifications.
  • It provides greater robustness to covariate shifts than softmax methods.
  • It stays informative over in-distribution data.
  • It supports interpretability-by-exemplar through dense matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might extend to vision models or other domains facing distribution shifts.
  • Future work could test combinations with existing calibration techniques for even better performance.

Load-bearing premise

That the addition of similarity awareness from depth-matches and distance awareness will produce a net gain in robustness to shifts without degrading in-distribution performance or causing instabilities.

What would settle it

Running the SDM estimator and softmax methods on a dataset with controlled covariate shift and measuring if selective accuracy is higher for SDM or if in-distribution informativeness drops.

read the original abstract

We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Similarity-Distance-Magnitude (SDM) activation function as a replacement for softmax in the final layer of pre-trained language models. SDM augments magnitude (decision-boundary) awareness with similarity awareness (via correctly predicted depth-matches to training examples) and distance-to-training-distribution awareness, while also enabling interpretability-by-exemplar. It further defines the SDM estimator, which constructs class-wise partitions by thresholding the empirical CDFs of SDM activation values computed on training data, with the goal of controlling class- and prediction-conditional accuracy in selective classification. The central claim is that this yields greater robustness to covariate shifts and out-of-distribution inputs than existing softmax-based calibration methods, while remaining informative on in-distribution data.

Significance. If the robustness and accuracy-control claims are substantiated with appropriate evidence, the work could provide a practically useful activation for selective classification in LLMs, particularly in settings with distribution shift, and the exemplar-based interpretability is a potential added benefit. The data-driven CDF partitioning approach is attractive in principle if it can be shown to generalize.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations' is presented without any quantitative results, ablation studies, or derivation details; this is load-bearing for the central claim because the superiority in robustness is the primary asserted advantage over prior work.
  2. [SDM estimator description] SDM estimator description: class-wise partitions are obtained by thresholding the empirical CDF of SDM activation values on ID training data, yet the robustness claim requires that these fixed partitions continue to deliver the target conditional accuracy on shifted or OOD inputs. Because the similarity (depth-match) and distance terms are functions of the training distribution, a covariate shift can alter the joint distribution of the three terms and therefore the rank order of SDM scores; no bound, invariance argument, or Lipschitz analysis is supplied showing that the added awareness terms compensate for such mismatches.
minor comments (2)
  1. The explicit functional form of the SDM activation (how the similarity, distance, and magnitude components are combined) should be stated with numbered equations for reproducibility.
  2. Clarify the precise definition and computation of 'correctly predicted depth-matches' used in the similarity term, including any hyperparameters involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive comments. We address each major point below, indicating revisions where the manuscript is updated to strengthen the presentation of results and limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations' is presented without any quantitative results, ablation studies, or derivation details; this is load-bearing for the central claim because the superiority in robustness is the primary asserted advantage over prior work.

    Authors: We agree that the abstract should more explicitly ground the robustness claim in the paper's empirical findings. The full manuscript reports quantitative comparisons and ablation studies in the experimental sections demonstrating improved performance under covariate shifts and OOD inputs relative to softmax baselines. We have revised the abstract to include a concise reference to these key results (e.g., retained conditional accuracy under shifts) while respecting length limits. revision: yes

  2. Referee: [SDM estimator description] SDM estimator description: class-wise partitions are obtained by thresholding the empirical CDF of SDM activation values on ID training data, yet the robustness claim requires that these fixed partitions continue to deliver the target conditional accuracy on shifted or OOD inputs. Because the similarity (depth-match) and distance terms are functions of the training distribution, a covariate shift can alter the joint distribution of the three terms and therefore the rank order of SDM scores; no bound, invariance argument, or Lipschitz analysis is supplied showing that the added awareness terms compensate for such mismatches.

    Authors: The referee correctly notes the lack of a formal bound or invariance argument. The SDM estimator uses fixed empirical CDF thresholds derived from training data, and while the similarity and distance terms are intended to reflect an input's relation to the training distribution, the manuscript does not derive a Lipschitz condition or rank-order preservation guarantee under covariate shift. Our support for robustness is empirical, based on evaluations across multiple shift scenarios. In revision we have added an explicit discussion of this theoretical gap and listed it as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in SDM activation or estimator

full rationale

The paper defines the SDM activation explicitly as an additive combination of similarity (depth-matches), distance-to-training-distribution, and magnitude terms, then constructs the estimator via direct partitioning of the class-wise empirical CDFs computed on training data. Robustness claims to covariate shift and OOD inputs are presented as empirical outcomes when the activation is substituted into pre-trained language models, not as first-principles derivations or predictions that reduce to the training partitions by construction. No self-citation chains, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided description; the approach remains a self-contained empirical proposal whose performance on shifted data is not forced by the training-time CDF construction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the novel definition of the SDM activation and the data-driven CDF partitioning procedure; no explicit free parameters are named in the abstract.

axioms (1)
  • standard math Standard softmax properties serve as the baseline magnitude awareness component.
    The paper explicitly builds the new function on top of existing softmax behavior.
invented entities (2)
  • Similarity-Distance-Magnitude (SDM) activation function no independent evidence
    purpose: To combine similarity, distance, and magnitude awareness for improved robustness and interpretability-by-exemplar.
    Newly formulated in this work as an extension of softmax.
  • SDM estimator no independent evidence
    purpose: To control class- and prediction-conditional accuracy in selective classification via partitioning of empirical CDFs.
    Derived directly from the SDM activation as described.

pith-pipeline@v0.9.0 · 5654 in / 1444 out tokens · 62930 ms · 2026-05-18T15:59:46.349139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 3 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

  2. [2]

    Uncertainty Sets for Image Classifiers using Conformal Prediction

    Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty Sets for Image Classifiers using Conformal Prediction . In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=eNdiU_DbM9

  3. [3]

    The Internal State of an LLM Knows When It ' s Lying

    Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it ' s lying. pp.\ 967--976, Singapore, December 2023. doi:10.18653/v1/2023.findings-emnlp.68. URL 2023.findings-emnlp.68

  4. [4]

    Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78 0 (1): 0 1 -- 3, 1950. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2. URL https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml

  5. [5]

    C. K. Chow. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, EC-6 0 (4): 0 247--254, 1957. doi:10.1109/TEC.1957.5222035

  6. [6]

    Cover and P

    T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 0 (1): 0 21--27, 1967. doi:10.1109/TIT.1967.1053964

  7. [7]

    A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77 0 (379): 0 605--610, 1982. doi:10.1080/01621459.1982.10477856. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856

  8. [8]

    A Probabilistic Theory of Pattern Recognition

    Luc Devroye, L \'a szl \'o Gy \"o rfi, and G \'a bor Lugosi. A Probabilistic Theory of Pattern Recognition . In Stochastic Modelling and Applied Probability, 1996

  9. [9]

    The limits of distribution-free conditional predictive inference

    Rina Foygel Barber, Emmanuel J Cand \`e s, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference . Information and Inference: A Journal of the IMA, 10 0 (2): 0 455--482, 08 2020. ISSN 2049-8772. doi:10.1093/imaiai/iaaa017. URL https://doi.org/10.1093/imaiai/iaaa017

  10. [10]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 1050--1059, New York, New York, USA, 20--22 Jun 201...

  11. [11]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/4a8423d5e91fd...

  12. [12]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks . In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 1321--1330. JMLR.org, 2017

  13. [13]

    Top-label calibration and multiclass-to-binary reductions

    Chirag Gupta and Aaditya Ramdas. Top-label calibration and multiclass-to-binary reductions . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WqoBaaPHS-

  14. [14]

    J. T. Gene Hwang and A. Adam Ding. Prediction intervals for artificial neural networks. Journal of the American Statistical Association, 92 0 (438): 0 748--757, 1997. ISSN 01621459. URL http://www.jstor.org/stable/2965723

  15. [15]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  16. [16]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

  17. [17]

    Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration

    Meelis Kull, Miquel Perello-Nieto, Markus K\" a ngsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration . Curran Associates Inc., Red Hook, NY, USA, 2019

  18. [18]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceed...

  19. [19]

    Distribution-free prediction bands for non-parametric regression

    Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76 0 (1): 0 71--96, 2014. doi:https://doi.org/10.1111/rssb.12021. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12021

  20. [20]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

  21. [21]

    Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Inform...

  22. [22]

    John C. Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods . In Advances in Large Margin Classifiers, pp.\ 61--74. MIT Press, 1999

  23. [23]

    Cand\` e s

    Yaniv Romano, Matteo Sesia, and Emmanuel J. Cand\` e s. Classification with valid and adaptive coverage. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

  24. [24]

    S em E val-2017 task 4: Sentiment analysis in T witter

    Sara Rosenthal, Noura Farra, and Preslav Nakov. S em E val-2017 task 4: Sentiment analysis in T witter. In Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Jurgens (eds.), Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017) , pp.\ 502--518, Vancouver, Canada, August 2017. Ass...

  25. [25]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg

  26. [26]

    Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Bo Sch \"o n

    Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Bo Sch \"o n. Evaluating model calibration in classification. In International Conference on Artificial Intelligence and Statistics, 2019. URL https://api.semanticscholar.org/CorpusID:67749814

  27. [27]

    L. G. Valiant. A theory of the learnable. Commun. ACM, 27 0 (11): 0 1134–1142, nov 1984. ISSN 0001-0782. doi:10.1145/1968.1972. URL https://doi.org/10.1145/1968.1972

  28. [28]

    Gomez, ukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, pp.\ 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

  29. [29]

    Conditional validity of inductive conformal predictors

    Vladimir Vovk. Conditional validity of inductive conformal predictors. In Steven C. H. Hoi and Wray Buntine (eds.), Proceedings of the Asian Conference on Machine Learning, volume 25 of Proceedings of Machine Learning Research, pp.\ 475--490, Singapore Management University, Singapore, 04--06 Nov 2012. PMLR. URL https://proceedings.mlr.press/v25/vovk12.html

  30. [30]

    Algorithmic Learning in a Random World

    Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522

  31. [31]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...