Similarity-Distance-Magnitude Activations

Allen Schmaltz

arxiv: 2509.12760 · v4 · submitted 2025-09-16 · 💻 cs.LG · cs.CL

Similarity-Distance-Magnitude Activations

Allen Schmaltz This is my paper

Pith reviewed 2026-05-18 15:59 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords similarity-distance-magnitudeactivation functionselective classificationcovariate shiftout-of-distributionlanguage modelssoftmaxempirical CDF

0 comments

The pith

The SDM activation function improves robustness to covariate shifts in selective classification over pre-trained language models compared to softmax.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the Similarity-Distance-Magnitude activation to replace softmax in the final layer of language models. It incorporates similarity through matches to correctly predicted training examples at certain depths and distance awareness from the training distribution. This supports an estimator that partitions empirical class-wise CDFs to select high-accuracy predictions. A reader would care if this leads to more trustworthy selective decisions when data distributions change.

Core claim

The SDM activation augments standard softmax with similarity awareness from correctly predicted depth-matches into training and distance-to-training-distribution awareness in addition to magnitude awareness. This enables the SDM estimator, via data-driven partitioning of the class-wise empirical CDFs, to control the class- and prediction-conditional accuracy among selective classifications. When applied to pre-trained language models, this yields greater robustness to covariate shifts and out-of-distribution inputs than softmax-based calibration while remaining informative in-distribution.

What carries the argument

The Similarity-Distance-Magnitude activation function that adds similarity awareness via depth-matches and distance awareness to the magnitude of softmax outputs for improved robustness and exemplar-based interpretability.

If this is right

The SDM estimator controls conditional accuracy in selective classifications.
It provides greater robustness to covariate shifts than softmax methods.
It stays informative over in-distribution data.
It supports interpretability-by-exemplar through dense matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might extend to vision models or other domains facing distribution shifts.
Future work could test combinations with existing calibration techniques for even better performance.

Load-bearing premise

That the addition of similarity awareness from depth-matches and distance awareness will produce a net gain in robustness to shifts without degrading in-distribution performance or causing instabilities.

What would settle it

Running the SDM estimator and softmax methods on a dataset with controlled covariate shift and measuring if selective accuracy is higher for SDM or if in-distribution informativeness drops.

read the original abstract

We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDM adds similarity and distance terms to softmax for selective classification in LMs, but the robustness under shifts depends on unverified assumptions about fixed CDF partitions holding up.

read the letter

The paper introduces the SDM activation, which layers similarity awareness (via correctly predicted depth-matches to training examples) and distance-to-training-distribution awareness on top of the usual magnitude from softmax. It pairs this with an SDM estimator that partitions class-wise empirical CDFs of the activation values to pick subsets with target accuracy for selective classification. The claim is that this setup stays informative on in-distribution data while being more robust to covariate shifts and OOD inputs than standard softmax calibration methods on pre-trained language models. Interpretability by exemplar matching is a side benefit they highlight. That combination is the actual new piece; prior work on robust activations or selective prediction exists, but this specific three-way split and the CDF-based estimator look like a fresh formulation. The intent to improve reliability for real deployment is clear and addresses a practical gap. The main soft spot is the one the stress-test flags. The partitions are built once on training data, yet the similarity and distance terms are themselves tied to that same distribution. A shift can reorder the SDM scores without any shown bound or Lipschitz condition guaranteeing the conditional accuracy inside each partition stays on target. The abstract states superiority without numbers, ablations, or derivations, so the central robustness result rests on the hope that the extra terms compensate exactly. If the full paper supplies controlled experiments across shifts plus a derivation that the rank order is preserved, that would tighten things; otherwise the claim stays under-supported. This is for researchers working on calibration, selective prediction, and OOD robustness in NLP. A reader already thinking about activation functions or post-hoc estimators could extract useful ideas even if the experiments need strengthening. It deserves a serious referee to check the empirical side and any formal arguments that may be in the full text.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Similarity-Distance-Magnitude (SDM) activation function as a replacement for softmax in the final layer of pre-trained language models. SDM augments magnitude (decision-boundary) awareness with similarity awareness (via correctly predicted depth-matches to training examples) and distance-to-training-distribution awareness, while also enabling interpretability-by-exemplar. It further defines the SDM estimator, which constructs class-wise partitions by thresholding the empirical CDFs of SDM activation values computed on training data, with the goal of controlling class- and prediction-conditional accuracy in selective classification. The central claim is that this yields greater robustness to covariate shifts and out-of-distribution inputs than existing softmax-based calibration methods, while remaining informative on in-distribution data.

Significance. If the robustness and accuracy-control claims are substantiated with appropriate evidence, the work could provide a practically useful activation for selective classification in LLMs, particularly in settings with distribution shift, and the exemplar-based interpretability is a potential added benefit. The data-driven CDF partitioning approach is attractive in principle if it can be shown to generalize.

major comments (2)

[Abstract] Abstract: the assertion that 'the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations' is presented without any quantitative results, ablation studies, or derivation details; this is load-bearing for the central claim because the superiority in robustness is the primary asserted advantage over prior work.
[SDM estimator description] SDM estimator description: class-wise partitions are obtained by thresholding the empirical CDF of SDM activation values on ID training data, yet the robustness claim requires that these fixed partitions continue to deliver the target conditional accuracy on shifted or OOD inputs. Because the similarity (depth-match) and distance terms are functions of the training distribution, a covariate shift can alter the joint distribution of the three terms and therefore the rank order of SDM scores; no bound, invariance argument, or Lipschitz analysis is supplied showing that the added awareness terms compensate for such mismatches.

minor comments (2)

The explicit functional form of the SDM activation (how the similarity, distance, and magnitude components are combined) should be stated with numbered equations for reproducibility.
Clarify the precise definition and computation of 'correctly predicted depth-matches' used in the similarity term, including any hyperparameters involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive comments. We address each major point below, indicating revisions where the manuscript is updated to strengthen the presentation of results and limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations' is presented without any quantitative results, ablation studies, or derivation details; this is load-bearing for the central claim because the superiority in robustness is the primary asserted advantage over prior work.

Authors: We agree that the abstract should more explicitly ground the robustness claim in the paper's empirical findings. The full manuscript reports quantitative comparisons and ablation studies in the experimental sections demonstrating improved performance under covariate shifts and OOD inputs relative to softmax baselines. We have revised the abstract to include a concise reference to these key results (e.g., retained conditional accuracy under shifts) while respecting length limits. revision: yes
Referee: [SDM estimator description] SDM estimator description: class-wise partitions are obtained by thresholding the empirical CDF of SDM activation values on ID training data, yet the robustness claim requires that these fixed partitions continue to deliver the target conditional accuracy on shifted or OOD inputs. Because the similarity (depth-match) and distance terms are functions of the training distribution, a covariate shift can alter the joint distribution of the three terms and therefore the rank order of SDM scores; no bound, invariance argument, or Lipschitz analysis is supplied showing that the added awareness terms compensate for such mismatches.

Authors: The referee correctly notes the lack of a formal bound or invariance argument. The SDM estimator uses fixed empirical CDF thresholds derived from training data, and while the similarity and distance terms are intended to reflect an input's relation to the training distribution, the manuscript does not derive a Lipschitz condition or rank-order preservation guarantee under covariate shift. Our support for robustness is empirical, based on evaluations across multiple shift scenarios. In revision we have added an explicit discussion of this theoretical gap and listed it as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in SDM activation or estimator

full rationale

The paper defines the SDM activation explicitly as an additive combination of similarity (depth-matches), distance-to-training-distribution, and magnitude terms, then constructs the estimator via direct partitioning of the class-wise empirical CDFs computed on training data. Robustness claims to covariate shift and OOD inputs are presented as empirical outcomes when the activation is substituted into pre-trained language models, not as first-principles derivations or predictions that reduce to the training partitions by construction. No self-citation chains, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided description; the approach remains a self-contained empirical proposal whose performance on shifted data is not forced by the training-time CDF construction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the novel definition of the SDM activation and the data-driven CDF partitioning procedure; no explicit free parameters are named in the abstract.

axioms (1)

standard math Standard softmax properties serve as the baseline magnitude awareness component.
The paper explicitly builds the new function on top of existing softmax behavior.

invented entities (2)

Similarity-Distance-Magnitude (SDM) activation function no independent evidence
purpose: To combine similarity, distance, and magnitude awareness for improved robustness and interpretability-by-exemplar.
Newly formulated in this work as an extension of softmax.
SDM estimator no independent evidence
purpose: To control class- and prediction-conditional accuracy in selective classification via partitioning of empirical CDFs.
Derived directly from the SDM activation as described.

pith-pipeline@v0.9.0 · 5654 in / 1444 out tokens · 62930 ms · 2026-05-18T15:59:46.349139+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SDM(z′)i=(2+q)d⋅z′i / ∑c(2+q)d⋅z′c (Eq. 6); q defined by consecutive correctly-predicted depth matches (Eq. 4); d via min(1−eCDFyc(dnearest)) (Eq. 5); HIGH-RELIABILITY region via Alg. 1 q′min search on class-wise eCDFs of SDM output
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

data-driven partitioning of class-wise empirical CDFs … to control class- and prediction-conditional accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 3 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Uncertainty Sets for Image Classifiers using Conformal Prediction

Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty Sets for Image Classifiers using Conformal Prediction . In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=eNdiU_DbM9

work page 2021
[3]

The Internal State of an LLM Knows When It ' s Lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it ' s lying. pp.\ 967--976, Singapore, December 2023. doi:10.18653/v1/2023.findings-emnlp.68. URL 2023.findings-emnlp.68

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[4]

Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78 0 (1): 0 1 -- 3, 1950. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2. URL https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml

work page doi:10.1175/1520-0493(1950)078 1950
[5]

C. K. Chow. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, EC-6 0 (4): 0 247--254, 1957. doi:10.1109/TEC.1957.5222035

work page doi:10.1109/tec.1957.5222035 1957
[6]

Cover and P

T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 0 (1): 0 21--27, 1967. doi:10.1109/TIT.1967.1053964

work page doi:10.1109/tit.1967.1053964 1967
[7]

A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77 0 (379): 0 605--610, 1982. doi:10.1080/01621459.1982.10477856. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856

work page doi:10.1080/01621459.1982.10477856 1982
[8]

A Probabilistic Theory of Pattern Recognition

Luc Devroye, L \'a szl \'o Gy \"o rfi, and G \'a bor Lugosi. A Probabilistic Theory of Pattern Recognition . In Stochastic Modelling and Applied Probability, 1996

work page 1996
[9]

The limits of distribution-free conditional predictive inference

Rina Foygel Barber, Emmanuel J Cand \`e s, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference . Information and Inference: A Journal of the IMA, 10 0 (2): 0 455--482, 08 2020. ISSN 2049-8772. doi:10.1093/imaiai/iaaa017. URL https://doi.org/10.1093/imaiai/iaaa017

work page doi:10.1093/imaiai/iaaa017 2020
[10]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 1050--1059, New York, New York, USA, 20--22 Jun 201...

work page 2016
[11]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/4a8423d5e91fd...

work page 2017
[12]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks . In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 1321--1330. JMLR.org, 2017

work page 2017
[13]

Top-label calibration and multiclass-to-binary reductions

Chirag Gupta and Aaditya Ramdas. Top-label calibration and multiclass-to-binary reductions . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WqoBaaPHS-

work page 2022
[14]

J. T. Gene Hwang and A. Adam Ding. Prediction intervals for artificial neural networks. Journal of the American Statistical Association, 92 0 (438): 0 748--757, 1997. ISSN 01621459. URL http://www.jstor.org/stable/2965723

work page arXiv 1997
[15]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration

Meelis Kull, Miquel Perello-Nieto, Markus K\" a ngsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration . Curran Associates Inc., Red Hook, NY, USA, 2019

work page 2019
[18]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceed...

work page 2017
[19]

Distribution-free prediction bands for non-parametric regression

Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76 0 (1): 0 71--96, 2014. doi:https://doi.org/10.1111/rssb.12021. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12021

work page doi:10.1111/rssb.12021 2014
[20]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

work page 2011
[21]

Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Inform...

work page 2019
[22]

John C. Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods . In Advances in Large Margin Classifiers, pp.\ 61--74. MIT Press, 1999

work page 1999
[23]

Cand\` e s

Yaniv Romano, Matteo Sesia, and Emmanuel J. Cand\` e s. Classification with valid and adaptive coverage. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

work page 2020
[24]

S em E val-2017 task 4: Sentiment analysis in T witter

Sara Rosenthal, Noura Farra, and Preslav Nakov. S em E val-2017 task 4: Sentiment analysis in T witter. In Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Jurgens (eds.), Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017) , pp.\ 502--518, Vancouver, Canada, August 2017. Ass...

work page doi:10.18653/v1/s17-2088 2017
[25]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg

work page 2017
[26]

Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Bo Sch \"o n

Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Bo Sch \"o n. Evaluating model calibration in classification. In International Conference on Artificial Intelligence and Statistics, 2019. URL https://api.semanticscholar.org/CorpusID:67749814

work page 2019
[27]

L. G. Valiant. A theory of the learnable. Commun. ACM, 27 0 (11): 0 1134–1142, nov 1984. ISSN 0001-0782. doi:10.1145/1968.1972. URL https://doi.org/10.1145/1968.1972

work page doi:10.1145/1968.1972 1984
[28]

Gomez, ukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, pp.\ 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017
[29]

Conditional validity of inductive conformal predictors

Vladimir Vovk. Conditional validity of inductive conformal predictors. In Steven C. H. Hoi and Wray Buntine (eds.), Proceedings of the Asian Conference on Machine Learning, volume 25 of Proceedings of Machine Learning Research, pp.\ 475--490, Singapore Management University, Singapore, 04--06 Nov 2012. PMLR. URL https://proceedings.mlr.press/v25/vovk12.html

work page 2012
[30]

Algorithmic Learning in a Random World

Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522

work page 2005
[31]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Uncertainty Sets for Image Classifiers using Conformal Prediction

Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty Sets for Image Classifiers using Conformal Prediction . In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=eNdiU_DbM9

work page 2021

[3] [3]

The Internal State of an LLM Knows When It ' s Lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it ' s lying. pp.\ 967--976, Singapore, December 2023. doi:10.18653/v1/2023.findings-emnlp.68. URL 2023.findings-emnlp.68

work page doi:10.18653/v1/2023.findings-emnlp.68 2023

[4] [4]

Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78 0 (1): 0 1 -- 3, 1950. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2. URL https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml

work page doi:10.1175/1520-0493(1950)078 1950

[5] [5]

C. K. Chow. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, EC-6 0 (4): 0 247--254, 1957. doi:10.1109/TEC.1957.5222035

work page doi:10.1109/tec.1957.5222035 1957

[6] [6]

Cover and P

T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 0 (1): 0 21--27, 1967. doi:10.1109/TIT.1967.1053964

work page doi:10.1109/tit.1967.1053964 1967

[7] [7]

A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77 0 (379): 0 605--610, 1982. doi:10.1080/01621459.1982.10477856. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856

work page doi:10.1080/01621459.1982.10477856 1982

[8] [8]

A Probabilistic Theory of Pattern Recognition

Luc Devroye, L \'a szl \'o Gy \"o rfi, and G \'a bor Lugosi. A Probabilistic Theory of Pattern Recognition . In Stochastic Modelling and Applied Probability, 1996

work page 1996

[9] [9]

The limits of distribution-free conditional predictive inference

Rina Foygel Barber, Emmanuel J Cand \`e s, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference . Information and Inference: A Journal of the IMA, 10 0 (2): 0 455--482, 08 2020. ISSN 2049-8772. doi:10.1093/imaiai/iaaa017. URL https://doi.org/10.1093/imaiai/iaaa017

work page doi:10.1093/imaiai/iaaa017 2020

[10] [10]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 1050--1059, New York, New York, USA, 20--22 Jun 201...

work page 2016

[11] [11]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/4a8423d5e91fd...

work page 2017

[12] [12]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks . In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 1321--1330. JMLR.org, 2017

work page 2017

[13] [13]

Top-label calibration and multiclass-to-binary reductions

Chirag Gupta and Aaditya Ramdas. Top-label calibration and multiclass-to-binary reductions . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WqoBaaPHS-

work page 2022

[14] [14]

J. T. Gene Hwang and A. Adam Ding. Prediction intervals for artificial neural networks. Journal of the American Statistical Association, 92 0 (438): 0 748--757, 1997. ISSN 01621459. URL http://www.jstor.org/stable/2965723

work page arXiv 1997

[15] [15]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration

Meelis Kull, Miquel Perello-Nieto, Markus K\" a ngsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration . Curran Associates Inc., Red Hook, NY, USA, 2019

work page 2019

[18] [18]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceed...

work page 2017

[19] [19]

Distribution-free prediction bands for non-parametric regression

Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76 0 (1): 0 71--96, 2014. doi:https://doi.org/10.1111/rssb.12021. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12021

work page doi:10.1111/rssb.12021 2014

[20] [20]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

work page 2011

[21] [21]

Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Inform...

work page 2019

[22] [22]

John C. Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods . In Advances in Large Margin Classifiers, pp.\ 61--74. MIT Press, 1999

work page 1999

[23] [23]

Cand\` e s

Yaniv Romano, Matteo Sesia, and Emmanuel J. Cand\` e s. Classification with valid and adaptive coverage. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

work page 2020

[24] [24]

S em E val-2017 task 4: Sentiment analysis in T witter

Sara Rosenthal, Noura Farra, and Preslav Nakov. S em E val-2017 task 4: Sentiment analysis in T witter. In Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Jurgens (eds.), Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017) , pp.\ 502--518, Vancouver, Canada, August 2017. Ass...

work page doi:10.18653/v1/s17-2088 2017

[25] [25]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg

work page 2017

[26] [26]

Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Bo Sch \"o n

Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Bo Sch \"o n. Evaluating model calibration in classification. In International Conference on Artificial Intelligence and Statistics, 2019. URL https://api.semanticscholar.org/CorpusID:67749814

work page 2019

[27] [27]

L. G. Valiant. A theory of the learnable. Commun. ACM, 27 0 (11): 0 1134–1142, nov 1984. ISSN 0001-0782. doi:10.1145/1968.1972. URL https://doi.org/10.1145/1968.1972

work page doi:10.1145/1968.1972 1984

[28] [28]

Gomez, ukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, pp.\ 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017

[29] [29]

Conditional validity of inductive conformal predictors

Vladimir Vovk. Conditional validity of inductive conformal predictors. In Steven C. H. Hoi and Wray Buntine (eds.), Proceedings of the Asian Conference on Machine Learning, volume 25 of Proceedings of Machine Learning Research, pp.\ 475--490, Singapore Management University, Singapore, 04--06 Nov 2012. PMLR. URL https://proceedings.mlr.press/v25/vovk12.html

work page 2012

[30] [30]

Algorithmic Learning in a Random World

Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522

work page 2005

[31] [31]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page