Divide et Calibra: Multiclass Local Calibration via Vector Quantization

Andrea Passerini; Andrea Pugnana; Cesare Barbera; Giovanni De Toni; Lorenzo Perini

arxiv: 2605.21060 · v1 · pith:52V7YDRLnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· stat.ML

Divide et Calibra: Multiclass Local Calibration via Vector Quantization

Cesare Barbera , Lorenzo Perini , Giovanni De Toni , Andrea Passerini , Andrea Pugnana This is my paper

Pith reviewed 2026-05-21 05:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords multiclass calibrationvector quantizationlocal calibrationDirichlet parameterizationlatent space partitioningmachine learning reliability

0 comments

The pith

Vector quantization creates region-specific calibration maps for multiclass models by sharing parameters across a partitioned latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that global calibration assumes uniform error across the entire latent space while many local methods discard information through dimensionality reduction. Instead it builds calibration maps from shared codeword-dependent factors once the space is partitioned by vector quantization. An indexed parameterization of Dirichlet concentrations then lets parameters be reused across regions. This produces heterogeneous maps that still calibrate well where data is sparse. A sympathetic reader would care because high-stakes applications need reliable uncertainty estimates everywhere, not just on average.

Core claim

By inducing a structured partition of the representation space with vector quantization and using an indexed parameterization of Dirichlet concentrations, the method constructs compositional, region-specific calibration maps that generalize to data-sparse regions while preserving global calibration and predictive performance.

What carries the argument

Vector quantization partition of the representation space together with indexed parameterization of Dirichlet concentrations, which together enable structured parameter sharing across regions.

If this is right

Local calibration error decreases on standard benchmarks while global calibration and accuracy remain competitive.
Calibration maps become heterogeneous and adapt to different parts of the latent space.
Parameter sharing across quantized regions reduces the need for separate maps in data-poor areas.
The approach avoids the information loss that accompanies dimensionality-reduction steps in prior local methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same VQ-plus-indexed-parameter idea might be tried with other partitioning schemes such as clustering or decision trees.
If the codebook size is treated as a hyperparameter, one could test whether larger codebooks further improve local calibration at the cost of more parameters.
The method could be combined with post-hoc recalibration techniques that operate on top of the VQ indices.

Load-bearing premise

That partitioning the latent space with vector quantization and tying Dirichlet parameters to codewords produces useful sharing that improves calibration in sparse regions without adding new biases.

What would settle it

Measure local calibration error on a held-out test set deliberately constructed to contain large empty regions in the latent space; if error in those regions does not drop relative to a strong global baseline or if a new bias metric rises, the claim is falsified.

Figures

Figures reproduced from arXiv: 2605.21060 by Andrea Passerini, Andrea Pugnana, Cesare Barbera, Giovanni De Toni, Lorenzo Perini.

**Figure 2.** Figure 2: Local calibration metrics over five runs ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Local calibration in density-based sub-bins over five runs ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Global calibration metrics over five runs ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Local calibration in density-based sub-bins over five runs on [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: LCE (left) and NLL (right) for various calibration-set sizes on TissueMNIST. is not only more accurate in well-sampled settings but is also substantially more robust under data scarcity, highlighting the effectiveness of parameter sharing in leveraging limited calibration data. We report the effect of calibration set size on local calibration (LCE) and negative log-likelihood (NLL) for Weather in [PITH_FU… view at source ↗

**Figure 7.** Figure 7: LCE (left) and NLL (right) for various calibration-set sizes on Weather [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Local calibration in density-based sub-bins over five runs ( [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Local calibration in density-based sub-bins over five runs ( [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

read the original abstract

Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes Divide et Calibra, a compositional multiclass calibration method that partitions the representation space via Vector Quantization (VQ) and uses an indexed parameterization of Dirichlet concentrations to share parameters across regions. It claims this yields heterogeneous, locally adaptive calibration maps that generalize effectively to sparse latent-space regions, with experiments on benchmark datasets demonstrating improved local calibration alongside competitive global calibration and predictive performance.

Significance. If the central claims are substantiated, the work offers a scalable alternative to global calibration (which assumes homogeneity) and dimensionality-reduction-based local methods (which incur information loss). The compositional construction via VQ-induced partitions and indexed Dirichlet sharing could enable better handling of heterogeneous calibration errors in high-stakes settings. The manuscript provides experimental results on standard benchmarks but does not include reproducible code, machine-checked proofs, or parameter-free derivations.

major comments (2)

[§3.2] §3.2 (VQ partition and indexed Dirichlet parameterization): the central claim that codeword-dependent factors enable bias-free parameter sharing that improves calibration specifically in sparse regions is load-bearing but under-supported; the joint objective must be shown to align codeword assignment with calibration-error homogeneity rather than reconstruction loss alone, otherwise the indexed parameterization may simply average toward a global map.
[Table 2] Table 2 / local-ECE breakdown: without an ablation that isolates the transfer effect to low-density codewords (e.g., by varying codebook size or freezing the VQ encoder), it remains unclear whether reported gains in sparse regions stem from the proposed sharing mechanism or from other modeling choices such as the Dirichlet concentration learning.

minor comments (3)

The abstract states 'significant improvements' in local calibration; the main text should report exact local-ECE deltas, confidence intervals, and the number of runs to allow readers to assess practical magnitude.
[Notation] Notation for the indexed Dirichlet concentrations (Eq. (X)) should explicitly define how the codeword index selects the concentration vector to prevent ambiguity in the parameter-sharing construction.
[Related Work] Related-work discussion should more explicitly contrast the compositional VQ approach against prior local calibration techniques that also employ clustering or partitioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the methodological rationale and outlining targeted revisions to strengthen the empirical and explanatory support.

read point-by-point responses

Referee: [§3.2] §3.2 (VQ partition and indexed Dirichlet parameterization): the central claim that codeword-dependent factors enable bias-free parameter sharing that improves calibration specifically in sparse regions is load-bearing but under-supported; the joint objective must be shown to align codeword assignment with calibration-error homogeneity rather than reconstruction loss alone, otherwise the indexed parameterization may simply average toward a global map.

Authors: We agree that the alignment between VQ partitions and calibration-error homogeneity requires clearer exposition. The VQ encoder is trained to minimize reconstruction loss on the latent representations, thereby grouping inputs with similar representations into the same codeword. The indexed Dirichlet parameterization then assigns a distinct concentration vector to each codeword; these parameters are optimized jointly with the calibration loss. Because codewords cluster representationally similar inputs, the shared parameters within a codeword effectively transfer strength from dense to sparse regions without introducing bias from dissimilar inputs. In the revision we will expand §3.2 with a paragraph that (i) formalizes the joint objective, (ii) states the assumption that representational proximity implies calibration-error homogeneity, and (iii) notes that the sharing is therefore bias-free conditional on that clustering. We will also add a short remark contrasting this construction with a purely global Dirichlet model. revision: partial
Referee: [Table 2] Table 2 / local-ECE breakdown: without an ablation that isolates the transfer effect to low-density codewords (e.g., by varying codebook size or freezing the VQ encoder), it remains unclear whether reported gains in sparse regions stem from the proposed sharing mechanism or from other modeling choices such as the Dirichlet concentration learning.

Authors: We concur that an ablation isolating the transfer effect would strengthen the claims. In the revised manuscript we will add two controlled experiments: (1) training with codebook sizes K = 8, 16, 32, 64 and reporting local-ECE stratified by codeword occupancy, and (2) a frozen-VQ variant in which the encoder is pretrained once and then held fixed while only the indexed Dirichlet parameters are learned. These results will be presented in an extended Table 2 together with a new column showing the performance gap between low- and high-density codewords, thereby isolating the contribution of the sharing mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on standard VQ and Dirichlet concepts without reduction to inputs by construction

full rationale

The paper's central proposal uses Vector Quantization to induce a partition of the representation space and an indexed parameterization of Dirichlet concentrations to enable parameter sharing. The abstract and available text present this as a compositional construction for heterogeneous calibration maps, with claimed generalization to sparse regions supported by experiments rather than any definitional equivalence or fitted-input renaming. No equations are shown that equate a prediction to its own fitting procedure, no uniqueness theorems are imported from self-citations, and no ansatz is smuggled via prior work. The approach relies on established VQ and Dirichlet machinery whose independence from the target calibration improvement is not contradicted by the provided material. This is the expected non-finding for a method paper whose load-bearing steps remain externally verifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that VQ creates a useful discrete partition allowing parameter sharing; no free parameters or invented entities are explicitly introduced beyond standard VQ and Dirichlet modeling.

axioms (1)

domain assumption Vector quantization induces a structured partition of the representation space suitable for region-specific calibration.
Invoked to justify composing local maps from shared factors.

pith-pipeline@v0.9.0 · 5676 in / 1074 out tokens · 25459 ms · 2026-05-21T05:35:01.249149+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 (Frequency-Weighted Convergence of Codeword Parameters)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

[1]

arXiv preprint arXiv:2308.01222 , year =

Cheng Wang. Calibration in deep learning: A survey of the state-of-the-art.CoRR, abs/2308.01222, 2023

work page arXiv 2023
[2]

Krishnan, and Deepti R

Abhishek Singh Sambyal, Usma Niyaz, Narayanan C. Krishnan, and Deepti R. Bathula. Under- standing calibration of deep neural networks for medical image classification.Comput. Methods Programs Biomed., 242:107816, 2023

work page 2023
[3]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InICML, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017

work page 2017
[4]

Koh, Jiaying Wu, Shen Li, Jianqing Xu, and Bryan Hooi

Miao Xiong, Ailin Deng, Pang Wei W. Koh, Jiaying Wu, Shen Li, Jianqing Xu, and Bryan Hooi. Proximity-informed calibration for deep neural networks. InNeurIPS, 2023

work page 2023
[5]

Local calibration: metrics and recalibration

Rachel Luo, Aadyot Bhatnagar, Yu Bai, Shengjia Zhao, Huan Wang, Caiming Xiong, Silvio Savarese, Stefano Ermon, Edward Schmerling, and Marco Pavone. Local calibration: metrics and recalibration. InUAI, volume 180 ofProceedings of Machine Learning Research, pages 1286–1295. PMLR, 2022

work page 2022
[6]

Meelis Kull, Miquel Perelló-Nieto, Markus Kängsepp, Telmo de Menezes e Silva Filho, Hao Song, and Peter A. Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. InNeurIPS, pages 12295–12305, 2019

work page 2019
[7]

Jordan, and Francis Bach

Eugene Berta, David Holzmüller, Michael I. Jordan, and Francis Bach. Structured Matrix Scaling for Multi-Class Calibration. InAISTATS, 2026

work page 2026
[8]

Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B

Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B. Schön. Evaluating model calibration in classification. InAISTATS, volume 89 of Proceedings of Machine Learning Research, pages 3459–3467. PMLR, 2019

work page 2019
[9]

Assuming locally equal calibration errors for non-parametric multiclass calibration.Transactions on Machine Learning Research, 2023

Kaspar Valk and Meelis Kull. Assuming locally equal calibration errors for non-parametric multiclass calibration.Transactions on Machine Learning Research, 2023

work page 2023
[10]

Multiclass Local Calibration With the Jensen-Shannon Distance

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, and Andrea Pugnana. Multiclass Local Calibration With the Jensen-Shannon Distance. InAISTATS, 2026

work page 2026
[11]

Cooper, and Milos Hauskrecht

Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InAAAI, pages 2901–2907. AAAI Press, 2015. 10

work page 2015
[12]

Metrics of calibration for probabilistic predictions.J

Imanol Arrieta Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, and Cherie Xu. Metrics of calibration for probabilistic predictions.J. Mach. Learn. Res., 23:351:1–351:54, 2022

work page 2022
[13]

Last layer re-training is sufficient for robustness to spurious correlations

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. InICLR. OpenReview.net, 2023

work page 2023
[14]

Taking a step back with kcal: Multi-class kernel-based calibration for deep neural networks

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Taking a step back with kcal: Multi-class kernel-based calibration for deep neural networks. InICLR. OpenReview.net, 2023

work page 2023
[15]

nearest neighbor

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? InInternational conference on database theory, pages 217–235. Springer, 1999

work page 1999
[16]

Springer, 2006

Larry Wasserman.All of nonparametric statistics. Springer, 2006

work page 2006
[17]

The elements of statistical learning, 2009

Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning, 2009

work page 2009
[18]

V oronoi density estimator for high-dimensional data: Computation, compactification and convergence

Vladislav Polianskii, Giovanni Luca Marchetti, Alexander Kravberg, Anastasiia Varava, Flo- rian T Pokorny, and Danica Kragic. V oronoi density estimator for high-dimensional data: Computation, compactification and convergence. InUncertainty in Artificial Intelligence, pages 1644–1653. PMLR, 2022

work page 2022
[19]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017
[20]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[21]

Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis

Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. InISBI, pages 191–195. IEEE, 2021

work page 2021
[22]

Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10(1):41, 2023

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10(1):41, 2023

work page 2023
[23]

Annotated high-throughput microscopy image sets for validation.Nature methods, 9(7):637, 2012

Vebjorn Ljosa, Katherine L Sokolnicki, and Anne E Carpenter. Annotated high-throughput microscopy image sets for validation.Nature methods, 9(7):637, 2012

work page 2012
[24]

Andrey Malinin, Neil Band, Yarin Gal, Mark J. F. Gales, Alexander Ganshin, German Ches- nokov, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy, Mariya Shmatova, Panagiotis Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. InNeurIPS...

work page 2021
[25]

Transforming classifier scores into accurate multiclass probability estimates

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InKDD, pages 694–699. ACM, 2002

work page 2002
[26]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

work page 1999
[27]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InCVPR, pages 11966–11976. IEEE, 2022

work page 2022
[28]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR. OpenReview.net, 2021

work page 2021
[29]

Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. InArtificial intelligence and statistics, pages 623–631. PMLR, 2017

work page 2017
[30]

Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers

Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. InICML, pages 609–616. Morgan Kaufmann, 2001. 11

work page 2001
[31]

Local temperature scaling for probability calibration

Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. Local temperature scaling for probability calibration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6889–6899, 2021

work page 2021
[32]

Multicalibration: Calibration for the (computationally-identifiable) masses

Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. InInternational Conference on Machine Learning, pages 1939–1948. PMLR, 2018

work page 1939
[33]

Moment multicalibration for uncertainty estimation

Christopher Jung, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh V ohra. Moment multicalibration for uncertainty estimation. InConference on Learning Theory, pages 2634–

work page
[34]

Multicalibration yields better matchings.arXiv preprint arXiv:2511.11413, 2025

Riccardo Colini Baldeschi, Simone Di Gregorio, Simone Fioravanti, Federico Fusco, Ido Guy, Daniel Haimovich, Stefano Leonardi, Fridolin Linder, Lorenzo Perini, Matteo Russo, et al. Multicalibration yields better matchings.arXiv preprint arXiv:2511.11413, 2025

work page arXiv 2025
[35]

Discretization- free multicalibration through loss minimization over tree ensembles.arXiv preprint arXiv:2505.17435, 2025

Hongyi Henry Jin, Zijun Ding, Dung Daniel Ngo, and Zhiwei Steven Wu. Discretization- free multicalibration through loss minimization over tree ensembles.arXiv preprint arXiv:2505.17435, 2025

work page arXiv 2025
[36]

Mcgrad: Multicalibration at web scale

Niek Tax, Lorenzo Perini, Fridolin Linder, Daniel Haimovich, Dima Karamshuk, Nastaran Okati, Milan V ojnovic, and Pavlos Athanasios Apostolopoulos. Mcgrad: Multicalibration at web scale. InKDD (1), pages 2470–2481. ACM, 2026

work page 2026
[37]

Multicalibrated regression for downstream fairness

Ira Globus-Harris, Varun Gupta, Christopher Jung, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Multicalibrated regression for downstream fairness. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 259–286, 2023

work page 2023
[38]

The statistical scope of multicalibration

Georgy Noarov and Aaron Roth. The statistical scope of multicalibration. InInternational Conference on Machine Learning, pages 26283–26310. PMLR, 2023

work page 2023
[39]

Diveq: Differentiable vector quantization using the reparameterization trick

Mohammad Hassan Vali, Tom Bäckström, and Arno Solin. Diveq: Differentiable vector quantization using the reparameterization trick. 2026

work page 2026
[40]

Cambridge university press, 2000

Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge university press, 2000

work page 2000
[41]

Maximum likelihood estimation of misspecified models.Econometrica: Journal of the econometric society, pages 1–25, 1982

Halbert White. Maximum likelihood estimation of misspecified models.Econometrica: Journal of the econometric society, pages 1–25, 1982

work page 1982
[42]

Revisiting deep learning models for tabular data

Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. InNeurIPS, pages 18932–18943, 2021

work page 2021
[43]

Pytorch image models

Ross Wightman et al. Pytorch image models. 2019

work page 2019
[44]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015. 12 A Proofs A.1 Proof of Proposition 1 Proof.It suffices to show that: min q∈Q ∥z−q∥ 2 = wX i=1 min j∈{1,...,|C|} ∥z(i) −c j∥2. Take any index vectors, ∥z−q s∥2 = (z(1) −c s(1), . . . ,z(w) −c s(w)) 2 = wX i=1 ∥z(i) −c s(i)∥2. Because the Euclidean norm is ...

work page arXiv 2015

[1] [1]

arXiv preprint arXiv:2308.01222 , year =

Cheng Wang. Calibration in deep learning: A survey of the state-of-the-art.CoRR, abs/2308.01222, 2023

work page arXiv 2023

[2] [2]

Krishnan, and Deepti R

Abhishek Singh Sambyal, Usma Niyaz, Narayanan C. Krishnan, and Deepti R. Bathula. Under- standing calibration of deep neural networks for medical image classification.Comput. Methods Programs Biomed., 242:107816, 2023

work page 2023

[3] [3]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InICML, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017

work page 2017

[4] [4]

Koh, Jiaying Wu, Shen Li, Jianqing Xu, and Bryan Hooi

Miao Xiong, Ailin Deng, Pang Wei W. Koh, Jiaying Wu, Shen Li, Jianqing Xu, and Bryan Hooi. Proximity-informed calibration for deep neural networks. InNeurIPS, 2023

work page 2023

[5] [5]

Local calibration: metrics and recalibration

Rachel Luo, Aadyot Bhatnagar, Yu Bai, Shengjia Zhao, Huan Wang, Caiming Xiong, Silvio Savarese, Stefano Ermon, Edward Schmerling, and Marco Pavone. Local calibration: metrics and recalibration. InUAI, volume 180 ofProceedings of Machine Learning Research, pages 1286–1295. PMLR, 2022

work page 2022

[6] [6]

Meelis Kull, Miquel Perelló-Nieto, Markus Kängsepp, Telmo de Menezes e Silva Filho, Hao Song, and Peter A. Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. InNeurIPS, pages 12295–12305, 2019

work page 2019

[7] [7]

Jordan, and Francis Bach

Eugene Berta, David Holzmüller, Michael I. Jordan, and Francis Bach. Structured Matrix Scaling for Multi-Class Calibration. InAISTATS, 2026

work page 2026

[8] [8]

Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B

Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B. Schön. Evaluating model calibration in classification. InAISTATS, volume 89 of Proceedings of Machine Learning Research, pages 3459–3467. PMLR, 2019

work page 2019

[9] [9]

Assuming locally equal calibration errors for non-parametric multiclass calibration.Transactions on Machine Learning Research, 2023

Kaspar Valk and Meelis Kull. Assuming locally equal calibration errors for non-parametric multiclass calibration.Transactions on Machine Learning Research, 2023

work page 2023

[10] [10]

Multiclass Local Calibration With the Jensen-Shannon Distance

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, and Andrea Pugnana. Multiclass Local Calibration With the Jensen-Shannon Distance. InAISTATS, 2026

work page 2026

[11] [11]

Cooper, and Milos Hauskrecht

Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InAAAI, pages 2901–2907. AAAI Press, 2015. 10

work page 2015

[12] [12]

Metrics of calibration for probabilistic predictions.J

Imanol Arrieta Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, and Cherie Xu. Metrics of calibration for probabilistic predictions.J. Mach. Learn. Res., 23:351:1–351:54, 2022

work page 2022

[13] [13]

Last layer re-training is sufficient for robustness to spurious correlations

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. InICLR. OpenReview.net, 2023

work page 2023

[14] [14]

Taking a step back with kcal: Multi-class kernel-based calibration for deep neural networks

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Taking a step back with kcal: Multi-class kernel-based calibration for deep neural networks. InICLR. OpenReview.net, 2023

work page 2023

[15] [15]

nearest neighbor

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? InInternational conference on database theory, pages 217–235. Springer, 1999

work page 1999

[16] [16]

Springer, 2006

Larry Wasserman.All of nonparametric statistics. Springer, 2006

work page 2006

[17] [17]

The elements of statistical learning, 2009

Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning, 2009

work page 2009

[18] [18]

V oronoi density estimator for high-dimensional data: Computation, compactification and convergence

Vladislav Polianskii, Giovanni Luca Marchetti, Alexander Kravberg, Anastasiia Varava, Flo- rian T Pokorny, and Danica Kragic. V oronoi density estimator for high-dimensional data: Computation, compactification and convergence. InUncertainty in Artificial Intelligence, pages 1644–1653. PMLR, 2022

work page 2022

[19] [19]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017

[20] [20]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[21] [21]

Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis

Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. InISBI, pages 191–195. IEEE, 2021

work page 2021

[22] [22]

Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10(1):41, 2023

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10(1):41, 2023

work page 2023

[23] [23]

Annotated high-throughput microscopy image sets for validation.Nature methods, 9(7):637, 2012

Vebjorn Ljosa, Katherine L Sokolnicki, and Anne E Carpenter. Annotated high-throughput microscopy image sets for validation.Nature methods, 9(7):637, 2012

work page 2012

[24] [24]

Andrey Malinin, Neil Band, Yarin Gal, Mark J. F. Gales, Alexander Ganshin, German Ches- nokov, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy, Mariya Shmatova, Panagiotis Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. InNeurIPS...

work page 2021

[25] [25]

Transforming classifier scores into accurate multiclass probability estimates

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InKDD, pages 694–699. ACM, 2002

work page 2002

[26] [26]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

work page 1999

[27] [27]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InCVPR, pages 11966–11976. IEEE, 2022

work page 2022

[28] [28]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR. OpenReview.net, 2021

work page 2021

[29] [29]

Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. InArtificial intelligence and statistics, pages 623–631. PMLR, 2017

work page 2017

[30] [30]

Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers

Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. InICML, pages 609–616. Morgan Kaufmann, 2001. 11

work page 2001

[31] [31]

Local temperature scaling for probability calibration

Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. Local temperature scaling for probability calibration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6889–6899, 2021

work page 2021

[32] [32]

Multicalibration: Calibration for the (computationally-identifiable) masses

Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. InInternational Conference on Machine Learning, pages 1939–1948. PMLR, 2018

work page 1939

[33] [33]

Moment multicalibration for uncertainty estimation

Christopher Jung, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh V ohra. Moment multicalibration for uncertainty estimation. InConference on Learning Theory, pages 2634–

work page

[34] [34]

Multicalibration yields better matchings.arXiv preprint arXiv:2511.11413, 2025

Riccardo Colini Baldeschi, Simone Di Gregorio, Simone Fioravanti, Federico Fusco, Ido Guy, Daniel Haimovich, Stefano Leonardi, Fridolin Linder, Lorenzo Perini, Matteo Russo, et al. Multicalibration yields better matchings.arXiv preprint arXiv:2511.11413, 2025

work page arXiv 2025

[35] [35]

Discretization- free multicalibration through loss minimization over tree ensembles.arXiv preprint arXiv:2505.17435, 2025

Hongyi Henry Jin, Zijun Ding, Dung Daniel Ngo, and Zhiwei Steven Wu. Discretization- free multicalibration through loss minimization over tree ensembles.arXiv preprint arXiv:2505.17435, 2025

work page arXiv 2025

[36] [36]

Mcgrad: Multicalibration at web scale

Niek Tax, Lorenzo Perini, Fridolin Linder, Daniel Haimovich, Dima Karamshuk, Nastaran Okati, Milan V ojnovic, and Pavlos Athanasios Apostolopoulos. Mcgrad: Multicalibration at web scale. InKDD (1), pages 2470–2481. ACM, 2026

work page 2026

[37] [37]

Multicalibrated regression for downstream fairness

Ira Globus-Harris, Varun Gupta, Christopher Jung, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Multicalibrated regression for downstream fairness. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 259–286, 2023

work page 2023

[38] [38]

The statistical scope of multicalibration

Georgy Noarov and Aaron Roth. The statistical scope of multicalibration. InInternational Conference on Machine Learning, pages 26283–26310. PMLR, 2023

work page 2023

[39] [39]

Diveq: Differentiable vector quantization using the reparameterization trick

Mohammad Hassan Vali, Tom Bäckström, and Arno Solin. Diveq: Differentiable vector quantization using the reparameterization trick. 2026

work page 2026

[40] [40]

Cambridge university press, 2000

Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge university press, 2000

work page 2000

[41] [41]

Maximum likelihood estimation of misspecified models.Econometrica: Journal of the econometric society, pages 1–25, 1982

Halbert White. Maximum likelihood estimation of misspecified models.Econometrica: Journal of the econometric society, pages 1–25, 1982

work page 1982

[42] [42]

Revisiting deep learning models for tabular data

Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. InNeurIPS, pages 18932–18943, 2021

work page 2021

[43] [43]

Pytorch image models

Ross Wightman et al. Pytorch image models. 2019

work page 2019

[44] [44]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015. 12 A Proofs A.1 Proof of Proposition 1 Proof.It suffices to show that: min q∈Q ∥z−q∥ 2 = wX i=1 min j∈{1,...,|C|} ∥z(i) −c j∥2. Take any index vectors, ∥z−q s∥2 = (z(1) −c s(1), . . . ,z(w) −c s(w)) 2 = wX i=1 ∥z(i) −c s(i)∥2. Because the Euclidean norm is ...

work page arXiv 2015