Divide et Calibra: Multiclass Local Calibration via Vector Quantization
Pith reviewed 2026-05-21 05:35 UTC · model grok-4.3
The pith
Vector quantization creates region-specific calibration maps for multiclass models by sharing parameters across a partitioned latent space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By inducing a structured partition of the representation space with vector quantization and using an indexed parameterization of Dirichlet concentrations, the method constructs compositional, region-specific calibration maps that generalize to data-sparse regions while preserving global calibration and predictive performance.
What carries the argument
Vector quantization partition of the representation space together with indexed parameterization of Dirichlet concentrations, which together enable structured parameter sharing across regions.
If this is right
- Local calibration error decreases on standard benchmarks while global calibration and accuracy remain competitive.
- Calibration maps become heterogeneous and adapt to different parts of the latent space.
- Parameter sharing across quantized regions reduces the need for separate maps in data-poor areas.
- The approach avoids the information loss that accompanies dimensionality-reduction steps in prior local methods.
Where Pith is reading between the lines
- The same VQ-plus-indexed-parameter idea might be tried with other partitioning schemes such as clustering or decision trees.
- If the codebook size is treated as a hyperparameter, one could test whether larger codebooks further improve local calibration at the cost of more parameters.
- The method could be combined with post-hoc recalibration techniques that operate on top of the VQ indices.
Load-bearing premise
That partitioning the latent space with vector quantization and tying Dirichlet parameters to codewords produces useful sharing that improves calibration in sparse regions without adding new biases.
What would settle it
Measure local calibration error on a held-out test set deliberately constructed to contain large empty regions in the latent space; if error in those regions does not drop relative to a strong global baseline or if a new bias metric rises, the claim is falsified.
Figures
read the original abstract
Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Divide et Calibra, a compositional multiclass calibration method that partitions the representation space via Vector Quantization (VQ) and uses an indexed parameterization of Dirichlet concentrations to share parameters across regions. It claims this yields heterogeneous, locally adaptive calibration maps that generalize effectively to sparse latent-space regions, with experiments on benchmark datasets demonstrating improved local calibration alongside competitive global calibration and predictive performance.
Significance. If the central claims are substantiated, the work offers a scalable alternative to global calibration (which assumes homogeneity) and dimensionality-reduction-based local methods (which incur information loss). The compositional construction via VQ-induced partitions and indexed Dirichlet sharing could enable better handling of heterogeneous calibration errors in high-stakes settings. The manuscript provides experimental results on standard benchmarks but does not include reproducible code, machine-checked proofs, or parameter-free derivations.
major comments (2)
- [§3.2] §3.2 (VQ partition and indexed Dirichlet parameterization): the central claim that codeword-dependent factors enable bias-free parameter sharing that improves calibration specifically in sparse regions is load-bearing but under-supported; the joint objective must be shown to align codeword assignment with calibration-error homogeneity rather than reconstruction loss alone, otherwise the indexed parameterization may simply average toward a global map.
- [Table 2] Table 2 / local-ECE breakdown: without an ablation that isolates the transfer effect to low-density codewords (e.g., by varying codebook size or freezing the VQ encoder), it remains unclear whether reported gains in sparse regions stem from the proposed sharing mechanism or from other modeling choices such as the Dirichlet concentration learning.
minor comments (3)
- The abstract states 'significant improvements' in local calibration; the main text should report exact local-ECE deltas, confidence intervals, and the number of runs to allow readers to assess practical magnitude.
- [Notation] Notation for the indexed Dirichlet concentrations (Eq. (X)) should explicitly define how the codeword index selects the concentration vector to prevent ambiguity in the parameter-sharing construction.
- [Related Work] Related-work discussion should more explicitly contrast the compositional VQ approach against prior local calibration techniques that also employ clustering or partitioning.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the methodological rationale and outlining targeted revisions to strengthen the empirical and explanatory support.
read point-by-point responses
-
Referee: [§3.2] §3.2 (VQ partition and indexed Dirichlet parameterization): the central claim that codeword-dependent factors enable bias-free parameter sharing that improves calibration specifically in sparse regions is load-bearing but under-supported; the joint objective must be shown to align codeword assignment with calibration-error homogeneity rather than reconstruction loss alone, otherwise the indexed parameterization may simply average toward a global map.
Authors: We agree that the alignment between VQ partitions and calibration-error homogeneity requires clearer exposition. The VQ encoder is trained to minimize reconstruction loss on the latent representations, thereby grouping inputs with similar representations into the same codeword. The indexed Dirichlet parameterization then assigns a distinct concentration vector to each codeword; these parameters are optimized jointly with the calibration loss. Because codewords cluster representationally similar inputs, the shared parameters within a codeword effectively transfer strength from dense to sparse regions without introducing bias from dissimilar inputs. In the revision we will expand §3.2 with a paragraph that (i) formalizes the joint objective, (ii) states the assumption that representational proximity implies calibration-error homogeneity, and (iii) notes that the sharing is therefore bias-free conditional on that clustering. We will also add a short remark contrasting this construction with a purely global Dirichlet model. revision: partial
-
Referee: [Table 2] Table 2 / local-ECE breakdown: without an ablation that isolates the transfer effect to low-density codewords (e.g., by varying codebook size or freezing the VQ encoder), it remains unclear whether reported gains in sparse regions stem from the proposed sharing mechanism or from other modeling choices such as the Dirichlet concentration learning.
Authors: We concur that an ablation isolating the transfer effect would strengthen the claims. In the revised manuscript we will add two controlled experiments: (1) training with codebook sizes K = 8, 16, 32, 64 and reporting local-ECE stratified by codeword occupancy, and (2) a frozen-VQ variant in which the encoder is pretrained once and then held fixed while only the indexed Dirichlet parameters are learned. These results will be presented in an extended Table 2 together with a new column showing the performance gap between low- and high-density codewords, thereby isolating the contribution of the sharing mechanism. revision: yes
Circularity Check
No circularity: derivation builds on standard VQ and Dirichlet concepts without reduction to inputs by construction
full rationale
The paper's central proposal uses Vector Quantization to induce a partition of the representation space and an indexed parameterization of Dirichlet concentrations to enable parameter sharing. The abstract and available text present this as a compositional construction for heterogeneous calibration maps, with claimed generalization to sparse regions supported by experiments rather than any definitional equivalence or fitted-input renaming. No equations are shown that equate a prediction to its own fitting procedure, no uniqueness theorems are imported from self-citations, and no ansatz is smuggled via prior work. The approach relies on established VQ and Dirichlet machinery whose independence from the target calibration improvement is not contradicted by the provided material. This is the expected non-finding for a method paper whose load-bearing steps remain externally verifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vector quantization induces a structured partition of the representation space suitable for region-specific calibration.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2 (Frequency-Weighted Convergence of Codeword Parameters)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2308.01222 , year =
Cheng Wang. Calibration in deep learning: A survey of the state-of-the-art.CoRR, abs/2308.01222, 2023
-
[2]
Abhishek Singh Sambyal, Usma Niyaz, Narayanan C. Krishnan, and Deepti R. Bathula. Under- standing calibration of deep neural networks for medical image classification.Comput. Methods Programs Biomed., 242:107816, 2023
work page 2023
-
[3]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InICML, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017
work page 2017
-
[4]
Koh, Jiaying Wu, Shen Li, Jianqing Xu, and Bryan Hooi
Miao Xiong, Ailin Deng, Pang Wei W. Koh, Jiaying Wu, Shen Li, Jianqing Xu, and Bryan Hooi. Proximity-informed calibration for deep neural networks. InNeurIPS, 2023
work page 2023
-
[5]
Local calibration: metrics and recalibration
Rachel Luo, Aadyot Bhatnagar, Yu Bai, Shengjia Zhao, Huan Wang, Caiming Xiong, Silvio Savarese, Stefano Ermon, Edward Schmerling, and Marco Pavone. Local calibration: metrics and recalibration. InUAI, volume 180 ofProceedings of Machine Learning Research, pages 1286–1295. PMLR, 2022
work page 2022
-
[6]
Meelis Kull, Miquel Perelló-Nieto, Markus Kängsepp, Telmo de Menezes e Silva Filho, Hao Song, and Peter A. Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. InNeurIPS, pages 12295–12305, 2019
work page 2019
-
[7]
Eugene Berta, David Holzmüller, Michael I. Jordan, and Francis Bach. Structured Matrix Scaling for Multi-Class Calibration. InAISTATS, 2026
work page 2026
-
[8]
Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B
Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B. Schön. Evaluating model calibration in classification. InAISTATS, volume 89 of Proceedings of Machine Learning Research, pages 3459–3467. PMLR, 2019
work page 2019
-
[9]
Kaspar Valk and Meelis Kull. Assuming locally equal calibration errors for non-parametric multiclass calibration.Transactions on Machine Learning Research, 2023
work page 2023
-
[10]
Multiclass Local Calibration With the Jensen-Shannon Distance
Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, and Andrea Pugnana. Multiclass Local Calibration With the Jensen-Shannon Distance. InAISTATS, 2026
work page 2026
-
[11]
Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InAAAI, pages 2901–2907. AAAI Press, 2015. 10
work page 2015
-
[12]
Metrics of calibration for probabilistic predictions.J
Imanol Arrieta Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, and Cherie Xu. Metrics of calibration for probabilistic predictions.J. Mach. Learn. Res., 23:351:1–351:54, 2022
work page 2022
-
[13]
Last layer re-training is sufficient for robustness to spurious correlations
Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. InICLR. OpenReview.net, 2023
work page 2023
-
[14]
Taking a step back with kcal: Multi-class kernel-based calibration for deep neural networks
Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Taking a step back with kcal: Multi-class kernel-based calibration for deep neural networks. InICLR. OpenReview.net, 2023
work page 2023
-
[15]
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? InInternational conference on database theory, pages 217–235. Springer, 1999
work page 1999
- [16]
-
[17]
The elements of statistical learning, 2009
Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning, 2009
work page 2009
-
[18]
V oronoi density estimator for high-dimensional data: Computation, compactification and convergence
Vladislav Polianskii, Giovanni Luca Marchetti, Alexander Kravberg, Anastasiia Varava, Flo- rian T Pokorny, and Danica Kragic. V oronoi density estimator for high-dimensional data: Computation, compactification and convergence. InUncertainty in Artificial Intelligence, pages 1644–1653. PMLR, 2022
work page 2022
-
[19]
Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
work page 2017
-
[20]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[21]
Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis
Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. InISBI, pages 191–195. IEEE, 2021
work page 2021
-
[22]
Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10(1):41, 2023
work page 2023
-
[23]
Annotated high-throughput microscopy image sets for validation.Nature methods, 9(7):637, 2012
Vebjorn Ljosa, Katherine L Sokolnicki, and Anne E Carpenter. Annotated high-throughput microscopy image sets for validation.Nature methods, 9(7):637, 2012
work page 2012
-
[24]
Andrey Malinin, Neil Band, Yarin Gal, Mark J. F. Gales, Alexander Ganshin, German Ches- nokov, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy, Mariya Shmatova, Panagiotis Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. InNeurIPS...
work page 2021
-
[25]
Transforming classifier scores into accurate multiclass probability estimates
Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InKDD, pages 694–699. ACM, 2002
work page 2002
-
[26]
John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999
work page 1999
-
[27]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InCVPR, pages 11966–11976. IEEE, 2022
work page 2022
-
[28]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR. OpenReview.net, 2021
work page 2021
-
[29]
Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. InArtificial intelligence and statistics, pages 623–631. PMLR, 2017
work page 2017
-
[30]
Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers
Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. InICML, pages 609–616. Morgan Kaufmann, 2001. 11
work page 2001
-
[31]
Local temperature scaling for probability calibration
Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. Local temperature scaling for probability calibration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6889–6899, 2021
work page 2021
-
[32]
Multicalibration: Calibration for the (computationally-identifiable) masses
Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. InInternational Conference on Machine Learning, pages 1939–1948. PMLR, 2018
work page 1939
-
[33]
Moment multicalibration for uncertainty estimation
Christopher Jung, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh V ohra. Moment multicalibration for uncertainty estimation. InConference on Learning Theory, pages 2634–
-
[34]
Multicalibration yields better matchings.arXiv preprint arXiv:2511.11413, 2025
Riccardo Colini Baldeschi, Simone Di Gregorio, Simone Fioravanti, Federico Fusco, Ido Guy, Daniel Haimovich, Stefano Leonardi, Fridolin Linder, Lorenzo Perini, Matteo Russo, et al. Multicalibration yields better matchings.arXiv preprint arXiv:2511.11413, 2025
-
[35]
Hongyi Henry Jin, Zijun Ding, Dung Daniel Ngo, and Zhiwei Steven Wu. Discretization- free multicalibration through loss minimization over tree ensembles.arXiv preprint arXiv:2505.17435, 2025
-
[36]
Mcgrad: Multicalibration at web scale
Niek Tax, Lorenzo Perini, Fridolin Linder, Daniel Haimovich, Dima Karamshuk, Nastaran Okati, Milan V ojnovic, and Pavlos Athanasios Apostolopoulos. Mcgrad: Multicalibration at web scale. InKDD (1), pages 2470–2481. ACM, 2026
work page 2026
-
[37]
Multicalibrated regression for downstream fairness
Ira Globus-Harris, Varun Gupta, Christopher Jung, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Multicalibrated regression for downstream fairness. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 259–286, 2023
work page 2023
-
[38]
The statistical scope of multicalibration
Georgy Noarov and Aaron Roth. The statistical scope of multicalibration. InInternational Conference on Machine Learning, pages 26283–26310. PMLR, 2023
work page 2023
-
[39]
Diveq: Differentiable vector quantization using the reparameterization trick
Mohammad Hassan Vali, Tom Bäckström, and Arno Solin. Diveq: Differentiable vector quantization using the reparameterization trick. 2026
work page 2026
-
[40]
Cambridge university press, 2000
Aad W Van der Vaart.Asymptotic statistics, volume 3. Cambridge university press, 2000
work page 2000
-
[41]
Halbert White. Maximum likelihood estimation of misspecified models.Econometrica: Journal of the econometric society, pages 1–25, 1982
work page 1982
-
[42]
Revisiting deep learning models for tabular data
Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. InNeurIPS, pages 18932–18943, 2021
work page 2021
- [43]
-
[44]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015. 12 A Proofs A.1 Proof of Proposition 1 Proof.It suffices to show that: min q∈Q ∥z−q∥ 2 = wX i=1 min j∈{1,...,|C|} ∥z(i) −c j∥2. Take any index vectors, ∥z−q s∥2 = (z(1) −c s(1), . . . ,z(w) −c s(w)) 2 = wX i=1 ∥z(i) −c s(i)∥2. Because the Euclidean norm is ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.