pith. machine review for the scientific record. sign in

arxiv: 2605.12780 · v1 · submitted 2026-05-12 · 📊 stat.ME · cs.LG· stat.ML

Recognition: unknown

When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:37 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML
keywords pseudo-labelingconfidence thresholdingcalibration diagnosticsattenuation biasregression estimationsemi-supervised methodsvariance-based diagnostics
0
0 comments X

The pith

The attenuation bias from thresholding confidence scores can be predicted exactly from residual score variance on unlabeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that confidence thresholding on calibrated classifier scores introduces an attenuation bias in subsequent regression estimates. This bias has a closed-form expression that depends only on the expected conditional variance of the scores given the regression controls. Practitioners can compute this quantity directly from classifier outputs before performing any downstream inference, allowing them to decide whether thresholding is safe or to apply corrections. The derivation assumes a structural separation between classifier features and regression controls, ensuring the variance term is identifiable.

Core claim

Building on a recent identification result, we derive a closed-form expression for the attenuation bias that confidence thresholding induces in the downstream regression coefficient. The bias can be predicted from the residual score variance V^*=E[Var(p|X)] on the unlabelled set after partialling out the downstream controls X. We also obtain a sharp sensitivity bound under bounded calibration drift and identify the boundary V^*=0, which holds if and only if p is a deterministic function of X.

What carries the argument

The residual score variance V^* = E[Var(p | X)], which serves as a pre-inference diagnostic for the size of attenuation bias induced by thresholding.

If this is right

  • The bias vanishes exactly when V^* = 0, i.e., when the score is fully determined by X.
  • A (V^*, kappa) decision rule tells practitioners when thresholding is safe.
  • The sensitivity bound quantifies how much calibration drift can be tolerated before the bias becomes large.
  • Simulations and the UCI Adult example confirm that the formula predicts observed bias accurately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar diagnostics could be developed for other pseudo-labeling strategies beyond simple thresholding.
  • The framework suggests designing classifiers with explicit feature separation from downstream controls.
  • Testing the diagnostic on streaming or online learning settings would reveal its robustness to distribution shift.

Load-bearing premise

The underlying moment equation is identified exactly and calibration drift remains bounded.

What would settle it

Run a controlled simulation with known true labels, apply thresholding at various cutoffs, and check if the difference between thresholded and oracle regression coefficients matches the predicted value from V^*.

Figures

Figures reproduced from arXiv: 2605.12780 by Marcell T. Kurbucz.

Figure 1
Figure 1. Figure 1: Empirical attenuation ratio (filled circles, with 95% Monte Carlo confidence [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Bias-variance decomposition of the confidence-thresholded estimator on log scale, [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Empirical bias of the confidence-thresholded estimator at six thresholds, plotted [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical bias under three calibration drift shapes, against [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: V ∗ on the observed score plotted against labelled-set size, with classifier fea￾tures W equal to the downstream controls X. The deterministic logistic + isotonic pipeline shrinks V ∗ as nL grows, because the classifier becomes a tighter function of X; posterior￾predictive sampling preserves V ∗ at an order of magnitude above. The structural alterna￾tive used in Section 5.7—taking W strictly larger than X—… view at source ↗
Figure 6
Figure 6. Figure 6: UCI Adult: MSE against the full-sample target on a log scale, as a function of [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Calibrated probability outputs of trained classifiers are increasingly used as inputs to downstream regression estimands such as effects, prevalences, or disparities for a latent group observed only on a small labelled subset. A standard practice is to threshold the calibrated score at a confidence cutoff and treat the hard label as the truth. Building on a recent identification result for the underlying moment equation, we develop a calibration-aware diagnostic apparatus for pseudo-labelling pipelines. We derive a closed-form expression for the attenuation bias that confidence thresholding induces in the downstream regression coefficient, and show that the bias can be predicted, before any inference is run, from the residual score variance $V^{*}=\mathbb{E}[\operatorname{Var}(p\mid X)]$ on the unlabelled set after partialling out the downstream controls $X$. We further obtain a sharp sensitivity bound under bounded calibration drift, and identify the boundary $V^{*}=0$, which holds iff $p$ is a deterministic function of $X$; this motivates a structural separation between classifier features $W$ and downstream controls $X\subsetneq W$. Five controlled simulations and a UCI Adult illustration trace the predictions. The contribution is operational: a $(V^{*}, \kappa)$ decision rule that practitioners can compute from any classifier output to decide whether confidence thresholding is safe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to derive a closed-form expression for the attenuation bias that confidence thresholding induces in a downstream regression coefficient when using calibrated classifier scores as pseudo-labels. The bias is shown to be predictable before inference from the observable residual score variance V^*=E[Var(p|X)] on the unlabelled set after partialling out controls X. Building on an external identification result for the moment equation, the authors supply a sharp sensitivity bound under bounded calibration drift, identify the boundary V^*=0 (which holds iff p is deterministic in X), and motivate a structural separation X subset W. The contribution is operationalized as a (V^*, κ) decision rule, supported by five controlled simulations and a UCI Adult illustration.

Significance. If the derivation is exact, the result supplies a practical, pre-inference diagnostic that lets practitioners decide whether thresholding is safe using only classifier outputs and observable quantities. The closed-form prediction, the explicit sensitivity bound, and the structural separation between classifier features W and downstream controls X are clear strengths. The simulations and real-data example provide direct traceability of the predicted bias. This could affect practice in semi-supervised regression and pseudo-labelling pipelines in statistics and machine learning.

major comments (1)
  1. [Derivation of closed-form bias expression] The closed-form bias expression (abstract and derivation section) treats the thresholding operator 1{p>κ} as preserving the moment identification exactly once V^* is conditioned on X. This requires that any selection-induced covariance between the thresholded pseudo-label and the downstream regression error (conditional on X) is fully absorbed into the residual variance term. The supplied sensitivity bound addresses only calibration drift; an explicit bound or verification for the additional covariance term is needed, as its presence would make the closed-form understate the bias even when V^* is observed perfectly.
minor comments (1)
  1. [Abstract] The abstract states that five simulations 'trace the predictions' but does not report the specific ranges of V^* and κ examined or the design of the data-generating processes; adding a short table or sentence would improve reproducibility assessment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the potential practical utility of the (V^*, κ) rule. We address the single major comment below and will strengthen the derivation section accordingly.

read point-by-point responses
  1. Referee: The closed-form bias expression (abstract and derivation section) treats the thresholding operator 1{p>κ} as preserving the moment identification exactly once V^* is conditioned on X. This requires that any selection-induced covariance between the thresholded pseudo-label and the downstream regression error (conditional on X) is fully absorbed into the residual variance term. The supplied sensitivity bound addresses only calibration drift; an explicit bound or verification for the additional covariance term is needed, as its presence would make the closed-form understate the bias even when V^* is observed perfectly.

    Authors: We appreciate the referee drawing attention to this covariance term. Under the structural separation X ⊂ W that we introduce, the downstream regression error is conditionally independent of the classifier features W (and hence of p and the thresholded pseudo-label) given X. The identification result we build upon therefore implies that E[thresholded pseudo-label × regression error | X] = 0, so the covariance vanishes and is absorbed into the residual variance V^* without further bias. The calibration-drift sensitivity bound is stated separately because drift can induce finite-sample dependence even when the population conditional independence holds. To make the argument fully explicit, we will insert a short lemma in the derivation section proving the covariance term is zero under our maintained assumptions and will note how the existing sensitivity bound extends if conditional independence is relaxed by a small amount. This clarification does not change the closed-form result or the (V^*, κ) rule. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external identification result and direct definition of V*

full rationale

The paper states it builds on a recent external identification result for the underlying moment equation rather than deriving that condition from its own quantities. V* is introduced as the observable E[Var(p|X)] computed directly from classifier outputs after partialling out controls X; the closed-form attenuation bias is expressed as a function of this quantity without fitting V* to the target coefficient or renaming a fitted input as a prediction. No self-citation is load-bearing for the central claim, no ansatz is smuggled, and the structural separation X subset W is maintained as an explicit assumption. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The derivation rests on an external identification result for the moment equation and on the assumption of bounded calibration drift; V* is treated as directly observable from data rather than estimated as a free parameter.

axioms (2)
  • domain assumption Recent identification result for the underlying moment equation holds
    Paper states it builds directly on this result to obtain the closed-form bias expression.
  • domain assumption Calibration drift is bounded
    Used to obtain the sharp sensitivity bound on the bias.

pith-pipeline@v0.9.0 · 5529 in / 1446 out tokens · 54121 ms · 2026-05-14T19:37:52.002207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Kallus, X

    N. Kallus, X. Mao, A. Zhou, Assessing algorithmic fairness with un- observed protected class using data combination, Management Science 68 (3) (2022) 1959–1981

  2. [2]

    Lee, Pseudo-label: The simple and efficient semi-supervised learn- ing method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Vol

    D.-H. Lee, Pseudo-label: The simple and efficient semi-supervised learn- ing method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Vol. 3, Atlanta, 2013, p. 896

  3. [3]

    K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A.Kurakin, H.Zhang, C.Raffel, FixMatch: Simplifyingsemi-supervised learning with consistency and confidence, in: Advances in Neural Infor- mation Processing Systems (NeurIPS), 2020

  4. [4]

    Zhang, Y

    B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, T. Shi- nozaki, FlexMatch: Boosting semi-supervised learning with curriculum pseudo labeling, Advances in neural information processing systems 34 (2021) 18408–18419

  5. [5]

    Y. Wang, H. Chen, Q. Heng, W. Hou, Y. Fan, Z. Wu, J. Wang, M. Sav- vides, T. Shinozaki, B. Raj, B. Schiele, X. Xie, FreeMatch: Self-adaptive thresholding for semi-supervised learning, International Conference on Learning Representations (ICLR) (2023)

  6. [6]

    Tarvainen, H

    A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning results, in: Advances in Neural Information Processing Systems (NeurIPS), 2017. 21

  7. [7]

    Arazo, D

    E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, K. McGuinness, Pseudo-labeling and confirmation bias in deep semi-supervised learning, in: 2020 International joint conference on neural networks (IJCNN), IEEE, 2020, pp. 1–8

  8. [8]

    M. T. Kurbucz, Identification of latent group effects under conditional calibration (2026)

  9. [9]

    Kohavi, Scaling up the accuracy of naive-Bayes classifiers: a decision- tree hybrid, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1996, pp

    R. Kohavi, Scaling up the accuracy of naive-Bayes classifiers: a decision- tree hybrid, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1996, pp. 202–207

  10. [10]

    J. Li, C. Xiong, S. C. Hoi, CoMatch: Semi-supervised learning with contrastive graph regularization, in: Proceedings of the IEEE/CVF in- ternational conference on computer vision, 2021, pp. 9475–9484

  11. [11]

    M. N. Rizve, K. Duarte, Y. S. Rawat, M. Shah, In defense of pseudo- labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning, in: International Conference on Learning Rep- resentations (ICLR), 2021

  12. [12]

    X. Wang, Z. Wu, L. Lian, S. X. Yu, Debiased learning from naturally imbalanced pseudo-labels, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  13. [13]

    J. E. Van Engelen, H. H. Hoos, A survey on semi-supervised learning, Machine learning 109 (2) (2020) 373–440

  14. [14]

    J. C. Platt, Probabilistic outputs for support vector machines and com- parisons to regularized likelihood methods, in: Advances in Large Mar- gin Classifiers, 1999

  15. [15]

    B.Zadrozny, C.Elkan, Transformingclassifierscoresintoaccuratemulti- classprobabilityestimates, in: ProceedingsoftheEighthACMSIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 694–699

  16. [16]

    Niculescu-Mizil, R

    A. Niculescu-Mizil, R. Caruana, Predicting good probabilities with su- pervised learning, in: Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 625–632. 22

  17. [17]

    C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, in: Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, pp. 1321–1330

  18. [18]

    M.Minderer, J.Djolonga, R.Romijnders, F.Hubis, X.Zhai, N.Houlsby, D. Tran, M. Lucic, Revisiting the calibration of modern neural networks, Advances in Neural Information Processing Systems 34 (2021) 15682– 15694

  19. [19]

    Ovadia, E

    Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, J. Snoek, Can you trust your model’s uncer- tainty? evaluating predictive uncertainty under dataset shift, Advances in Neural Information Processing Systems (NeurIPS) 32 (2019)

  20. [20]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531 (2015)

  21. [21]

    R.Müller, S.Kornblith, G.E.Hinton, Whendoeslabelsmoothinghelp?, Advances in Neural Information Processing Systems 32 (2019)

  22. [22]

    Lewbel, Estimation of average treatment effects with misclassifica- tion, Econometrica 75 (2) (2007) 537–551

    A. Lewbel, Estimation of average treatment effects with misclassifica- tion, Econometrica 75 (2) (2007) 537–551

  23. [23]

    Mahajan, Identification and estimation of regression models with misclassification, Econometrica 74 (3) (2006) 631–665

    A. Mahajan, Identification and estimation of regression models with misclassification, Econometrica 74 (3) (2006) 631–665

  24. [24]

    P. M. Robinson, Root-N-consistent semiparametric regression, Econo- metrica 56 (4) (1988) 931–954

  25. [25]

    Chernozhukov, D

    V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, J. Robins, Double/debiased machine learning for treatment and structural parameters, The Econometrics Journal 21 (1) (2018) C1– C68

  26. [26]

    Lakshminarayanan, A

    B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable pre- dictive uncertainty estimation using deep ensembles, Advances in Neural Information Processing Systems 30 (2017)

  27. [27]

    Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Repre- senting model uncertainty in deep learning, in: Proceedings of the 33rd International Conference on Machine Learning, 2016, pp. 1050–1059. 23

  28. [28]

    Breiman, Random forests, Machine Learning 45 (2001) 5–32

    L. Breiman, Random forests, Machine Learning 45 (2001) 5–32

  29. [29]

    M. N. Wright, A. Ziegler, ranger: A fast implementation of random forests for high dimensional data in C++ and R, Journal of Statistical Software 77 (1) (2017) 1–17. 24