arxiv: 2605.12780 · v1 · submitted 2026-05-12 · 📊 stat.ME · cs.LG· stat.ML

Recognition: unknown

When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression

Marcell T. Kurbucz

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:37 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML

keywords pseudo-labelingconfidence thresholdingcalibration diagnosticsattenuation biasregression estimationsemi-supervised methodsvariance-based diagnostics

0 comments

The pith

The attenuation bias from thresholding confidence scores can be predicted exactly from residual score variance on unlabeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that confidence thresholding on calibrated classifier scores introduces an attenuation bias in subsequent regression estimates. This bias has a closed-form expression that depends only on the expected conditional variance of the scores given the regression controls. Practitioners can compute this quantity directly from classifier outputs before performing any downstream inference, allowing them to decide whether thresholding is safe or to apply corrections. The derivation assumes a structural separation between classifier features and regression controls, ensuring the variance term is identifiable.

Core claim

Building on a recent identification result, we derive a closed-form expression for the attenuation bias that confidence thresholding induces in the downstream regression coefficient. The bias can be predicted from the residual score variance V^*=E[Var(p|X)] on the unlabelled set after partialling out the downstream controls X. We also obtain a sharp sensitivity bound under bounded calibration drift and identify the boundary V^*=0, which holds if and only if p is a deterministic function of X.

What carries the argument

The residual score variance V^* = E[Var(p | X)], which serves as a pre-inference diagnostic for the size of attenuation bias induced by thresholding.

If this is right

The bias vanishes exactly when V^* = 0, i.e., when the score is fully determined by X.
A (V^*, kappa) decision rule tells practitioners when thresholding is safe.
The sensitivity bound quantifies how much calibration drift can be tolerated before the bias becomes large.
Simulations and the UCI Adult example confirm that the formula predicts observed bias accurately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar diagnostics could be developed for other pseudo-labeling strategies beyond simple thresholding.
The framework suggests designing classifiers with explicit feature separation from downstream controls.
Testing the diagnostic on streaming or online learning settings would reveal its robustness to distribution shift.

Load-bearing premise

The underlying moment equation is identified exactly and calibration drift remains bounded.

What would settle it

Run a controlled simulation with known true labels, apply thresholding at various cutoffs, and check if the difference between thresholded and oracle regression coefficients matches the predicted value from V^*.

Figures

Figures reproduced from arXiv: 2605.12780 by Marcell T. Kurbucz.

**Figure 2.** Figure 2: Bias-variance decomposition of the confidence-thresholded estimator on log scale, [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical bias of the confidence-thresholded estimator at six thresholds, plotted [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical bias under three calibration drift shapes, against [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: V ∗ on the observed score plotted against labelled-set size, with classifier features W equal to the downstream controls X. The deterministic logistic + isotonic pipeline shrinks V ∗ as nL grows, because the classifier becomes a tighter function of X; posteriorpredictive sampling preserves V ∗ at an order of magnitude above. The structural alternative used in Section 5.7—taking W strictly larger than X—… view at source ↗

**Figure 6.** Figure 6: UCI Adult: MSE against the full-sample target on a log scale, as a function of [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Calibrated probability outputs of trained classifiers are increasingly used as inputs to downstream regression estimands such as effects, prevalences, or disparities for a latent group observed only on a small labelled subset. A standard practice is to threshold the calibrated score at a confidence cutoff and treat the hard label as the truth. Building on a recent identification result for the underlying moment equation, we develop a calibration-aware diagnostic apparatus for pseudo-labelling pipelines. We derive a closed-form expression for the attenuation bias that confidence thresholding induces in the downstream regression coefficient, and show that the bias can be predicted, before any inference is run, from the residual score variance $V^{*}=\mathbb{E}[\operatorname{Var}(p\mid X)]$ on the unlabelled set after partialling out the downstream controls $X$. We further obtain a sharp sensitivity bound under bounded calibration drift, and identify the boundary $V^{*}=0$, which holds iff $p$ is a deterministic function of $X$; this motivates a structural separation between classifier features $W$ and downstream controls $X\subsetneq W$. Five controlled simulations and a UCI Adult illustration trace the predictions. The contribution is operational: a $(V^{*}, \kappa)$ decision rule that practitioners can compute from any classifier output to decide whether confidence thresholding is safe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a closed-form attenuation-bias formula tied to observable residual score variance V* and a practical (V*, κ) rule for deciding when confidence thresholding is safe in pseudo-labelling.

read the letter

The main takeaway is a closed-form expression for how thresholding calibrated scores as pseudo-labels attenuates the downstream regression coefficient, plus a way to predict that bias from the residual variance V* = E[Var(p|X)] on the unlabelled set after partialling out controls. This is new and directly usable: practitioners can compute V* from classifier outputs and apply the decision rule before any inference runs. The sensitivity bound under bounded calibration drift and the structural separation of classifier features W from controls X are sensible additions that keep the diagnostic non-circular. The five simulations and UCI Adult illustration trace the predictions reasonably well, showing the formula tracks actual bias in controlled cases. The work builds cleanly on the prior identification result rather than re-deriving the moment condition from scratch. The soft spot is whether the closed-form fully absorbs selection effects from the thresholding operator itself. If thresholding introduces covariance between the hard pseudo-label and the regression error conditional on X that is not captured inside V*, the bias prediction could understate the problem even when V* is measured perfectly. The paper bounds calibration drift but does not appear to supply an explicit check or bound for this extra term. A referee would want to see the derivation steps expanded to confirm no gaps there. This is aimed at applied statisticians and data scientists who use pseudo-labelling for effect estimation or disparity measurement with scarce labels. Readers who need a concrete, pre-inference diagnostic from existing classifier outputs will get immediate value. It deserves serious peer review because the claim is operational and testable rather than purely theoretical. I would send it out, with referees asked to inspect the moment condition under thresholding and to run the rule on fresh datasets.

Referee Report

1 major / 1 minor

Summary. The paper claims to derive a closed-form expression for the attenuation bias that confidence thresholding induces in a downstream regression coefficient when using calibrated classifier scores as pseudo-labels. The bias is shown to be predictable before inference from the observable residual score variance V^*=E[Var(p|X)] on the unlabelled set after partialling out controls X. Building on an external identification result for the moment equation, the authors supply a sharp sensitivity bound under bounded calibration drift, identify the boundary V^*=0 (which holds iff p is deterministic in X), and motivate a structural separation X subset W. The contribution is operationalized as a (V^*, κ) decision rule, supported by five controlled simulations and a UCI Adult illustration.

Significance. If the derivation is exact, the result supplies a practical, pre-inference diagnostic that lets practitioners decide whether thresholding is safe using only classifier outputs and observable quantities. The closed-form prediction, the explicit sensitivity bound, and the structural separation between classifier features W and downstream controls X are clear strengths. The simulations and real-data example provide direct traceability of the predicted bias. This could affect practice in semi-supervised regression and pseudo-labelling pipelines in statistics and machine learning.

major comments (1)

[Derivation of closed-form bias expression] The closed-form bias expression (abstract and derivation section) treats the thresholding operator 1{p>κ} as preserving the moment identification exactly once V^* is conditioned on X. This requires that any selection-induced covariance between the thresholded pseudo-label and the downstream regression error (conditional on X) is fully absorbed into the residual variance term. The supplied sensitivity bound addresses only calibration drift; an explicit bound or verification for the additional covariance term is needed, as its presence would make the closed-form understate the bias even when V^* is observed perfectly.

minor comments (1)

[Abstract] The abstract states that five simulations 'trace the predictions' but does not report the specific ranges of V^* and κ examined or the design of the data-generating processes; adding a short table or sentence would improve reproducibility assessment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the potential practical utility of the (V^*, κ) rule. We address the single major comment below and will strengthen the derivation section accordingly.

read point-by-point responses

Referee: The closed-form bias expression (abstract and derivation section) treats the thresholding operator 1{p>κ} as preserving the moment identification exactly once V^* is conditioned on X. This requires that any selection-induced covariance between the thresholded pseudo-label and the downstream regression error (conditional on X) is fully absorbed into the residual variance term. The supplied sensitivity bound addresses only calibration drift; an explicit bound or verification for the additional covariance term is needed, as its presence would make the closed-form understate the bias even when V^* is observed perfectly.

Authors: We appreciate the referee drawing attention to this covariance term. Under the structural separation X ⊂ W that we introduce, the downstream regression error is conditionally independent of the classifier features W (and hence of p and the thresholded pseudo-label) given X. The identification result we build upon therefore implies that E[thresholded pseudo-label × regression error | X] = 0, so the covariance vanishes and is absorbed into the residual variance V^* without further bias. The calibration-drift sensitivity bound is stated separately because drift can induce finite-sample dependence even when the population conditional independence holds. To make the argument fully explicit, we will insert a short lemma in the derivation section proving the covariance term is zero under our maintained assumptions and will note how the existing sensitivity bound extends if conditional independence is relaxed by a small amount. This clarification does not change the closed-form result or the (V^*, κ) rule. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external identification result and direct definition of V*

full rationale

The paper states it builds on a recent external identification result for the underlying moment equation rather than deriving that condition from its own quantities. V* is introduced as the observable E[Var(p|X)] computed directly from classifier outputs after partialling out controls X; the closed-form attenuation bias is expressed as a function of this quantity without fitting V* to the target coefficient or renaming a fitted input as a prediction. No self-citation is load-bearing for the central claim, no ansatz is smuggled, and the structural separation X subset W is maintained as an explicit assumption. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The derivation rests on an external identification result for the moment equation and on the assumption of bounded calibration drift; V* is treated as directly observable from data rather than estimated as a free parameter.

axioms (2)

domain assumption Recent identification result for the underlying moment equation holds
Paper states it builds directly on this result to obtain the closed-form bias expression.
domain assumption Calibration drift is bounded
Used to obtain the sharp sensitivity bound on the bias.

pith-pipeline@v0.9.0 · 5529 in / 1446 out tokens · 54121 ms · 2026-05-14T19:37:52.002207+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Kallus, X

N. Kallus, X. Mao, A. Zhou, Assessing algorithmic fairness with un- observed protected class using data combination, Management Science 68 (3) (2022) 1959–1981

work page 2022
[2]

Lee, Pseudo-label: The simple and efficient semi-supervised learn- ing method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Vol

D.-H. Lee, Pseudo-label: The simple and efficient semi-supervised learn- ing method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Vol. 3, Atlanta, 2013, p. 896

work page 2013
[3]

K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A.Kurakin, H.Zhang, C.Raffel, FixMatch: Simplifyingsemi-supervised learning with consistency and confidence, in: Advances in Neural Infor- mation Processing Systems (NeurIPS), 2020

work page 2020
[4]

Zhang, Y

B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, T. Shi- nozaki, FlexMatch: Boosting semi-supervised learning with curriculum pseudo labeling, Advances in neural information processing systems 34 (2021) 18408–18419

work page 2021
[5]

Y. Wang, H. Chen, Q. Heng, W. Hou, Y. Fan, Z. Wu, J. Wang, M. Sav- vides, T. Shinozaki, B. Raj, B. Schiele, X. Xie, FreeMatch: Self-adaptive thresholding for semi-supervised learning, International Conference on Learning Representations (ICLR) (2023)

work page 2023
[6]

Tarvainen, H

A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning results, in: Advances in Neural Information Processing Systems (NeurIPS), 2017. 21

work page 2017
[7]

Arazo, D

E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, K. McGuinness, Pseudo-labeling and confirmation bias in deep semi-supervised learning, in: 2020 International joint conference on neural networks (IJCNN), IEEE, 2020, pp. 1–8

work page 2020
[8]

M. T. Kurbucz, Identification of latent group effects under conditional calibration (2026)

work page 2026
[9]

Kohavi, Scaling up the accuracy of naive-Bayes classifiers: a decision- tree hybrid, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1996, pp

R. Kohavi, Scaling up the accuracy of naive-Bayes classifiers: a decision- tree hybrid, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1996, pp. 202–207

work page 1996
[10]

J. Li, C. Xiong, S. C. Hoi, CoMatch: Semi-supervised learning with contrastive graph regularization, in: Proceedings of the IEEE/CVF in- ternational conference on computer vision, 2021, pp. 9475–9484

work page 2021
[11]

M. N. Rizve, K. Duarte, Y. S. Rawat, M. Shah, In defense of pseudo- labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning, in: International Conference on Learning Rep- resentations (ICLR), 2021

work page 2021
[12]

X. Wang, Z. Wu, L. Lian, S. X. Yu, Debiased learning from naturally imbalanced pseudo-labels, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

work page 2022
[13]

J. E. Van Engelen, H. H. Hoos, A survey on semi-supervised learning, Machine learning 109 (2) (2020) 373–440

work page 2020
[14]

J. C. Platt, Probabilistic outputs for support vector machines and com- parisons to regularized likelihood methods, in: Advances in Large Mar- gin Classifiers, 1999

work page 1999
[15]

B.Zadrozny, C.Elkan, Transformingclassifierscoresintoaccuratemulti- classprobabilityestimates, in: ProceedingsoftheEighthACMSIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 694–699

work page 2002
[16]

Niculescu-Mizil, R

A. Niculescu-Mizil, R. Caruana, Predicting good probabilities with su- pervised learning, in: Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 625–632. 22

work page 2005
[17]

C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, in: Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, pp. 1321–1330

work page 2017
[18]

M.Minderer, J.Djolonga, R.Romijnders, F.Hubis, X.Zhai, N.Houlsby, D. Tran, M. Lucic, Revisiting the calibration of modern neural networks, Advances in Neural Information Processing Systems 34 (2021) 15682– 15694

work page 2021
[19]

Ovadia, E

Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, J. Snoek, Can you trust your model’s uncer- tainty? evaluating predictive uncertainty under dataset shift, Advances in Neural Information Processing Systems (NeurIPS) 32 (2019)

work page 2019
[20]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[21]

R.Müller, S.Kornblith, G.E.Hinton, Whendoeslabelsmoothinghelp?, Advances in Neural Information Processing Systems 32 (2019)

work page 2019
[22]

Lewbel, Estimation of average treatment effects with misclassifica- tion, Econometrica 75 (2) (2007) 537–551

A. Lewbel, Estimation of average treatment effects with misclassifica- tion, Econometrica 75 (2) (2007) 537–551

work page 2007
[23]

Mahajan, Identification and estimation of regression models with misclassification, Econometrica 74 (3) (2006) 631–665

A. Mahajan, Identification and estimation of regression models with misclassification, Econometrica 74 (3) (2006) 631–665

work page 2006
[24]

P. M. Robinson, Root-N-consistent semiparametric regression, Econo- metrica 56 (4) (1988) 931–954

work page 1988
[25]

Chernozhukov, D

V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, J. Robins, Double/debiased machine learning for treatment and structural parameters, The Econometrics Journal 21 (1) (2018) C1– C68

work page 2018
[26]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable pre- dictive uncertainty estimation using deep ensembles, Advances in Neural Information Processing Systems 30 (2017)

work page 2017
[27]

Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Repre- senting model uncertainty in deep learning, in: Proceedings of the 33rd International Conference on Machine Learning, 2016, pp. 1050–1059. 23

work page 2016
[28]

Breiman, Random forests, Machine Learning 45 (2001) 5–32

L. Breiman, Random forests, Machine Learning 45 (2001) 5–32

work page 2001
[29]

M. N. Wright, A. Ziegler, ranger: A fast implementation of random forests for high dimensional data in C++ and R, Journal of Statistical Software 77 (1) (2017) 1–17. 24

work page 2017