pith. sign in

arxiv: 2307.02719 · v4 · submitted 2023-07-06 · 💻 cs.LG · stat.ML

Understanding Uncertainty Sampling via Equivalent Loss

Pith reviewed 2026-05-24 07:46 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords uncertainty samplingactive learningequivalent lossbinary classificationsample complexitysurrogate lossconvexityasymptotic superiority
0
0 comments X

The pith

Uncertainty sampling optimizes an equivalent loss derived from the uncertainty measure and the original loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that uncertainty sampling algorithms, which query points where the model is uncertain, are in fact optimizing a constructed equivalent loss that incorporates both the uncertainty measure and the original task loss. This lens confirms when standard uncertainty measures qualify as proper via their surrogate properties and convexity. With convexity preserved, the authors derive sample complexity bounds on the equivalent loss and map them back to guarantees on binary classification error. They further establish that uncertainty sampling outperforms passive learning asymptotically under mild conditions. A reader would care because the result replaces a common heuristic with an explicit optimization target that carries provable rates.

Core claim

An uncertainty sampling algorithm is optimizing against an equivalent loss which depends on the used uncertainty measure and the original loss function. This perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. When the convexity is preserved, a sample complexity result is given for the equivalent loss and translated into a binary loss guarantee via the surrogate link function. The asymptotic superiority of uncertainty sampling against passive learning is proved under mild conditions.

What carries the argument

The equivalent loss, a loss function constructed from the uncertainty measure and the original loss, against which uncertainty sampling is shown to optimize.

If this is right

  • Sample complexity bounds hold for the equivalent loss when convexity is preserved.
  • The bounds translate to guarantees on the original binary loss through the surrogate link function.
  • Uncertainty sampling is asymptotically superior to passive learning under the stated mild conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The equivalent-loss construction could be used to engineer new uncertainty measures by choosing target loss properties first.
  • The same unification might be tested directly in pool-based active learning by restricting the equivalence to a finite candidate set.
  • Parallel equivalent-loss derivations may exist for regression and multi-class settings, following the extensions outlined in the paper.

Load-bearing premise

Convexity must be preserved for the equivalent loss to obtain the sample complexity result, and the mild conditions must hold for the asymptotic superiority proof.

What would settle it

Run uncertainty sampling and passive learning side-by-side on a binary classification task where the constructed equivalent loss is convex, and measure whether the active method reaches a target error rate with asymptotically fewer labeled samples.

Figures

Figures reproduced from arXiv: 2307.02719 by Shang Liu, Xiaocheng Li.

Figure 1
Figure 1. Figure 1: Probabilistic model with entropy uncertainty (Ex [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Probabilistic model with least confidence uncerta [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Margin-based model (Raj and Bach, 2022) with µ = 0.5. • Example 1 – entropy uncertainty (see [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Threshold-based model (Tifrea et al., 2022) with γ = 1.5. • Example 3 – threshold-based uncertainty (see [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Margin loss with the margin-based uncertainty whe [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Exponential loss and exponential uncertainty wit [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The decision boundaries of the original loss, the e [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Decision boundaries of Example 2 squared margin lo [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Decision boundaries of Example 2 logistic loss and [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Decision boundaries of Example 4 margin loss and m [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Decision boundaries of Example 5 exponential los [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy v.s. step number curves for the Dry Bean d [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Accuracy v.s. step number curves for the Covertyp [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Accuracy v.s. step number curves for the Waveform [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Mean squared error v.s. step number curves for the [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
read the original abstract

Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: There is no consensus on the proper definition of ``uncertainty'' for a specific task under a specific loss, nor a theoretical guarantee that prescribes a standard protocol to implement the algorithm. In this work, we systematically examine uncertainty sampling algorithms in the binary classification problem via a notion of equivalent loss which depends on the used uncertainty measure and the original loss function, and establish that an uncertainty sampling algorithm is optimizing against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. When the convexity is preserved, we give a sample complexity result for the equivalent loss, and later translate it into a binary loss guarantee via the surrogate link function. We prove the asymptotic superiority of the uncertainty sampling against the passive learning via this approach under mild conditions. We also discuss some potential extensions, including pool-based setting and potential generalization to the multi-class classification as well as the regression problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript proposes a theoretical framework for uncertainty sampling in binary classification by defining an 'equivalent loss' based on the uncertainty measure and the original loss function. It establishes that uncertainty sampling optimizes this equivalent loss, verifies its surrogate property and convexity, derives sample complexity results when convexity holds, translates these to guarantees for the binary loss via the surrogate link function, and proves the asymptotic superiority of uncertainty sampling over passive learning under mild conditions. Potential extensions to pool-based active learning, multi-class classification, and regression are also discussed.

Significance. If the central claims hold, this work provides a much-needed theoretical foundation for a widely used but heuristic active learning method. By linking uncertainty sampling to equivalent loss optimization and providing sample complexity and superiority results, it could help standardize the choice of uncertainty measures and clarify when active learning is beneficial. The explicit scoping of conditions (convexity preservation and mild assumptions) is a strength, as is the systematic examination of properness from surrogate and convexity perspectives.

minor comments (4)
  1. The abstract packs multiple technical claims into a single paragraph; breaking it into shorter sentences or adding a dedicated contribution sentence would improve readability for a broad audience.
  2. Notation for the equivalent loss, uncertainty measure, and surrogate link function is introduced without gradual buildup; a short preliminary section or table defining symbols before the main results would aid clarity.
  3. The discussion of extensions (pool-based, multi-class, regression) is brief; even a short paragraph outlining the key obstacles or required modifications for each would strengthen the paper without expanding scope.
  4. Ensure that all cited prior work on surrogate losses and active learning sample complexity is referenced in the introduction to properly situate the contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments under the MAJOR COMMENTS section, so there are no points requiring point-by-point rebuttal or manuscript changes.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper constructs an equivalent loss from the uncertainty measure and original loss function, then establishes that uncertainty sampling optimizes against it. This framing is presented as a perspective for verifying surrogate properties and convexity, followed by independent sample complexity and asymptotic superiority results under explicitly scoped mild conditions. No load-bearing self-citation, fitted-input-as-prediction, or definitional reduction of the central claims to their inputs is visible in the provided abstract or stated program. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review based on abstract only; full text unavailable so ledger is necessarily incomplete.

axioms (2)
  • domain assumption convexity is preserved for the equivalent loss
    Explicitly required in abstract for the sample complexity result.
  • domain assumption mild conditions hold for asymptotic superiority
    Stated in abstract as prerequisite for the superiority proof.
invented entities (1)
  • equivalent loss no independent evidence
    purpose: captures the optimization target of uncertainty sampling given an uncertainty measure and original loss
    Newly introduced construct in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5716 in / 1183 out tokens · 28712 ms · 2026-05-24T07:46:55.967665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, S an Diego, CA, USA; June 13-15,

    Ma rgin based active learning. Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, S an Diego, CA, USA; June 13-15,

  2. [2]

    Bousquet, Olivier, Stéphane Boucheron, Gábor Lugosi

    Springer. Bousquet, Olivier, Stéphane Boucheron, Gábor Lugosi. 2003 . Introduction to statistical learning theory. Summer school on machine learning . Springer, 169–207. Cohn, David, Les Atlas, Richard Ladner

  3. [3]

    Machine Learning Proceedings 1995

    Committee-based samplin g for training probabilistic classifiers. Machine Learning Proceedings 1995 . Elsevier, 150–157. Dasgupta, Sanjoy, Adam Tauman Kalai, Claire Monteleoni. 20

  4. [4]

    Learning Theory: 18th Annual Conference on Learning Theory, COL T 2005, Bertinoro, Italy, June 27-30,

    Analysis of perceptron-based active learning. Learning Theory: 18th Annual Conference on Learning Theory, COL T 2005, Bertinoro, Italy, June 27-30,

  5. [5]

    arXiv preprint arXiv:2305.12283

    Distribution- free model-agnostic regression calibration via nonparametric methods. arXiv preprint arXiv:2305.12283 . Maurer, Andreas, Massimiliano Pontil

  6. [6]

    Empirical Bernstein Bounds and Sample Variance Penalization

    Empirical ber nstein bounds and sample variance penaliza- tion. arXiv preprint arXiv:0907.3740 . Mussmann, Stephen, Percy Liang. 2018a. On the relationship between data efficiency and error for uncertainty sampling. International Conference on Machine Learning . PMLR, 3674–3682. Mussmann, Stephen, Percy S Liang. 2018b. Uncertainty sampl ing is preconditione...

  7. [7]

    Omnipress, 433–440

    july 2, 2011 . Omnipress, 433–440. 28 Raj, Anant, Francis Bach

  8. [8]

    arXiv preprint arXiv:2212.00772

    Unifo rm versus uncertainty sampling: When being active is less efficient than staying passive. arXiv preprint arXiv:2212.00772 . Wang, Ran, Chi-Yin Chow, Sam Kwong

  9. [9]

    IEEE Transactions on Fuzzy Systems 24(1) 242–248

    Ambiguity-based m ulticlass active learning. IEEE Transactions on Fuzzy Systems 24(1) 242–248. 29 A Derivation of equivalent losses and surrogate link functi on This section will present the detailed calculations of the e quivalent losses and the surrogate link functions of all the listed examples in previous sections. The subscri pt of µ or γ will someti...

  10. [10]

    log(q) + log(q) 1− q +p·−q log(1− q) + log(1− q) q =p log(q)− (1− p) log(1− q) + (1− p) log(q)− (1− p)· log(q) 1− q − p log(1− q) +p· log(1− q) q = log(q)− log(1− q)− (1− p)· log(q) 1− q +p· log(1− q) q . Then by calculating its indefinite integral, we have ∫ U (q)·∂l ∂q dq =q log(q) + (1− q) log(1− q)− p·Li2(q)− (1− p)·Li2(1− q) +C, where Li2(z) is the Sp...

  11. [11]

    For the SVM-based methods, both the loss and the uncertainty function can be expressed as a function of Y· ˆY , where ˆY =θ⊤X

    Example 2 (Equivalent loss of ( Raj and Bach , 2022)). For the SVM-based methods, both the loss and the uncertainty function can be expressed as a function of Y· ˆY , where ˆY =θ⊤X. By the similar chain rule arguments in Example 1, we can find the equivalent loss with respect to θ as long as we can find that with respect to Y· ˆY . To simplify the notations...

  12. [12]

    Example 3 (Equivalent loss of Tifrea et al. (2022)). The uncertainty function is probably the simplest case: an indicator function of whether |s|=|Y· ˆY|=|Yθ ⊤X| is no greater than a certain threshold γ. Then for those s’s that satisfy the threshold requirement, the equivalent l oss is identical to the original loss (which is the logistic loss, as a remin...

  13. [13]

    (2006) and provide their surrogate link function computation method for the margin-based models su ch as the SVM

    A.2 Surrogate property and proof of Proposition 2 In this subsection, we summarize the arguments in Bartlett et al. (2006) and provide their surrogate link function computation method for the margin-based models su ch as the SVM. Such a surrogate property induces a mini-max optimal bound on the excessive 0-1 risk (s ee Theorem 3 in Bartlett et al. (2006))...

  14. [14]

    Bartlett et al

    (or Fisher consistent ( Lin, 2004)) if H − (p)>H (p) for any p̸= 1 2 . Bartlett et al. (2006) provide a way of computing the surrogate link function ψ : [0, 1]→ R via ˜ψ (z) = H − (1 +z 2 ) − H (1 +z 2 ) , 33 ψ (z) = ˜ψ ∗∗ (z), where g∗∗ is the Fenchel-Legendre biconjugate of the function g, characterized by epig∗∗ = co epi g. Note that those functions ar...

  15. [15]

    Remind that the original loss can be expressed as l( ˆY·Y ) =− log ( 1 + ˆY·Y 2 )

    Example 1 (Surrogate link function of Dagan and Engelson (1995); Culotta and McCallum (2005)). Remind that the original loss can be expressed as l( ˆY·Y ) =− log ( 1 + ˆY·Y 2 ) . The entropy uncertainty is U =− [q log(q) + (1− q) log(1− q)] =− [ 1 + ˆY 2 log ( 1 + ˆY 2 ) + 1− ˆY 2 log ( 1− ˆY 2 )] =− [ 1 + ˆY·Y 2 log ( 1 + ˆY·Y 2 ) + 1− ˆY·Y 2 log ( 1− ˆY...

  16. [16]

    of the same order as

    By the convexity of ˜ψ , we have ψ = ˜ψ. From the first-order derivative of ψ dψ dz = 1 2 log(1 +z), we know that the surrogate link function ψ is only tending to zero if and only if z itself tends zero. Thus, the equivalent loss is classification-calibrated. From the facts that dψ dz ⏐ ⏐ ⏐ ⏐ ⏐ z=0 = 0 36 and d2ψ dz2 ⏐ ⏐ ⏐ ⏐ ⏐ z=0 = 1 2, we know that ψ (z)∼...

  17. [17]

    By the definition, Uµ· ∂l ∂ ˆY = ∂˜lµ ∂ ˆY

    log(1 +µ )− 2 µ . By the definition, Uµ· ∂l ∂ ˆY = ∂˜lµ ∂ ˆY . Since Uµ = (1 + µ|ˆY|)− 1 is a positive and even function, minimizing the expected equ ivalent loss is identical to minimizing the expected original loss (which i s, the squared Hinge loss). By direct calculation (or referring the Example 2 in Bartlett et al. (2006)), the minimizer should be ˆY ∗ = 2p−

  18. [18]

    of the same order as

    ) . Since the equivalent loss is convex, the minimized risk of th e non-Bayes classifier must be H − (p) = Cp(0) = C. Hence we have ˜ψ (z) = 1 +z 2 · ( 2 µ ( 1 µ + 1 ) log(1 +µz ) + 2 µz ) + 1− z 2 · ( 2 µ ( 1 µ− 1 ) log(1 +µz ) + 2 µz ) = 2 µ 2·(1 +µz ) log(1 +µz )− 2 µz. The second-order derivative of ˜(ψ ) is d2 ˜ψ dz2 = 2 1 +µz > 0, 37 which guarantees...

  19. [19]

    at the same order as

    Therefore, ψ (z) =    1 2 [(1 +z) log(1 +z) + (1− z) log(1− z)], if z≤ z0; 1 2 [(1 +z) log(1 +z0) + (1− z) log(1− z0)], if z≥ z0, wherez0 is some positive constant stated above. By examining the firs t-order derivative of ψ (z), we can easily find out that the equivalent loss is classification-ca librated: dψ dz =    1 2 [log(1 +z)− log(1− z)], if z≤ z...

  20. [20]

    of the same order as

    Due to the boundedness of ψ , we have ψ (z) = Θ(z2), where the big theta notation is “of the same order as”. A.3 Convexity and Proof of Proposition 4 In this subsection, we examine how the convexity requiremen ts are fulfilled in the listed examples. Example 1 (Convexity of Dagan and Engelson (1995); Culotta and McCallum (2005)). W.l.o.g. we still assume Y...

  21. [21]

    We have shown the convexity with respect to ˆY for both cases

    Therefore, the equivalent loss is convex with respect to ˆY . We have shown the convexity with respect to ˆY for both cases. If ˆY is linear with respect to θ, then we can further conclude that the convexity regarding θ holds. But unlike the margin-based classifiers, the probabilistic models restrict that ˆY∈ (− 1, 1), where a popular model is the logistic...

  22. [22]

    Example 3 (Convexity of Tifrea et al

    Hence the model is convex but not strongly convex. Example 3 (Convexity of Tifrea et al. (2022)). The equivalent loss is non-convex for ˆY since it is a truncated logistic loss outside a region, where the truncat ion is to set the loss to be a constant. By the linearity of ˆY on θ, the model is also non-convex for θ. Example 4 (Nonconvexity of margin loss...

  23. [23]

    loss as uncertainty

    For the remaining generalization term, we summarize an eas y-to- check criterion. To begin with, we briefly review the classical statistical le arning theory. The estimator ˆf we get in any algorithm is dependent on the data points DX n , so we cannot directly get the generalization bound via the concentration inequalities that rely on the i.i.d. cond itio...

  24. [24]

    The original cross-entropy loss is not Lipschitz on the range ˆY ∈ (− 1,

    Example 1 (Lipschitzness of Dagan and Engelson (1995); Culotta and McCallum (2005)). The original cross-entropy loss is not Lipschitz on the range ˆY ∈ (− 1,

  25. [25]

    But from direct computation, for the entropy uncertainty ( Dagan and Engelson , 1995), we have (w.l.o.g

    (or equivalently, q ∈ (0, 1)), since the derivative of the negative logarithm will explode near the z ero point. But from direct computation, for the entropy uncertainty ( Dagan and Engelson , 1995), we have (w.l.o.g. assume Y =

  26. [26]

    Example 2 (Lipschitzness of Raj and Bach (2022))

    45 Its partial derivative is ∂˜l ∂ ˆY =    − 1 2· 1− ˆY 1+ ˆY, if ˆY≥ 0; − 1 2, if ˆY≤ 0, which implies that the equivalent loss is 1 2 -Lipschitz. Example 2 (Lipschitzness of Raj and Bach (2022)). From direct computation, the partial derivative w.r.t. ˆY can be upper-bounded by      ∂˜l ∂ ˆY     ≤ 2 µ. By assuming an almost upper bound MX on...

  27. [27]

    B.3 Proof of Theorem 2 Proof

    47 By the convexity of R(·), we have R (¯θT +1 ) =R ( 1 T + 1 T +1∑ t=1 θt ) ≤ 1 T + 1 T +1∑ t=1 R(θt), which finally verifies the proof. B.3 Proof of Theorem 2 Proof. Since ψ is convex by the definition in Bartlett et al. (2006), we have ψ ( E [ L01(f ¯θT +1)− inf g∈G L01(g) ]) ≤ E [ ψ ( L01(f ¯θT +1 )− inf g∈G L01(g) )] , where the expectation is taken wit...

  28. [28]

    differential

    B.4 Proof of Proposition 5 Proof. The algorithm we are considering is under the stream-based s etting (Algorithm 1), where the newly observed sample is directly taken from the unknown dis tributionP. As discussed in Section 3.4, we are directly applying SGD on the expected loss L(θ;X). Following the equivalent loss analyses, one can find the equivalent exp...