Understanding Uncertainty Sampling via Equivalent Loss

Shang Liu; Xiaocheng Li

arxiv: 2307.02719 · v4 · submitted 2023-07-06 · 💻 cs.LG · stat.ML

Understanding Uncertainty Sampling via Equivalent Loss

Shang Liu , Xiaocheng Li This is my paper

Pith reviewed 2026-05-24 07:46 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords uncertainty samplingactive learningequivalent lossbinary classificationsample complexitysurrogate lossconvexityasymptotic superiority

0 comments

The pith

Uncertainty sampling optimizes an equivalent loss derived from the uncertainty measure and the original loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that uncertainty sampling algorithms, which query points where the model is uncertain, are in fact optimizing a constructed equivalent loss that incorporates both the uncertainty measure and the original task loss. This lens confirms when standard uncertainty measures qualify as proper via their surrogate properties and convexity. With convexity preserved, the authors derive sample complexity bounds on the equivalent loss and map them back to guarantees on binary classification error. They further establish that uncertainty sampling outperforms passive learning asymptotically under mild conditions. A reader would care because the result replaces a common heuristic with an explicit optimization target that carries provable rates.

Core claim

An uncertainty sampling algorithm is optimizing against an equivalent loss which depends on the used uncertainty measure and the original loss function. This perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. When the convexity is preserved, a sample complexity result is given for the equivalent loss and translated into a binary loss guarantee via the surrogate link function. The asymptotic superiority of uncertainty sampling against passive learning is proved under mild conditions.

What carries the argument

The equivalent loss, a loss function constructed from the uncertainty measure and the original loss, against which uncertainty sampling is shown to optimize.

If this is right

Sample complexity bounds hold for the equivalent loss when convexity is preserved.
The bounds translate to guarantees on the original binary loss through the surrogate link function.
Uncertainty sampling is asymptotically superior to passive learning under the stated mild conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The equivalent-loss construction could be used to engineer new uncertainty measures by choosing target loss properties first.
The same unification might be tested directly in pool-based active learning by restricting the equivalence to a finite candidate set.
Parallel equivalent-loss derivations may exist for regression and multi-class settings, following the extensions outlined in the paper.

Load-bearing premise

Convexity must be preserved for the equivalent loss to obtain the sample complexity result, and the mild conditions must hold for the asymptotic superiority proof.

What would settle it

Run uncertainty sampling and passive learning side-by-side on a binary classification task where the constructed equivalent loss is convex, and measure whether the active method reaches a target error rate with asymptotically fewer labeled samples.

Figures

Figures reproduced from arXiv: 2307.02719 by Shang Liu, Xiaocheng Li.

**Figure 2.** Figure 2: Probabilistic model with least confidence uncerta [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Margin-based model (Raj and Bach, 2022) with µ = 0.5. • Example 1 – entropy uncertainty (see [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Threshold-based model (Tifrea et al., 2022) with γ = 1.5. • Example 3 – threshold-based uncertainty (see [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Margin loss with the margin-based uncertainty whe [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Exponential loss and exponential uncertainty wit [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: The decision boundaries of the original loss, the e [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Decision boundaries of Example 2 squared margin lo [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Decision boundaries of Example 2 logistic loss and [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Decision boundaries of Example 4 margin loss and m [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Decision boundaries of Example 5 exponential los [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Accuracy v.s. step number curves for the Dry Bean d [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Accuracy v.s. step number curves for the Covertyp [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Accuracy v.s. step number curves for the Waveform [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Mean squared error v.s. step number curves for the [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

read the original abstract

Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: There is no consensus on the proper definition of ``uncertainty'' for a specific task under a specific loss, nor a theoretical guarantee that prescribes a standard protocol to implement the algorithm. In this work, we systematically examine uncertainty sampling algorithms in the binary classification problem via a notion of equivalent loss which depends on the used uncertainty measure and the original loss function, and establish that an uncertainty sampling algorithm is optimizing against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. When the convexity is preserved, we give a sample complexity result for the equivalent loss, and later translate it into a binary loss guarantee via the surrogate link function. We prove the asymptotic superiority of the uncertainty sampling against the passive learning via this approach under mild conditions. We also discuss some potential extensions, including pool-based setting and potential generalization to the multi-class classification as well as the regression problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames uncertainty sampling as optimization of a derived equivalent loss, which lets them derive sample complexity and asymptotic superiority under scoped conditions, but the actual derivations need close checking.

read the letter

The main takeaway is that uncertainty sampling in binary classification can be recast as minimizing an equivalent loss built from the chosen uncertainty measure and the original loss. This view lets the authors check surrogate properties and convexity for existing measures, then obtain a sample complexity bound when convexity holds and translate it to a binary guarantee. They also prove asymptotic superiority over passive learning under explicitly mild conditions. That framing is new relative to the cited prior work and gives a clean way to justify why certain uncertainty measures are reasonable. The scoping language around convexity preservation and mild conditions keeps the claims from overreaching, which is a plus. The discussion of pool-based settings and possible extensions to multi-class or regression is noted but not developed in detail. On the soft side, the equivalence construction itself is the load-bearing step; if the derivation of the equivalent loss contains hidden restrictions or if convexity fails for common measures in practice, the sample-complexity and superiority results weaken. The translation through the surrogate link function may also introduce looseness that is not quantified. No obvious circularity or invented entities appear in the stated program. This is a theory paper aimed at active-learning researchers who want a loss-based justification for a common heuristic. The approach is coherent enough on its own terms to merit referee time, even if the proofs turn out to need tightening or if the mild conditions prove restrictive in applications. I would bring it to a reading group for the technical discussion and would not cite it in my own work unless the equivalence result is used directly.

Referee Report

0 major / 4 minor

Summary. The manuscript proposes a theoretical framework for uncertainty sampling in binary classification by defining an 'equivalent loss' based on the uncertainty measure and the original loss function. It establishes that uncertainty sampling optimizes this equivalent loss, verifies its surrogate property and convexity, derives sample complexity results when convexity holds, translates these to guarantees for the binary loss via the surrogate link function, and proves the asymptotic superiority of uncertainty sampling over passive learning under mild conditions. Potential extensions to pool-based active learning, multi-class classification, and regression are also discussed.

Significance. If the central claims hold, this work provides a much-needed theoretical foundation for a widely used but heuristic active learning method. By linking uncertainty sampling to equivalent loss optimization and providing sample complexity and superiority results, it could help standardize the choice of uncertainty measures and clarify when active learning is beneficial. The explicit scoping of conditions (convexity preservation and mild assumptions) is a strength, as is the systematic examination of properness from surrogate and convexity perspectives.

minor comments (4)

The abstract packs multiple technical claims into a single paragraph; breaking it into shorter sentences or adding a dedicated contribution sentence would improve readability for a broad audience.
Notation for the equivalent loss, uncertainty measure, and surrogate link function is introduced without gradual buildup; a short preliminary section or table defining symbols before the main results would aid clarity.
The discussion of extensions (pool-based, multi-class, regression) is brief; even a short paragraph outlining the key obstacles or required modifications for each would strengthen the paper without expanding scope.
Ensure that all cited prior work on surrogate losses and active learning sample complexity is referenced in the introduction to properly situate the contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments under the MAJOR COMMENTS section, so there are no points requiring point-by-point rebuttal or manuscript changes.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper constructs an equivalent loss from the uncertainty measure and original loss function, then establishes that uncertainty sampling optimizes against it. This framing is presented as a perspective for verifying surrogate properties and convexity, followed by independent sample complexity and asymptotic superiority results under explicitly scoped mild conditions. No load-bearing self-citation, fitted-input-as-prediction, or definitional reduction of the central claims to their inputs is visible in the provided abstract or stated program. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review based on abstract only; full text unavailable so ledger is necessarily incomplete.

axioms (2)

domain assumption convexity is preserved for the equivalent loss
Explicitly required in abstract for the sample complexity result.
domain assumption mild conditions hold for asymptotic superiority
Stated in abstract as prerequisite for the superiority proof.

invented entities (1)

equivalent loss no independent evidence
purpose: captures the optimization target of uncertainty sampling given an uncertainty measure and original loss
Newly introduced construct in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5716 in / 1183 out tokens · 28712 ms · 2026-05-24T07:46:55.967665+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, S an Diego, CA, USA; June 13-15,

Ma rgin based active learning. Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, S an Diego, CA, USA; June 13-15,

work page 2007
[2]

Bousquet, Olivier, Stéphane Boucheron, Gábor Lugosi

Springer. Bousquet, Olivier, Stéphane Boucheron, Gábor Lugosi. 2003 . Introduction to statistical learning theory. Summer school on machine learning . Springer, 169–207. Cohn, David, Les Atlas, Richard Ladner

work page 2003
[3]

Machine Learning Proceedings 1995

Committee-based samplin g for training probabilistic classiﬁers. Machine Learning Proceedings 1995 . Elsevier, 150–157. Dasgupta, Sanjoy, Adam Tauman Kalai, Claire Monteleoni. 20

work page 1995
[4]

Learning Theory: 18th Annual Conference on Learning Theory, COL T 2005, Bertinoro, Italy, June 27-30,

Analysis of perceptron-based active learning. Learning Theory: 18th Annual Conference on Learning Theory, COL T 2005, Bertinoro, Italy, June 27-30,

work page 2005
[5]

arXiv preprint arXiv:2305.12283

Distribution- free model-agnostic regression calibration via nonparametric methods. arXiv preprint arXiv:2305.12283 . Maurer, Andreas, Massimiliano Pontil

work page arXiv
[6]

Empirical Bernstein Bounds and Sample Variance Penalization

Empirical ber nstein bounds and sample variance penaliza- tion. arXiv preprint arXiv:0907.3740 . Mussmann, Stephen, Percy Liang. 2018a. On the relationship between data eﬃciency and error for uncertainty sampling. International Conference on Machine Learning . PMLR, 3674–3682. Mussmann, Stephen, Percy S Liang. 2018b. Uncertainty sampl ing is preconditione...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Omnipress, 433–440

july 2, 2011 . Omnipress, 433–440. 28 Raj, Anant, Francis Bach

work page 2011
[8]

arXiv preprint arXiv:2212.00772

Unifo rm versus uncertainty sampling: When being active is less eﬃcient than staying passive. arXiv preprint arXiv:2212.00772 . Wang, Ran, Chi-Yin Chow, Sam Kwong

work page arXiv
[9]

IEEE Transactions on Fuzzy Systems 24(1) 242–248

Ambiguity-based m ulticlass active learning. IEEE Transactions on Fuzzy Systems 24(1) 242–248. 29 A Derivation of equivalent losses and surrogate link functi on This section will present the detailed calculations of the e quivalent losses and the surrogate link functions of all the listed examples in previous sections. The subscri pt of µ or γ will someti...

work page 1995
[10]

log(q) + log(q) 1− q +p·−q log(1− q) + log(1− q) q =p log(q)− (1− p) log(1− q) + (1− p) log(q)− (1− p)· log(q) 1− q − p log(1− q) +p· log(1− q) q = log(q)− log(1− q)− (1− p)· log(q) 1− q +p· log(1− q) q . Then by calculating its indeﬁnite integral, we have ∫ U (q)·∂l ∂q dq =q log(q) + (1− q) log(1− q)− p·Li2(q)− (1− p)·Li2(1− q) +C, where Li2(z) is the Sp...

work page 2005
[11]

For the SVM-based methods, both the loss and the uncertainty function can be expressed as a function of Y· ˆY , where ˆY =θ⊤X

Example 2 (Equivalent loss of ( Raj and Bach , 2022)). For the SVM-based methods, both the loss and the uncertainty function can be expressed as a function of Y· ˆY , where ˆY =θ⊤X. By the similar chain rule arguments in Example 1, we can ﬁnd the equivalent loss with respect to θ as long as we can ﬁnd that with respect to Y· ˆY . To simplify the notations...

work page 2022
[12]

Example 3 (Equivalent loss of Tifrea et al. (2022)). The uncertainty function is probably the simplest case: an indicator function of whether |s|=|Y· ˆY|=|Yθ ⊤X| is no greater than a certain threshold γ. Then for those s’s that satisfy the threshold requirement, the equivalent l oss is identical to the original loss (which is the logistic loss, as a remin...

work page 2022
[13]

(2006) and provide their surrogate link function computation method for the margin-based models su ch as the SVM

A.2 Surrogate property and proof of Proposition 2 In this subsection, we summarize the arguments in Bartlett et al. (2006) and provide their surrogate link function computation method for the margin-based models su ch as the SVM. Such a surrogate property induces a mini-max optimal bound on the excessive 0-1 risk (s ee Theorem 3 in Bartlett et al. (2006))...

work page 2006
[14]

Bartlett et al

(or Fisher consistent ( Lin, 2004)) if H − (p)>H (p) for any p̸= 1 2 . Bartlett et al. (2006) provide a way of computing the surrogate link function ψ : [0, 1]→ R via ˜ψ (z) = H − (1 +z 2 ) − H (1 +z 2 ) , 33 ψ (z) = ˜ψ ∗∗ (z), where g∗∗ is the Fenchel-Legendre biconjugate of the function g, characterized by epig∗∗ = co epi g. Note that those functions ar...

work page 2004
[15]

Remind that the original loss can be expressed as l( ˆY·Y ) =− log ( 1 + ˆY·Y 2 )

Example 1 (Surrogate link function of Dagan and Engelson (1995); Culotta and McCallum (2005)). Remind that the original loss can be expressed as l( ˆY·Y ) =− log ( 1 + ˆY·Y 2 ) . The entropy uncertainty is U =− [q log(q) + (1− q) log(1− q)] =− [ 1 + ˆY 2 log ( 1 + ˆY 2 ) + 1− ˆY 2 log ( 1− ˆY 2 )] =− [ 1 + ˆY·Y 2 log ( 1 + ˆY·Y 2 ) + 1− ˆY·Y 2 log ( 1− ˆY...

work page 1995
[16]

of the same order as

By the convexity of ˜ψ , we have ψ = ˜ψ. From the ﬁrst-order derivative of ψ dψ dz = 1 2 log(1 +z), we know that the surrogate link function ψ is only tending to zero if and only if z itself tends zero. Thus, the equivalent loss is classiﬁcation-calibrated. From the facts that dψ dz ⏐ ⏐ ⏐ ⏐ ⏐ z=0 = 0 36 and d2ψ dz2 ⏐ ⏐ ⏐ ⏐ ⏐ z=0 = 1 2, we know that ψ (z)∼...

work page 2022
[17]

By the deﬁnition, Uµ· ∂l ∂ ˆY = ∂˜lµ ∂ ˆY

log(1 +µ )− 2 µ . By the deﬁnition, Uµ· ∂l ∂ ˆY = ∂˜lµ ∂ ˆY . Since Uµ = (1 + µ|ˆY|)− 1 is a positive and even function, minimizing the expected equ ivalent loss is identical to minimizing the expected original loss (which i s, the squared Hinge loss). By direct calculation (or referring the Example 2 in Bartlett et al. (2006)), the minimizer should be ˆY ∗ = 2p−

work page 2006
[18]

of the same order as

) . Since the equivalent loss is convex, the minimized risk of th e non-Bayes classiﬁer must be H − (p) = Cp(0) = C. Hence we have ˜ψ (z) = 1 +z 2 · ( 2 µ ( 1 µ + 1 ) log(1 +µz ) + 2 µz ) + 1− z 2 · ( 2 µ ( 1 µ− 1 ) log(1 +µz ) + 2 µz ) = 2 µ 2·(1 +µz ) log(1 +µz )− 2 µz. The second-order derivative of ˜(ψ ) is d2 ˜ψ dz2 = 2 1 +µz > 0, 37 which guarantees...

work page 2022
[19]

at the same order as

Therefore, ψ (z) =    1 2 [(1 +z) log(1 +z) + (1− z) log(1− z)], if z≤ z0; 1 2 [(1 +z) log(1 +z0) + (1− z) log(1− z0)], if z≥ z0, wherez0 is some positive constant stated above. By examining the ﬁrs t-order derivative of ψ (z), we can easily ﬁnd out that the equivalent loss is classiﬁcation-ca librated: dψ dz =    1 2 [log(1 +z)− log(1− z)], if z≤ z...

work page 2006
[20]

of the same order as

Due to the boundedness of ψ , we have ψ (z) = Θ(z2), where the big theta notation is “of the same order as”. A.3 Convexity and Proof of Proposition 4 In this subsection, we examine how the convexity requiremen ts are fulﬁlled in the listed examples. Example 1 (Convexity of Dagan and Engelson (1995); Culotta and McCallum (2005)). W.l.o.g. we still assume Y...

work page 1995
[21]

We have shown the convexity with respect to ˆY for both cases

Therefore, the equivalent loss is convex with respect to ˆY . We have shown the convexity with respect to ˆY for both cases. If ˆY is linear with respect to θ, then we can further conclude that the convexity regarding θ holds. But unlike the margin-based classiﬁers, the probabilistic models restrict that ˆY∈ (− 1, 1), where a popular model is the logistic...

work page 2022
[22]

Example 3 (Convexity of Tifrea et al

Hence the model is convex but not strongly convex. Example 3 (Convexity of Tifrea et al. (2022)). The equivalent loss is non-convex for ˆY since it is a truncated logistic loss outside a region, where the truncat ion is to set the loss to be a constant. By the linearity of ˆY on θ, the model is also non-convex for θ. Example 4 (Nonconvexity of margin loss...

work page 2022
[23]

loss as uncertainty

For the remaining generalization term, we summarize an eas y-to- check criterion. To begin with, we brieﬂy review the classical statistical le arning theory. The estimator ˆf we get in any algorithm is dependent on the data points DX n , so we cannot directly get the generalization bound via the concentration inequalities that rely on the i.i.d. cond itio...

work page 2003
[24]

The original cross-entropy loss is not Lipschitz on the range ˆY ∈ (− 1,

Example 1 (Lipschitzness of Dagan and Engelson (1995); Culotta and McCallum (2005)). The original cross-entropy loss is not Lipschitz on the range ˆY ∈ (− 1,

work page 1995
[25]

But from direct computation, for the entropy uncertainty ( Dagan and Engelson , 1995), we have (w.l.o.g

(or equivalently, q ∈ (0, 1)), since the derivative of the negative logarithm will explode near the z ero point. But from direct computation, for the entropy uncertainty ( Dagan and Engelson , 1995), we have (w.l.o.g. assume Y =

work page 1995
[26]

Example 2 (Lipschitzness of Raj and Bach (2022))

45 Its partial derivative is ∂˜l ∂ ˆY =    − 1 2· 1− ˆY 1+ ˆY, if ˆY≥ 0; − 1 2, if ˆY≤ 0, which implies that the equivalent loss is 1 2 -Lipschitz. Example 2 (Lipschitzness of Raj and Bach (2022)). From direct computation, the partial derivative w.r.t. ˆY can be upper-bounded by      ∂˜l ∂ ˆY     ≤ 2 µ. By assuming an almost upper bound MX on...

work page 2022
[27]

B.3 Proof of Theorem 2 Proof

47 By the convexity of R(·), we have R (¯θT +1 ) =R ( 1 T + 1 T +1∑ t=1 θt ) ≤ 1 T + 1 T +1∑ t=1 R(θt), which ﬁnally veriﬁes the proof. B.3 Proof of Theorem 2 Proof. Since ψ is convex by the deﬁnition in Bartlett et al. (2006), we have ψ ( E [ L01(f ¯θT +1)− inf g∈G L01(g) ]) ≤ E [ ψ ( L01(f ¯θT +1 )− inf g∈G L01(g) )] , where the expectation is taken wit...

work page 2006
[28]

diﬀerential

B.4 Proof of Proposition 5 Proof. The algorithm we are considering is under the stream-based s etting (Algorithm 1), where the newly observed sample is directly taken from the unknown dis tributionP. As discussed in Section 3.4, we are directly applying SGD on the expected loss L(θ;X). Following the equivalent loss analyses, one can ﬁnd the equivalent exp...

work page 1982

[1] [1]

Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, S an Diego, CA, USA; June 13-15,

Ma rgin based active learning. Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, S an Diego, CA, USA; June 13-15,

work page 2007

[2] [2]

Bousquet, Olivier, Stéphane Boucheron, Gábor Lugosi

Springer. Bousquet, Olivier, Stéphane Boucheron, Gábor Lugosi. 2003 . Introduction to statistical learning theory. Summer school on machine learning . Springer, 169–207. Cohn, David, Les Atlas, Richard Ladner

work page 2003

[3] [3]

Machine Learning Proceedings 1995

Committee-based samplin g for training probabilistic classiﬁers. Machine Learning Proceedings 1995 . Elsevier, 150–157. Dasgupta, Sanjoy, Adam Tauman Kalai, Claire Monteleoni. 20

work page 1995

[4] [4]

Learning Theory: 18th Annual Conference on Learning Theory, COL T 2005, Bertinoro, Italy, June 27-30,

Analysis of perceptron-based active learning. Learning Theory: 18th Annual Conference on Learning Theory, COL T 2005, Bertinoro, Italy, June 27-30,

work page 2005

[5] [5]

arXiv preprint arXiv:2305.12283

Distribution- free model-agnostic regression calibration via nonparametric methods. arXiv preprint arXiv:2305.12283 . Maurer, Andreas, Massimiliano Pontil

work page arXiv

[6] [6]

Empirical Bernstein Bounds and Sample Variance Penalization

Empirical ber nstein bounds and sample variance penaliza- tion. arXiv preprint arXiv:0907.3740 . Mussmann, Stephen, Percy Liang. 2018a. On the relationship between data eﬃciency and error for uncertainty sampling. International Conference on Machine Learning . PMLR, 3674–3682. Mussmann, Stephen, Percy S Liang. 2018b. Uncertainty sampl ing is preconditione...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Omnipress, 433–440

july 2, 2011 . Omnipress, 433–440. 28 Raj, Anant, Francis Bach

work page 2011

[8] [8]

arXiv preprint arXiv:2212.00772

Unifo rm versus uncertainty sampling: When being active is less eﬃcient than staying passive. arXiv preprint arXiv:2212.00772 . Wang, Ran, Chi-Yin Chow, Sam Kwong

work page arXiv

[9] [9]

IEEE Transactions on Fuzzy Systems 24(1) 242–248

Ambiguity-based m ulticlass active learning. IEEE Transactions on Fuzzy Systems 24(1) 242–248. 29 A Derivation of equivalent losses and surrogate link functi on This section will present the detailed calculations of the e quivalent losses and the surrogate link functions of all the listed examples in previous sections. The subscri pt of µ or γ will someti...

work page 1995

[10] [10]

log(q) + log(q) 1− q +p·−q log(1− q) + log(1− q) q =p log(q)− (1− p) log(1− q) + (1− p) log(q)− (1− p)· log(q) 1− q − p log(1− q) +p· log(1− q) q = log(q)− log(1− q)− (1− p)· log(q) 1− q +p· log(1− q) q . Then by calculating its indeﬁnite integral, we have ∫ U (q)·∂l ∂q dq =q log(q) + (1− q) log(1− q)− p·Li2(q)− (1− p)·Li2(1− q) +C, where Li2(z) is the Sp...

work page 2005

[11] [11]

For the SVM-based methods, both the loss and the uncertainty function can be expressed as a function of Y· ˆY , where ˆY =θ⊤X

Example 2 (Equivalent loss of ( Raj and Bach , 2022)). For the SVM-based methods, both the loss and the uncertainty function can be expressed as a function of Y· ˆY , where ˆY =θ⊤X. By the similar chain rule arguments in Example 1, we can ﬁnd the equivalent loss with respect to θ as long as we can ﬁnd that with respect to Y· ˆY . To simplify the notations...

work page 2022

[12] [12]

Example 3 (Equivalent loss of Tifrea et al. (2022)). The uncertainty function is probably the simplest case: an indicator function of whether |s|=|Y· ˆY|=|Yθ ⊤X| is no greater than a certain threshold γ. Then for those s’s that satisfy the threshold requirement, the equivalent l oss is identical to the original loss (which is the logistic loss, as a remin...

work page 2022

[13] [13]

(2006) and provide their surrogate link function computation method for the margin-based models su ch as the SVM

A.2 Surrogate property and proof of Proposition 2 In this subsection, we summarize the arguments in Bartlett et al. (2006) and provide their surrogate link function computation method for the margin-based models su ch as the SVM. Such a surrogate property induces a mini-max optimal bound on the excessive 0-1 risk (s ee Theorem 3 in Bartlett et al. (2006))...

work page 2006

[14] [14]

Bartlett et al

(or Fisher consistent ( Lin, 2004)) if H − (p)>H (p) for any p̸= 1 2 . Bartlett et al. (2006) provide a way of computing the surrogate link function ψ : [0, 1]→ R via ˜ψ (z) = H − (1 +z 2 ) − H (1 +z 2 ) , 33 ψ (z) = ˜ψ ∗∗ (z), where g∗∗ is the Fenchel-Legendre biconjugate of the function g, characterized by epig∗∗ = co epi g. Note that those functions ar...

work page 2004

[15] [15]

Remind that the original loss can be expressed as l( ˆY·Y ) =− log ( 1 + ˆY·Y 2 )

Example 1 (Surrogate link function of Dagan and Engelson (1995); Culotta and McCallum (2005)). Remind that the original loss can be expressed as l( ˆY·Y ) =− log ( 1 + ˆY·Y 2 ) . The entropy uncertainty is U =− [q log(q) + (1− q) log(1− q)] =− [ 1 + ˆY 2 log ( 1 + ˆY 2 ) + 1− ˆY 2 log ( 1− ˆY 2 )] =− [ 1 + ˆY·Y 2 log ( 1 + ˆY·Y 2 ) + 1− ˆY·Y 2 log ( 1− ˆY...

work page 1995

[16] [16]

of the same order as

By the convexity of ˜ψ , we have ψ = ˜ψ. From the ﬁrst-order derivative of ψ dψ dz = 1 2 log(1 +z), we know that the surrogate link function ψ is only tending to zero if and only if z itself tends zero. Thus, the equivalent loss is classiﬁcation-calibrated. From the facts that dψ dz ⏐ ⏐ ⏐ ⏐ ⏐ z=0 = 0 36 and d2ψ dz2 ⏐ ⏐ ⏐ ⏐ ⏐ z=0 = 1 2, we know that ψ (z)∼...

work page 2022

[17] [17]

By the deﬁnition, Uµ· ∂l ∂ ˆY = ∂˜lµ ∂ ˆY

log(1 +µ )− 2 µ . By the deﬁnition, Uµ· ∂l ∂ ˆY = ∂˜lµ ∂ ˆY . Since Uµ = (1 + µ|ˆY|)− 1 is a positive and even function, minimizing the expected equ ivalent loss is identical to minimizing the expected original loss (which i s, the squared Hinge loss). By direct calculation (or referring the Example 2 in Bartlett et al. (2006)), the minimizer should be ˆY ∗ = 2p−

work page 2006

[18] [18]

of the same order as

) . Since the equivalent loss is convex, the minimized risk of th e non-Bayes classiﬁer must be H − (p) = Cp(0) = C. Hence we have ˜ψ (z) = 1 +z 2 · ( 2 µ ( 1 µ + 1 ) log(1 +µz ) + 2 µz ) + 1− z 2 · ( 2 µ ( 1 µ− 1 ) log(1 +µz ) + 2 µz ) = 2 µ 2·(1 +µz ) log(1 +µz )− 2 µz. The second-order derivative of ˜(ψ ) is d2 ˜ψ dz2 = 2 1 +µz > 0, 37 which guarantees...

work page 2022

[19] [19]

at the same order as

Therefore, ψ (z) =    1 2 [(1 +z) log(1 +z) + (1− z) log(1− z)], if z≤ z0; 1 2 [(1 +z) log(1 +z0) + (1− z) log(1− z0)], if z≥ z0, wherez0 is some positive constant stated above. By examining the ﬁrs t-order derivative of ψ (z), we can easily ﬁnd out that the equivalent loss is classiﬁcation-ca librated: dψ dz =    1 2 [log(1 +z)− log(1− z)], if z≤ z...

work page 2006

[20] [20]

of the same order as

Due to the boundedness of ψ , we have ψ (z) = Θ(z2), where the big theta notation is “of the same order as”. A.3 Convexity and Proof of Proposition 4 In this subsection, we examine how the convexity requiremen ts are fulﬁlled in the listed examples. Example 1 (Convexity of Dagan and Engelson (1995); Culotta and McCallum (2005)). W.l.o.g. we still assume Y...

work page 1995

[21] [21]

We have shown the convexity with respect to ˆY for both cases

Therefore, the equivalent loss is convex with respect to ˆY . We have shown the convexity with respect to ˆY for both cases. If ˆY is linear with respect to θ, then we can further conclude that the convexity regarding θ holds. But unlike the margin-based classiﬁers, the probabilistic models restrict that ˆY∈ (− 1, 1), where a popular model is the logistic...

work page 2022

[22] [22]

Example 3 (Convexity of Tifrea et al

Hence the model is convex but not strongly convex. Example 3 (Convexity of Tifrea et al. (2022)). The equivalent loss is non-convex for ˆY since it is a truncated logistic loss outside a region, where the truncat ion is to set the loss to be a constant. By the linearity of ˆY on θ, the model is also non-convex for θ. Example 4 (Nonconvexity of margin loss...

work page 2022

[23] [23]

loss as uncertainty

For the remaining generalization term, we summarize an eas y-to- check criterion. To begin with, we brieﬂy review the classical statistical le arning theory. The estimator ˆf we get in any algorithm is dependent on the data points DX n , so we cannot directly get the generalization bound via the concentration inequalities that rely on the i.i.d. cond itio...

work page 2003

[24] [24]

The original cross-entropy loss is not Lipschitz on the range ˆY ∈ (− 1,

Example 1 (Lipschitzness of Dagan and Engelson (1995); Culotta and McCallum (2005)). The original cross-entropy loss is not Lipschitz on the range ˆY ∈ (− 1,

work page 1995

[25] [25]

But from direct computation, for the entropy uncertainty ( Dagan and Engelson , 1995), we have (w.l.o.g

(or equivalently, q ∈ (0, 1)), since the derivative of the negative logarithm will explode near the z ero point. But from direct computation, for the entropy uncertainty ( Dagan and Engelson , 1995), we have (w.l.o.g. assume Y =

work page 1995

[26] [26]

Example 2 (Lipschitzness of Raj and Bach (2022))

45 Its partial derivative is ∂˜l ∂ ˆY =    − 1 2· 1− ˆY 1+ ˆY, if ˆY≥ 0; − 1 2, if ˆY≤ 0, which implies that the equivalent loss is 1 2 -Lipschitz. Example 2 (Lipschitzness of Raj and Bach (2022)). From direct computation, the partial derivative w.r.t. ˆY can be upper-bounded by      ∂˜l ∂ ˆY     ≤ 2 µ. By assuming an almost upper bound MX on...

work page 2022

[27] [27]

B.3 Proof of Theorem 2 Proof

47 By the convexity of R(·), we have R (¯θT +1 ) =R ( 1 T + 1 T +1∑ t=1 θt ) ≤ 1 T + 1 T +1∑ t=1 R(θt), which ﬁnally veriﬁes the proof. B.3 Proof of Theorem 2 Proof. Since ψ is convex by the deﬁnition in Bartlett et al. (2006), we have ψ ( E [ L01(f ¯θT +1)− inf g∈G L01(g) ]) ≤ E [ ψ ( L01(f ¯θT +1 )− inf g∈G L01(g) )] , where the expectation is taken wit...

work page 2006

[28] [28]

diﬀerential

B.4 Proof of Proposition 5 Proof. The algorithm we are considering is under the stream-based s etting (Algorithm 1), where the newly observed sample is directly taken from the unknown dis tributionP. As discussed in Section 3.4, we are directly applying SGD on the expected loss L(θ;X). Following the equivalent loss analyses, one can ﬁnd the equivalent exp...

work page 1982