Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

Charlotte Laclau; Louise Davy; Stephan Cl\'emen\c{c}on

arxiv: 2606.02345 · v1 · pith:SBAICFU4new · submitted 2026-06-01 · 📊 stat.ML · cs.LG

Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

Louise Davy , Stephan Cl\'emen\c{c}on , Charlotte Laclau This is my paper

Pith reviewed 2026-06-28 12:32 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords pairwise losssurvey samplingempirical estimationsimilarity learningembeddingsmachine learningcomputational efficiencyranking

0 comments

The pith

Direct sampling of pairs rather than observations allows near-full pairwise loss performance with far fewer pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that empirical pairwise loss functions in machine learning can be estimated or minimized effectively by keeping only a small fraction of pairs, provided the sampling targets pairs directly and uses auxiliary information to favor informative ones. This holds for tasks such as similarity learning, ranking, and clustering where full pairwise computation scales quadratically and becomes infeasible. A sympathetic reader would care because the approach supplies a theoretically grounded way to trade accuracy for speed without simply dropping to random subsampling of points. The central demonstration combines sampling theory with experiments on high-dimensional embeddings typical in vision and graph settings.

Core claim

A frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. Such sampling plans must target pairs directly rather than individual observations. Assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation for losses between high-dimensional vectors such as embeddings.

What carries the argument

Survey sampling plans that assign inclusion probabilities directly to pairs using auxiliary information to prioritize informative pairs.

If this is right

Performance remains close to full pairwise evaluation when pairs are sampled directly with auxiliary guidance.
A principled accuracy-computation trade-off becomes available for pairwise losses at scale.
The same sampling logic applies to embeddings in vision and graph learning tasks.
Sampling individual observations is provably less effective than direct pair sampling for these losses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested by varying the quality of auxiliary information to measure how much guidance is actually required.
Integration with online or streaming settings might allow the sampling probabilities to adapt as new data arrives.
Memory savings from storing only sampled pairs could extend the approach to datasets too large to hold in full.
Similar pair-targeting ideas might apply to higher-order losses involving triples or larger tuples.

Load-bearing premise

Suitable auxiliary information exists that can guide pair sampling without introducing unacceptable bias or extra computational overhead.

What would settle it

An experiment on high-dimensional embedding losses in which uniform random sampling of pairs performs as well as auxiliary-guided pair sampling would falsify the need for targeted direct-pair plans.

Figures

Figures reproduced from arXiv: 2606.02345 by Charlotte Laclau, Louise Davy, Stephan Cl\'emen\c{c}on.

**Figure 2.** Figure 2: Results on Cora (a, b, c) and LFW (d, e, f). Results are averaged over 5 seeds, shaded = [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: MAE and variance comparison on Cora (a, b, c, d) and MovieLens (e, f, g, h) between Ob [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: MAE and variance comparison on LFW with the top-1 correlated attribute (a, b, c, d), with [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Loss curves on Cora and LFW (average over 5 seeds, shaded = [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Example of directly using the Siamese loss at the beginning of training as auxiliary infor [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

read the original abstract

Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues that sampling pairs directly with auxiliary information for inclusion probabilities can match full pairwise loss performance at lower cost in embedding settings, but this hinges on the auxiliary being both cheap and informative.

read the letter

The core move here is to treat pairwise loss estimation as a survey sampling problem and insist on sampling pairs rather than observations. That distinction, plus the use of auxiliary variables to tilt inclusion probabilities toward informative pairs in high-dimensional embeddings, is the main claim. The abstract says theory and experiments back it up for ranking, similarity, and clustering tasks.

What stands out is the grounding in established survey sampling results like Horvitz-Thompson estimation. If the derivations hold and the experiments control for the usual baselines (uniform subsampling, random pair selection), this could give a cleaner way to trade accuracy for speed than ad-hoc heuristics. The experiments apparently show performance close to the full set while keeping only a fraction of pairs, which is the practical payoff.

The soft spot is exactly the one the stress-test flags: the method needs auxiliary information that correlates with pair usefulness yet costs far less than computing the loss itself. For vision or graph embeddings, if that auxiliary still requires a pass over the high-dimensional data or edges, the claimed savings shrink. The abstract does not spell out how the auxiliary is obtained or how bias is controlled when probabilities are unequal, so it is hard to judge whether the estimator stays low-bias in practice.

Without the full derivations and experimental details I cannot verify the bounds or the controls, but the framing is coherent and the literature engagement looks honest. This is worth a serious referee who can check the sampling theory application and the auxiliary construction. I would bring it to a reading group for the sampling angle but would not cite it yet.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes applying survey sampling techniques to estimate or minimize empirical pairwise loss functions at scale. It claims that sampling plans must target pairs directly (rather than observations), and that using auxiliary information to assign higher inclusion probabilities to informative pairs yields performance close to full pairwise evaluation, with supporting theory and experiments for high-dimensional embedding tasks in vision and graph learning.

Significance. If the central claims hold, the work supplies a principled, sampling-theory-grounded method for reducing the quadratic cost of pairwise losses while preserving accuracy. Explicit credit is due for grounding the approach in external survey sampling results rather than ad-hoc heuristics and for the empirical demonstration that pair-targeted unequal-probability sampling outperforms observation-level sampling.

major comments (2)

[Abstract and §1] The load-bearing assumption that suitable auxiliary information exists which is both cheap to compute and strongly correlated with pair informativeness is stated in the abstract and §1 but is not accompanied by a concrete cost analysis or bias bound that would confirm the auxiliary itself does not approach the cost of full pairwise evaluation. Without this, the claimed accuracy-cost trade-off cannot be verified.
[Theory section (presumed §3)] The manuscript asserts that the Horvitz-Thompson or similar weighting corrects for the unequal inclusion probabilities induced by the auxiliary; however, no explicit statement appears on whether the auxiliary is required to be independent of the target loss or how dependence would affect unbiasedness of the estimator.

minor comments (2)

Notation for inclusion probabilities and auxiliary variables should be introduced with a single consistent table or list of symbols.
[Experiments] The experimental section would benefit from an explicit statement of the computational cost metric used to compare sampling plans against the full-pair baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will incorporate clarifications and additions in a revised manuscript.

read point-by-point responses

Referee: [Abstract and §1] The load-bearing assumption that suitable auxiliary information exists which is both cheap to compute and strongly correlated with pair informativeness is stated in the abstract and §1 but is not accompanied by a concrete cost analysis or bias bound that would confirm the auxiliary itself does not approach the cost of full pairwise evaluation. Without this, the claimed accuracy-cost trade-off cannot be verified.

Authors: We agree that an explicit cost analysis would strengthen the claims. In the revised manuscript we will add a dedicated paragraph (likely in §1 or a new subsection) noting that typical auxiliaries such as embedding norms or node degrees are computable in linear O(n) time, which is negligible relative to quadratic pairwise evaluation. On bias, the Horvitz-Thompson estimator remains design-unbiased irrespective of auxiliary quality; any performance gap is variance-driven. We will include a short analytic remark linking variance reduction to the correlation between auxiliary and loss, referencing the existing experiments. revision: yes
Referee: [Theory section (presumed §3)] The manuscript asserts that the Horvitz-Thompson or similar weighting corrects for the unequal inclusion probabilities induced by the auxiliary; however, no explicit statement appears on whether the auxiliary is required to be independent of the target loss or how dependence would affect unbiasedness of the estimator.

Authors: The Horvitz-Thompson estimator is unbiased under the sampling design alone, provided inclusion probabilities are known and positive; this holds irrespective of any dependence between the auxiliary (used only to set the design) and the target pairwise loss. Dependence influences variance but not unbiasedness—a standard result in survey sampling. We will insert an explicit clarifying sentence in the theory section of the revision to prevent misinterpretation. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external survey sampling theory

full rationale

The paper applies established survey sampling methods (e.g., unequal-probability sampling of pairs with Horvitz-Thompson-style estimators) to pairwise loss estimation. The central claim that pair-targeted sampling with auxiliary information outperforms observation-targeted sampling is presented as a consequence of sampling theory rather than a self-referential fit or definition. No load-bearing step reduces by construction to the paper's own inputs, fitted parameters, or self-citations; the auxiliary-information assumption is stated explicitly and does not derive the performance result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper relies on standard survey sampling theory being applicable to pairwise loss estimation without introducing new free parameters or invented entities; no explicit fitted values or ad-hoc assumptions are stated.

axioms (1)

domain assumption Standard survey sampling theory (inclusion probabilities, unbiased estimation) applies directly to empirical pairwise loss functions.
The central claim rests on transferring established sampling results to the pairwise setting without additional justification visible in the abstract.

pith-pipeline@v0.9.1-grok · 5660 in / 1212 out tokens · 23291 ms · 2026-06-28T12:32:56.968634+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Maillard

R Bardenet and O.A. Maillard. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361–1385, 2015

2015
[2]

Bellet, A

A. Bellet, A. Habrard, and M. Sebban.Metric Learning. Synthesis Lectures on Artificial Intelli- gence and Machine Learning. Morgan & Claypool Publishers, 2015. ISBN 9781627053662. URLhttps://books.google.fr/books?id=SvzRBgAAQBAJ

2015
[3]

Bertail, S

P. Bertail, S. Clémençon, Y . Guyonvarch, and N. Noiry. Learning from biased data: A semi- parametric approach. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learn- ing Research, pages 803–812. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr. press/v139/b...

2021
[4]

Boistard, H.P

H. Boistard, H.P. Lopuhaâ, and A. Ruiz-Gazen. Approximation of rejective sampling inclusion probabilities and application to high order correlations.Electron. J. Statist., 6:1967–1983, 2012

1967
[5]

Scandinavian Journal of Statistics , number =

P. Brändén and J. Jonasson. Negative dependence in sampling.Scandinavian Journal of Statis- tics, 39(4):830–838, 2012. doi: https://doi.org/10.1111/j.1467-9469.2011.00766.x. URL https: //onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9469.2011.00766.x

work page doi:10.1111/j.1467-9469.2011.00766.x 2012
[6]

M. T. Chao. A general purpose unequal probability sampling plan.Biometrika, 69(3):653–656,
[7]

URLhttp://www.jstor.org/stable/2336002

ISSN 00063444, 14643510. URLhttp://www.jstor.org/stable/2336002

work page arXiv
[8]

Chauvet and M

G. Chauvet and M. Gerber. Exponential inequalities for sampling designs.Statistics & Probabil- ity Letters, 232:110654, 2026. ISSN 0167-7152. doi: https://doi.org/10.1016/j.spl.2026.110654. URLhttps://www.sciencedirect.com/science/article/pii/S0167715226000180

work page doi:10.1016/j.spl.2026.110654 2026
[9]

T. Chen, S. Kornblith, M. Norouzi, and G.E. Hinton. A simple framework for contrastive learn- ing of visual representations.CoRR, abs/2002.05709, 2020. URL https://arxiv.org/abs/ 2002.05709

work page internal anchor Pith review Pith/arXiv arXiv 2002
[10]

Clémençon, G

S. Clémençon, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of U- statistics.The Annals of Statistics, 36(2):844–874, 2008

2008
[11]

Clémençon, P

S. Clémençon, P. Bertail, and G. Papa. Learning from Survey Training Samples: Rate Bounds for Horvitz-Thompson Risk Minimizers. In Robert J. Durrant and Kee-Eung Kim, editors,Pro- ceedings of The 8th Asian Conference on Machine Learning, volume 63 ofProceedings of Machine Learning Research, pages 142–157. PMLR, 2016

2016
[12]

Clémençon, P

S. Clémençon, P. Bertail, and E. Chautru. Sampling and Empirical Risk Minimization.Statis- tics, 51(1):30–42, 2017. doi: 10.1080/02331888.2016.1259810. URL https://doi.org/10. 1080/02331888.2016.1259810

work page doi:10.1080/02331888.2016.1259810 2017
[13]

Clémençon, P

S. Clémençon, P. Bertail, E. Chautru, and G. Papa. Optimal survey schemes for stochastic gradient descent with applications to M-estimation.ESAIM: PS, 23:310–337, 2019. doi: 10.1051/ps/2018021

work page doi:10.1051/ps/2018021 2019
[14]

Clémençon, I

S. Clémençon, I. Colin, and A. Bellet. Scaling-up empirical risk minimization: Optimization of incomplete u-statistics.Journal of Machine Learning Research, 17(76):1–36, 2016. URL http://jmlr.org/papers/v17/15-012.html

2016
[15]

de la Pena and E

V . de la Pena and E. Giné.Decoupling: from Dependence to Independence. Springer, 1999

1999
[16]

Dupacova

J. Dupacova. A note on rejective sampling.Contribution to Statistics (J. Hajek memorial volume) Academia Prague, pages 71–78, 1979

1979
[17]

Fey and J.E

M. Fey and J.E. Lenssen. Fast graph representation learning with PyTorch Geometric. InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

2019
[18]

2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) , volume =

R. Hadsell, S. Chopra, and Y . LeCun. Dimensionality reduction by learning an invariant map- ping. In2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742, 2006. doi: 10.1109/CVPR.2006.100

work page doi:10.1109/cvpr.2006.100 2006
[19]

J. Hajek. Asymptotic theory of rejective sampling with varying probabilities from a finite population.The Annals of Mathematical Statistics, 35(4):1491–1523, 1964

1964
[20]

Harper and J.A

F.M. Harper and J.A. Konstan. The MovieLens datasets: History and context.ACM Transactions on Interactive Intelligent Systems, 2015. 10

2015
[21]

Hartley and J.N.K

H.O. Hartley and J.N.K. Rao. Sampling with unequal probabilities and without replacement. Ann. Math. Statist., 33:350–374, 1962

1962
[22]

Hoeffding

W. Hoeffding. Probability inequalities for sums of bounded random variables.J. Amer. Statist. Assoc., 58:13–30, 1963

1963
[23]

Horvitz and D.J

D.G. Horvitz and D.J. Thompson. A generalization of sampling without replacement from a finite universe.JASA, 47:663–685, 1951

1951
[24]

Houdré and P

C. Houdré and P. Reynaud-Bouret. Exponential inequalities, with constants, for u-statistics of order two. In E. Giné, C. Houdré, and D. Nualart, editors,Stochastic Inequalities and Applications, pages 55–69, Basel, 2003. Birkhäuser Basel. ISBN 978-3-0348-8069-5

2003
[25]

Huang, M

G. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments.Tech. rep., 10 2008

2008
[26]

Joag-Dev and F

K. Joag-Dev and F. Proschan. Negative association of random variables with applications.The Annals of Statistics, pages 286–295, 1983

1983
[27]

Katharopoulos and F

A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with impor- tance sampling. InInternational Conference on Machine Learning, pages 2525–2534. PMLR, 2018

2018
[28]

Kipf and M

T.N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017

2017
[29]

Knuth et al.The art of computer programming, volume 3

D.E. Knuth et al.The art of computer programming, volume 3. Addison-Wesley Reading, MA, 1973

1973
[30]

Kumar, A.C

N. Kumar, A.C. Berg, P.N. Belhumeur, and S.K. Nayar. Attribute and simile classifiers for face verification. In2009 IEEE 12th International Conference on Computer Vision, pages 365–372,
[31]

doi: 10.1109/ICCV .2009.5459250

work page doi:10.1109/iccv 2009
[32]

Lee.U-statistics: Theory and practice

A.J. Lee.U-statistics: Theory and practice. Marcel Dekker, Inc., New York, 1990

1990
[33]

Musgrave, S

K. Musgrave, S. Belongie, and S.N. Lim. A metric learning reality check. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV, page 681–699, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58594-5. doi: 10.1007/978-3-030-58595-2_41. URL https://doi.org/10.1007/978-3-030-58595-2_ 41

work page doi:10.1007/978-3-030-58595-2_41 2020
[34]

E. Ohlsson. Sequential poisson sampling.Journal of Official Statistics, 14(2):149, 1998

1998
[35]

J. N. K. Rao and C. F. J. Wu. Resampling inference with complex survey data.Journal of the American Statistical Association, 83(401):231–241, 1988. ISSN 01621459, 1537274X. URL http://www.jstor.org/stable/2288945

work page arXiv 1988
[36]

Robinson, C.Y

J. Robinson, C.Y . Chuang, S. Sra, and S. Jegelka. Contrastive learning with hard negative samples.CoRR, abs/2010.04592, 2020. URLhttps://arxiv.org/abs/2010.04592

work page arXiv 2010
[37]

M. R. Sampford. On sampling without replacement with unequal probabilities of selection. Biometrika, 54:499–513, 1967

1967
[38]

Schroff, D

F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recogni- tion and clustering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015

2015
[39]

Tillé.Sampling algorithms

Y . Tillé.Sampling algorithms. Springer Series in Statistics, 2006

2006
[40]

J.S. Vitter. Random sampling with a reservoir.ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985

1985
[41]

C.Y . Wu, R. Manmatha, A.J. Smola, and P. Krähenbühl. Sampling matters in deep embedding learning, 2018. URLhttps://arxiv.org/abs/1706.07567

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

Z. Yang, W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. InInternational Conference on Machine Learning, 2016

2016
[43]

Yates and P.M

F. Yates and P.M. Grundy. Selection without replacement from within strata with probability proportional to size.Journal of the Royal Statistical Society: Series B (Methodological), 15(2): 253–261, 1953. 11

1953
[44]

Zbontar, L

J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction.CoRR, abs/2103.03230, 2021. URL https://arxiv.org/abs/2103. 03230

work page arXiv 2021
[45]

Zhao and T

P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. Ininternational Conference on Machine Learning, pages 1–9. PMLR, 2015

2015
[46]

ℓ(Zi, Zj) ¯πi,j(WN) − ℓ(Zk, Zl) ¯πk,l(WN) 2 WN # × ¯πi,j(WN)¯πk,l(WN)−¯π(i,j),(k,l)(WN) .(44) For all(i, j),(k, l)s.t.i < jandk < l, we have E

S. Zhou, Y . Lei, and A. Kabán. Randomized pairwise learning with adaptive sampling: A pac- bayes analysis, 2025. URLhttps://arxiv.org/abs/2504.02957. 12 Organisation of the appendix The appendix is organised as follows. • Appendix A gathers additional information regarding notations and survey schemes. • Appendix B presents the proofs of the paper. • App...

work page arXiv 2025

[1] [1]

Maillard

R Bardenet and O.A. Maillard. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361–1385, 2015

2015

[2] [2]

Bellet, A

A. Bellet, A. Habrard, and M. Sebban.Metric Learning. Synthesis Lectures on Artificial Intelli- gence and Machine Learning. Morgan & Claypool Publishers, 2015. ISBN 9781627053662. URLhttps://books.google.fr/books?id=SvzRBgAAQBAJ

2015

[3] [3]

Bertail, S

P. Bertail, S. Clémençon, Y . Guyonvarch, and N. Noiry. Learning from biased data: A semi- parametric approach. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learn- ing Research, pages 803–812. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr. press/v139/b...

2021

[4] [4]

Boistard, H.P

H. Boistard, H.P. Lopuhaâ, and A. Ruiz-Gazen. Approximation of rejective sampling inclusion probabilities and application to high order correlations.Electron. J. Statist., 6:1967–1983, 2012

1967

[5] [5]

Scandinavian Journal of Statistics , number =

P. Brändén and J. Jonasson. Negative dependence in sampling.Scandinavian Journal of Statis- tics, 39(4):830–838, 2012. doi: https://doi.org/10.1111/j.1467-9469.2011.00766.x. URL https: //onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9469.2011.00766.x

work page doi:10.1111/j.1467-9469.2011.00766.x 2012

[6] [6]

M. T. Chao. A general purpose unequal probability sampling plan.Biometrika, 69(3):653–656,

[7] [7]

URLhttp://www.jstor.org/stable/2336002

ISSN 00063444, 14643510. URLhttp://www.jstor.org/stable/2336002

work page arXiv

[8] [8]

Chauvet and M

G. Chauvet and M. Gerber. Exponential inequalities for sampling designs.Statistics & Probabil- ity Letters, 232:110654, 2026. ISSN 0167-7152. doi: https://doi.org/10.1016/j.spl.2026.110654. URLhttps://www.sciencedirect.com/science/article/pii/S0167715226000180

work page doi:10.1016/j.spl.2026.110654 2026

[9] [9]

T. Chen, S. Kornblith, M. Norouzi, and G.E. Hinton. A simple framework for contrastive learn- ing of visual representations.CoRR, abs/2002.05709, 2020. URL https://arxiv.org/abs/ 2002.05709

work page internal anchor Pith review Pith/arXiv arXiv 2002

[10] [10]

Clémençon, G

S. Clémençon, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of U- statistics.The Annals of Statistics, 36(2):844–874, 2008

2008

[11] [11]

Clémençon, P

S. Clémençon, P. Bertail, and G. Papa. Learning from Survey Training Samples: Rate Bounds for Horvitz-Thompson Risk Minimizers. In Robert J. Durrant and Kee-Eung Kim, editors,Pro- ceedings of The 8th Asian Conference on Machine Learning, volume 63 ofProceedings of Machine Learning Research, pages 142–157. PMLR, 2016

2016

[12] [12]

Clémençon, P

S. Clémençon, P. Bertail, and E. Chautru. Sampling and Empirical Risk Minimization.Statis- tics, 51(1):30–42, 2017. doi: 10.1080/02331888.2016.1259810. URL https://doi.org/10. 1080/02331888.2016.1259810

work page doi:10.1080/02331888.2016.1259810 2017

[13] [13]

Clémençon, P

S. Clémençon, P. Bertail, E. Chautru, and G. Papa. Optimal survey schemes for stochastic gradient descent with applications to M-estimation.ESAIM: PS, 23:310–337, 2019. doi: 10.1051/ps/2018021

work page doi:10.1051/ps/2018021 2019

[14] [14]

Clémençon, I

S. Clémençon, I. Colin, and A. Bellet. Scaling-up empirical risk minimization: Optimization of incomplete u-statistics.Journal of Machine Learning Research, 17(76):1–36, 2016. URL http://jmlr.org/papers/v17/15-012.html

2016

[15] [15]

de la Pena and E

V . de la Pena and E. Giné.Decoupling: from Dependence to Independence. Springer, 1999

1999

[16] [16]

Dupacova

J. Dupacova. A note on rejective sampling.Contribution to Statistics (J. Hajek memorial volume) Academia Prague, pages 71–78, 1979

1979

[17] [17]

Fey and J.E

M. Fey and J.E. Lenssen. Fast graph representation learning with PyTorch Geometric. InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

2019

[18] [18]

2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) , volume =

R. Hadsell, S. Chopra, and Y . LeCun. Dimensionality reduction by learning an invariant map- ping. In2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742, 2006. doi: 10.1109/CVPR.2006.100

work page doi:10.1109/cvpr.2006.100 2006

[19] [19]

J. Hajek. Asymptotic theory of rejective sampling with varying probabilities from a finite population.The Annals of Mathematical Statistics, 35(4):1491–1523, 1964

1964

[20] [20]

Harper and J.A

F.M. Harper and J.A. Konstan. The MovieLens datasets: History and context.ACM Transactions on Interactive Intelligent Systems, 2015. 10

2015

[21] [21]

Hartley and J.N.K

H.O. Hartley and J.N.K. Rao. Sampling with unequal probabilities and without replacement. Ann. Math. Statist., 33:350–374, 1962

1962

[22] [22]

Hoeffding

W. Hoeffding. Probability inequalities for sums of bounded random variables.J. Amer. Statist. Assoc., 58:13–30, 1963

1963

[23] [23]

Horvitz and D.J

D.G. Horvitz and D.J. Thompson. A generalization of sampling without replacement from a finite universe.JASA, 47:663–685, 1951

1951

[24] [24]

Houdré and P

C. Houdré and P. Reynaud-Bouret. Exponential inequalities, with constants, for u-statistics of order two. In E. Giné, C. Houdré, and D. Nualart, editors,Stochastic Inequalities and Applications, pages 55–69, Basel, 2003. Birkhäuser Basel. ISBN 978-3-0348-8069-5

2003

[25] [25]

Huang, M

G. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments.Tech. rep., 10 2008

2008

[26] [26]

Joag-Dev and F

K. Joag-Dev and F. Proschan. Negative association of random variables with applications.The Annals of Statistics, pages 286–295, 1983

1983

[27] [27]

Katharopoulos and F

A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with impor- tance sampling. InInternational Conference on Machine Learning, pages 2525–2534. PMLR, 2018

2018

[28] [28]

Kipf and M

T.N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017

2017

[29] [29]

Knuth et al.The art of computer programming, volume 3

D.E. Knuth et al.The art of computer programming, volume 3. Addison-Wesley Reading, MA, 1973

1973

[30] [30]

Kumar, A.C

N. Kumar, A.C. Berg, P.N. Belhumeur, and S.K. Nayar. Attribute and simile classifiers for face verification. In2009 IEEE 12th International Conference on Computer Vision, pages 365–372,

[31] [31]

doi: 10.1109/ICCV .2009.5459250

work page doi:10.1109/iccv 2009

[32] [32]

Lee.U-statistics: Theory and practice

A.J. Lee.U-statistics: Theory and practice. Marcel Dekker, Inc., New York, 1990

1990

[33] [33]

Musgrave, S

K. Musgrave, S. Belongie, and S.N. Lim. A metric learning reality check. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV, page 681–699, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58594-5. doi: 10.1007/978-3-030-58595-2_41. URL https://doi.org/10.1007/978-3-030-58595-2_ 41

work page doi:10.1007/978-3-030-58595-2_41 2020

[34] [34]

E. Ohlsson. Sequential poisson sampling.Journal of Official Statistics, 14(2):149, 1998

1998

[35] [35]

J. N. K. Rao and C. F. J. Wu. Resampling inference with complex survey data.Journal of the American Statistical Association, 83(401):231–241, 1988. ISSN 01621459, 1537274X. URL http://www.jstor.org/stable/2288945

work page arXiv 1988

[36] [36]

Robinson, C.Y

J. Robinson, C.Y . Chuang, S. Sra, and S. Jegelka. Contrastive learning with hard negative samples.CoRR, abs/2010.04592, 2020. URLhttps://arxiv.org/abs/2010.04592

work page arXiv 2010

[37] [37]

M. R. Sampford. On sampling without replacement with unequal probabilities of selection. Biometrika, 54:499–513, 1967

1967

[38] [38]

Schroff, D

F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recogni- tion and clustering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015

2015

[39] [39]

Tillé.Sampling algorithms

Y . Tillé.Sampling algorithms. Springer Series in Statistics, 2006

2006

[40] [40]

J.S. Vitter. Random sampling with a reservoir.ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985

1985

[41] [41]

C.Y . Wu, R. Manmatha, A.J. Smola, and P. Krähenbühl. Sampling matters in deep embedding learning, 2018. URLhttps://arxiv.org/abs/1706.07567

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

Z. Yang, W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. InInternational Conference on Machine Learning, 2016

2016

[43] [43]

Yates and P.M

F. Yates and P.M. Grundy. Selection without replacement from within strata with probability proportional to size.Journal of the Royal Statistical Society: Series B (Methodological), 15(2): 253–261, 1953. 11

1953

[44] [44]

Zbontar, L

J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction.CoRR, abs/2103.03230, 2021. URL https://arxiv.org/abs/2103. 03230

work page arXiv 2021

[45] [45]

Zhao and T

P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. Ininternational Conference on Machine Learning, pages 1–9. PMLR, 2015

2015

[46] [46]

ℓ(Zi, Zj) ¯πi,j(WN) − ℓ(Zk, Zl) ¯πk,l(WN) 2 WN # × ¯πi,j(WN)¯πk,l(WN)−¯π(i,j),(k,l)(WN) .(44) For all(i, j),(k, l)s.t.i < jandk < l, we have E

S. Zhou, Y . Lei, and A. Kabán. Randomized pairwise learning with adaptive sampling: A pac- bayes analysis, 2025. URLhttps://arxiv.org/abs/2504.02957. 12 Organisation of the appendix The appendix is organised as follows. • Appendix A gathers additional information regarding notations and survey schemes. • Appendix B presents the proofs of the paper. • App...

work page arXiv 2025