pith. sign in

arxiv: 2606.02345 · v1 · pith:SBAICFU4new · submitted 2026-06-01 · 📊 stat.ML · cs.LG

Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

Pith reviewed 2026-06-28 12:32 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords pairwise losssurvey samplingempirical estimationsimilarity learningembeddingsmachine learningcomputational efficiencyranking
0
0 comments X

The pith

Direct sampling of pairs rather than observations allows near-full pairwise loss performance with far fewer pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that empirical pairwise loss functions in machine learning can be estimated or minimized effectively by keeping only a small fraction of pairs, provided the sampling targets pairs directly and uses auxiliary information to favor informative ones. This holds for tasks such as similarity learning, ranking, and clustering where full pairwise computation scales quadratically and becomes infeasible. A sympathetic reader would care because the approach supplies a theoretically grounded way to trade accuracy for speed without simply dropping to random subsampling of points. The central demonstration combines sampling theory with experiments on high-dimensional embeddings typical in vision and graph settings.

Core claim

A frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. Such sampling plans must target pairs directly rather than individual observations. Assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation for losses between high-dimensional vectors such as embeddings.

What carries the argument

Survey sampling plans that assign inclusion probabilities directly to pairs using auxiliary information to prioritize informative pairs.

If this is right

  • Performance remains close to full pairwise evaluation when pairs are sampled directly with auxiliary guidance.
  • A principled accuracy-computation trade-off becomes available for pairwise losses at scale.
  • The same sampling logic applies to embeddings in vision and graph learning tasks.
  • Sampling individual observations is provably less effective than direct pair sampling for these losses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested by varying the quality of auxiliary information to measure how much guidance is actually required.
  • Integration with online or streaming settings might allow the sampling probabilities to adapt as new data arrives.
  • Memory savings from storing only sampled pairs could extend the approach to datasets too large to hold in full.
  • Similar pair-targeting ideas might apply to higher-order losses involving triples or larger tuples.

Load-bearing premise

Suitable auxiliary information exists that can guide pair sampling without introducing unacceptable bias or extra computational overhead.

What would settle it

An experiment on high-dimensional embedding losses in which uniform random sampling of pairs performs as well as auxiliary-guided pair sampling would falsify the need for targeted direct-pair plans.

Figures

Figures reproduced from arXiv: 2606.02345 by Charlotte Laclau, Louise Davy, Stephan Cl\'emen\c{c}on.

Figure 1
Figure 1. Figure 1: Illustration of the HT-risk estimator efficiency. (a) and (b) Comparison of the variance of [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results on Cora (a, b, c) and LFW (d, e, f). Results are averaged over 5 seeds, shaded = [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MAE and variance comparison on Cora (a, b, c, d) and MovieLens (e, f, g, h) between Ob [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MAE and variance comparison on LFW with the top-1 correlated attribute (a, b, c, d), with [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Loss curves on Cora and LFW (average over 5 seeds, shaded = [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of directly using the Siamese loss at the beginning of training as auxiliary infor [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
read the original abstract

Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes applying survey sampling techniques to estimate or minimize empirical pairwise loss functions at scale. It claims that sampling plans must target pairs directly (rather than observations), and that using auxiliary information to assign higher inclusion probabilities to informative pairs yields performance close to full pairwise evaluation, with supporting theory and experiments for high-dimensional embedding tasks in vision and graph learning.

Significance. If the central claims hold, the work supplies a principled, sampling-theory-grounded method for reducing the quadratic cost of pairwise losses while preserving accuracy. Explicit credit is due for grounding the approach in external survey sampling results rather than ad-hoc heuristics and for the empirical demonstration that pair-targeted unequal-probability sampling outperforms observation-level sampling.

major comments (2)
  1. [Abstract and §1] The load-bearing assumption that suitable auxiliary information exists which is both cheap to compute and strongly correlated with pair informativeness is stated in the abstract and §1 but is not accompanied by a concrete cost analysis or bias bound that would confirm the auxiliary itself does not approach the cost of full pairwise evaluation. Without this, the claimed accuracy-cost trade-off cannot be verified.
  2. [Theory section (presumed §3)] The manuscript asserts that the Horvitz-Thompson or similar weighting corrects for the unequal inclusion probabilities induced by the auxiliary; however, no explicit statement appears on whether the auxiliary is required to be independent of the target loss or how dependence would affect unbiasedness of the estimator.
minor comments (2)
  1. Notation for inclusion probabilities and auxiliary variables should be introduced with a single consistent table or list of symbols.
  2. [Experiments] The experimental section would benefit from an explicit statement of the computational cost metric used to compare sampling plans against the full-pair baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will incorporate clarifications and additions in a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and §1] The load-bearing assumption that suitable auxiliary information exists which is both cheap to compute and strongly correlated with pair informativeness is stated in the abstract and §1 but is not accompanied by a concrete cost analysis or bias bound that would confirm the auxiliary itself does not approach the cost of full pairwise evaluation. Without this, the claimed accuracy-cost trade-off cannot be verified.

    Authors: We agree that an explicit cost analysis would strengthen the claims. In the revised manuscript we will add a dedicated paragraph (likely in §1 or a new subsection) noting that typical auxiliaries such as embedding norms or node degrees are computable in linear O(n) time, which is negligible relative to quadratic pairwise evaluation. On bias, the Horvitz-Thompson estimator remains design-unbiased irrespective of auxiliary quality; any performance gap is variance-driven. We will include a short analytic remark linking variance reduction to the correlation between auxiliary and loss, referencing the existing experiments. revision: yes

  2. Referee: [Theory section (presumed §3)] The manuscript asserts that the Horvitz-Thompson or similar weighting corrects for the unequal inclusion probabilities induced by the auxiliary; however, no explicit statement appears on whether the auxiliary is required to be independent of the target loss or how dependence would affect unbiasedness of the estimator.

    Authors: The Horvitz-Thompson estimator is unbiased under the sampling design alone, provided inclusion probabilities are known and positive; this holds irrespective of any dependence between the auxiliary (used only to set the design) and the target pairwise loss. Dependence influences variance but not unbiasedness—a standard result in survey sampling. We will insert an explicit clarifying sentence in the theory section of the revision to prevent misinterpretation. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external survey sampling theory

full rationale

The paper applies established survey sampling methods (e.g., unequal-probability sampling of pairs with Horvitz-Thompson-style estimators) to pairwise loss estimation. The central claim that pair-targeted sampling with auxiliary information outperforms observation-targeted sampling is presented as a consequence of sampling theory rather than a self-referential fit or definition. No load-bearing step reduces by construction to the paper's own inputs, fitted parameters, or self-citations; the auxiliary-information assumption is stated explicitly and does not derive the performance result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper relies on standard survey sampling theory being applicable to pairwise loss estimation without introducing new free parameters or invented entities; no explicit fitted values or ad-hoc assumptions are stated.

axioms (1)
  • domain assumption Standard survey sampling theory (inclusion probabilities, unbiased estimation) applies directly to empirical pairwise loss functions.
    The central claim rests on transferring established sampling results to the pairwise setting without additional justification visible in the abstract.

pith-pipeline@v0.9.1-grok · 5660 in / 1212 out tokens · 23291 ms · 2026-06-28T12:32:56.968634+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Maillard

    R Bardenet and O.A. Maillard. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361–1385, 2015

  2. [2]

    Bellet, A

    A. Bellet, A. Habrard, and M. Sebban.Metric Learning. Synthesis Lectures on Artificial Intelli- gence and Machine Learning. Morgan & Claypool Publishers, 2015. ISBN 9781627053662. URLhttps://books.google.fr/books?id=SvzRBgAAQBAJ

  3. [3]

    Bertail, S

    P. Bertail, S. Clémençon, Y . Guyonvarch, and N. Noiry. Learning from biased data: A semi- parametric approach. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learn- ing Research, pages 803–812. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr. press/v139/b...

  4. [4]

    Boistard, H.P

    H. Boistard, H.P. Lopuhaâ, and A. Ruiz-Gazen. Approximation of rejective sampling inclusion probabilities and application to high order correlations.Electron. J. Statist., 6:1967–1983, 2012

  5. [5]

    Scandinavian Journal of Statistics , number =

    P. Brändén and J. Jonasson. Negative dependence in sampling.Scandinavian Journal of Statis- tics, 39(4):830–838, 2012. doi: https://doi.org/10.1111/j.1467-9469.2011.00766.x. URL https: //onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9469.2011.00766.x

  6. [6]

    M. T. Chao. A general purpose unequal probability sampling plan.Biometrika, 69(3):653–656,

  7. [7]

    URLhttp://www.jstor.org/stable/2336002

    ISSN 00063444, 14643510. URLhttp://www.jstor.org/stable/2336002

  8. [8]

    Chauvet and M

    G. Chauvet and M. Gerber. Exponential inequalities for sampling designs.Statistics & Probabil- ity Letters, 232:110654, 2026. ISSN 0167-7152. doi: https://doi.org/10.1016/j.spl.2026.110654. URLhttps://www.sciencedirect.com/science/article/pii/S0167715226000180

  9. [9]

    T. Chen, S. Kornblith, M. Norouzi, and G.E. Hinton. A simple framework for contrastive learn- ing of visual representations.CoRR, abs/2002.05709, 2020. URL https://arxiv.org/abs/ 2002.05709

  10. [10]

    Clémençon, G

    S. Clémençon, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of U- statistics.The Annals of Statistics, 36(2):844–874, 2008

  11. [11]

    Clémençon, P

    S. Clémençon, P. Bertail, and G. Papa. Learning from Survey Training Samples: Rate Bounds for Horvitz-Thompson Risk Minimizers. In Robert J. Durrant and Kee-Eung Kim, editors,Pro- ceedings of The 8th Asian Conference on Machine Learning, volume 63 ofProceedings of Machine Learning Research, pages 142–157. PMLR, 2016

  12. [12]

    Clémençon, P

    S. Clémençon, P. Bertail, and E. Chautru. Sampling and Empirical Risk Minimization.Statis- tics, 51(1):30–42, 2017. doi: 10.1080/02331888.2016.1259810. URL https://doi.org/10. 1080/02331888.2016.1259810

  13. [13]

    Clémençon, P

    S. Clémençon, P. Bertail, E. Chautru, and G. Papa. Optimal survey schemes for stochastic gradient descent with applications to M-estimation.ESAIM: PS, 23:310–337, 2019. doi: 10.1051/ps/2018021

  14. [14]

    Clémençon, I

    S. Clémençon, I. Colin, and A. Bellet. Scaling-up empirical risk minimization: Optimization of incomplete u-statistics.Journal of Machine Learning Research, 17(76):1–36, 2016. URL http://jmlr.org/papers/v17/15-012.html

  15. [15]

    de la Pena and E

    V . de la Pena and E. Giné.Decoupling: from Dependence to Independence. Springer, 1999

  16. [16]

    Dupacova

    J. Dupacova. A note on rejective sampling.Contribution to Statistics (J. Hajek memorial volume) Academia Prague, pages 71–78, 1979

  17. [17]

    Fey and J.E

    M. Fey and J.E. Lenssen. Fast graph representation learning with PyTorch Geometric. InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

  18. [18]

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) , volume =

    R. Hadsell, S. Chopra, and Y . LeCun. Dimensionality reduction by learning an invariant map- ping. In2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742, 2006. doi: 10.1109/CVPR.2006.100

  19. [19]

    J. Hajek. Asymptotic theory of rejective sampling with varying probabilities from a finite population.The Annals of Mathematical Statistics, 35(4):1491–1523, 1964

  20. [20]

    Harper and J.A

    F.M. Harper and J.A. Konstan. The MovieLens datasets: History and context.ACM Transactions on Interactive Intelligent Systems, 2015. 10

  21. [21]

    Hartley and J.N.K

    H.O. Hartley and J.N.K. Rao. Sampling with unequal probabilities and without replacement. Ann. Math. Statist., 33:350–374, 1962

  22. [22]

    Hoeffding

    W. Hoeffding. Probability inequalities for sums of bounded random variables.J. Amer. Statist. Assoc., 58:13–30, 1963

  23. [23]

    Horvitz and D.J

    D.G. Horvitz and D.J. Thompson. A generalization of sampling without replacement from a finite universe.JASA, 47:663–685, 1951

  24. [24]

    Houdré and P

    C. Houdré and P. Reynaud-Bouret. Exponential inequalities, with constants, for u-statistics of order two. In E. Giné, C. Houdré, and D. Nualart, editors,Stochastic Inequalities and Applications, pages 55–69, Basel, 2003. Birkhäuser Basel. ISBN 978-3-0348-8069-5

  25. [25]

    Huang, M

    G. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments.Tech. rep., 10 2008

  26. [26]

    Joag-Dev and F

    K. Joag-Dev and F. Proschan. Negative association of random variables with applications.The Annals of Statistics, pages 286–295, 1983

  27. [27]

    Katharopoulos and F

    A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with impor- tance sampling. InInternational Conference on Machine Learning, pages 2525–2534. PMLR, 2018

  28. [28]

    Kipf and M

    T.N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017

  29. [29]

    Knuth et al.The art of computer programming, volume 3

    D.E. Knuth et al.The art of computer programming, volume 3. Addison-Wesley Reading, MA, 1973

  30. [30]

    Kumar, A.C

    N. Kumar, A.C. Berg, P.N. Belhumeur, and S.K. Nayar. Attribute and simile classifiers for face verification. In2009 IEEE 12th International Conference on Computer Vision, pages 365–372,

  31. [31]

    doi: 10.1109/ICCV .2009.5459250

  32. [32]

    Lee.U-statistics: Theory and practice

    A.J. Lee.U-statistics: Theory and practice. Marcel Dekker, Inc., New York, 1990

  33. [33]

    Musgrave, S

    K. Musgrave, S. Belongie, and S.N. Lim. A metric learning reality check. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV, page 681–699, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58594-5. doi: 10.1007/978-3-030-58595-2_41. URL https://doi.org/10.1007/978-3-030-58595-2_ 41

  34. [34]

    E. Ohlsson. Sequential poisson sampling.Journal of Official Statistics, 14(2):149, 1998

  35. [35]

    J. N. K. Rao and C. F. J. Wu. Resampling inference with complex survey data.Journal of the American Statistical Association, 83(401):231–241, 1988. ISSN 01621459, 1537274X. URL http://www.jstor.org/stable/2288945

  36. [36]

    Robinson, C.Y

    J. Robinson, C.Y . Chuang, S. Sra, and S. Jegelka. Contrastive learning with hard negative samples.CoRR, abs/2010.04592, 2020. URLhttps://arxiv.org/abs/2010.04592

  37. [37]

    M. R. Sampford. On sampling without replacement with unequal probabilities of selection. Biometrika, 54:499–513, 1967

  38. [38]

    Schroff, D

    F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recogni- tion and clustering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015

  39. [39]

    Tillé.Sampling algorithms

    Y . Tillé.Sampling algorithms. Springer Series in Statistics, 2006

  40. [40]

    J.S. Vitter. Random sampling with a reservoir.ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985

  41. [41]

    C.Y . Wu, R. Manmatha, A.J. Smola, and P. Krähenbühl. Sampling matters in deep embedding learning, 2018. URLhttps://arxiv.org/abs/1706.07567

  42. [42]

    Z. Yang, W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. InInternational Conference on Machine Learning, 2016

  43. [43]

    Yates and P.M

    F. Yates and P.M. Grundy. Selection without replacement from within strata with probability proportional to size.Journal of the Royal Statistical Society: Series B (Methodological), 15(2): 253–261, 1953. 11

  44. [44]

    Zbontar, L

    J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction.CoRR, abs/2103.03230, 2021. URL https://arxiv.org/abs/2103. 03230

  45. [45]

    Zhao and T

    P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. Ininternational Conference on Machine Learning, pages 1–9. PMLR, 2015

  46. [46]

    ℓ(Zi, Zj) ¯πi,j(WN) − ℓ(Zk, Zl) ¯πk,l(WN) 2 WN # × ¯πi,j(WN)¯πk,l(WN)−¯π(i,j),(k,l)(WN) .(44) For all(i, j),(k, l)s.t.i < jandk < l, we have E

    S. Zhou, Y . Lei, and A. Kabán. Randomized pairwise learning with adaptive sampling: A pac- bayes analysis, 2025. URLhttps://arxiv.org/abs/2504.02957. 12 Organisation of the appendix The appendix is organised as follows. • Appendix A gathers additional information regarding notations and survey schemes. • Appendix B presents the proofs of the paper. • App...