pith. sign in

arxiv: 2602.04626 · v2 · submitted 2026-02-04 · ❄️ cond-mat.stat-mech · math-ph· math.MP

The Most Dispersed Subset of Random Points in mathbb{R}^d

Pith reviewed 2026-05-16 07:10 UTC · model grok-4.3

classification ❄️ cond-mat.stat-mech math-phmath.MP
keywords dispersionrandom pointsorder statisticsreplica methodlarge deviationssubset selectionrotationally symmetric
0
0 comments X

The pith

For large random populations in any dimension, the subset maximizing dispersion is exactly the points outside a self-consistently sized ball.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives the full statistics, including large-deviation tails, of the highest dispersion achievable by choosing M points out of N when each point has d independent traits drawn from a rotationally symmetric distribution. It proves that in the large-N limit the optimal choice is always to keep every point lying outside a d-dimensional ball whose radius is fixed by a self-consistency condition derived from the additive dispersion measure. Two independent analytic routes—one using mean-field order statistics and one using replicas—both reach the same geometric rule. The formulas reduce to exact solvable expressions when d equals 1 and are checked against simulations and heuristic searches for small instances.

Core claim

In all dimensions d, and for rotationally symmetric distributions, the optimal subset for large populations consists of all points lying outside a d-dimensional ball whose radius is determined self-consistently.

What carries the argument

A d-dimensional ball whose radius is fixed by the self-consistency condition that balances the expected pairwise dispersion contributions of the retained points.

Load-bearing premise

Traits are drawn independently and identically from a rotationally symmetric distribution and dispersion is strictly additive over pairwise separations, with results holding only in the large-N limit.

What would settle it

For a standard Gaussian distribution in d=2, generate many large point clouds, compute the true maximal-dispersion subset of size M by exhaustive or heuristic search, and check whether the retained points lie exactly outside the predicted radius and inside points are discarded.

Figures

Figures reproduced from arXiv: 2602.04626 by Fabio Deelan Cunden, Giovanni Gramegna, Noemi Cuppone, Pierpaolo Vivo.

Figure 1
Figure 1. Figure 1: (a) N = 10 random traits (‘×’) on a line (d = 1). In red, the subset that optimises the M-dispersion with M = 4. The blue boxes indicate the subset that optimises the M-dispersion for M = 6. The optimising subsets in d = 1 are composed by two separate clusters for every M, comprising the k leftmost and the M −k rightmost variables for certain k’s. The optimal subset for M + 1 includes the optimal subset fo… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic representation of the various terms of the expression in Eq. (29). Changing variables F(x) = u, F(y) = v , and writing n k := n(n−1)(n−2). . .(n−k+1) for the falling factorial, E[D M bal] = cN,M Z 1 0 da Z 1 a db aM/2 [b − a] N−M−2 (1 − b) M/2 × n M 2 2 Z Z a −∞ du a dv a (G(u) − G(v))2 + 2  M 2 2 Z a −∞ du a Z +∞ b dv 1 − b (G(u) − G(v))2 +  M 2 2 Z Z +∞ b du 1 − b dv 1 − b (G(u) − G(v))2 … view at source ↗
Figure 3
Figure 3. Figure 3: Mean and variance of the maximal M-dispersion for d = 1, uniformly distributed random variables in [−1/2, 1/2] obtained numerically from the true (not necessarily balanced) prefix-suffix optimiser, compared with the theoretical predictions for the balanced configuration in the large N-limit provided by equations (35) and (37) (red line). The dotted line shows the predictions for the balanced configuration … view at source ↗
Figure 4
Figure 4. Figure 4: Plots for N = 500 points (x, y) sampled according to the standard Gaussian distribution in d = 2. The yellow points are producing the maximal M-dispersion for different values of α = M/N according to the greedy algorithm described in Sec. 6. The black dashed circle centered in the origin has radius R(α) given by (49). Example 4 (Gaussian distributions in d > 1). For the standard Gaussian density f(x) = 1 (… view at source ↗
Figure 5
Figure 5. Figure 5: Mean and variance of DM max (with appropriate normalisation in N) for Gaussian points in d = 2. The results obtained from the greedy algorithm with several values of N are compared with the theoretical values (red line). The numerical estimates have been obtained from a sample of size 104 . probable spatial arrangement of points: a fraction α of them lie beyond a certain radius R(α), itself determined by t… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between analytical results and a numerical estimation of the large deviation tails. The numerical results in both cases are obtained with 109 samples. (Left) Results obtained for the Gaussian case in d = 1 with N = 200 and M = 80. The blue dashed line is obtained from a numerical evaluation of the Legendre-Fenchel transform of (83), while the yellow solid line is obtained from a numerical simula… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between the results obtained from the heuristic algorithm and the exact solution obtained via a brute-force approach in N = 10, for several values of α = M/N and d. The sample-size for the estimation of probability and averages is 103 . (Left) Probability that the set selected by the greedy algorithm matches the exact solution from a brute-force algorithm. (Right) Average relative error in the m… view at source ↗
read the original abstract

Consider a population of $N$ individuals, each having $d\geq 1$ different traits, and an additive measure, called dispersion, which rewards large pairwise separations between traits. The goal is to select $M\leq N$ individuals such that their traits are as dispersed as possible. We compute analytically the full statistics (including large deviation tails) of the maximally achievable dispersion among sub-populations of size $M$ when the traits are independent and identically distributed. Two complementary approaches are developed, one based on a mean-field theory for order statistics, and the other on the replica method from the field of disordered systems. In all dimensions $d$, and for rotationally symmetric distributions, the optimal subset for large populations consists of all points lying outside a $d$-dimensional ball whose radius is determined self-consistently. For a single trait ($d=1$), the statistics of the maximal dispersion can be tackled for finite $N,M$ as well. The formulae we obtained are corroborated by numerical simulations on small instances and by heuristic algorithms that find near-optimal solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript analyzes the problem of selecting an M-subset from N i.i.d. random points in R^d that maximizes an additive dispersion measure based on pairwise separations. For rotationally symmetric distributions, it derives that the optimal large-N subset consists of all points lying outside a d-dimensional ball whose radius R is fixed by a self-consistent equation obtained from the known distribution. Two complementary methods are used: a mean-field treatment of order statistics and the replica method from disordered systems. Full statistics of the maximal dispersion, including large-deviation tails, are computed; for d=1 the finite-N,M case is also treated exactly. Results are stated to be corroborated by simulations on small instances and heuristic algorithms.

Significance. If the central identification of the radial threshold holds, the work supplies an explicit, parameter-free characterization of the optimal dispersed subset in the large-N limit together with its complete statistics. This is a concrete advance for selection problems in high dimensions and for mean-field treatments of combinatorial optimization. The dual use of order-statistic mean-field theory and the replica method, together with the self-consistent radius construction, constitutes a technical strength that could be useful in related statistical-physics models.

major comments (2)
  1. [Replica method] Replica-method section: the derivation of the radial threshold relies on a replica-symmetric saddle point whose order parameter is a simple radial cut. The manuscript does not examine the stability of this RS solution or the possibility of replica symmetry breaking. If RSB occurs, angular rearrangements of selected points could produce distinct near-optimal subsets, undermining the claim that the pure radial selection is the typical optimum. A stability analysis or explicit check that the RS ansatz is self-consistent is required to support the explicit form of the optimal subset.
  2. [Mean-field order statistics] Mean-field order-statistics derivation (around the self-consistent equation for R): the radius is obtained from the known marginal distribution rather than fitted to the dispersion value. While this avoids circularity, the manuscript must show that the resulting large-deviation rate function is indeed extremal over all possible subsets and not merely over radially symmetric ones; otherwise the optimality statement remains conditional on the ansatz.
minor comments (3)
  1. [Introduction] The definition of the dispersion functional and the precise normalization of the pairwise term should be stated explicitly in the introduction or methods section to avoid ambiguity when comparing to simulations.
  2. [Numerical results] Simulation section: error bars, number of independent realizations, and finite-size scaling checks for the large-deviation tails are missing; these should be added to allow quantitative assessment of the analytical predictions.
  3. [Replica method] Notation for the self-consistent radius R and the associated order parameter in the replica calculation should be unified across the two approaches.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and insightful report. The comments highlight important aspects of the replica-symmetric ansatz and the optimality of the radial selection. We address each major comment below and clarify the scope of our results.

read point-by-point responses
  1. Referee: [Replica method] Replica-method section: the derivation of the radial threshold relies on a replica-symmetric saddle point whose order parameter is a simple radial cut. The manuscript does not examine the stability of this RS solution or the possibility of replica symmetry breaking. If RSB occurs, angular rearrangements of selected points could produce distinct near-optimal subsets, undermining the claim that the pure radial selection is the typical optimum. A stability analysis or explicit check that the RS ansatz is self-consistent is required to support the explicit form of the optimal subset.

    Authors: We thank the referee for this important observation. Our replica-symmetric ansatz with a radial order parameter is motivated by the rotational symmetry of the distribution, which makes angular variations unlikely to improve the dispersion in the large-N limit. We have performed additional checks showing that the RS solution satisfies the saddle-point equations consistently, and numerical simulations confirm that radial selections achieve the predicted maximal dispersion. A full RSB analysis is technically involved and left for future work, but we will add a paragraph discussing the stability of the RS ansatz based on the Hessian of the free energy in the revised version. revision: partial

  2. Referee: [Mean-field order statistics] Mean-field order-statistics derivation (around the self-consistent equation for R): the radius is obtained from the known marginal distribution rather than fitted to the dispersion value. While this avoids circularity, the manuscript must show that the resulting large-deviation rate function is indeed extremal over all possible subsets and not merely over radially symmetric ones; otherwise the optimality statement remains conditional on the ansatz.

    Authors: We appreciate this comment on the scope of the optimality. Due to the rotational symmetry, the dispersion measure is invariant under rotations, so the optimal subset can always be chosen to be radially symmetric without loss of generality. The mean-field order statistics derivation maximizes the expected contribution from pairwise distances, and any deviation from the radial threshold would decrease the average dispersion by the properties of order statistics. The large-deviation function is thus extremal over all subsets because non-radial configurations are equivalent to radial ones after averaging over rotations. We will revise the manuscript to include this symmetry argument explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity: self-consistent radius derived from distribution via standard mean-field and replica methods

full rationale

The derivation uses mean-field order statistics and the replica method to obtain the large-N optimal subset as all points outside a ball of self-consistent radius R for rotationally symmetric distributions. This R is fixed by the known input distribution (not fitted to the dispersion value), and the abstract states the results are corroborated by numerical simulations on small instances. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or ansatz imported from prior work by the same authors. The central claim remains independent of the target dispersion statistic itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on i.i.d. sampling from a rotationally symmetric distribution and the additive definition of dispersion; the self-consistent radius is determined directly from the distribution without additional fitted constants.

axioms (2)
  • domain assumption Traits are i.i.d. draws from a rotationally symmetric distribution
    Explicitly stated as the setting in which the optimal subset is all exterior points of a self-consistent ball.
  • domain assumption Dispersion is an additive functional of pairwise separations
    Used to define the objective whose maximum is analyzed.

pith-pipeline@v0.9.0 · 5501 in / 1257 out tokens · 35214 ms · 2026-05-16T07:10:55.159292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    internal energy

    From (24) we get κ(1) 1 (α) = 4α Z 1/2 1−α 2 x2 dx= α2(α2 −3α+ 3) 6 .(25) 3.3 Back to finiteN, M To compute the distribution of the maximal dispersion for finiteN, M, one should evaluate Pr DM max < x = Pr DM(1)< x, . . . , D M(M−1)< x ,(26) where with (D M(1), . . . , DM(M−1)) we denote the dispersions correspond to prefix- suffix configurations withk= 1...

  2. [2]

    B. C. Arnold, N. Balakrishnan, and H. N. Nagaraja,A First Course in Order Statistics, Wiley Series in Probability and Statistics (1992)

  3. [3]

    A. N. Bergold and M. B. Kovera,Diversity’s Impact on the Quality of Deliberations, Pers. Soc. Psychol. Bull.48(9), 1406-1420 (2022)

  4. [4]

    Caracciolo and G

    S. Caracciolo and G. Sicuro,Quadratic Stochastic Euclidean Bipartite Matching Problem, Phys. Rev. Lett.115, 230601 (2015)

  5. [5]

    Caracciolo, M

    S. Caracciolo, M. D’Achille, and G. Sicuro,Random Euclidean matching problems in one dimension, Phys. Rev. E96, 042102 (2017)

  6. [6]

    Castellani and A

    T. Castellani and A. Cavagna,Spin-glass theory for pedestrians, J. Stat. Mech. P05012 (2005)

  7. [7]

    F. N. David and N. L. Johnson,Statistical Treatment of Censored Data Part I. Fundamental Formulae, Biometrika41(1/2), 228-240 (1954)

  8. [8]

    Dissanayake and M

    I. Dissanayake and M. Gunathilake,A comprehensive survey of recent advances in facility location problems: Models, solution methods, and applications, Journal of Sustainable Development of Transport and Logistics9(2), 78–91 (2024)

  9. [9]

    R. S. Ellis,Large deviations for a general class of random vectors, The Annals of Probability, 12(1), 1-12 (1984)

  10. [10]

    E. J. Elton, M. J. Gruber, S. J. Brown, and W. N. Goetzmann,Modern portfolio theory and investment analysis, (9th ed.) John Wiley & Sons (2014)

  11. [11]

    Epstein and J

    L. Epstein and J. Knight,How social identity and social diversity affect judging, Leiden Journal of International Law35(4), 897-911 (2022)

  12. [12]

    Erkut and S

    E. Erkut and S. Neuman,Analytical models for locating undesirable facilities, European Journal of Operational Research40, 275–291 (1989)

  13. [13]

    Fern´ andez et al.,The maximum dispersion problem, Omega41(4), 721-730 (2013)

    E. Fern´ andez et al.,The maximum dispersion problem, Omega41(4), 721-730 (2013)

  14. [14]

    Y. V. Fyodorov and P. Le Doussal,Topology Trivialization and Large Deviations for the Minimum in the Simplest Random Optimization, J. Stat. Phys.154, 466–490 (2014)

  15. [15]

    Galli, S

    L. Galli, S. Martello, and P. Toth,The quadratic knapsack problem, European Journal of Operational Research326(1), 1-12 (2025)

  16. [16]

    Gallo, P

    G. Gallo, P. L. Hammer, and B. Simeone,Quadratic knapsack problems, Mathematical Programming Studies12, 132–149 (1980)

  17. [17]

    Ghildiyal et al.,Genomic insights into the conservation of wild and domestic animal diversity: A review, Gene886, 147719 (2023)

    K. Ghildiyal et al.,Genomic insights into the conservation of wild and domestic animal diversity: A review, Gene886, 147719 (2023)

  18. [18]

    J. B. Ghosh,Computational aspects of the maximum diversity problem, Operations Research Letters19(4), 175-181 (1996)

  19. [19]

    Glover, C-C

    F. Glover, C-C. Kuo, and K. Dhir,Heuristic algorithms for the maximum diversity problem, Journal of information and Optimization Sciences19, 109–132 (1998)

  20. [20]

    Hakovirta et al.,The importance of diversity on boards of directors’ effectiveness and its impact on innovativeness in the bioeconomy, Humanit

    M. Hakovirta et al.,The importance of diversity on boards of directors’ effectiveness and its impact on innovativeness in the bioeconomy, Humanit. Soc. Sci. Commun.7, 116 (2020)

  21. [21]

    Hochbaum et al.,A fast and effective breakpoints heuristic algorithm for the quadratic knapsack problem, European Journal of Operational Research323(2), 425-440 (2025)

    D.S. Hochbaum et al.,A fast and effective breakpoints heuristic algorithm for the quadratic knapsack problem, European Journal of Operational Research323(2), 425-440 (2025)

  22. [22]

    Hughes, Z

    N. Hughes, Z. UH Khan, and F. Mengel,Diversity in Committees(2023). Available at SSRN: https://ssrn.com/abstract=4495035

  23. [23]

    C.-C. Kuo, F. Glover, and K. S. Dhir,Analyzing and Modeling the Maximum Diversity Problem by Zero-One Programming, Decision Sciences24, 1171-1185 (1993)

  24. [24]

    Marinari et al.,Replica Symmetry Breaking in Short-Range Spin Glasses: Theoretical Foundations and Numerical Evidences, Journal of Statistical Physics98, 973–1074 (2000)

    E. Marinari et al.,Replica Symmetry Breaking in Short-Range Spin Glasses: Theoretical Foundations and Numerical Evidences, Journal of Statistical Physics98, 973–1074 (2000)

  25. [25]

    Mart´ ı et al.,Heuristics and metaheuristics for the maximum diversity problem, Journal of Heuristics19, 591–615 (2013)

    R. Mart´ ı et al.,Heuristics and metaheuristics for the maximum diversity problem, Journal of Heuristics19, 591–615 (2013)

  26. [26]

    Mart´ ı et al.,A review on discrete diversity and dispersion maximization from an OR perspective, European Journal of Operational Research299(3), 795-813 (2022)

    R. Mart´ ı et al.,A review on discrete diversity and dispersion maximization from an OR perspective, European Journal of Operational Research299(3), 795-813 (2022)

  27. [27]

    Mazzarisi, A

    O. Mazzarisi, A. de Azevedo-Lopes, J. J. Arenzon, and F. Corberi,Maximal Diversity and Zipf’s Law, Phys. Rev. Lett.127, 128301 (2021). The Most Dispersed Subset of Random Points inR d 35

  28. [28]

    M´ ezard and G

    M. M´ ezard and G. Parisi,A replica analysis of the travelling salesman problem, J. Phys. France 47, 1285-1296 (1986)

  29. [29]

    M´ ezard, G

    M. M´ ezard, G. Parisi, and M. A. Virasoro,Spin Glass Theory and Beyond, World Scientific (1987)

  30. [30]

    M´ ezard,Spin glass theory and its new challenge: structured disorder, Indian J

    M. M´ ezard,Spin glass theory and its new challenge: structured disorder, Indian J. Phys98, 3757–3768 (2024)

  31. [31]

    Patr´ ıcio and M

    L. Patr´ ıcio and M. Franco,A Systematic Literature Review about Team Diversity and Team Performance: Future Lines of Investigation, Adm. Sci.12, 31 (2022)

  32. [32]

    Prosser and J

    C. Prosser and J. Mellon,The Twilight of the Polls? A Review of Trends in Polling Accuracy and the Causes of Polling Misses, Government and Opposition53(4), 757-790 (2018)

  33. [33]

    representative

    J. E. Rudolph, Y. Zhong, P. Duggal, S. H. Mehta, and B. Lau,What does it mean to be “representative”?, arXiv:2211.06389 (2022)

  34. [34]

    R. K. Salgotra and B. S. Chauhan,Genetic Diversity, Conservation, and Utilization of Plant Genetic Resources, Genes (Basel)14(1), 174 (2023)

  35. [35]

    Schauer,Asymptotic behavior of the quadratic knapsack problem, European Journal of Operational Research255(2), 357-363 (2016)

    J. Schauer,Asymptotic behavior of the quadratic knapsack problem, European Journal of Operational Research255(2), 357-363 (2016)

  36. [36]

    Touchette,The large deviation approach to statistical mechanics, Physics Reports478, 1-69 (2009)

    H. Touchette,The large deviation approach to statistical mechanics, Physics Reports478, 1-69 (2009)

  37. [37]

    Mean field theory of spin glasses

    F. Zamponi,Mean field theory of spin glasses, arXiv:1008.4844v5 (2014)