pith. sign in

arxiv: 2604.16432 · v2 · submitted 2026-04-06 · 💻 cs.CY · cs.AI· cs.LG· econ.EM

Quantifying how AI Panels improve precision

Pith reviewed 2026-05-10 19:44 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LGecon.EM
keywords AI panelsprecision estimationpairwise correlationapplicant screeningquantile selectiondecision supporthiring algorithms
0
0 comments X

The pith

A formula estimates the precision of AI panels selecting top candidates from CV-like data by accounting for their average pairwise correlation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a closed-form approximation for how much a panel of n AIs improves the precision of identifying the top q quantile of applicants when their outputs are correlated at level ρ. The expression P(q) ≈ [ρ n^b + q(1-ρ)] / [1 + (n^b - 1)ρ] with b ≈ q* + 0.8(1-ρ) and q* clipped to [0.07, 0.22] supplies an upper bound on that precision for realistic CV data. A reader would care because the formula lets decision makers calculate whether adding AIs is worth the cost for a given hiring or selection task. It shifts discussion from qualitative warnings about single-AI bias toward concrete trade-offs between panel size and correlation. The work therefore supplies a practical tool for designing more robust AI-supported socioeconomic processes.

Core claim

The paper establishes that for data resembling realistic CVs the precision P(q) achieved by a panel of n AIs in selecting the top q quantile is given approximately by P(q) ≈ [ρ n^b + q(1-ρ)] / [1 + (n^b - 1)ρ], where ρ is the average pairwise correlation among the AIs and the exponent b is approximated as q* + 0.8(1-ρ) with q* equal to q clipped inside [0.07, 0.22]. This relation furnishes a quantitative basis for choosing the number of AIs in a panel according to the stakes of the decision.

What carries the argument

The closed-form precision formula P(q) that incorporates panel size n, selection quantile q, and average pairwise correlation ρ, together with the empirical adjustment for the exponent b.

If this is right

  • For any fixed correlation ρ, the formula shows how precision rises with panel size n and therefore indicates when adding AIs is justified by the importance of the decision.
  • Panels remain beneficial even when AIs are moderately correlated, provided n is chosen according to the formula.
  • The expression quantifies the value of lowering ρ through greater AI diversity, directly supporting arguments for building diversity into AI hiring systems.
  • Single-AI reliance is shown to be suboptimal for most realistic values of ρ and n greater than one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bounding approach could be tested on other high-stakes selection tasks such as loan or scholarship decisions where ground-truth outcomes are eventually observable.
  • If the exponent approximation holds across datasets, regulators could require disclosure of expected panel precision rather than just individual AI accuracy.
  • Empirical measurement of ρ on production AI systems would turn the formula into an operational planning tool for organizations.

Load-bearing premise

The derivation assumes that AI outputs on CV-like data can be summarized by one average pairwise correlation ρ and that the resulting precision follows the stated closed-form expression with the given approximation for the exponent b.

What would settle it

Collect a large set of real CVs with known ground-truth quality rankings, run several independent AIs on them, compute the actual precision of panels of varying size n at different quantiles q, and compare those measured values to the formula's predictions for the observed ρ.

read the original abstract

AI in applications like screening job applicants had become widespread, and may contribute to unemployment especially among the young. Biases in the AIs may become baked into the job selection process, but even in their absence, reliance on a single AI is problematic. In this paper we derive a simple formula to estimate, or at least place an upper bound on, the precision of such approaches for data resembling realistic CVs: $P(q) \approx \frac{\rho n^b + q(1-\rho)}{1 + (n^b - 1)\rho}$ where $b \approx q^* + 0.8 (1 - \rho)$ and $q^*$ is $q$ clipped to $[0.07, 0.22]$ where $P(q)$ is the precision of the top $q$ quantile selected by a panel of $n$ AIs and $\rho$ is their average pairwise correlation. This equation provides a basis for considering how many AIs should be used in a Panel, depending on the importance of the decision. A quantitative discussion of the merits of using a diverse panel of AIs to support decision-making in such areas will move away from dangerous reliance on single AI systems and encourage a balanced assessment of the extent to which diversity needs to be built into the AI parts of the socioeconomic systems that are so important for our future.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript claims to derive a simple closed-form formula to estimate or upper-bound the precision P(q) of top-q quantile selection by a panel of n AIs for data resembling realistic CVs: P(q) ≈ [ρ n^b + q(1-ρ)] / [1 + (n^b - 1)ρ] where b ≈ q* + 0.8(1 - ρ) and q* is q clipped to [0.07, 0.22], with ρ the average pairwise correlation. It positions this as a basis for deciding panel size in applications like job screening to reduce single-AI risks and encourage diversity.

Significance. If the formula holds after proper derivation and validation, it would supply a quantitative tool for assessing precision gains from AI panels in high-stakes decisions, supporting more balanced evaluation of multi-AI systems over single-AI reliance. The effort to move from qualitative arguments to an analytical expression is a constructive contribution, though the absence of supporting derivation or data limits its current utility.

major comments (4)
  1. [Abstract] Abstract: the formula is asserted as derived from the correlation model, yet no derivation steps, order-statistic justification, or proof that the stated functional form follows from average pairwise correlation ρ are provided. This is load-bearing for the central claim of a 'simple formula' that estimates precision.
  2. [Abstract] Abstract, definition of b: the exponent incorporates an unexplained numerical constant 0.8 together with clipping of q* to [0.07, 0.22]. No justification or fitting procedure is shown, so the expression is not parameter-free or generally derived and the claimed upper-bound property cannot be assessed.
  3. [Abstract] Abstract: P(q) depends directly on ρ, which must be measured from data or assumed, but no validation against actual AI outputs on CV-like data, error analysis, or sensitivity to the single-ρ assumption is referenced. This creates the circularity noted in the stress-test and prevents the formula from serving as an independent prediction.
  4. [Abstract] Abstract: the weakest assumption—that AI outputs for realistic CV data are adequately summarized by a single average pairwise correlation ρ—is stated without supporting evidence or test of robustness; violation of this assumption would invalidate the closed-form expression for the claimed applications.
minor comments (2)
  1. [Abstract] Abstract: grammatical error in opening sentence ('had become' should be 'has become').
  2. [Abstract] Abstract: the claim that the formula 'provides a basis for considering how many AIs should be used' is not illustrated with any concrete numerical examples or guidance on choosing n for different decision importances.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments point by point below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the formula is asserted as derived from the correlation model, yet no derivation steps, order-statistic justification, or proof that the stated functional form follows from average pairwise correlation ρ are provided. This is load-bearing for the central claim of a 'simple formula' that estimates precision.

    Authors: We acknowledge that the submitted manuscript did not include explicit derivation steps. The approximate formula was developed by modeling the AI panel scores as draws from a multivariate distribution with pairwise correlation ρ and approximating the probability that the top-q selected items are correctly ranked using order statistics. We will revise the manuscript to include a new section detailing these steps, including the justification based on order statistics for correlated variables. revision: yes

  2. Referee: [Abstract] Abstract, definition of b: the exponent incorporates an unexplained numerical constant 0.8 together with clipping of q* to [0.07, 0.22]. No justification or fitting procedure is shown, so the expression is not parameter-free or generally derived and the claimed upper-bound property cannot be assessed.

    Authors: The form of b, including the coefficient 0.8 and the clipping of q to [0.07, 0.22], was chosen based on empirical fitting to simulated precision curves for q in the relevant range for CV screening applications. This makes the formula a hybrid analytical-empirical approximation rather than a purely closed-form derivation. In the revision, we will describe the simulation setup used for fitting, the range of parameters tested, and clarify under what conditions the expression provides an upper bound on precision. revision: yes

  3. Referee: [Abstract] Abstract: P(q) depends directly on ρ, which must be measured from data or assumed, but no validation against actual AI outputs on CV-like data, error analysis, or sensitivity to the single-ρ assumption is referenced. This creates the circularity noted in the stress-test and prevents the formula from serving as an independent prediction.

    Authors: The formula is intended to be used with ρ estimated from observed AI correlations on the specific data. The current manuscript focuses on the analytical form and does not present extensive validation or sensitivity analysis. We will add discussion of how to measure ρ in practice, include error analysis from simulations, and address sensitivity to the single-ρ assumption. However, comprehensive validation on real-world AI outputs for CV data is beyond the scope of this initial work and would be a valuable extension. revision: partial

  4. Referee: [Abstract] Abstract: the weakest assumption—that AI outputs for realistic CV data are adequately summarized by a single average pairwise correlation ρ—is stated without supporting evidence or test of robustness; violation of this assumption would invalidate the closed-form expression for the claimed applications.

    Authors: We agree that the single average ρ is a simplifying assumption. While it is standard in such models, we will enhance the manuscript with additional simulations exploring robustness to heterogeneous correlations and discuss potential limitations for the applications mentioned. This will help assess when the formula remains useful even if the assumption is mildly violated. revision: yes

Circularity Check

1 steps flagged

Formula relies on empirical exponent b ≈ q* + 0.8(1-ρ) with ad-hoc clipping, not derived from first principles

specific steps
  1. fitted input called prediction [Abstract (formula for P(q) and b)]
    "P(q) ≈ ρ n^b + q(1-ρ) / [1 + (n^b - 1)ρ] where b ≈ q^* + 0.8 (1 - ρ) and q^* is q clipped to [0.07, 0.22]"

    The claimed derivation of the precision formula from ρ assumes the functional form follows directly from the correlation model, yet b incorporates an empirical constant 0.8 and ad-hoc clipping of q to [0.07, 0.22]. These are not derived within the paper's premises but calibrated externally, so the 'prediction' P(q) is statistically tied to fitted inputs rather than independent first-principles output.

full rationale

The paper presents P(q) as a derived closed-form expression from average pairwise correlation ρ for estimating precision of AI panels on CV-like data. However, the exponent b is not obtained from the correlation model via order statistics or effective sample size but is instead approximated as b ≈ q* + 0.8(1-ρ) with q clipped to [0.07, 0.22]. This introduces a fitted constant 0.8 and range restriction that must come from data calibration rather than the stated premises, making the output sensitive to those choices. ρ itself is an input measured or assumed from data. While the overall structure may have independent content as an approximation tool, the load-bearing exponent reduces to empirical tuning, warranting a moderate circularity flag without the entire result being forced by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on an ad-hoc functional form for precision and an empirical approximation for the exponent b; ρ is treated as a sufficient summary statistic for AI agreement.

free parameters (1)
  • 0.8 coefficient
    Numerical constant appearing in the approximation b ≈ q* + 0.8 (1 - ρ); its value is not derived from first principles in the abstract and functions as an empirical fit.
axioms (2)
  • domain assumption AI outputs on CV-like data are adequately characterized by a single average pairwise correlation ρ.
    ρ is the sole dependence parameter in the formula and must be supplied as input.
  • ad hoc to paper Precision of top-q selection by n AIs follows the stated closed-form expression involving ρ and the approximated exponent b.
    The functional form is asserted without derivation steps visible in the abstract.

pith-pipeline@v0.9.0 · 5542 in / 1262 out tokens · 72346 ms · 2026-05-10T19:44:08.954509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    • [Aggarwal & al 2024] Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., and Deshpande, A

  2. [2]

    Investment in Human Capital: A Theoretical Analysis

    GEO: Generative Engine Optimization. In Proceedings of the 30th ACM SIGKDD ©2026 Nicholas Beale 11 Conference on Knowledge Discovery and Data Mining (KDD '24). Association for Computing Machinery, New York, NY , USA, 5–16. https://doi.org/10.1145/3637528.3671900 • [Akdemir & Levy 2025] Akdemir, A. and Levy, J. Understanding and Defending Against Resume- B...

  3. [3]

    Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I Monatshefte für Mathematik und Physik 1931 • [Hall 1831] Hall, B

    • [Gödel, 1931] Gödel, K. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I Monatshefte für Mathematik und Physik 1931 • [Hall 1831] Hall, B. Fragments of Voyages and Travels,

  4. [4]

    Jean-Pierre Eckmann and David Ruelle

    “It is in the midshipmen’s berth that the officers of the navy are formed.” • [Hong & Page, 2004] Hong, L & Page S.E. Groups of diverse problem solvers can outperform groups of high-ability problem solvers, Proc. Natl. Acad. Sci. U.S.A. 101 (46) 16385-16389, (2004) https://doi.org/10.1073/pnas.0403723101. ©2026 Nicholas Beale 12 • [Jin & al 2020] Jin, D.,...

  5. [5]

    See also [Lucas 1970] • [Lucas 1970] Lucas, J

    This is the printed copy of the paper he read in 1959 to the Oxford Philosophical Society. See also [Lucas 1970] • [Lucas 1970] Lucas, J. R. The Freedom of the Will Oxford University Press 1970 ISBN 978- 0198243434 • [Mallaby, 2026] Mallaby, S. The Infinity Machine: Demis Hassabis, Deepmind and the Quest for Superintelligence Penguin Random House

  6. [6]

    The Emperor's New Mind

    “A Council on Foreign Relations Book”. Mallaby does mention Gödel because of course Hassabis is deeply aware of him, but doesn’t appreciate the implications, speaking of neural networks processing “a near infinity of bits…disproving…claims about the limits of classical computers”. (p392) • [MacKay & al 2017] MacKay, R. S., Kenna, R., Low, R. J., Parker, S...

  7. [7]

    & Smith-Loud, J

    Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices. In Conference on Fairness, Accountability, and Transparency (FAT* ’20), January 27–30, 2020, Barcelona, Spain. ACM, New York, NY , USA, 13 pages. doi.org/10.1145/3351095.3372828 • [Rijo 2026] Rijo, L. Sponsored stores and quick web results spotted inside Google AI Mode. PPC Land 6 April

  8. [8]

    the safety of the whole Republic depends on the choice of recruits

    https://ppc.land/sponsored-stores-and-quick-web-results-spotted-inside- google-ai-mode/ See also https://www.airanklab.com/blog/ai-search-state-of-market-report for discussion of this trend in other AIs. • [Schapire, 1990] Schapire, R. E. The Strength of Weak Learnability, " Machine Learning 5(2):197– 227 doi.org/10.1007/BF00116037 • [Tramèr & al 2017] Tr...

  9. [9]

    Generate Noise scaled by specific trial sigma ©2026 Nicholas Beale 16 noise_scales = sigmas * np.sqrt((1/rho**2) -

  10. [10]

    {VERSION}

    MAIN SIMULATION ENGINE # ========================================== def run_lansdowne_simulation(m=2000, rho=0.8, trials=2000, t_dof=4): print(f"{VERSION}") print(f"Parameters: m={m}, rho={rho}, t_DoF={t_dof}, trials={trials}") print("Running simulation... (Approx 15-20s)") print("-" *

  11. [11]

    avg P(0.2) = {p_avg_02:.1%}, Normal Limit = {p_norm_anc:.1%},

    picked_indices = sort_idxs[:k] hits = np.sum(true_ranks[picked_indices] <= k) ©2026 Nicholas Beale 17 plot_data[dist][i] += (hits / k) for dist in distributions: plot_data[dist] /= trials # --- B. Reference Lines --- idx_02 = np.abs(q_calc - 0.2).argmin() p02_values = [plot_data[dist][idx_02] for dist in distributions] p_avg_02 = np.mean(p02_values) slope...

  12. [12]

    ") ax.scatter([q_anchor], [p_t_anc], color='purple', marker='+', s=mk_size, linewidth=mk_wid, zorder=10, label=f'Student-t (df={t_dof}) Limit' if is_log else

    Anchors q_anchor = 1/m mk_size = 250 mk_wid = 4 ax.scatter([q_anchor], [p_norm_anc], color='red', marker='+', s=mk_size, linewidth=mk_wid, zorder=10, label='Normal Limit' if is_log else "") ax.scatter([q_anchor], [p_t_anc], color='purple', marker='+', s=mk_size, linewidth=mk_wid, zorder=10, label=f'Student-t (df={t_dof}) Limit' if is_log else "") ax.scatt...

  13. [13]

    Death Zone

    TAIL-SPECIFIC CONFIGURATION # ========================================== # We focus strictly on the "Death Zone" TAIL_SAMPLES = 4000 TAIL_CORES = 8 TAIL_CANDIDATES = 2000 TAIL_UNIVERSE = 100 MAX_TAIL_K = 30 # Use the Standard Superstar Settings (Real World Scenario) # This ensures we capture the "Robustness" effect if it exists KINK = 1.6 BOOST = 0 # NB s...

  14. [14]

    CORE LOGIC (Renamed helper functions to avoid namespace collision? # actually, Python functions can be reused, but let's be safe and simple) # ========================================== def tail_transform(z_scores, kink, boost, sharpness): if boost == 0: return z_scores scaled_diff = sharpness * (z_scores - kink) smooth_excess = (1.0 / sharpness) * np.log...

  15. [15]

    obs_precisions = [] for k in panel_sizes: batch_prec = [] for _ in range(TAIL_SAMPLES): indices = np.random.choice(TAIL_UNIVERSE, k, replace=False) est = np.mean(Xt[:, indices], axis=1) p = tail_precision_calc(y_true, est, q) batch_prec.append(p) obs_precisions.append(np.mean(batch_prec)) ©2026 Nicholas Beale 27 #