pith. sign in

arxiv: 2605.31482 · v2 · pith:OVU7WUG2new · submitted 2026-05-29 · ✦ hep-ph

Hyperoptimisation algorithm for the next generation of PDF determinations: ensemble regression with an unbiased selection model

Pith reviewed 2026-06-28 21:40 UTC · model grok-4.3

classification ✦ hep-ph
keywords parton distribution functionshyperoptimisationensemble regressionk-foldingPDF uncertaintiesfitting methodologieshyperparameter variation
0
0 comments X

The pith

A hyperoptimisation algorithm using ensemble regression generates PDF sets whose uncertainties account for variation across fitting methodologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an automated procedure for selecting fitting methodologies in parton distribution function determinations. It extends the existing k-folding approach by creating ensembles of hyperparameter configurations and introducing a new metric that folds in full PDF uncertainties at each step. These statistically equivalent methodologies are then combined into one PDF set. The resulting uncertainties are moderately larger in regions with limited data constraints, while the central values remain consistent with prior determinations on the same inputs. A reader would care because the procedure removes a layer of manual choice from the uncertainty budget.

Core claim

The algorithm outputs an ensemble of statistically equivalent fitting methodologies, which are combined to produce a single PDF set whose uncertainties consistently account for hyperparameter variation, yielding moderately larger uncertainties in regions where data provide limited constraints while broadly confirming earlier NNPDF results.

What carries the argument

Ensemble regression with an unbiased selection model that employs a new k-folding metric incorporating full PDF uncertainties to generate and combine statistically equivalent hyperparameter configurations.

If this is right

  • Results obtained with identical data and theory but different methodologies now show moderately larger uncertainties where data constraints are weak.
  • The final PDF set uncertainties are stated to include the effect of hyperparameter choice through the equal-weight combination of the ensemble.
  • The procedure produces a single PDF set rather than a family of separate sets, with the ensemble step internalised in the uncertainty.
  • Earlier NNPDF results are recovered to good approximation when the same inputs are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be applied to other global fits that rely on many tunable hyperparameters to make uncertainty estimates less dependent on single choices.
  • If the statistical equivalence of the generated ensembles holds across different data releases, future PDF updates could adopt the procedure without re-tuning by hand.
  • One could test whether the observed increase in uncertainty propagates into observables such as cross-section predictions at the LHC in a manner consistent with the enlarged error bands.

Load-bearing premise

The new k-folding metric incorporates full PDF uncertainties without selection bias and the generated hyperparameter ensembles are statistically equivalent so that equal weighting is justified.

What would settle it

A direct comparison in which the new metric selects configurations whose combined uncertainties differ systematically from the spread obtained by holding the metric fixed would falsify the claim that uncertainties now consistently include hyperparameter variation.

read the original abstract

We present a new automated procedure for selecting fitting methodologies in the determination of parton distribution functions (PDFs), based on a hyperoptimisation algorithm using ensemble regression. Building on the k-folding approach previously employed by the NNPDF collaboration, we introduce a systematic strategy to generate ensembles of hyperparameter configurations and define a new k-folding metric that consistently incorporates full PDF uncertainties. The algorithm outputs an ensemble of statistically equivalent fitting methodologies, which are combined to produce a single PDF set whose uncertainties consistently account for hyperparameter variation. We assess the impact of this approach by comparing results obtained with identical data and theoretical inputs but different fitting methodologies, determined using the previous and the new hyperoptimisation procedures. The new method broadly confirms earlier NNPDF results, while yielding moderately larger uncertainties in regions where data provide limited constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a hyperoptimisation algorithm for PDF determinations based on ensemble regression. Building on the NNPDF k-folding approach, it defines a new k-folding metric that incorporates full PDF uncertainties to generate an ensemble of statistically equivalent hyperparameter configurations. These are combined into a single PDF set whose uncertainties are claimed to consistently account for hyperparameter variation. The method is assessed via comparison to prior NNPDF results using identical data and theory inputs, broadly confirming earlier findings while producing moderately larger uncertainties in data-limited regions.

Significance. If the claims regarding unbiased selection and statistical equivalence hold after validation, the work would offer a systematic procedure for incorporating hyperparameter uncertainty into PDF fits. This could improve the robustness of uncertainty estimates in global analyses, particularly in poorly constrained kinematic regions, and support more reliable inputs for LHC phenomenology. The ensemble-regression framework provides a reproducible route to methodology selection.

major comments (3)
  1. [Abstract] Abstract and results comparison: the central claim that the new metric 'consistently incorporates full PDF uncertainties' and yields 'statistically equivalent' methodologies (justifying equal weighting in the combination) is asserted without any reported quantitative validation metrics, χ² distributions, uncertainty ratios, or statistical tests for equivalence across ensembles; this directly underpins the moderately larger uncertainties and the overall procedure.
  2. [New k-folding metric] Section describing the new k-folding metric: the assertion that the metric avoids selection bias while using full PDF uncertainties requires explicit demonstration (e.g., via comparable fit-quality profiles or covariance comparisons between selected configurations), yet the qualitative comparison to prior NNPDF results provides no such test, leaving the bias-free property unverified.
  3. [Results and comparison] Ensemble combination and results section: the statement of 'moderately larger uncertainties in regions where data provide limited constraints' is presented without numerical values, tables, or figures quantifying the increase relative to the previous procedure, which is load-bearing for assessing the practical impact of the new method.
minor comments (2)
  1. [Method] Clarify the precise definition of 'statistically equivalent' (e.g., via a threshold on a distance metric between configurations) at first introduction to avoid ambiguity in the combination step.
  2. [Introduction] Ensure the prior NNPDF k-folding reference is cited with page or equation numbers when defining the new metric to highlight the incremental change.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and recommendation. We address each major comment below with clarifications and indicate revisions to strengthen the manuscript where the points identify gaps in quantitative support.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results comparison: the central claim that the new metric 'consistently incorporates full PDF uncertainties' and yields 'statistically equivalent' methodologies (justifying equal weighting in the combination) is asserted without any reported quantitative validation metrics, χ² distributions, uncertainty ratios, or statistical tests for equivalence across ensembles; this directly underpins the moderately larger uncertainties and the overall procedure.

    Authors: We acknowledge that the abstract asserts the properties of the metric and equivalence without explicit quantitative validation metrics or statistical tests. The manuscript performs comparisons using identical data and theory inputs, but these are primarily qualitative. We will revise the abstract to reference the added quantitative material and include χ² distributions, uncertainty ratio plots, and equivalence tests in a new subsection of the results. revision: yes

  2. Referee: [New k-folding metric] Section describing the new k-folding metric: the assertion that the metric avoids selection bias while using full PDF uncertainties requires explicit demonstration (e.g., via comparable fit-quality profiles or covariance comparisons between selected configurations), yet the qualitative comparison to prior NNPDF results provides no such test, leaving the bias-free property unverified.

    Authors: The metric is constructed to incorporate full PDF uncertainties precisely to mitigate selection bias. The current manuscript relies on qualitative comparison to prior NNPDF results. We agree that explicit demonstration is needed and will add fit-quality profiles together with covariance comparisons among the selected configurations in the revised metric section. revision: yes

  3. Referee: [Results and comparison] Ensemble combination and results section: the statement of 'moderately larger uncertainties in regions where data provide limited constraints' is presented without numerical values, tables, or figures quantifying the increase relative to the previous procedure, which is load-bearing for assessing the practical impact of the new method.

    Authors: We agree that the statement would be strengthened by explicit quantification. The manuscript describes the effect on the basis of the performed comparisons but does not supply numerical values or dedicated tables/figures for the increase. We will add tables reporting average uncertainty ratios in data-limited kinematic regions and supplementary figures in the revised results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper builds on prior NNPDF k-folding work via self-citation but introduces a new hyperoptimisation algorithm, ensemble generation strategy, and k-folding metric as independent additions. No load-bearing step reduces by construction to fitted inputs, self-definitions, or unverified self-citation chains; the output ensemble and combined PDF uncertainties are presented as results of the new procedure rather than tautological renamings or forced equivalences. The derivation remains self-contained against external benchmarks of previous NNPDF results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents identification of concrete free parameters, axioms, or invented entities; the procedure appears to rest on standard domain assumptions about the validity of k-folding for PDF fitting and the statistical equivalence of selected ensembles.

axioms (1)
  • domain assumption The k-folding approach previously employed by the NNPDF collaboration provides a valid foundation that can be extended with ensemble regression without introducing new bias.
    The paper states it builds directly on this prior method.

pith-pipeline@v0.9.1-grok · 5679 in / 1360 out tokens · 32131 ms · 2026-06-28T21:40:30.062682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    New CTEQ global analysis of quantum chromodynamics with high-precision data from the LHC

    T.-J. Hou et al.,New CTEQ global analysis of quantum chromodynamics with high-precision data from the LHC,Phys. Rev. D103(2021) 014013 [1912.10053]

  2. [2]

    Bailey, T

    S. Bailey, T. Cridge, L.A. Harland-Lang, A.D. Martin and R.S. Thorne,Parton distributions from LHC, HERA, Tevatron and fixed target data: MSHT20 PDFs,Eur. Phys. J. C81 (2021) 341 [2012.04684]. [3]NNPDFcollaboration,The path to proton structure at 1% accuracy,Eur. Phys. J. C82 (2022) 428 [2109.02653]. [4]CMScollaboration,Measurement of the Drell–Yan forwar...

  3. [3]

    Carrazza and J

    S. Carrazza and J. Cruz-Martinez,Towards a new generation of parton densities with deep learning models,The European Physical Journal C79(2019) [1907.05075]

  4. [4]

    Cruz-Martinez, A

    J. Cruz-Martinez, A. Jansen, G. van Oord, T.R. Rabemananjara, C.M.R. Rocha, J. Rojo et al.,Hyperparameter optimisation in deep learning from ensemble methods: applications to proton structure,Mach. Learn. Sci. Tech.6(2025) 025027 [2410.16248]

  5. [5]

    Del Debbio, T

    L. Del Debbio, T. Giani and M. Wilson,Bayesian approach to inverse problems: an application to NNPDF closure testing,Eur. Phys. J. C82(2022) 330 [2111.05787]

  6. [6]

    Candido, L

    A. Candido, L. Del Debbio, T. Giani and G. Petrillo,Bayesian inference with Gaussian processes for the determination of parton distribution functions,Eur. Phys. J. C84(2024) 716 [2404.07573]

  7. [7]

    Costantini, M

    M.N. Costantini, M. Madigan, L. Mantani and J.M. Moore,A critical study of the Monte Carlo replica method,JHEP12(2024) 064 [2404.10056]

  8. [8]

    Study of Monte Carlo approach to experimental uncertainty propagation with MSTW 2008 PDFs

    G. Watt and R.S. Thorne,Study of Monte Carlo approach to experimental uncertainty propagation with MSTW 2008 PDFs,JHEP08(2012) 052 [1205.4024]

  9. [9]

    Harland-Lang, T

    L.A. Harland-Lang, T. Cridge and R.S. Thorne,A stress test of global PDF fits: closure testing the MSHT PDFs and a first direct comparison to the neural net approach,Eur. Phys. J. C85(2025) 316 [2407.07944]

  10. [10]

    Medrano, H

    Y.C. Medrano, H. Dutrieux, J. Karpie, K. Orginos and S. Zafeiropoulos,Gaussian Processes for Inferring Parton Distributions,2510.21041

  11. [11]

    Rasmussen and C.K.I

    C.E. Rasmussen and C.K.I. Williams,Gaussian Processes for Machine Learning, The MIT Press (2006)

  12. [12]

    Bergstra, D

    J. Bergstra, D. Yamins and D. Cox,Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, inProceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester, eds., vol. 28 ofProceedings of Machine Learning Research, (Atlanta, Georgia, USA), pp. 115–123, PMLR, 17–1...

  13. [13]

    Cridge, J

    T. Cridge, J. Cruz-Martinez and J. Huston,QED Effects in PDFs – A Les Houches Comparison Study,2602.06908. [20]NNPDFcollaboration,Parton distributions for the LHC Run II,JHEP04(2015) 040 [1410.8849]

  14. [14]

    Cruz-Martinez, T

    J.M. Cruz-Martinez, T. Giani and L.A. Harland-Lang,Assessing the Impact of Fitting Methodology at aN 3LO with FPPDF: an Open Source Tool for Extracting Parton Distribution Functions in the Hessian Approach,2602.07118. – 22 –