An NLO-Matched Initial and Final State Parton Shower on a GPU

Michael H. Seymour; Siddharth Sule

arxiv: 2511.19633 · v3 · pith:35JWV43Dnew · submitted 2025-11-24 · ✦ hep-ph

An NLO-Matched Initial and Final State Parton Shower on a GPU

Michael H. Seymour , Siddharth Sule This is my paper

Pith reviewed 2026-05-21 18:01 UTC · model grok-4.3

classification ✦ hep-ph

keywords parton showerGPU computingMonte Carlo event generatorNLO matchingZ productionLHC simulationCUDA

0 comments

The pith

A single NVIDIA V100 GPU matches the speed and energy use of a 96-core Intel Xeon cluster when running NLO-matched Z production simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases version 2 of the GAPS generator, a CUDA C++ implementation that runs initial and final state parton showers on a GPU while supporting hard-process matching. It ships with a nearly identical C++ CPU version so that direct performance comparisons can be made. When both versions simulate NLO Z production at the LHC, the GPU delivers throughput and power consumption on par with a 96-core cluster built from two Intel Xeon Gold 5220R processors. This result indicates that GPU hardware can serve as a practical substitute for traditional CPU clusters in Monte Carlo event generation.

Core claim

We have developed and released version 2 of the CUDA C++ parton shower event generator GAPS, which performs initial and final state emissions on a GPU and supports hard-process matching. The generator is accompanied by a near-identical C++ version for single-core and multi-core CPUs. Simulations of NLO Z production at the LHC show that the speed and energy consumption of an NVIDIA V100 GPU are comparable to those of a 96-core cluster composed of two Intel Xeon Gold 5220R processors.

What carries the argument

The GAPS CUDA C++ parton shower that executes initial and final state emissions with hard-process matching on GPU hardware.

If this is right

NLO-matched parton shower simulations for processes such as Z production can be executed on GPU hardware with performance comparable to large CPU clusters.
Monte Carlo event generators can be ported to GPUs while preserving both initial- and final-state radiation and hard-process matching.
Single-GPU machines offer a practical alternative to 96-core CPU clusters for high-energy physics event generation in terms of both throughput and energy consumption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Research groups without access to large CPU clusters may be able to perform equivalent simulations using a single high-end GPU.
Similar GPU ports could be applied to other Monte Carlo generators, extending the approach beyond the current GAPS implementation.
Newer GPU architectures may further widen the performance gap in favor of accelerators for particle physics workloads.

Load-bearing premise

The GPU port produces results that are numerically and physically equivalent to the CPU version for all observables of interest, with no unaccounted differences arising from floating-point precision, thread scheduling, or algorithmic approximations.

What would settle it

Any statistically significant discrepancy in kinematic distributions, cross sections, or other observables between the GPU and CPU runs of the NLO Z production simulation would falsify the claimed equivalence.

Figures

Figures reproduced from arXiv: 2511.19633 by Michael H. Seymour, Siddharth Sule.

**Figure 2.** Figure 2: Partitioning the event record list. In this case, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Z Observables and Anti-kT jets produced with R = 0.4. The Z and lepton observables are fully inclusive, while each jet’s pT distribution is shown when its |η| < 5, its η distribution is shown when its pT > 5 GeV, and the ∆R and multiplicity distributions are shown when pT > 5 GeV and |η| < 5. The Z and lepton observables agree very well with Herwig, the leading jet pretty well, the second and third jets sl… view at source ↗

**Figure 4.** Figure 4: NLO+Shower for the process pp → Z, where the Z boson is on-shell and stable. Like the LO+Shower case, the Z boson observables are in agreement. The jet observables also contained the same deviations and are omitted here. 4.2 GPU Profiling and Impact of Computational Improvements Similar to our previous work, we used the NVIDIA V100 [14], which has 32 cores per warp and 64 warps per streaming multiprocessor… view at source ↗

**Figure 5.** Figure 5: Kernel Tuning Results, with partitioning on and off. For small [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Execution Time, Average power consumption and total energy consump [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: PDF Ratio Results for c → g, b → g, g → d and g → u splittings. The CT14lo set was used for the simulations. The chosen process was Z production at NLO, to incorporate the phase space for the power shower. The results show that the fitted equation overestimates the majority of the data points. The code to generate the data for this can be found in the C++ implementation, and the plotting code is in plot-pd… view at source ↗

read the original abstract

Recent developments have demonstrated the potential for high simulation speeds and reduced energy consumption by porting Monte Carlo Event Generators to GPUs. We release version 2 of the CUDA C++ parton shower event generator GAPS, which can simulate initial and final state emissions on a GPU and is capable of hard-process matching. As before, we accompany the generator with a near-identical C++ generator to run simulations on single-core and multi-core CPUs. Using these programs, we simulate NLO Z production at the LHC and demonstrate that the speed and energy consumption of an NVIDIA V100 GPU are on par with a 96-core cluster composed of two Intel Xeon Gold 5220R Processors, providing a potential alternative to cluster computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAPS v2 adds initial-state showers and NLO matching to its GPU port and reports competitive speed versus a 96-core CPU cluster, but the numerical equivalence between GPU and CPU outputs is not shown in enough detail.

read the letter

The paper ships GAPS v2, a CUDA version of a parton shower that now does both initial and final state emissions together with NLO matching. They keep a near-identical C++ CPU code for direct comparison and run NLO Z production at the LHC as the test case. The headline result is that one NVIDIA V100 GPU matches the wall-clock time and energy draw of a two-socket 96-core Xeon Gold 5220R cluster. That is the concrete new piece: extending the earlier GPU work to the full initial-plus-final shower plus matching inside one framework, with a side-by-side CPU reference included. The implementation itself looks like a straightforward port that preserves the original algorithm structure, which is the right way to do it if you want believable timing numbers. The performance claim is therefore worth looking at for anyone who runs large-scale event generation. The soft spot is exactly the one the stress-test flags. The speed comparison only holds if the GPU and CPU versions produce statistically identical distributions for the observables that matter. The abstract states the performance numbers but does not report bin-by-bin agreement, Kolmogorov-Smirnov distances, or pull plots for pT(Z), rapidity, or jet multiplicities at the 10^7-event level. Without those checks it is possible that reduced precision, different veto ordering, or thread-dependent effects have shifted the results by a small but non-zero amount. If the full manuscript contains those cross-checks they should be highlighted; if not, they need to be added before the performance numbers can be taken at face value. This is a methods paper aimed at the Monte Carlo and computing groups in high-energy physics. People who maintain or port event generators will find the code release and the direct CPU-GPU comparison useful. It is solid enough on the implementation side and concrete enough on the timing side to deserve a serious referee rather than a desk reject, even though the validation section will probably need work.

Referee Report

1 major / 2 minor

Summary. The manuscript presents version 2 of the CUDA C++ parton shower event generator GAPS, which implements initial- and final-state emissions together with hard-process matching and runs on GPUs. A near-identical C++ CPU version is provided for comparison. The authors simulate NLO Z production at the LHC and report that the speed and energy consumption of a single NVIDIA V100 GPU are comparable to those of a 96-core cluster built from two Intel Xeon Gold 5220R processors.

Significance. If the numerical equivalence between the GPU and CPU implementations is established and the performance numbers are reproducible, the work supplies a concrete, energy-efficient alternative to conventional CPU clusters for NLO-matched parton-shower simulations. The provision of both CUDA and reference C++ codes is a positive feature that facilitates direct benchmarking.

major comments (1)

[Results / validation of NLO-matched observables] The central performance claim (GPU parity with a 96-core Xeon cluster) presupposes that the CUDA implementation produces statistically and numerically identical results to the C++ reference for all observables of interest. No bin-by-bin comparison, pull distribution, or Kolmogorov-Smirnov test is presented for distributions such as pT(Z), y(Z), or jet multiplicities at the 10^7-event level. This validation is load-bearing for the speed/energy comparison and must be supplied.

minor comments (2)

[Abstract] The abstract states performance parity but does not quote the actual timing or energy figures; these numbers should appear in the abstract or be clearly cross-referenced to a table in the main text.
[Implementation details] Clarify the floating-point precision used on the GPU (single vs. double) and any algorithmic approximations introduced in the emission loop or Sudakov veto ordering.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting the importance of rigorous validation to support the performance claims. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Results / validation of NLO-matched observables] The central performance claim (GPU parity with a 96-core Xeon cluster) presupposes that the CUDA implementation produces statistically and numerically identical results to the C++ reference for all observables of interest. No bin-by-bin comparison, pull distribution, or Kolmogorov-Smirnov test is presented for distributions such as pT(Z), y(Z), or jet multiplicities at the 10^7-event level. This validation is load-bearing for the speed/energy comparison and must be supplied.

Authors: We agree that establishing numerical equivalence between the CUDA and C++ implementations is essential for the validity of the reported speed and energy comparisons. The current manuscript includes overall consistency checks for the NLO-matched Z production process but does not provide the detailed statistical tests (bin-by-bin ratios, pull distributions, or Kolmogorov-Smirnov tests) at the 10^7-event level for the specific observables listed. In the revised version we will add these comparisons for p_T(Z), y(Z), and jet multiplicities, using the same event statistics as the performance benchmarks. This addition will directly address the referee's concern and strengthen the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical performance benchmarking

full rationale

The paper describes a CUDA implementation of the GAPS parton shower (initial/final-state emissions plus NLO matching) and reports direct wall-clock and energy measurements for NLO Z production at the LHC. These results are obtained by running the identical generator on GPU versus a 96-core Xeon cluster; no equations, fitted parameters, or predictions are derived that reduce to the inputs by construction. Self-citations to prior GAPS versions exist but are not load-bearing for the timing claims, which rest on hardware execution rather than any self-referential theorem or ansatz. The manuscript is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the assumption that the parton-shower physics model is correctly implemented on both platforms and that the GPU version introduces no systematic biases relative to the CPU reference.

axioms (1)

domain assumption Standard parton-shower algorithms accurately capture the dominant QCD radiation patterns in initial- and final-state emissions.
This is the foundational modeling assumption inherited from established Monte Carlo generators in hep-ph.

pith-pipeline@v0.9.0 · 5647 in / 1355 out tokens · 50913 ms · 2026-05-21T18:01:57.647537+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We release version 2 of the CUDA C++ parton shower event generator GAPS, which can simulate initial and final state emissions on a GPU and is capable of hard-process matching.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the parallelised veto algorithm... Generate Trial Emission... Calculate Acceptance Probability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 15 internal anchors

[1]

Bothmann, W

E. Bothmann, W . Giele, S. Hoeche, J. Isaacson and M. Knobbe, Many-gluon tree ampli- tudes on modern GPUs: A case study for novel event generators, SciPost Phys. Codeb.2022, 3 (2022), doi:10.21468 /SciPostPhysCodeb.3, 2106.06507

work page arXiv 2022
[2]

A GPU compatible quasi-Monte Carlo integrator interfaced to pySecDec

S. Borowka, G. Heinrich, S. Jahn, S. P . Jones, M. Kerner and J. Schlenk,A GPU compatible quasi-Monte Carlo integrator interfaced to pySecDec , Comput. Phys. Commun. 240, 120 (2019), doi:10.1016 /j.cpc.2019.02.015, 1811.11720

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Heinrich, S

G. Heinrich, S. P . Jones, M. Kerner, V . Magerya, A. Olsson and J. Schlenk, Numerical scattering amplitudes with pySecDec , Comput. Phys. Commun. 295, 108956 (2024), doi:10.1016/j.cpc.2023.108956, 2305.19768

work page doi:10.1016/j.cpc.2023.108956 2024
[4]

J. M. Cruz-Martinez, G. De Laurentis and M. Pellen, Accelerating Berends–Giele recursion for gluons in arbitrary dimensions over finite fields , Eur. Phys. J. C 85(5), 590 (2025), doi:10.1140/epjc/s10052-025-14318-3, 2502.07060. 24 SciPost Physics Codebases Submission

work page doi:10.1140/epjc/s10052-025-14318-3 2025
[5]

LHAPDF6: parton density access in the LHC precision era

A. Buckley , J. Ferrando, S. Lloyd, K. Nordström, B. Page, M. Rüfenacht, M. Schönherr and G. Watt, LHAPDF6: parton density access in the LHC precision era, Eur. Phys. J. C 75, 132 (2015), doi:10.1140 /epjc/s10052-015-3318-8, 1412.7420

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

Carrazza, J

S. Carrazza, J. M. Cruz-Martinez and M. Rossi,PDFFlow: Parton distribution functions on GPU, Comput. Phys. Commun. 264, 107995 (2021), doi:10.1016 /j.cpc.2021.107995, 2009.06635

work page arXiv 2021
[7]

Bothmann, T

E. Bothmann, T . Childers, W . Giele, S. Höche, J. Isaacson and M. Knobbe, A portable parton-level event generator for the high-luminosity LHC, SciPost Phys.17(3), 081 (2024), doi:10.21468/SciPostPhys.17.3.081, 2311.06198

work page doi:10.21468/scipostphys.17.3.081 2024
[8]

Data-parallel leading-order event generation in MadGraph5aMC@NLO

S. Hageböck, D. Massaro, O. Mattelaer, S. Roiser, A. Valassi and Z. Wettersten, Data- parallel leading-order event generation in MadGraph5_aMC@NLO (2025), 2507.21039

work page arXiv 2025
[9]

Carrazza, J

S. Carrazza, J. Cruz-Martinez, M. Rossi and M. Zaro, MadFlow: automating Monte Carlo simulation on GPU for particle physics processes , Eur. Phys. J. C 81(7), 656 (2021), doi:10.1140/epjc/s10052-021-09443-8, 2106.10279

work page doi:10.1140/epjc/s10052-021-09443-8 2021
[10]

com/cuda/cuda-c-programming-guide /index.html#, Accessed: 2025-08-11

NVIDIA Corporation & affiliates, CUDA C++ Programming Guide, https: //docs.nvidia. com/cuda/cuda-c-programming-guide /index.html#, Accessed: 2025-08-11

work page 2025
[11]

M. H. Seymour and S. Sule, An algorithm to parallelise parton showers on a GPU, SciPost Phys. Codebases p. 33 (2024), doi:10.21468 /SciPostPhysCodeb.33

work page 2024
[12]

M. H. Seymour and S. Sule, Codebase release 1.1 for GAPS, SciPost Phys. Codebases pp. 33–r1.1 (2024), doi:10.21468/SciPostPhysCodeb.33-r1.1

work page doi:10.21468/scipostphyscodeb.33-r1.1 2024
[13]

Intel Coorportation, Intel Xeon Processor E5-2620 v4 (20M Cache, 2.10 GHz) Specifications, https: //www.intel.com/content/www/us/en/products/sku/92986/ intel-xeon-processor-e52620-v4-20m-cache-2-10-ghz /specifications.html, Accessed: 2025-09-06

work page 2025
[14]

NVIDIA Corporation & affiliates, Nvidia tesla v100 , https: //www.nvidia.com/en-gb/ data-center/v100/, Accessed: 2025-08-11

work page 2025
[15]

NVIDIA Corporation & affiliates, Thrust: The C ++ Parallel Algorithms Library , https: //nvidia.github.io/cccl/thrust/, Accessed: 2025-08-11

work page 2025
[16]

NVIDIA Corporation & affiliates, Nvidia a100 , https: //www.nvidia.com/en-gb/ data-center/a100/, Accessed: 2025-08-11

work page 2025
[17]

van Werkhoven, Kernel tuner: A search-optimizing gpu code auto-tuner, Future Generation Computer Systems 90, 347 (2019), doi:https://doi.org/10.1016/j.future.2018.08.004

B. van Werkhoven, Kernel tuner: A search-optimizing gpu code auto-tuner, Future Generation Computer Systems 90, 347 (2019), doi:https://doi.org/10.1016/j.future.2018.08.004

work page doi:10.1016/j.future.2018.08.004 2019
[18]

Petroviˇc and J

F . Petroviˇc and J. Filipovi ˇc, Kernel tuning toolkit , SoftwareX 22, 101385 (2023), doi:https://doi.org/10.1016/j.softx.2023.101385

work page doi:10.1016/j.softx.2023.101385 2023
[19]

The anti-k_t jet clustering algorithm

M. Cacciari, G. P . Salam and G. Soyez,The anti-kt jet clustering algorithm, JHEP 04, 063 (2008), doi:10.1088 /1126-6708/2008/04/063, 0802.1189

work page internal anchor Pith review Pith/arXiv arXiv 2008
[20]

Bierlich, A

C. Bierlich, A. Buckley , J. M. Butterworth, C. Gütschow, L. Lönnblad, T . Procter, P . Richard- son and Y. Yeh, Robust independent validation of experiment and theory: Rivet version 4 release note, SciPost Phys. Codeb. 36, 1 (2024), doi:10.21468 /SciPostPhysCodeb.36, 2404.15984. 25 SciPost Physics Codebases Submission

work page arXiv 2024
[21]

Buckley , L

A. Buckley , L. Corpe, M. Filipovich, C. Gütschow, N. Rozinsky , S. Thor, Y. Yeh and J. Yellen, Consistent, multidimensional differential histogramming and summary statistics with YODA 2, SciPost Phys. Codeb. 45 (2023), doi:10.21468 /SciPostPhysCodeb.45, 2312.15070

work page arXiv 2023
[22]

Bewick et al

G. Bewick et al. , Herwig 7.3 Release Note , Eur. Phys. J. C 84(10), 1053 (2024), doi:10.1140/epjc/s10052-024-13211-9, 2312.05175

work page doi:10.1140/epjc/s10052-024-13211-9 2024
[23]

The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations

J. Alwall, R. Frederix, S. Frixione, V . Hirschi, F . Maltoni, O. Mattelaer, H. S. Shao, T . Stelzer, P . Torrielli and M. Zaro,The automated computation of tree-level and next-to- leading order differential cross sections, and their matching to parton shower simulations , JHEP 07, 079 (2014), doi:10.1007 /JHEP07(2014)079, 1405.0301

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

Lindert, Philipp Maierh¨ ofer, Stefano Pozzorini, Hantian Zhang, and Max F

F . Buccioni, J. N. Lang, J. M. Lindert, P . Maierhöfer, S. Pozzorini, H. Zhang and M. F . Zoller, OpenLoops 2, Eur. Phys. J. C79(10), 866 (2019), doi:10.1140/epjc/s10052-019-7306-2, 1907.13071

work page doi:10.1140/epjc/s10052-019-7306-2 2019
[25]

Dipole Showers and Automated NLO Matching in Herwig++

S. Plätzer and S. Gieseke, Dipole Showers and Automated NLO Matching in Herwig ++, Eur. Phys. J. C 72, 2187 (2012), doi:10.1140 /epjc/s10052-012-2187-7, 1109.6256

work page internal anchor Pith review Pith/arXiv arXiv 2012
[26]

NVIDIA Corporation & affiliates, NVIDIA Nsight Systems, https: //developer.nvidia.com/ nsight-systems, Accessed: 2025-08-17 (2024)

work page 2025
[27]

Intel Coorportation, Intel Xeon Processor Gold 5220R (35.75M Cache, 2.20 GHz) Specifications, https: //www.intel.com/content/www/us/en/products/sku/199354/ intel-xeon-gold-5220r-processor-35-75m-cache-2-20-ghz /specifications.html, Ac- cessed: 2025-09-06

work page 2025
[28]

Lottick, S

K. Lottick, S. Susai, S. A. Friedler and J. P . Wilson, Energy usage reports: Environmental awareness as part of algorithmic accountability, In Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019 (2019), 1911.08354

work page arXiv 2019
[29]

CodeCarbon Development Team, CodeCarbon: Track and Reduce Your Carbon Emissions from Computing, https: //codecarbon.io/, Accessed: 2025-08-17

work page 2025
[30]

nvidia.com/cuda-gpus#, Accessed: 2025-10-26

NVIDIA Corporation & affiliates, CUDA GPU Compute Capability , https: //developer. nvidia.com/cuda-gpus#, Accessed: 2025-10-26

work page 2025
[31]

R. K. Ellis, W . J. Stirling and B. R. Webber, QCD and collider physics , vol. 8, Cambridge University Press, ISBN 978-0-511-82328-2, 978-0-521-54589-1, doi:10.1017/CBO9780511628788 (2011)

work page doi:10.1017/cbo9780511628788 2011
[32]

Scale-Free Networks: Complex Webs in Nature and Technology

J. Campbell, J. Huston and F . Krauss, The Black Book of Quantum Chromodynam- ics : a Primer for the LHC Era , Oxford University Press, ISBN 978-0-19-965274-7, doi:10.1093/oso/9780199652747.001.0001 (2018)

work page doi:10.1093/oso/9780199652747.001.0001 2018
[33]

General-purpose event generators for LHC physics

A. Buckley et al., General-purpose event generators for LHC physics, Phys. Rept. 504, 145 (2011), doi:10.1016 /j.physrep.2011.03.005, 1101.2599

work page internal anchor Pith review Pith/arXiv arXiv 2011
[34]

S. Höche, Introduction to parton-shower event generators , In Theoretical Advanced Study Institute in Elementary Particle Physics: Journeys Through the Precision Frontier: Amplitudes for Colliders , pp. 235–295, doi:10.1142 /9789814678766_0005 (2015), 1411.4085

work page internal anchor Pith review Pith/arXiv arXiv 2015
[35]

Yu. L. Dokshitzer, Calculation of the Structure Functions for Deep Inelastic Scattering and e+ e- Annihilation by Perturbation Theory in Quantum Chromodynamics., Sov. Phys. JETP 46, 641 (1977). 26 SciPost Physics Codebases Submission

work page 1977
[36]

V . N. Gribov and L. N. Lipatov,Deep inelastic e p scattering in perturbation theory, Sov. J. Nucl. Phys. 15, 438 (1972)

work page 1972
[37]

Altarelli and G

G. Altarelli and G. Parisi, Asymptotic Freedom in Parton Language , Nucl. Phys. B 126, 298 (1977), doi:10.1016 /0550-3213(77)90384-4

work page 1977
[38]

Lönnblad, ARIADNE version 4: A Program for simulation of QCD cascades implementing the color dipole model , Comput

L. Lönnblad, ARIADNE version 4: A Program for simulation of QCD cascades implementing the color dipole model , Comput. Phys. Commun. 71, 15 (1992), doi:10.1016 /0010- 4655(92)90068-A

work page 1992
[39]

Nagy and D

Z. Nagy and D. E. Soper, A New parton shower algorithm: Shower evolution, match- ing at leading and next-to-leading order level , In Ringberg Workshop on New Trends in HERA Physics 2005, pp. 101–123, doi:10.1142/9789812773524_0010 (2006), hep-ph/ 0601021

work page doi:10.1142/9789812773524_0010 2005
[40]

Parton showers from the dipole formalism

M. Dinsdale, M. Ternick and S. Weinzierl,Parton showers from the dipole formalism, Phys. Rev. D76, 094003 (2007), doi:10.1103 /PhysRevD.76.094003, 0709.1026

work page internal anchor Pith review Pith/arXiv arXiv 2007
[41]

A parton shower algorithm based on Catani-Seymour dipole factorisation

S. Schumann and F . Krauss, A Parton shower algorithm based on Catani-Seymour dipole factorisation, JHEP 03, 038 (2008), doi:10.1088/1126-6708/2008/03/038, 0709.1027

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1088/1126-6708/2008/03/038 2008
[42]

Catani and M

S. Catani and M. H. Seymour,A General algorithm for calculating jet cross-sections in NLO QCD, Nucl. Phys. B 485, 291 (1997), doi:10.1016/S0550-3213(96)00589-5, [Erratum: Nucl.Phys.B 510, 503–504 (1998)], hep-ph /9605323

work page doi:10.1016/s0550-3213(96)00589-5 1997
[43]

J. C. Winter and F . Krauss, Initial-state showering based on colour dipoles connected to incoming parton lines , JHEP 07, 040 (2008), doi:10.1088 /1126-6708/2008/07/040, 0712.3913

work page internal anchor Pith review Pith/arXiv arXiv 2008
[44]

Coherent Parton Showers with Local Recoils

S. Plätzer and S. Gieseke, Coherent Parton Showers with Local Recoils , JHEP 01, 024 (2011), doi:10.1007 /JHEP01(2011)024, 0909.5593

work page internal anchor Pith review Pith/arXiv arXiv 2011
[45]

Transverse-Momentum-Ordered Showers and Interleaved Multiple Interactions

T . Sjöstrand and P . Z. Skands, Transverse-momentum-ordered showers and interleaved multiple interactions, Eur. Phys. J. C 39, 129 (2005), doi:10.1140 /epjc/s2004-02084-y, hep-ph/0408302

work page internal anchor Pith review Pith/arXiv arXiv 2005
[46]

Dasgupta, F

M. Dasgupta, F . A. Dreyer, K. Hamilton, P . F . Monni, G. P . Salam and G. Soyez,Parton showers beyond leading logarithmic accuracy , Phys. Rev. Lett. 125(5), 052002 (2020), doi:10.1103/PhysRevLett.125.052002, 2002.11114

work page doi:10.1103/physrevlett.125.052002 2020
[47]

J. R. Forshaw, J. Holguin and S. Plätzer, Building a consistent parton shower , JHEP 09, 014 (2020), doi:10.1007 /JHEP09(2020)014, 2003.06400

work page arXiv 2020
[48]

Herren, S

F . Herren, S. Höche, F . Krauss, D. Reichelt and M. Schönherr,A new approach to color- coherent parton evolution, JHEP 10, 091 (2023), doi:10.1007/JHEP10(2023)091, 2208. 06057

work page doi:10.1007/jhep10(2023)091 2023
[49]

Nagy and D

Z. Nagy and D. E. Soper, Summations of large logarithms by parton showers , Phys. Rev. D 104(5), 054049 (2021), doi:10.1103 /PhysRevD.104.054049, 2011.04773

work page arXiv 2021
[50]

C. T . Preuss, A partitioned dipole-antenna shower with improved transverse recoil , JHEP 07, 161 (2024), doi:10.1007 /JHEP07(2024)161, 2403.19452

work page arXiv 2024
[51]

ExSample -- A Library for Sampling Sudakov-Type Distributions

S. Plätzer, ExSample: A Library for Sampling Sudakov-Type Distributions, Eur. Phys. J. C 72, 1929 (2012), doi:10.1140 /epjc/s10052-012-1929-x, 1108.6182. 27 SciPost Physics Codebases Submission

work page internal anchor Pith review Pith/arXiv arXiv 1929
[52]

M. Bähr, S. Gieseke and M. H. Seymour, Simulation of multiple partonic interactions in Herwig++, JHEP 07, 076 (2008), doi:10.1088 /1126-6708/2008/07/076, 0803.3633

work page internal anchor Pith review Pith/arXiv arXiv 2008
[53]

M. H. Seymour,A Simple prescription for first order corrections to quark scattering and an- nihilation processes, Nucl. Phys. B 436, 443 (1995), doi:10.1016/0550-3213(94)00554- R, hep-ph /9410244

work page doi:10.1016/0550-3213(94)00554- 1995
[54]

Plehn, D

T . Plehn, D. Rainwater and P . Z. Skands,Squark and gluino production with jets , Phys. Lett. B 645, 217 (2007), doi:10.1016 /j.physletb.2006.12.009, hep-ph /0510144

work page 2007
[55]

Parton Shower Uncertainties with Herwig 7: Benchmarks at Leading Order

J. Bellm, G. Nail, S. Plätzer, P . Schichtel and A. Siódmok, Parton Shower Uncertain- ties with Herwig 7: Benchmarks at Leading Order , Eur. Phys. J. C 76(12), 665 (2016), doi:10.1140/epjc/s10052-016-4506-x, 1605.01338. 28

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1140/epjc/s10052-016-4506-x 2016

[1] [1]

Bothmann, W

E. Bothmann, W . Giele, S. Hoeche, J. Isaacson and M. Knobbe, Many-gluon tree ampli- tudes on modern GPUs: A case study for novel event generators, SciPost Phys. Codeb.2022, 3 (2022), doi:10.21468 /SciPostPhysCodeb.3, 2106.06507

work page arXiv 2022

[2] [2]

A GPU compatible quasi-Monte Carlo integrator interfaced to pySecDec

S. Borowka, G. Heinrich, S. Jahn, S. P . Jones, M. Kerner and J. Schlenk,A GPU compatible quasi-Monte Carlo integrator interfaced to pySecDec , Comput. Phys. Commun. 240, 120 (2019), doi:10.1016 /j.cpc.2019.02.015, 1811.11720

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

Heinrich, S

G. Heinrich, S. P . Jones, M. Kerner, V . Magerya, A. Olsson and J. Schlenk, Numerical scattering amplitudes with pySecDec , Comput. Phys. Commun. 295, 108956 (2024), doi:10.1016/j.cpc.2023.108956, 2305.19768

work page doi:10.1016/j.cpc.2023.108956 2024

[4] [4]

J. M. Cruz-Martinez, G. De Laurentis and M. Pellen, Accelerating Berends–Giele recursion for gluons in arbitrary dimensions over finite fields , Eur. Phys. J. C 85(5), 590 (2025), doi:10.1140/epjc/s10052-025-14318-3, 2502.07060. 24 SciPost Physics Codebases Submission

work page doi:10.1140/epjc/s10052-025-14318-3 2025

[5] [5]

LHAPDF6: parton density access in the LHC precision era

A. Buckley , J. Ferrando, S. Lloyd, K. Nordström, B. Page, M. Rüfenacht, M. Schönherr and G. Watt, LHAPDF6: parton density access in the LHC precision era, Eur. Phys. J. C 75, 132 (2015), doi:10.1140 /epjc/s10052-015-3318-8, 1412.7420

work page internal anchor Pith review Pith/arXiv arXiv 2015

[6] [6]

Carrazza, J

S. Carrazza, J. M. Cruz-Martinez and M. Rossi,PDFFlow: Parton distribution functions on GPU, Comput. Phys. Commun. 264, 107995 (2021), doi:10.1016 /j.cpc.2021.107995, 2009.06635

work page arXiv 2021

[7] [7]

Bothmann, T

E. Bothmann, T . Childers, W . Giele, S. Höche, J. Isaacson and M. Knobbe, A portable parton-level event generator for the high-luminosity LHC, SciPost Phys.17(3), 081 (2024), doi:10.21468/SciPostPhys.17.3.081, 2311.06198

work page doi:10.21468/scipostphys.17.3.081 2024

[8] [8]

Data-parallel leading-order event generation in MadGraph5aMC@NLO

S. Hageböck, D. Massaro, O. Mattelaer, S. Roiser, A. Valassi and Z. Wettersten, Data- parallel leading-order event generation in MadGraph5_aMC@NLO (2025), 2507.21039

work page arXiv 2025

[9] [9]

Carrazza, J

S. Carrazza, J. Cruz-Martinez, M. Rossi and M. Zaro, MadFlow: automating Monte Carlo simulation on GPU for particle physics processes , Eur. Phys. J. C 81(7), 656 (2021), doi:10.1140/epjc/s10052-021-09443-8, 2106.10279

work page doi:10.1140/epjc/s10052-021-09443-8 2021

[10] [10]

com/cuda/cuda-c-programming-guide /index.html#, Accessed: 2025-08-11

NVIDIA Corporation & affiliates, CUDA C++ Programming Guide, https: //docs.nvidia. com/cuda/cuda-c-programming-guide /index.html#, Accessed: 2025-08-11

work page 2025

[11] [11]

M. H. Seymour and S. Sule, An algorithm to parallelise parton showers on a GPU, SciPost Phys. Codebases p. 33 (2024), doi:10.21468 /SciPostPhysCodeb.33

work page 2024

[12] [12]

M. H. Seymour and S. Sule, Codebase release 1.1 for GAPS, SciPost Phys. Codebases pp. 33–r1.1 (2024), doi:10.21468/SciPostPhysCodeb.33-r1.1

work page doi:10.21468/scipostphyscodeb.33-r1.1 2024

[13] [13]

Intel Coorportation, Intel Xeon Processor E5-2620 v4 (20M Cache, 2.10 GHz) Specifications, https: //www.intel.com/content/www/us/en/products/sku/92986/ intel-xeon-processor-e52620-v4-20m-cache-2-10-ghz /specifications.html, Accessed: 2025-09-06

work page 2025

[14] [14]

NVIDIA Corporation & affiliates, Nvidia tesla v100 , https: //www.nvidia.com/en-gb/ data-center/v100/, Accessed: 2025-08-11

work page 2025

[15] [15]

NVIDIA Corporation & affiliates, Thrust: The C ++ Parallel Algorithms Library , https: //nvidia.github.io/cccl/thrust/, Accessed: 2025-08-11

work page 2025

[16] [16]

NVIDIA Corporation & affiliates, Nvidia a100 , https: //www.nvidia.com/en-gb/ data-center/a100/, Accessed: 2025-08-11

work page 2025

[17] [17]

van Werkhoven, Kernel tuner: A search-optimizing gpu code auto-tuner, Future Generation Computer Systems 90, 347 (2019), doi:https://doi.org/10.1016/j.future.2018.08.004

B. van Werkhoven, Kernel tuner: A search-optimizing gpu code auto-tuner, Future Generation Computer Systems 90, 347 (2019), doi:https://doi.org/10.1016/j.future.2018.08.004

work page doi:10.1016/j.future.2018.08.004 2019

[18] [18]

Petroviˇc and J

F . Petroviˇc and J. Filipovi ˇc, Kernel tuning toolkit , SoftwareX 22, 101385 (2023), doi:https://doi.org/10.1016/j.softx.2023.101385

work page doi:10.1016/j.softx.2023.101385 2023

[19] [19]

The anti-k_t jet clustering algorithm

M. Cacciari, G. P . Salam and G. Soyez,The anti-kt jet clustering algorithm, JHEP 04, 063 (2008), doi:10.1088 /1126-6708/2008/04/063, 0802.1189

work page internal anchor Pith review Pith/arXiv arXiv 2008

[20] [20]

Bierlich, A

C. Bierlich, A. Buckley , J. M. Butterworth, C. Gütschow, L. Lönnblad, T . Procter, P . Richard- son and Y. Yeh, Robust independent validation of experiment and theory: Rivet version 4 release note, SciPost Phys. Codeb. 36, 1 (2024), doi:10.21468 /SciPostPhysCodeb.36, 2404.15984. 25 SciPost Physics Codebases Submission

work page arXiv 2024

[21] [21]

Buckley , L

A. Buckley , L. Corpe, M. Filipovich, C. Gütschow, N. Rozinsky , S. Thor, Y. Yeh and J. Yellen, Consistent, multidimensional differential histogramming and summary statistics with YODA 2, SciPost Phys. Codeb. 45 (2023), doi:10.21468 /SciPostPhysCodeb.45, 2312.15070

work page arXiv 2023

[22] [22]

Bewick et al

G. Bewick et al. , Herwig 7.3 Release Note , Eur. Phys. J. C 84(10), 1053 (2024), doi:10.1140/epjc/s10052-024-13211-9, 2312.05175

work page doi:10.1140/epjc/s10052-024-13211-9 2024

[23] [23]

The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations

J. Alwall, R. Frederix, S. Frixione, V . Hirschi, F . Maltoni, O. Mattelaer, H. S. Shao, T . Stelzer, P . Torrielli and M. Zaro,The automated computation of tree-level and next-to- leading order differential cross sections, and their matching to parton shower simulations , JHEP 07, 079 (2014), doi:10.1007 /JHEP07(2014)079, 1405.0301

work page internal anchor Pith review Pith/arXiv arXiv 2014

[24] [24]

Lindert, Philipp Maierh¨ ofer, Stefano Pozzorini, Hantian Zhang, and Max F

F . Buccioni, J. N. Lang, J. M. Lindert, P . Maierhöfer, S. Pozzorini, H. Zhang and M. F . Zoller, OpenLoops 2, Eur. Phys. J. C79(10), 866 (2019), doi:10.1140/epjc/s10052-019-7306-2, 1907.13071

work page doi:10.1140/epjc/s10052-019-7306-2 2019

[25] [25]

Dipole Showers and Automated NLO Matching in Herwig++

S. Plätzer and S. Gieseke, Dipole Showers and Automated NLO Matching in Herwig ++, Eur. Phys. J. C 72, 2187 (2012), doi:10.1140 /epjc/s10052-012-2187-7, 1109.6256

work page internal anchor Pith review Pith/arXiv arXiv 2012

[26] [26]

NVIDIA Corporation & affiliates, NVIDIA Nsight Systems, https: //developer.nvidia.com/ nsight-systems, Accessed: 2025-08-17 (2024)

work page 2025

[27] [27]

Intel Coorportation, Intel Xeon Processor Gold 5220R (35.75M Cache, 2.20 GHz) Specifications, https: //www.intel.com/content/www/us/en/products/sku/199354/ intel-xeon-gold-5220r-processor-35-75m-cache-2-20-ghz /specifications.html, Ac- cessed: 2025-09-06

work page 2025

[28] [28]

Lottick, S

K. Lottick, S. Susai, S. A. Friedler and J. P . Wilson, Energy usage reports: Environmental awareness as part of algorithmic accountability, In Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019 (2019), 1911.08354

work page arXiv 2019

[29] [29]

CodeCarbon Development Team, CodeCarbon: Track and Reduce Your Carbon Emissions from Computing, https: //codecarbon.io/, Accessed: 2025-08-17

work page 2025

[30] [30]

nvidia.com/cuda-gpus#, Accessed: 2025-10-26

NVIDIA Corporation & affiliates, CUDA GPU Compute Capability , https: //developer. nvidia.com/cuda-gpus#, Accessed: 2025-10-26

work page 2025

[31] [31]

R. K. Ellis, W . J. Stirling and B. R. Webber, QCD and collider physics , vol. 8, Cambridge University Press, ISBN 978-0-511-82328-2, 978-0-521-54589-1, doi:10.1017/CBO9780511628788 (2011)

work page doi:10.1017/cbo9780511628788 2011

[32] [32]

Scale-Free Networks: Complex Webs in Nature and Technology

J. Campbell, J. Huston and F . Krauss, The Black Book of Quantum Chromodynam- ics : a Primer for the LHC Era , Oxford University Press, ISBN 978-0-19-965274-7, doi:10.1093/oso/9780199652747.001.0001 (2018)

work page doi:10.1093/oso/9780199652747.001.0001 2018

[33] [33]

General-purpose event generators for LHC physics

A. Buckley et al., General-purpose event generators for LHC physics, Phys. Rept. 504, 145 (2011), doi:10.1016 /j.physrep.2011.03.005, 1101.2599

work page internal anchor Pith review Pith/arXiv arXiv 2011

[34] [34]

S. Höche, Introduction to parton-shower event generators , In Theoretical Advanced Study Institute in Elementary Particle Physics: Journeys Through the Precision Frontier: Amplitudes for Colliders , pp. 235–295, doi:10.1142 /9789814678766_0005 (2015), 1411.4085

work page internal anchor Pith review Pith/arXiv arXiv 2015

[35] [35]

Yu. L. Dokshitzer, Calculation of the Structure Functions for Deep Inelastic Scattering and e+ e- Annihilation by Perturbation Theory in Quantum Chromodynamics., Sov. Phys. JETP 46, 641 (1977). 26 SciPost Physics Codebases Submission

work page 1977

[36] [36]

V . N. Gribov and L. N. Lipatov,Deep inelastic e p scattering in perturbation theory, Sov. J. Nucl. Phys. 15, 438 (1972)

work page 1972

[37] [37]

Altarelli and G

G. Altarelli and G. Parisi, Asymptotic Freedom in Parton Language , Nucl. Phys. B 126, 298 (1977), doi:10.1016 /0550-3213(77)90384-4

work page 1977

[38] [38]

Lönnblad, ARIADNE version 4: A Program for simulation of QCD cascades implementing the color dipole model , Comput

L. Lönnblad, ARIADNE version 4: A Program for simulation of QCD cascades implementing the color dipole model , Comput. Phys. Commun. 71, 15 (1992), doi:10.1016 /0010- 4655(92)90068-A

work page 1992

[39] [39]

Nagy and D

Z. Nagy and D. E. Soper, A New parton shower algorithm: Shower evolution, match- ing at leading and next-to-leading order level , In Ringberg Workshop on New Trends in HERA Physics 2005, pp. 101–123, doi:10.1142/9789812773524_0010 (2006), hep-ph/ 0601021

work page doi:10.1142/9789812773524_0010 2005

[40] [40]

Parton showers from the dipole formalism

M. Dinsdale, M. Ternick and S. Weinzierl,Parton showers from the dipole formalism, Phys. Rev. D76, 094003 (2007), doi:10.1103 /PhysRevD.76.094003, 0709.1026

work page internal anchor Pith review Pith/arXiv arXiv 2007

[41] [41]

A parton shower algorithm based on Catani-Seymour dipole factorisation

S. Schumann and F . Krauss, A Parton shower algorithm based on Catani-Seymour dipole factorisation, JHEP 03, 038 (2008), doi:10.1088/1126-6708/2008/03/038, 0709.1027

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1088/1126-6708/2008/03/038 2008

[42] [42]

Catani and M

S. Catani and M. H. Seymour,A General algorithm for calculating jet cross-sections in NLO QCD, Nucl. Phys. B 485, 291 (1997), doi:10.1016/S0550-3213(96)00589-5, [Erratum: Nucl.Phys.B 510, 503–504 (1998)], hep-ph /9605323

work page doi:10.1016/s0550-3213(96)00589-5 1997

[43] [43]

J. C. Winter and F . Krauss, Initial-state showering based on colour dipoles connected to incoming parton lines , JHEP 07, 040 (2008), doi:10.1088 /1126-6708/2008/07/040, 0712.3913

work page internal anchor Pith review Pith/arXiv arXiv 2008

[44] [44]

Coherent Parton Showers with Local Recoils

S. Plätzer and S. Gieseke, Coherent Parton Showers with Local Recoils , JHEP 01, 024 (2011), doi:10.1007 /JHEP01(2011)024, 0909.5593

work page internal anchor Pith review Pith/arXiv arXiv 2011

[45] [45]

Transverse-Momentum-Ordered Showers and Interleaved Multiple Interactions

T . Sjöstrand and P . Z. Skands, Transverse-momentum-ordered showers and interleaved multiple interactions, Eur. Phys. J. C 39, 129 (2005), doi:10.1140 /epjc/s2004-02084-y, hep-ph/0408302

work page internal anchor Pith review Pith/arXiv arXiv 2005

[46] [46]

Dasgupta, F

M. Dasgupta, F . A. Dreyer, K. Hamilton, P . F . Monni, G. P . Salam and G. Soyez,Parton showers beyond leading logarithmic accuracy , Phys. Rev. Lett. 125(5), 052002 (2020), doi:10.1103/PhysRevLett.125.052002, 2002.11114

work page doi:10.1103/physrevlett.125.052002 2020

[47] [47]

J. R. Forshaw, J. Holguin and S. Plätzer, Building a consistent parton shower , JHEP 09, 014 (2020), doi:10.1007 /JHEP09(2020)014, 2003.06400

work page arXiv 2020

[48] [48]

Herren, S

F . Herren, S. Höche, F . Krauss, D. Reichelt and M. Schönherr,A new approach to color- coherent parton evolution, JHEP 10, 091 (2023), doi:10.1007/JHEP10(2023)091, 2208. 06057

work page doi:10.1007/jhep10(2023)091 2023

[49] [49]

Nagy and D

Z. Nagy and D. E. Soper, Summations of large logarithms by parton showers , Phys. Rev. D 104(5), 054049 (2021), doi:10.1103 /PhysRevD.104.054049, 2011.04773

work page arXiv 2021

[50] [50]

C. T . Preuss, A partitioned dipole-antenna shower with improved transverse recoil , JHEP 07, 161 (2024), doi:10.1007 /JHEP07(2024)161, 2403.19452

work page arXiv 2024

[51] [51]

ExSample -- A Library for Sampling Sudakov-Type Distributions

S. Plätzer, ExSample: A Library for Sampling Sudakov-Type Distributions, Eur. Phys. J. C 72, 1929 (2012), doi:10.1140 /epjc/s10052-012-1929-x, 1108.6182. 27 SciPost Physics Codebases Submission

work page internal anchor Pith review Pith/arXiv arXiv 1929

[52] [52]

M. Bähr, S. Gieseke and M. H. Seymour, Simulation of multiple partonic interactions in Herwig++, JHEP 07, 076 (2008), doi:10.1088 /1126-6708/2008/07/076, 0803.3633

work page internal anchor Pith review Pith/arXiv arXiv 2008

[53] [53]

M. H. Seymour,A Simple prescription for first order corrections to quark scattering and an- nihilation processes, Nucl. Phys. B 436, 443 (1995), doi:10.1016/0550-3213(94)00554- R, hep-ph /9410244

work page doi:10.1016/0550-3213(94)00554- 1995

[54] [54]

Plehn, D

T . Plehn, D. Rainwater and P . Z. Skands,Squark and gluino production with jets , Phys. Lett. B 645, 217 (2007), doi:10.1016 /j.physletb.2006.12.009, hep-ph /0510144

work page 2007

[55] [55]

Parton Shower Uncertainties with Herwig 7: Benchmarks at Leading Order

J. Bellm, G. Nail, S. Plätzer, P . Schichtel and A. Siódmok, Parton Shower Uncertain- ties with Herwig 7: Benchmarks at Leading Order , Eur. Phys. J. C 76(12), 665 (2016), doi:10.1140/epjc/s10052-016-4506-x, 1605.01338. 28

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1140/epjc/s10052-016-4506-x 2016