pith. sign in

arxiv: 2605.18360 · v2 · pith:5P4WRR26new · submitted 2026-05-18 · ✦ hep-ph

Nested-GPT for variable-multiplicity parton showers: A case study in the resummation of non-global logarithms

Pith reviewed 2026-05-21 08:02 UTC · model grok-4.3

classification ✦ hep-ph
keywords parton showersnon-global logarithmsautoregressive transformervariable multiplicityresummationmachine learningdipole showerhigh-energy physics
0
0 comments X

The pith

Nested-GPT generates variable-multiplicity parton showers by sequentially predicting emissions and learning a termination condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Nested-GPT, a hierarchical autoregressive Transformer for simulating parton-shower histories whose multiplicity is not fixed in advance. The architecture predicts each emission in sequence while enforcing the ordered Markovian branching structure of a dipole shower and uses a learned condition to decide when the sequence ends. The authors test this approach on the leading-logarithmic resummation of non-global logarithms in the large-Nc limit, training on data from a stochastic Monte Carlo dipole shower and comparing gap-fraction observables under two training regimes. Generated samples from Nested-GPT agree with the reference shower within statistical uncertainties, establishing it as a consistent autoregressive surrogate for traditional shower generators.

Core claim

Nested-GPT strictly enforces the ordered Markovian branching structure, predicting emissions sequentially and dynamically evaluating a learned sequence-termination condition; the resulting generated samples agree with the reference shower within statistical uncertainties for the observables considered.

What carries the argument

Nested-GPT, the hierarchical autoregressive Transformer that models sequential emission prediction together with a learned termination condition to produce variable-length shower histories.

If this is right

  • Nested-GPT supplies a physically consistent surrogate for variable-multiplicity parton-shower generators.
  • The same architecture supports both direct training on vetoed histories and inclusive training followed by an analysis-level veto.
  • The method provides a foundation for extending the resummation treatment to subleading logarithms.
  • The results motivate further development toward finite-Nc color evolution inside the same autoregressive framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Enforcing the physical branching order inside the model may reduce the need for post-generation corrections in other sequential Monte Carlo simulations.
  • The dynamic termination mechanism could transfer to other variable-length processes such as hadronization or decay chains.
  • Combining the architecture with higher-order matrix elements might improve the accuracy of full event generation without manual multiplicity specification.

Load-bearing premise

The stochastic Monte Carlo dipole shower used to generate the training data correctly captures the leading-logarithmic resummation of non-global logarithms in the large-Nc limit.

What would settle it

A statistically significant mismatch between the gap-fraction distributions produced by Nested-GPT samples and those from the reference dipole shower, beyond the reported statistical uncertainties, would falsify the agreement claim.

Figures

Figures reproduced from arXiv: 2605.18360 by Ding Yu Shao, Hao-Zhe Shi, Wanchen Li, Yu-Xuan Sun.

Figure 1
Figure 1. Figure 1: FIG. 1. Architecture of the Nested-GPT model. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Comparison of the gap fraction [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. Training history of the Nested-GPT model on the [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Comparison of the inclusive shower samples gen [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. Events from both models are terminated after genera [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

We introduce Nested-GPT, a hierarchical autoregressive Transformer architecture for simulating the variable-multiplicity parton-shower histories. As a controlled benchmark, we study the leading-logarithmic resummation of non-global logarithms in the large-$N_c$ limit, utilizing a stochastic Monte Carlo dipole shower to generate reference training data. We systematically evaluate Nested-GPT against a Transformer flow-matching baseline. The flow-matching framework successfully parameterizes the joint distribution of emission kinematics at fixed multiplicity. Its phase-space representation, however, requires the final number of emissions to be specified externally rather than generated dynamically. Conversely, Nested-GPT strictly enforces the ordered Markovian branching structure, predicting emissions sequentially and dynamically evaluating a learned sequence-termination condition. We benchmark both approaches using gap fraction observables under two complementary training regimes: direct training on vetoed histories and inclusive training followed by an analysis-level veto. The resulting generated samples agree with the reference shower within statistical uncertainties for the observables considered. These results establish Nested-GPT as a physically consistent autoregressive surrogate for variable-multiplicity shower generator and motivate extensions to subleading-logarithmic resummation and finite-$N_c$ color evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Nested-GPT, a hierarchical autoregressive Transformer for generating variable-multiplicity parton-shower histories. Using a stochastic Monte Carlo dipole shower as reference in the large-N_c limit, it benchmarks the model on leading-logarithmic resummation of non-global logarithms. Nested-GPT is compared to a flow-matching Transformer baseline; the former enforces ordered Markovian branching with a learned termination condition while the latter requires external multiplicity specification. Generated samples are reported to agree with the reference within statistical uncertainties for gap-fraction observables under direct-veto and inclusive-plus-analysis-veto training regimes.

Significance. If the central results hold, the work provides a concrete demonstration that autoregressive Transformers can serve as physically consistent surrogates for variable-multiplicity showers by dynamically learning both emission kinematics and sequence termination. The explicit comparison to flow-matching and the use of two complementary training regimes are strengths that clarify the advantages of the Markovian structure for non-global logarithm resummation.

major comments (1)
  1. [Abstract] Abstract and benchmark setup: the central claim that Nested-GPT reproduces the reference shower via a learned sequence-termination condition is supported only by agreement on gap-fraction observables. No explicit validation of the multiplicity distribution or of the termination probability as a function of gap configuration is described, leaving open whether the joint distribution over emission number and kinematics remains faithful at multiplicities or phase-space points sparsely sampled in training.
minor comments (2)
  1. Notation for the hierarchical autoregressive structure and the precise definition of the learned termination condition should be clarified with an explicit equation or pseudocode block.
  2. Figure captions for the gap-fraction plots should state the statistical uncertainty bands and the number of generated events used for each curve.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the constructive feedback. We address the major comment below and indicate the revisions made to strengthen the validation of the model.

read point-by-point responses
  1. Referee: [Abstract] Abstract and benchmark setup: the central claim that Nested-GPT reproduces the reference shower via a learned sequence-termination condition is supported only by agreement on gap-fraction observables. No explicit validation of the multiplicity distribution or of the termination probability as a function of gap configuration is described, leaving open whether the joint distribution over emission number and kinematics remains faithful at multiplicities or phase-space points sparsely sampled in training.

    Authors: We thank the referee for this observation. The gap-fraction observables directly probe the leading-logarithmic resummation of non-global logarithms and are sensitive to the interplay between emission kinematics and the number of branchings. Nevertheless, we agree that explicit checks on the multiplicity distribution and the learned termination probability provide valuable additional evidence. In the revised manuscript we have added new figures comparing the generated multiplicity distributions to the reference Monte Carlo shower for both the direct-veto and inclusive-plus-analysis-veto training regimes; these distributions agree within statistical uncertainties over the full range of multiplicities populated by the reference. We have also included a plot of the termination probability conditioned on gap configuration, which reproduces the expected dependence arising from the ordered Markovian branching structure. These additions confirm that the joint distribution over emission number and kinematics is reproduced faithfully, including in regions with lower training statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: validation against external Monte Carlo reference shower

full rationale

The paper generates training data from an independent stochastic Monte Carlo dipole shower and shows that Nested-GPT samples agree with this reference within statistical uncertainties on gap-fraction observables. The architecture enforces ordered Markovian branching by construction and learns a termination condition from the external data; neither step reduces to a self-referential fit or self-citation. A separate flow-matching baseline is used for comparison, providing an independent check. No load-bearing claim relies on prior author work or renames a known result as a new derivation. The central result is empirical reproduction of an external generator, which is self-contained and falsifiable against the reference.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The approach rests on the accuracy of the reference Monte Carlo dipole shower and the validity of the large-Nc leading-log approximation; no new physical entities are postulated.

free parameters (1)
  • Transformer hyperparameters and training schedule
    Model depth, width, learning rate, and termination threshold are chosen or optimized during training on the reference data.
axioms (2)
  • domain assumption Parton-shower evolution can be represented as an ordered Markovian branching process
    Invoked in the design of the autoregressive prediction and sequence-termination condition.
  • domain assumption The large-Nc limit plus leading-log resummation is adequately captured by the stochastic dipole shower used for training data
    Basis for the benchmark observables and reference samples.
invented entities (1)
  • Nested-GPT hierarchical autoregressive Transformer no independent evidence
    purpose: To generate variable-multiplicity parton-shower histories while enforcing ordered branching
    New model architecture introduced for this task.

pith-pipeline@v0.9.0 · 5751 in / 1470 out tokens · 43426 ms · 2026-05-21T08:02:51.981891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 33 internal anchors

  1. [1]

    Butteret al., SciPost Phys.14, 079 (2023), arXiv:2203.07460 [hep-ph]

    A. Butteret al., SciPost Phys.14, 079 (2023), arXiv:2203.07460 [hep-ph]

  2. [2]

    Ubiali, in2nd European AI for Fundamental Physics Conference(2026) arXiv:2602.03728 [hep-ph]

    M. Ubiali, in2nd European AI for Fundamental Physics Conference(2026) arXiv:2602.03728 [hep-ph]

  3. [3]

    T. Cai, K. Li, and T. Li, (2026), arXiv:2605.03474 [hep- ph]

  4. [4]

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, inAdvances in Neural Information Processing Systems, Vol. 27 (Curran Associates, Inc., 2014) pp. 2672–2680, arXiv:1406.2661 [stat.ML]

  5. [5]

    CaloGAN: Simulating 3D High Energy Particle Showers in Multi-Layer Electromagnetic Calorimeters with Generative Adversarial Networks

    M. Paganini, L. de Oliveira, and B. Nachman, Phys. Rev. D97, 014021 (2018), arXiv:1712.10321 [hep-ex]

  6. [6]

    Learning Particle Physics by Example: Location-Aware Generative Adversarial Networks for Physics Synthesis

    L. de Oliveira, M. Paganini, and B. Nachman, Comput. Softw. Big Sci.1, 4 (2017), arXiv:1701.05927 [stat.ML]

  7. [7]

    Butter, T

    A. Butter, T. Plehn, and R. Winterhalder, SciPost Phys. 7, 075 (2019), arXiv:1907.03764 [hep-ph]

  8. [8]

    Butter, T

    A. Butter, T. Plehn, and R. Winterhalder, SciPost Phys. Core3, 009 (2020), arXiv:1912.08824 [hep-ph]

  9. [9]

    Butter, S

    A. Butter, S. Diefenbacher, G. Kasieczka, B. Nach- man, and T. Plehn, SciPost Phys.10, 139 (2021), arXiv:2008.06545 [hep-ph]

  10. [10]

    Papamakarios, E

    G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mo- hamed, and B. Lakshminarayanan, J. Mach. Learn. Res. 22, 1 (2021), arXiv:1912.02762 [stat.ML]

  11. [11]

    Butter, T

    A. Butter, T. Heimel, S. Hummerich, T. Krebs, T. Plehn, A. Rousselot, and S. Vent, SciPost Phys.14, 078 (2023), arXiv:2110.13632 [hep-ph]

  12. [12]

    Heimel, R

    T. Heimel, R. Winterhalder, A. Butter, J. Isaacson, C. Krause, F. Maltoni, O. Mattelaer, and T. Plehn, Sci- Post Phys.15, 141 (2023), arXiv:2212.06172 [hep-ph]

  13. [13]

    C. Gao, S. H¨ oche, J. Isaacson, C. Krause, and H. Schulz, Phys. Rev. D101, 076002 (2020), arXiv:2001.10028 [hep- ph]

  14. [14]

    Bothmann, T

    E. Bothmann, T. Janßen, M. Knobbe, T. Schmale, and S. Schumann, SciPost Phys.8, 069 (2020), arXiv:2001.05478 [hep-ph]

  15. [15]

    J. Ho, A. Jain, and P. Abbeel, inAdvances in Neu- ral Information Processing Systems 33 (NeurIPS)(2020) arXiv:2006.11239 [cs.LG]

  16. [16]

    Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Ku- mar, S. Ermon, and B. Poole, inInternational Con- ference on Learning Representations (ICLR)(2021) arXiv:2011.13456 [cs.LG]

  17. [17]

    Flow Matching for Generative Modeling

    Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, inInternational Conference on Learning Rep- resentations (ICLR)(2023) arXiv:2210.02747 [cs.LG]

  18. [18]

    M. S. Albergo and E. Vanden-Eijnden, inInternational Conference on Learning Representations (ICLR)(2023) arXiv:2209.15571 [cs.LG]

  19. [19]

    Mikuni and B

    V. Mikuni and B. Nachman, Phys. Rev. D106, 092009 (2022), arXiv:2206.11898 [hep-ph]

  20. [20]

    EPiC-ly fast particle cloud generation with flow-matching and diffusion,

    E. Buhmann, C. Ewen, D. A. Faroughy, T. Golling, G. Kasieczka, M. Leigh, G. Qu´ etant, J. A. Raine, D. Sengupta, and D. Shih, “EPiC-ly fast particle cloud generation with flow-matching and diffusion,” (2023), arXiv:2310.00049 [hep-ph]

  21. [21]

    Butter, N

    A. Butter, N. Huetsch, S. Palacios Schweitzer, T. Plehn, P. Sorrenson, and J. Spinner, SciPost Phys. Core8, 026 (2025), arXiv:2305.10475 [hep-ph]

  22. [22]

    Modern Machine Learning for LHC Physicists,

    T. Plehn, A. Butter, B. Dillon, T. Heimel, C. Krause, and R. Winterhalder, “Modern Machine Learning for LHC Physicists,” (2022), arXiv:2211.01421 [hep-ph]

  23. [23]

    A Living Review of Ma- chine Learning for Particle Physics,

    HEP ML Community, “A Living Review of Ma- chine Learning for Particle Physics,”https://iml-wg. github.io/HEPML-LivingReview/(2022)

  24. [24]

    Gross, E

    F. Gross, E. Klempt, S. J. Brodsky,et al., Eur. Phys. J. C83, 1125 (2023), arXiv:2212.11107 [hep-ph]

  25. [25]

    A. D. Martin, W. J. Stirling, R. S. Thorne, and G. Watt, Eur. Phys. J. C63, 189 (2009), arXiv:0901.0002 [hep-ph]

  26. [26]

    Kogleret al., Rev

    R. Kogleret al., Rev. Mod. Phys.91, 045003 (2019), arXiv:1803.06991 [hep-ex]

  27. [27]

    General-purpose event generators for LHC physics

    A. Buckleyet al., Phys. Rept.504, 145 (2011), arXiv:1101.2599 [hep-ph]

  28. [28]

    Reweighting a parton shower using a neural network: the final-state case

    E. Bothmann and L. Del Debbio, JHEP01, 033 (2019), arXiv:1808.07802 [hep-ph]

  29. [29]

    J. W. Monk, JHEP12, 021 (2018), arXiv:1807.03685 [hep-ph]

  30. [30]

    Butter, F

    A. Butter, F. Charton, J. M. Villadamigo, A. Ore, T. Plehn, and J. Spinner, SciPost Phys.20, 004 (2026), arXiv:2412.12074 [hep-ph]

  31. [31]

    Danziger, T

    K. Danziger, T. Janßen, S. Schumann, and F. Siegert, SciPost Phys.12, 164 (2022), arXiv:2109.11964 [hep-ph]

  32. [32]

    MadNIS at NLO

    G. De Crescenzo, J. M. Villadamigo, N. Elmer, T. Heimel, T. Plehn, R. Winterhalder, and M. Zaro, “MadNIS at NLO,” (2026), arXiv:2603.22407 [hep-ph]

  33. [33]

    MadSpace – Event Generation for the Era of GPUs and ML,

    T. Heimel, O. Mattelaer, and R. Winterhalder, “MadSpace – Event Generation for the Era of GPUs and ML,” (2026), arXiv:2602.06895 [hep-ph]

  34. [34]

    FASTColor – Full-color Amplitude Surrogate Toolkit for QCD,

    J. M. Villadamigo, R. Frederix, T. Plehn, T. Vitos, and R. Winterhalder, “FASTColor – Full-color Amplitude Surrogate Toolkit for QCD,” (2025), arXiv:2509.07068 [hep-ph]

  35. [35]

    A comprehensive guide to the physics and usage of PYTHIA 8.3

    C. Bierlichet al., SciPost Phys. Codebases , 8 (2022), arXiv:2203.11601 [hep-ph]

  36. [36]

    Bothmannet al.(Sherpa), JHEP12, 156 (2024), arXiv:2410.22148 [hep-ph]

    E. Bothmannet al.(Sherpa), JHEP12, 156 (2024), arXiv:2410.22148 [hep-ph]

  37. [37]

    Herwig 7.0 / Herwig++ 3.0 Release Note

    J. Bellmet al., Eur. Phys. J. C76, 196 (2016), arXiv:1512.01178 [hep-ph]

  38. [38]

    Nagy and D

    Z. Nagy and D. E. Soper, Phys. Rev. D104, 054049 (2021), arXiv:2011.04773 [hep-ph]

  39. [39]

    J. R. Forshaw, J. Holguin, and S. Pl¨ atzer, JHEP09, 014 (2020), arXiv:2003.06400 [hep-ph]

  40. [40]

    van Beekveldet al., Phys

    M. van Beekveldet al., Phys. Rev. Lett.134, 011901 12 (2025), arXiv:2406.02661 [hep-ph]

  41. [41]

    Ferrario Ravasio, in59th Rencontres de Moriond on QCD and High Energy Interactions: Moriond QCD 2025 (2025) arXiv:2505.13395 [hep-ph]

    S. Ferrario Ravasio, in59th Rencontres de Moriond on QCD and High Energy Interactions: Moriond QCD 2025 (2025) arXiv:2505.13395 [hep-ph]

  42. [42]

    Resummation of non-global QCD observables

    M. Dasgupta and G. P. Salam, Phys. Lett.B512, 323 (2001), arXiv:hep-ph/0104277 [hep-ph]

  43. [43]

    Away-from-jet energy flow

    A. Banfi, G. Marchesini, and G. Smye, JHEP08, 006 (2002), arXiv:hep-ph/0206076 [hep-ph]

  44. [44]

    Balitsky, Phys

    I. Balitsky, Phys. Rev. D60, 014020 (1999), arXiv:hep- ph/9812311

  45. [45]

    Y. V. Kovchegov, Phys. Rev. D61, 074018 (2000), arXiv:hep-ph/9905214

  46. [46]

    Y. V. Kovchegov, Phys. Rev. D60, 034008 (1999), arXiv:hep-ph/9901281

  47. [47]

    Resummation of non-global logarithms and the BFKL equation

    S. Caron-Huot, JHEP03, 036 (2018), arXiv:1501.03754 [hep-ph]

  48. [48]

    Brunello, S

    G. Brunello, S. Caron-Huot, G. Crisanti, M. Giroux, and S. Smith, JHEP11, 055 (2025), arXiv:2508.03794 [hep- ph]

  49. [49]

    J. R. Forshaw, A. Kyrieleis, and M. H. Seymour, JHEP 08, 059 (2006), arXiv:hep-ph/0604094

  50. [50]

    Resummation of non-global logarithms at finite $N_c$

    Y. Hatta and T. Ueda, Nucl. Phys. B874, 808 (2013), arXiv:1304.6930 [hep-ph]

  51. [51]

    De Angelis, J

    M. De Angelis, J. R. Forshaw, and S. Pl¨ atzer, Phys. Rev. Lett.126, 112001 (2021), arXiv:2007.09648 [hep-ph]

  52. [52]

    Banfi, F

    A. Banfi, F. A. Dreyer, and P. F. Monni, JHEP10, 006 (2021), arXiv:2104.06416 [hep-ph]

  53. [53]

    Banfi, F

    A. Banfi, F. A. Dreyer, and P. F. Monni, JHEP03, 135 (2022), arXiv:2111.02413 [hep-ph]

  54. [54]

    Becher, T

    T. Becher, T. Rauh, and X. Xu, JHEP08, 134 (2022), arXiv:2112.02108 [hep-ph]

  55. [55]

    Becher, N

    T. Becher, N. Schalch, and X. Xu, Phys. Rev. Lett.132, 081602 (2024), arXiv:2307.02283 [hep-ph]

  56. [56]

    Ferrario Ravasio, K

    S. Ferrario Ravasio, K. Hamilton, A. Karlberg, G. P. Salam, L. Scyboz, and G. Soyez, Phys. Rev. Lett.131, 161906 (2023), arXiv:2307.11142 [hep-ph]

  57. [57]

    Leigh, D

    M. Leigh, D. Sengupta, G. Qu´ etant, J. A. Raine, K. Zoch, and T. Golling, SciPost Phys.16, 018 (2024), arXiv:2303.05376 [hep-ph]

  58. [58]

    Y. S. Lai, D. Neill, M. P losko´ n, and F. Ringer, Phys. Lett. B829, 137055 (2022), arXiv:2012.06582 [hep-ph]

  59. [59]

    Non-global logarithms in jet and isolation cone cross sections

    M. Balsiger, T. Becher, and D. Y. Shao, JHEP08, 104 (2018), arXiv:1803.07045 [hep-ph]

  60. [60]

    Balsiger, T

    M. Balsiger, T. Becher, and A. Ferroglia, JHEP09, 029 (2020), arXiv:2006.00014 [hep-ph]

  61. [61]

    An Effective Field Theory for Jet Processes

    T. Becher, M. Neubert, L. Rothen, and D. Y. Shao, Phys. Rev. Lett.116, 192001 (2016), arXiv:1508.06645 [hep-ph]

  62. [62]

    Factorization and Resummation for Jet Processes

    T. Becher, M. Neubert, L. Rothen, and D. Y. Shao, JHEP11, 019 (2016), [Erratum: JHEP 05, 154 (2017)], arXiv:1605.02737 [hep-ph]

  63. [63]

    A. J. Larkoski, I. Moult, and D. Neill, JHEP09, 143 (2015), arXiv:1501.04596 [hep-ph]

  64. [64]

    R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Du- venaud, inAdvances in Neural Information Processing Systems 31 (NeurIPS)(2018) arXiv:1806.07366 [cs.LG]

  65. [65]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, inAdvances in Neural Information Processing Systems 30 (NeurIPS)(2017) arXiv:1706.03762 [cs.CL]

  66. [66]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, inInternational Con- ference on Learning Representations (ICLR)(2019) arXiv:1711.05101 [cs.LG]

  67. [67]

    Improving language understanding by gen- erative pre-training,

    A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by gen- erative pre-training,” (2018), openAI preprint

  68. [68]

    Language models are unsupervised multi- task learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multi- task learners,” (2019), openAI preprint

  69. [69]

    Finke, M

    T. Finke, M. Kr¨ amer, A. M¨ uck, and J. T¨ onshoff, JHEP 06, 184 (2023), arXiv:2303.07364 [hep-ph]

  70. [70]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” (2016), arXiv:1606.08415 [cs.LG]

  71. [71]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal- ization,” (2016), arXiv:1607.06450 [stat.ML]

  72. [72]

    K. Cho, B. van Merri¨ enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, inProceedings of the 2014 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP)(2014) pp. 1724–1734, arXiv:1406.1078 [cs.CL]

  73. [73]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Em- pirical evaluation of gated recurrent neural networks on sequence modeling,” (2014), arXiv:1412.3555 [cs.NE]

  74. [74]

    Diaz and A

    R. Diaz and A. Marathe, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2019) pp. 4738–4747