pith. sign in

arxiv: 2505.03258 · v2 · submitted 2025-05-06 · ✦ hep-ph · hep-ex

IAFormer: Interaction-Aware Transformer network for collider data analysis

Pith reviewed 2026-05-22 17:09 UTC · model grok-4.3

classification ✦ hep-ph hep-ex
keywords Transformer architectureCollider data analysisSparse attentionParticle interactionsTop quark taggingQuark-gluon discriminationBoost-invariant observablesModel efficiency
0
0 comments X

The pith

IAFormer classifies top quarks and quark-gluon jets at state-of-the-art accuracy using far fewer parameters than prior particle transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IAFormer as a Transformer architecture designed for collider event classification. It builds the attention matrix from a fixed set of boost-invariant pairwise particle quantities instead of learning all interactions from scratch, then applies differential attention to create a sparse, dynamic focus on the most relevant tokens. This combination cuts computational cost by more than an order of magnitude relative to the Particle Transformer while preserving or exceeding its accuracy on the standard top-tagging and quark-gluon datasets. The authors further show through interpretability tools that the sparse layers progressively assemble physically meaningful features that remain stable against statistical fluctuations in the input.

Core claim

IAFormer achieves state-of-the-art performance on top and quark-gluon classification by making the attention matrix depend on predefined boost-invariant pairwise quantities and by replacing standard attention with a differential sparse mechanism that dynamically prioritizes informative particle tokens, thereby reducing model size and computation without loss of accuracy.

What carries the argument

Dynamic sparse attention whose matrix is constructed from a fixed set of boost-invariant pairwise quantities, combined with differential attention to select relevant tokens on the fly.

If this is right

  • Large collider datasets can be processed with substantially lower memory and runtime requirements.
  • Sparse attention layers produce outputs that remain stable under statistical fluctuations in the input events.
  • Layer-wise interpretability reveals that physically relevant features are assembled progressively through the sparse mechanism.
  • Domain-specific pairwise quantities can be injected into Transformer attention to reduce parameter count while retaining performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction could be tested on regression or anomaly-detection tasks in collider physics where full attention is currently too expensive.
  • If the predefined quantities prove sufficient across more datasets, future models might be initialized with far fewer learned interaction parameters.
  • Real-time event selection at high-luminosity colliders could become feasible with networks of this reduced complexity.

Load-bearing premise

The chosen set of predefined boost-invariant pairwise quantities already encodes all the interaction information needed for high-accuracy classification on these datasets.

What would settle it

A direct comparison on the same top and quark-gluon datasets in which a standard particle transformer or a version of IAFormer without the predefined pairwise quantities reaches equal or higher accuracy at comparable or lower computational cost.

Figures

Figures reproduced from arXiv: 2505.03258 by A. Hammad, M. Nojiri, W. Esmail.

Figure 1
Figure 1. Figure 1: Schematic architecture ofIAFormer network. Additionally, a dynamic sparse attention pattern is included following the idea of “differ￾ential attention” [51]. The differential attention utilizes the attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the sparse attention patterns dynamically. The model has a trainable parameter β to con￾tr… view at source ↗
Figure 2
Figure 2. Figure 2: illustrates the distribution of β across all IAFormer layers. To better understand the role of β, we analyze three different random seeds. Interestingly, all distributions ex￾hibit a similar pattern, β values increase in the initial layers, reach a maximum value, and then decrease in the later layers. Moreover, the network classification accuracy improves for higher β; the blue line corresponds to the best… view at source ↗
Figure 3
Figure 3. Figure 3: β distribution for quark-gluon test data set for three different seed numbers. For quark-gluon tagging, the number of layers of the IAFormer architecture is reduced to 14 [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attention maps of the final self-attention layer of [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Linear CKA similarity for IAFormer (left), Transformer +Ii,j (middle), and Plain Transformer (right) using 1000 test events from the top jet dataset. The axes represent the attention layers in each network, while the colour bar indicates the CKA values. embeddings. To construct the Gram matrices, M and N, we consider 1000 test events and average over the feature dimension, resulting in Gram matrices of siz… view at source ↗
read the original abstract

In this paper, we introduce \texttt{IAFormer}, a novel Transformer-based architecture that efficiently integrates pairwise particle interactions through a dynamic sparse attention mechanism. \texttt{IAFormer} has two new mechanisms within the model. First, the attention matrix depends on predefined boost invariant pairwise quantities, reducing the network parameters significantly from the original particle transformer models. Second, \texttt{IAFormer} incorporates the sparse attention mechanism by utilizing the "differential attention", so that it can dynamically prioritize relevant particle tokens while reducing computational overhead associated with less informative ones. This approach significantly lowers the model complexity without compromising performance. Despite being computationally efficient by more than an order of magnitude than the Particle Transformer network, \texttt{IAFormer} achieves state-of-the-art performance in classification tasks on the top and quark-gluon datasets. Furthermore, we employ AI interpretability techniques, verifying that the model effectively captures physically meaningful information layer by layer through its sparse attention mechanism, building an efficient network output that is resistant to statistical fluctuations. \texttt{IAFormer} highlights the need for sparse attention in Transformer analysis to reduce the network size while improving its performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IAFormer, a Transformer architecture for collider data analysis that conditions the attention matrix on predefined boost-invariant pairwise particle quantities and employs a differential sparse attention mechanism. It claims this yields more than an order of magnitude computational efficiency gain over the Particle Transformer while achieving state-of-the-art classification performance on top-quark and quark-gluon jet tagging datasets, with additional support from layer-wise AI interpretability analysis showing capture of physically meaningful features.

Significance. If the performance and efficiency claims are substantiated, the work could offer a practical advance for scalable machine learning in high-energy physics by incorporating domain-specific invariants to reduce parameters and computation. The interpretability component is a constructive element that may help validate physical relevance in jet classification tasks.

major comments (2)
  1. [Abstract and results] Abstract and results sections: the central claims of SOTA performance on the top and quark-gluon datasets together with an order-of-magnitude efficiency improvement over the Particle Transformer are presented without any reported details on training procedure, data splits, baseline implementations, statistical uncertainties, or error bars. These omissions directly undermine evaluation of the empirical assertions that constitute the paper's primary contribution.
  2. [Architecture description (methods)] Architecture description (methods): the model deliberately fixes the set of boost-invariant pairwise quantities used to condition the attention matrix rather than learning interactions from embeddings. No ablation is described that replaces this fixed set with a learned pairwise module, leaving open whether the reported accuracy and efficiency gains are attributable to the differential attention or to an unusually well-matched choice of input features for these particular datasets.
minor comments (2)
  1. [Abstract] The abstract states that the model is 'resistant to statistical fluctuations' but provides no quantitative metric or test (e.g., variance across seeds or robustness to input perturbations) to support this phrasing.
  2. [Methods] Notation for the differential attention mechanism should be introduced with an explicit equation or pseudocode early in the methods section to allow readers to reproduce the sparsity implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major point below and have revised the manuscript to improve reproducibility and strengthen the analysis.

read point-by-point responses
  1. Referee: [Abstract and results] Abstract and results sections: the central claims of SOTA performance on the top and quark-gluon datasets together with an order-of-magnitude efficiency improvement over the Particle Transformer are presented without any reported details on training procedure, data splits, baseline implementations, statistical uncertainties, or error bars. These omissions directly undermine evaluation of the empirical assertions that constitute the paper's primary contribution.

    Authors: We agree that these details are necessary for proper evaluation. The revised manuscript includes an expanded Experimental Setup section with: complete training procedure (optimizer, learning rate, epochs, batch size); explicit data splits for both datasets; implementation details for the Particle Transformer baseline to ensure fair comparison; and results reported with statistical uncertainties and error bars from five independent runs with different random seeds. These additions directly support the reported performance and efficiency claims. revision: yes

  2. Referee: [Architecture description (methods)] Architecture description (methods): the model deliberately fixes the set of boost-invariant pairwise quantities used to condition the attention matrix rather than learning interactions from embeddings. No ablation is described that replaces this fixed set with a learned pairwise module, leaving open whether the reported accuracy and efficiency gains are attributable to the differential attention or to an unusually well-matched choice of input features for these particular datasets.

    Authors: The fixed boost-invariant quantities (invariant mass, Δη, Δφ) are a deliberate choice grounded in established collider physics to embed domain knowledge and minimize parameters. This design lets the differential attention focus on higher-order effects. We acknowledge the value of an ablation. The revised manuscript now includes a new ablation study replacing the fixed set with a learned pairwise module; the updated results and discussion clarify the relative contributions of the fixed features and the differential attention to accuracy and efficiency. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external empirical benchmarks

full rationale

The paper introduces IAFormer as a new architecture whose attention matrix is conditioned on a fixed set of predefined boost-invariant pairwise quantities and a differential sparse attention mechanism. The central claims of order-of-magnitude efficiency gains and state-of-the-art classification accuracy are established solely by direct numerical comparison against the Particle Transformer and other baselines on the standard top-tagging and quark-gluon tagging datasets. No equation or result in the derivation reduces to a quantity defined in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation whose content is itself unverified. The evaluation therefore remains externally falsifiable and independent of the model's internal definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central performance claim rests on the sufficiency of a small set of hand-chosen boost-invariant quantities and on standard supervised training assumptions; no new particles or forces are postulated.

free parameters (1)
  • choice of boost-invariant pairwise quantities
    The specific set of quantities used to build the attention matrix is selected rather than learned; this choice directly affects model capacity and is not derived from first principles.
axioms (2)
  • domain assumption Boost invariance is a fundamental symmetry of high-energy particle collisions and can be used to define interaction features.
    Invoked to justify the construction of the attention matrix from predefined pairwise quantities.
  • domain assumption Standard supervised classification loss and optimization suffice to train the model to capture physically meaningful patterns.
    Underlying the claim that interpretability checks confirm layer-by-layer physical relevance.
invented entities (1)
  • Differential attention mechanism no independent evidence
    purpose: Dynamically prioritize relevant particle tokens while suppressing less informative ones
    New component introduced to achieve sparsity; no independent falsifiable prediction outside the model performance is provided.

pith-pipeline@v0.9.0 · 5732 in / 1485 out tokens · 31986 ms · 2026-05-22T17:09:56.171261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dissecting Jet-Tagger Through Mechanistic Interpretability

    hep-ph 2026-05 accept novelty 8.0

    A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    J. M. Butterworth, A. R. Davison, M. Rubin, and G. P. Salam,Jet substructure as a new Higgs search channel at the LHC,Phys. Rev. Lett.100(2008) 242001, [arXiv:0802.2470]

  2. [2]

    L. G. Almeida, M. Backović, M. Cliche, S. J. Lee, and M. Perelstein,Playing Tag with ANN: Boosted Top Identification with Pattern Recognition,JHEP07(2015) 086, [arXiv:1501.05968]

  3. [3]

    Deep-learned Top Tagging with a Lorentz Layer

    A. Butter, G. Kasieczka, T. Plehn, and M. Russell,Deep-learned Top Tagging with a Lorentz Layer,SciPost Phys.5(2018), no. 3 028, [arXiv:1707.08966]. 25

  4. [4]

    Deep-learning Top Taggers or The End of QCD?

    G. Kasieczka, T. Plehn, M. Russell, and T. Schell,Deep-learning Top Taggers or The End of QCD?,JHEP05(2017) 006, [arXiv:1701.08784]

  5. [5]

    Louppe, K

    G. Louppe, K. Cho, C. Becot, and K. Cranmer,QCD-Aware Recursive Neural Networks for Jet Physics,JHEP01(2019) 057, [arXiv:1702.00748]

  6. [6]

    Butter et al.,The Machine Learning landscape of top taggers,SciPost Phys.7 (2019) 014, [arXiv:1902.09914]

    A. Butter et al.,The Machine Learning landscape of top taggers,SciPost Phys.7 (2019) 014, [arXiv:1902.09914]

  7. [7]

    Chakraborty, S

    A. Chakraborty, S. H. Lim, M. M. Nojiri, and M. Takeuchi,Neural Network-based Top Tagger with Two-Point Energy Correlations and Geometry of Soft Emissions, JHEP07(2020) 111, [arXiv:2003.11787]

  8. [8]

    Bhattacharya, M

    S. Bhattacharya, M. Guchait, and A. H. Vijay,Boosted top quark tagging and polarization measurement using machine learning,Phys. Rev. D105(2022), no. 4 042005, [arXiv:2010.11778]

  9. [9]

    Ju and B

    X. Ju and B. Nachman,Supervised Jet Clustering with Graph Neural Networks for Lorentz Boosted Bosons,Phys. Rev. D102(2020), no. 7 075014, [arXiv:2008.06064]

  10. [10]

    F. A. Dreyer and H. Qu,Jet tagging in the Lund plane with graph networks,JHEP 03(2021) 052, [arXiv:2012.08526]

  11. [11]

    Tannenwald, C

    B. Tannenwald, C. Neu, A. Li, G. Buehlmann, A. Cuddeback, L. Hatfield, R. Parvatam, and C. Thompson,Benchmarking Machine Learning Techniques with Di-Higgs Production at the LHC,arXiv:2009.06754

  12. [12]

    F. A. Dreyer, R. Grabarczyk, and P. F. Monni,Leveraging universality of jet taggers through transfer learning,Eur. Phys. J. C82(2022), no. 6 564, [arXiv:2203.06210]

  13. [13]

    Hammad, S

    A. Hammad, S. Khalil, and S. Moretti,Search for mono-Higgs signals in bb¯final states using deep neural networks,Phys. Rev. D107(2023), no. 7 075027, [arXiv:2208.10133]

  14. [14]

    Ahmed, A

    I. Ahmed, A. Zada, M. Waqas, and M. U. Ashraf,Application of deep learning in top pair and single top quark production at the LHC,Eur. Phys. J. Plus138(2023), no. 9 795, [arXiv:2203.12871]

  15. [15]

    J. M. Munoz, I. Batatia, and C. Ortner,Boost invariant polynomials for efficient jet tagging,Mach. Learn. Sci. Tech.3(2022), no. 4 04LT05, [arXiv:2207.08272]

  16. [16]

    He and D

    M. He and D. Wang,Quark/gluon discrimination and top tagging with dual attention transformer,Eur. Phys. J. C83(2023), no. 12 1116, [arXiv:2307.04723]

  17. [17]

    J. A. Aguilar-Saavedra, E. Arganda, F. R. Joaquim, R. M. Sandá Seoane, and J. F. Seabra,Gradient Boosting MUST taggers for highly-boosted jets,arXiv:2305.04957

  18. [18]

    Athanasakos, A

    D. Athanasakos, A. J. Larkoski, and J. Mulligan,Is infrared-collinear safe information all you need for jet classification?,arXiv:2305.08979

  19. [19]

    Grossi, M

    M. Grossi, M. Incudini, M. Pellen, and G. Pelliccioli,Amplitude-assisted tagging of longitudinally polarised bosons using wide neural networks,Eur. Phys. J. C83 (2023), no. 8 759, [arXiv:2306.07726]. 26

  20. [20]

    Hammad, P

    A. Hammad, P. Ko, C.-T. Lu, and M. Park,Exploring exotic decays of the Higgs boson to multi-photons at the LHC via multimodal learning approaches,JHEP09 (2024) 166, [arXiv:2405.18834]

  21. [21]

    Hammad and M

    A. Hammad and M. M. Nojiri,Streamlined jet tagging network assisted by jet prong structure,JHEP06(2024) 176, [arXiv:2404.14677]. [22]CMSCollaboration, A. M. Sirunyan et al.,Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV,JINST13(2018), no. 05 P05011, [arXiv:1712.07158]. [23]ATLASCollaboration,Identification of Jets Co...

  22. [22]

    Andrews et al.,End-to-end jet classification of boosted top quarks with the CMS open data,EPJ Web Conf.251(2021) 04030, [arXiv:2104.14659]

    M. Andrews et al.,End-to-end jet classification of boosted top quarks with the CMS open data,EPJ Web Conf.251(2021) 04030, [arXiv:2104.14659]

  23. [23]

    Keicher,Machine Learning in Top Physics in the ATLAS and CMS Collaborations, in15th International Workshop on Top Quark Physics, 1, 2023.arXiv:2301.09534

    P. Keicher,Machine Learning in Top Physics in the ATLAS and CMS Collaborations, in15th International Workshop on Top Quark Physics, 1, 2023.arXiv:2301.09534

  24. [24]

    Baroň, J

    P. Baroň, J. Kvita, R. Přívara, J. Tomeček, and R. Vodák,Application of Machine Learning Based Top Quark and W Jet Tagging to Hadronic Four-Top Final States Induced by SM as well as BSM Processes, in16th International Workshop on Top Quark Physics, 10, 2023.arXiv:2310.13009

  25. [25]

    Hammad, S

    A. Hammad, S. Moretti, and M. Nojiri,Multi-scale cross-attention transformer encoder for event classification,JHEP03(2024) 144, [arXiv:2401.00452]

  26. [26]

    Esmail, A

    W. Esmail, A. Hammad, and S. Moretti,Sharpening the A→Z(∗)h signature of the Type-II 2HDM at the LHC through advanced Machine Learning,JHEP11(2023) 020, [arXiv:2305.13781]

  27. [27]

    Datta, A

    K. Datta, A. Larkoski, and B. Nachman,Automating the Construction of Jet Observables with Machine Learning,Phys. Rev. D100(2019), no. 9 095016, [arXiv:1902.07180]

  28. [28]

    Interpretable deep learning for two-prong jet classification with jet spectra,

    A. Chakraborty, S. H. Lim, and M. M. Nojiri,Interpretable deep learning for two-prong jet classification with jet spectra,JHEP07(2019) 135, [arXiv:1904.02092]

  29. [29]

    Kim and A

    T. Kim and A. Martin,AW±polarization analyzer from Deep Neural Networks, arXiv:2102.05124

  30. [30]

    Subba and R

    A. Subba and R. K. Singh,Role of polarizations and spin-spin correlations of W’s in e-e+→W-W+ at s=250 GeV to probe anomalous W-W+Z/γcouplings,Phys. Rev. D 107(2023), no. 7 073004, [arXiv:2212.12973]. 27

  31. [31]

    Bogatskiy, T

    A. Bogatskiy, T. Hoffman, D. W. Miller, J. T. Offermann, and X. Liu,Explainable equivariant neural networks for particle physics: PELICAN,JHEP03(2024) 113, [arXiv:2307.16506]

  32. [32]

    S. Akar, T. J. Boettcher, S. Carl, H. F. Schreiner, M. D. Sokoloff, M. Stahl, C. Weisser, and M. Williams,An updated hybrid deep learning algorithm for identifying and locating primary vertices,arXiv:2007.01023

  33. [33]

    Shlomi, S

    J. Shlomi, S. Ganguly, E. Gross, K. Cranmer, Y. Lipman, H. Serviansky, H. Maron, and N. Segol,Secondary vertex finding in jets with neural networks,Eur. Phys. J. C 81(2021), no. 6 540, [arXiv:2008.02831]

  34. [34]

    K. Goto, T. Suehara, T. Yoshioka, M. Kurata, H. Nagahara, Y. Nakashima, N. Takemura, and M. Iwasaki,Development of a vertex finding algorithm using Recurrent Neural Network,Nucl. Instrum. Meth. A1047(2023) 167836, [arXiv:2101.11906]

  35. [35]

    Guiang et al.,Improving tracking algorithms with machine learning: a case for line-segment tracking at the High Luminosity LHC, inConnecting The Dots 2023, 3, 2024.arXiv:2403.13166

    J. Guiang et al.,Improving tracking algorithms with machine learning: a case for line-segment tracking at the High Luminosity LHC, inConnecting The Dots 2023, 3, 2024.arXiv:2403.13166

  36. [36]

    Erdmann,A tagger for strange jets based on tracking information using long short-term memory,JINST15(2020), no

    J. Erdmann,A tagger for strange jets based on tracking information using long short-term memory,JINST15(2020), no. 01 P01021, [arXiv:1907.07505]

  37. [37]

    Nakai, D

    Y. Nakai, D. Shih, and S. Thomas,Strange Jet Tagging,arXiv:2003.09517

  38. [38]

    Erdmann, O

    J. Erdmann, O. Nackenhorst, and S. V. Zeißner,Maximum performance of strange-jet tagging at hadron colliders,JINST16(2021), no. 08 P08039, [arXiv:2011.10736]

  39. [39]

    P. T. Komiske, E. M. Metodiev, and M. D. Schwartz,Deep learning in color: towards automated quark/gluon jet discrimination,JHEP01(2017) 110, [arXiv:1612.01551]

  40. [40]

    Recursive Neural Networks in Quark/Gluon Tagging

    T. Cheng,Recursive Neural Networks in Quark/Gluon Tagging,Comput. Softw. Big Sci.2(2018), no. 1 3, [arXiv:1711.02633]

  41. [41]

    Abbas, A

    M. Abbas, A. Khan, A. S. Qureshi, and M. W. Khan,Extracting Signals of Higgs Boson From Background Noise Using Deep Neural Networks,arXiv:2010.08201. [46]CMSCollaboration, A. Tumasyan et al.,Search for Higgs Boson and Observation of Z Boson through their Decay into a Charm Quark-Antiquark Pair in Boosted Topologies in Proton-Proton Collisions at s=13 TeV,...

  42. [42]

    Zhang, J

    Z. Zhang, J. Liu, J. Hu, Q. Wang, and U.-G. Meißner,Revealing the nature of hidden charm pentaquarks with machine learning,Sci. Bull.68(2023) 981–989, [arXiv:2301.05364]

  43. [43]

    Goswami, S

    K. Goswami, S. Prasad, N. Mallick, R. Sahoo, and G. B. Mohanty,A machine learning-based study of open-charm hadrons in proton-proton collisions at the Large Hadron Collider,arXiv:2404.09839

  44. [44]

    H. Qu, C. Li, and S. Qian,Particle Transformer for Jet Tagging,arXiv:2202.03772

  45. [45]

    Y. Wu, K. Wang, C. Li, H. Qu, and J. Zhu,Jet tagging with more-interaction particle transformer*,Chin. Phys. C49(2025), no. 1 013110, [arXiv:2407.08682]. 28

  46. [46]

    T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei,Differential transformer,arXiv preprint arXiv:2410.05258(2024)

  47. [47]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,Attention is all you need,Advances in neural information processing systems30(2017)

  48. [48]

    P. T. Komiske, E. M. Metodiev, and J. Thaler,Energy Flow Networks: Deep Sets for Particle Jets,JHEP01(2019) 121, [arXiv:1810.05165]

  49. [49]

    Qu and L

    H. Qu and L. Gouskos,ParticleNet: Jet Tagging via Particle Clouds,Phys. Rev. D 101(2020), no. 5 056019, [arXiv:1902.08570]

  50. [50]

    B. Käch, D. Krücker, and I. Melzer-Pellmann,Point Cloud Generation using Transformer Encoders and Normalising Flows,arXiv:2211.13623

  51. [51]

    Blekman, F

    F. Blekman, F. Canelli, A. De Moor, K. Gautam, A. Ilg, A. Macchiolo, and E. Ploerer,Tagging more quark jet flavours at FCC-ee at 91 GeV with a transformer-based neural network,Eur. Phys. J. C85(2025), no. 2 165, [arXiv:2406.08590]

  52. [52]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, and I. Sutskever,Generating long sequences with sparse transformers,arXiv preprint arXiv:1904.10509(2019)

  53. [53]

    Z. Fu, W. Song, Y. Wang, X. Wu, Y. Zheng, Y. Zhang, D. Xu, X. Wei, T. Xu, and X. Zhao,Sliding window attention training for efficient large language models,arXiv preprint arXiv:2502.18845(2025)

  54. [54]

    X. Pan, T. Ye, Z. Xia, S. Song, and G. Huang,Slide-transformer: Hierarchical vision transformer with local self-attention, inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2082–2091, 2023

  55. [55]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan,Longformer: The long-document transformer,arXiv preprint arXiv:2004.05150(2020)

  56. [56]

    Hassani, S

    A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi,Escaping the big data paradigm with compact transformers,arXiv preprint arXiv:2104.05704(2021)

  57. [57]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel,Gaussian error linear units (gelus),arXiv preprint arXiv:1606.08415(2016)

  58. [58]

    Zhang and R

    B. Zhang and R. Sennrich,Root mean square layer normalization,Advances in Neural Information Processing Systems32(2019)

  59. [59]

    Elfwing, E

    S. Elfwing, E. Uchibe, and K. Doya,Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,Neural networks107(2018) 3–11

  60. [60]

    Kasieczka, T

    G. Kasieczka, T. Plehn, J. Thompson, and M. Russel,Top quark tagging reference dataset, Mar., 2019

  61. [61]

    Komiske, E

    P. Komiske, E. Metodiev, and J. Thaler,Pythia8 quark and gluon jets for energy flow, May, 2019

  62. [62]

    A comprehensive guide to the physics and usage of PYTHIA 8.3

    C. Bierlich et al.,A comprehensive guide to the physics and usage of PYTHIA 8.3, SciPost Phys. Codeb.2022(2022) 8, [arXiv:2203.11601]. 29 [68]DELPHES 3Collaboration, J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi,DELPHES 3, A modular framework for fast simulation of a generic collider experiment,JHEP02(2014) 0...

  63. [63]

    Shimmin,Particle Convolution for High Energy Physics, 7, 2021

    C. Shimmin,Particle Convolution for High Energy Physics, 7, 2021. arXiv:2107.02908

  64. [64]

    S. Gong, Q. Meng, J. Zhang, H. Qu, C. Li, S. Qian, W. Du, Z.-M. Ma, and T.-Y. Liu, An efficient Lorentz equivariant graph neural network for jet tagging,JHEP07 (2022) 030, [arXiv:2201.08187]

  65. [65]

    Brehmer, V

    J. Brehmer, V. Bresó, P. de Haan, T. Plehn, H. Qu, J. Spinner, and J. Thaler,A Lorentz-Equivariant Transformer for All of the LHC,arXiv:2411.00446

  66. [66]

    Mikuni and F

    V. Mikuni and F. Canelli,Point cloud transformers applied to collider physics,Mach. Learn. Sci. Tech.2(2021), no. 3 035027, [arXiv:2102.05073]

  67. [67]

    Mikuni and B

    V. Mikuni and B. Nachman,Method to simultaneously facilitate all jet physics tasks, Phys. Rev. D111(2025), no. 5 054015, [arXiv:2502.14652]

  68. [68]

    Loshchilov and F

    I. Loshchilov and F. Hutter,Decoupled weight decay regularization, inInternational Conference on Learning Representations, 2019

  69. [69]

    Mikuni and F

    V. Mikuni and F. Canelli,ABCNet: An attention-based method for particle tagging, Eur. Phys. J. Plus135(2020), no. 6 463, [arXiv:2001.05311]

  70. [70]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al.,Training compute-optimal large language models,arXiv preprint arXiv:2203.15556(2022)

  71. [71]

    On the relationship between self-attention and convolutional layers,

    J.-B. Cordonnier, A. Loukas, and M. Jaggi,On the relationship between self-attention and convolutional layers,arXiv preprint arXiv:1911.03584(2019)

  72. [72]

    Kornblith, M

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton,Similarity of neural network representations revisited, inInternational conference on machine learning, pp. 3519–3529, PMLR, 2019

  73. [73]

    Gomez, T

    T. Gomez, T. Fréour, and H. Mouchère,Metrics for saliency map evaluation of deep learning explanation methods, inInternational Conference on Pattern Recognition and Artificial Intelligence, pp. 84–95, Springer, 2022

  74. [74]

    Binder, G

    A. Binder, G. Montavon, S. Lapuschkin, K.-R. Müller, and W. Samek,Layer-wise relevance propagation for neural networks with local renormalization layers, in Artificial Neural Networks and Machine Learning–ICANN 2016: 25th International Conference on Artificial Neural Networks, Barcelona, Spain, September 6-9, 2016, Proceedings, Part II 25, pp. 63–71, Spri...

  75. [75]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, Grad-cam: visual explanations from deep networks via gradient-based localization, International journal of computer vision128(2020) 336–359

  76. [76]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an 30 imperative style, high-performance deep learning library. Curran Associates Inc., Red Hook, NY, USA, 2019. 31