IAFormer: Interaction-Aware Transformer network for collider data analysis

A. Hammad; M. Nojiri; W. Esmail

arxiv: 2505.03258 · v2 · submitted 2025-05-06 · ✦ hep-ph · hep-ex

IAFormer: Interaction-Aware Transformer network for collider data analysis

W. Esmail , A. Hammad , M. Nojiri This is my paper

Pith reviewed 2026-05-22 17:09 UTC · model grok-4.3

classification ✦ hep-ph hep-ex

keywords Transformer architectureCollider data analysisSparse attentionParticle interactionsTop quark taggingQuark-gluon discriminationBoost-invariant observablesModel efficiency

0 comments

The pith

IAFormer classifies top quarks and quark-gluon jets at state-of-the-art accuracy using far fewer parameters than prior particle transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IAFormer as a Transformer architecture designed for collider event classification. It builds the attention matrix from a fixed set of boost-invariant pairwise particle quantities instead of learning all interactions from scratch, then applies differential attention to create a sparse, dynamic focus on the most relevant tokens. This combination cuts computational cost by more than an order of magnitude relative to the Particle Transformer while preserving or exceeding its accuracy on the standard top-tagging and quark-gluon datasets. The authors further show through interpretability tools that the sparse layers progressively assemble physically meaningful features that remain stable against statistical fluctuations in the input.

Core claim

IAFormer achieves state-of-the-art performance on top and quark-gluon classification by making the attention matrix depend on predefined boost-invariant pairwise quantities and by replacing standard attention with a differential sparse mechanism that dynamically prioritizes informative particle tokens, thereby reducing model size and computation without loss of accuracy.

What carries the argument

Dynamic sparse attention whose matrix is constructed from a fixed set of boost-invariant pairwise quantities, combined with differential attention to select relevant tokens on the fly.

If this is right

Large collider datasets can be processed with substantially lower memory and runtime requirements.
Sparse attention layers produce outputs that remain stable under statistical fluctuations in the input events.
Layer-wise interpretability reveals that physically relevant features are assembled progressively through the sparse mechanism.
Domain-specific pairwise quantities can be injected into Transformer attention to reduce parameter count while retaining performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same construction could be tested on regression or anomaly-detection tasks in collider physics where full attention is currently too expensive.
If the predefined quantities prove sufficient across more datasets, future models might be initialized with far fewer learned interaction parameters.
Real-time event selection at high-luminosity colliders could become feasible with networks of this reduced complexity.

Load-bearing premise

The chosen set of predefined boost-invariant pairwise quantities already encodes all the interaction information needed for high-accuracy classification on these datasets.

What would settle it

A direct comparison on the same top and quark-gluon datasets in which a standard particle transformer or a version of IAFormer without the predefined pairwise quantities reaches equal or higher accuracy at comparable or lower computational cost.

Figures

Figures reproduced from arXiv: 2505.03258 by A. Hammad, M. Nojiri, W. Esmail.

**Figure 1.** Figure 1: Schematic architecture ofIAFormer network. Additionally, a dynamic sparse attention pattern is included following the idea of “differential attention” [51]. The differential attention utilizes the attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the sparse attention patterns dynamically. The model has a trainable parameter β to contr… view at source ↗

**Figure 2.** Figure 2: illustrates the distribution of β across all IAFormer layers. To better understand the role of β, we analyze three different random seeds. Interestingly, all distributions exhibit a similar pattern, β values increase in the initial layers, reach a maximum value, and then decrease in the later layers. Moreover, the network classification accuracy improves for higher β; the blue line corresponds to the best… view at source ↗

**Figure 3.** Figure 3: β distribution for quark-gluon test data set for three different seed numbers. For quark-gluon tagging, the number of layers of the IAFormer architecture is reduced to 14 [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Attention maps of the final self-attention layer of [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Linear CKA similarity for IAFormer (left), Transformer +Ii,j (middle), and Plain Transformer (right) using 1000 test events from the top jet dataset. The axes represent the attention layers in each network, while the colour bar indicates the CKA values. embeddings. To construct the Gram matrices, M and N, we consider 1000 test events and average over the feature dimension, resulting in Gram matrices of siz… view at source ↗

read the original abstract

In this paper, we introduce \texttt{IAFormer}, a novel Transformer-based architecture that efficiently integrates pairwise particle interactions through a dynamic sparse attention mechanism. \texttt{IAFormer} has two new mechanisms within the model. First, the attention matrix depends on predefined boost invariant pairwise quantities, reducing the network parameters significantly from the original particle transformer models. Second, \texttt{IAFormer} incorporates the sparse attention mechanism by utilizing the "differential attention", so that it can dynamically prioritize relevant particle tokens while reducing computational overhead associated with less informative ones. This approach significantly lowers the model complexity without compromising performance. Despite being computationally efficient by more than an order of magnitude than the Particle Transformer network, \texttt{IAFormer} achieves state-of-the-art performance in classification tasks on the top and quark-gluon datasets. Furthermore, we employ AI interpretability techniques, verifying that the model effectively captures physically meaningful information layer by layer through its sparse attention mechanism, building an efficient network output that is resistant to statistical fluctuations. \texttt{IAFormer} highlights the need for sparse attention in Transformer analysis to reduce the network size while improving its performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IAFormer bakes predefined boost-invariant pairwise quantities into the attention matrix and adds differential attention for sparsity, cutting size versus Particle Transformers while claiming SOTA on standard tagging tasks, but the numbers rest on thin reported evidence.

read the letter

The main thing here is that IAFormer conditions its attention directly on a fixed set of boost-invariant pairwise quantities and uses differential attention to sparsify the computation dynamically. This produces a model that runs more than ten times faster than the Particle Transformer baselines while reporting state-of-the-art accuracy on the top-tagging and quark-gluon datasets from the usual benchmarks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IAFormer, a Transformer architecture for collider data analysis that conditions the attention matrix on predefined boost-invariant pairwise particle quantities and employs a differential sparse attention mechanism. It claims this yields more than an order of magnitude computational efficiency gain over the Particle Transformer while achieving state-of-the-art classification performance on top-quark and quark-gluon jet tagging datasets, with additional support from layer-wise AI interpretability analysis showing capture of physically meaningful features.

Significance. If the performance and efficiency claims are substantiated, the work could offer a practical advance for scalable machine learning in high-energy physics by incorporating domain-specific invariants to reduce parameters and computation. The interpretability component is a constructive element that may help validate physical relevance in jet classification tasks.

major comments (2)

[Abstract and results] Abstract and results sections: the central claims of SOTA performance on the top and quark-gluon datasets together with an order-of-magnitude efficiency improvement over the Particle Transformer are presented without any reported details on training procedure, data splits, baseline implementations, statistical uncertainties, or error bars. These omissions directly undermine evaluation of the empirical assertions that constitute the paper's primary contribution.
[Architecture description (methods)] Architecture description (methods): the model deliberately fixes the set of boost-invariant pairwise quantities used to condition the attention matrix rather than learning interactions from embeddings. No ablation is described that replaces this fixed set with a learned pairwise module, leaving open whether the reported accuracy and efficiency gains are attributable to the differential attention or to an unusually well-matched choice of input features for these particular datasets.

minor comments (2)

[Abstract] The abstract states that the model is 'resistant to statistical fluctuations' but provides no quantitative metric or test (e.g., variance across seeds or robustness to input perturbations) to support this phrasing.
[Methods] Notation for the differential attention mechanism should be introduced with an explicit equation or pseudocode early in the methods section to allow readers to reproduce the sparsity implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major point below and have revised the manuscript to improve reproducibility and strengthen the analysis.

read point-by-point responses

Referee: [Abstract and results] Abstract and results sections: the central claims of SOTA performance on the top and quark-gluon datasets together with an order-of-magnitude efficiency improvement over the Particle Transformer are presented without any reported details on training procedure, data splits, baseline implementations, statistical uncertainties, or error bars. These omissions directly undermine evaluation of the empirical assertions that constitute the paper's primary contribution.

Authors: We agree that these details are necessary for proper evaluation. The revised manuscript includes an expanded Experimental Setup section with: complete training procedure (optimizer, learning rate, epochs, batch size); explicit data splits for both datasets; implementation details for the Particle Transformer baseline to ensure fair comparison; and results reported with statistical uncertainties and error bars from five independent runs with different random seeds. These additions directly support the reported performance and efficiency claims. revision: yes
Referee: [Architecture description (methods)] Architecture description (methods): the model deliberately fixes the set of boost-invariant pairwise quantities used to condition the attention matrix rather than learning interactions from embeddings. No ablation is described that replaces this fixed set with a learned pairwise module, leaving open whether the reported accuracy and efficiency gains are attributable to the differential attention or to an unusually well-matched choice of input features for these particular datasets.

Authors: The fixed boost-invariant quantities (invariant mass, Δη, Δφ) are a deliberate choice grounded in established collider physics to embed domain knowledge and minimize parameters. This design lets the differential attention focus on higher-order effects. We acknowledge the value of an ablation. The revised manuscript now includes a new ablation study replacing the fixed set with a learned pairwise module; the updated results and discussion clarify the relative contributions of the fixed features and the differential attention to accuracy and efficiency. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external empirical benchmarks

full rationale

The paper introduces IAFormer as a new architecture whose attention matrix is conditioned on a fixed set of predefined boost-invariant pairwise quantities and a differential sparse attention mechanism. The central claims of order-of-magnitude efficiency gains and state-of-the-art classification accuracy are established solely by direct numerical comparison against the Particle Transformer and other baselines on the standard top-tagging and quark-gluon tagging datasets. No equation or result in the derivation reduces to a quantity defined in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation whose content is itself unverified. The evaluation therefore remains externally falsifiable and independent of the model's internal definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central performance claim rests on the sufficiency of a small set of hand-chosen boost-invariant quantities and on standard supervised training assumptions; no new particles or forces are postulated.

free parameters (1)

choice of boost-invariant pairwise quantities
The specific set of quantities used to build the attention matrix is selected rather than learned; this choice directly affects model capacity and is not derived from first principles.

axioms (2)

domain assumption Boost invariance is a fundamental symmetry of high-energy particle collisions and can be used to define interaction features.
Invoked to justify the construction of the attention matrix from predefined pairwise quantities.
domain assumption Standard supervised classification loss and optimization suffice to train the model to capture physically meaningful patterns.
Underlying the claim that interpretability checks confirm layer-by-layer physical relevance.

invented entities (1)

Differential attention mechanism no independent evidence
purpose: Dynamically prioritize relevant particle tokens while suppressing less informative ones
New component introduced to achieve sparsity; no independent falsifiable prediction outside the model performance is provided.

pith-pipeline@v0.9.0 · 5732 in / 1485 out tokens · 31986 ms · 2026-05-22T17:09:56.171261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the attention matrix depends on predefined boost invariant pairwise quantities... differential attention... αi,i′=softmax(W1·Ii,j)−βsoftmax(W2·Ii,j)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IAFormer... achieves state-of-the-art performance... with a significantly reduced network size

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dissecting Jet-Tagger Through Mechanistic Interpretability
hep-ph 2026-05 accept novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

J. M. Butterworth, A. R. Davison, M. Rubin, and G. P. Salam,Jet substructure as a new Higgs search channel at the LHC,Phys. Rev. Lett.100(2008) 242001, [arXiv:0802.2470]

work page internal anchor Pith review Pith/arXiv arXiv 2008
[2]

L. G. Almeida, M. Backović, M. Cliche, S. J. Lee, and M. Perelstein,Playing Tag with ANN: Boosted Top Identification with Pattern Recognition,JHEP07(2015) 086, [arXiv:1501.05968]

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Deep-learned Top Tagging with a Lorentz Layer

A. Butter, G. Kasieczka, T. Plehn, and M. Russell,Deep-learned Top Tagging with a Lorentz Layer,SciPost Phys.5(2018), no. 3 028, [arXiv:1707.08966]. 25

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Deep-learning Top Taggers or The End of QCD?

G. Kasieczka, T. Plehn, M. Russell, and T. Schell,Deep-learning Top Taggers or The End of QCD?,JHEP05(2017) 006, [arXiv:1701.08784]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Louppe, K

G. Louppe, K. Cho, C. Becot, and K. Cranmer,QCD-Aware Recursive Neural Networks for Jet Physics,JHEP01(2019) 057, [arXiv:1702.00748]

work page arXiv 2019
[6]

Butter et al.,The Machine Learning landscape of top taggers,SciPost Phys.7 (2019) 014, [arXiv:1902.09914]

A. Butter et al.,The Machine Learning landscape of top taggers,SciPost Phys.7 (2019) 014, [arXiv:1902.09914]

work page arXiv 2019
[7]

Chakraborty, S

A. Chakraborty, S. H. Lim, M. M. Nojiri, and M. Takeuchi,Neural Network-based Top Tagger with Two-Point Energy Correlations and Geometry of Soft Emissions, JHEP07(2020) 111, [arXiv:2003.11787]

work page arXiv 2020
[8]

Bhattacharya, M

S. Bhattacharya, M. Guchait, and A. H. Vijay,Boosted top quark tagging and polarization measurement using machine learning,Phys. Rev. D105(2022), no. 4 042005, [arXiv:2010.11778]

work page arXiv 2022
[9]

Ju and B

X. Ju and B. Nachman,Supervised Jet Clustering with Graph Neural Networks for Lorentz Boosted Bosons,Phys. Rev. D102(2020), no. 7 075014, [arXiv:2008.06064]

work page arXiv 2020
[10]

F. A. Dreyer and H. Qu,Jet tagging in the Lund plane with graph networks,JHEP 03(2021) 052, [arXiv:2012.08526]

work page arXiv 2021
[11]

Tannenwald, C

B. Tannenwald, C. Neu, A. Li, G. Buehlmann, A. Cuddeback, L. Hatfield, R. Parvatam, and C. Thompson,Benchmarking Machine Learning Techniques with Di-Higgs Production at the LHC,arXiv:2009.06754

work page arXiv 2009
[12]

F. A. Dreyer, R. Grabarczyk, and P. F. Monni,Leveraging universality of jet taggers through transfer learning,Eur. Phys. J. C82(2022), no. 6 564, [arXiv:2203.06210]

work page arXiv 2022
[13]

Hammad, S

A. Hammad, S. Khalil, and S. Moretti,Search for mono-Higgs signals in bb¯final states using deep neural networks,Phys. Rev. D107(2023), no. 7 075027, [arXiv:2208.10133]

work page arXiv 2023
[14]

Ahmed, A

I. Ahmed, A. Zada, M. Waqas, and M. U. Ashraf,Application of deep learning in top pair and single top quark production at the LHC,Eur. Phys. J. Plus138(2023), no. 9 795, [arXiv:2203.12871]

work page arXiv 2023
[15]

J. M. Munoz, I. Batatia, and C. Ortner,Boost invariant polynomials for efficient jet tagging,Mach. Learn. Sci. Tech.3(2022), no. 4 04LT05, [arXiv:2207.08272]

work page arXiv 2022
[16]

He and D

M. He and D. Wang,Quark/gluon discrimination and top tagging with dual attention transformer,Eur. Phys. J. C83(2023), no. 12 1116, [arXiv:2307.04723]

work page arXiv 2023
[17]

J. A. Aguilar-Saavedra, E. Arganda, F. R. Joaquim, R. M. Sandá Seoane, and J. F. Seabra,Gradient Boosting MUST taggers for highly-boosted jets,arXiv:2305.04957

work page arXiv
[18]

Athanasakos, A

D. Athanasakos, A. J. Larkoski, and J. Mulligan,Is infrared-collinear safe information all you need for jet classification?,arXiv:2305.08979

work page arXiv
[19]

Grossi, M

M. Grossi, M. Incudini, M. Pellen, and G. Pelliccioli,Amplitude-assisted tagging of longitudinally polarised bosons using wide neural networks,Eur. Phys. J. C83 (2023), no. 8 759, [arXiv:2306.07726]. 26

work page arXiv 2023
[20]

Hammad, P

A. Hammad, P. Ko, C.-T. Lu, and M. Park,Exploring exotic decays of the Higgs boson to multi-photons at the LHC via multimodal learning approaches,JHEP09 (2024) 166, [arXiv:2405.18834]

work page arXiv 2024
[21]

Hammad and M

A. Hammad and M. M. Nojiri,Streamlined jet tagging network assisted by jet prong structure,JHEP06(2024) 176, [arXiv:2404.14677]. [22]CMSCollaboration, A. M. Sirunyan et al.,Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV,JINST13(2018), no. 05 P05011, [arXiv:1712.07158]. [23]ATLASCollaboration,Identification of Jets Co...

work page arXiv 2024
[22]

Andrews et al.,End-to-end jet classification of boosted top quarks with the CMS open data,EPJ Web Conf.251(2021) 04030, [arXiv:2104.14659]

M. Andrews et al.,End-to-end jet classification of boosted top quarks with the CMS open data,EPJ Web Conf.251(2021) 04030, [arXiv:2104.14659]

work page arXiv 2021
[23]

Keicher,Machine Learning in Top Physics in the ATLAS and CMS Collaborations, in15th International Workshop on Top Quark Physics, 1, 2023.arXiv:2301.09534

P. Keicher,Machine Learning in Top Physics in the ATLAS and CMS Collaborations, in15th International Workshop on Top Quark Physics, 1, 2023.arXiv:2301.09534

work page arXiv 2023
[24]

Baroň, J

P. Baroň, J. Kvita, R. Přívara, J. Tomeček, and R. Vodák,Application of Machine Learning Based Top Quark and W Jet Tagging to Hadronic Four-Top Final States Induced by SM as well as BSM Processes, in16th International Workshop on Top Quark Physics, 10, 2023.arXiv:2310.13009

work page arXiv 2023
[25]

Hammad, S

A. Hammad, S. Moretti, and M. Nojiri,Multi-scale cross-attention transformer encoder for event classification,JHEP03(2024) 144, [arXiv:2401.00452]

work page arXiv 2024
[26]

Esmail, A

W. Esmail, A. Hammad, and S. Moretti,Sharpening the A→Z(∗)h signature of the Type-II 2HDM at the LHC through advanced Machine Learning,JHEP11(2023) 020, [arXiv:2305.13781]

work page arXiv 2023
[27]

Datta, A

K. Datta, A. Larkoski, and B. Nachman,Automating the Construction of Jet Observables with Machine Learning,Phys. Rev. D100(2019), no. 9 095016, [arXiv:1902.07180]

work page arXiv 2019
[28]

Interpretable deep learning for two-prong jet classification with jet spectra,

A. Chakraborty, S. H. Lim, and M. M. Nojiri,Interpretable deep learning for two-prong jet classification with jet spectra,JHEP07(2019) 135, [arXiv:1904.02092]

work page arXiv 2019
[29]

Kim and A

T. Kim and A. Martin,AW±polarization analyzer from Deep Neural Networks, arXiv:2102.05124

work page arXiv
[30]

Subba and R

A. Subba and R. K. Singh,Role of polarizations and spin-spin correlations of W’s in e-e+→W-W+ at s=250 GeV to probe anomalous W-W+Z/γcouplings,Phys. Rev. D 107(2023), no. 7 073004, [arXiv:2212.12973]. 27

work page arXiv 2023
[31]

Bogatskiy, T

A. Bogatskiy, T. Hoffman, D. W. Miller, J. T. Offermann, and X. Liu,Explainable equivariant neural networks for particle physics: PELICAN,JHEP03(2024) 113, [arXiv:2307.16506]

work page arXiv 2024
[32]

S. Akar, T. J. Boettcher, S. Carl, H. F. Schreiner, M. D. Sokoloff, M. Stahl, C. Weisser, and M. Williams,An updated hybrid deep learning algorithm for identifying and locating primary vertices,arXiv:2007.01023

work page arXiv 2007
[33]

Shlomi, S

J. Shlomi, S. Ganguly, E. Gross, K. Cranmer, Y. Lipman, H. Serviansky, H. Maron, and N. Segol,Secondary vertex finding in jets with neural networks,Eur. Phys. J. C 81(2021), no. 6 540, [arXiv:2008.02831]

work page arXiv 2021
[34]

K. Goto, T. Suehara, T. Yoshioka, M. Kurata, H. Nagahara, Y. Nakashima, N. Takemura, and M. Iwasaki,Development of a vertex finding algorithm using Recurrent Neural Network,Nucl. Instrum. Meth. A1047(2023) 167836, [arXiv:2101.11906]

work page arXiv 2023
[35]

Guiang et al.,Improving tracking algorithms with machine learning: a case for line-segment tracking at the High Luminosity LHC, inConnecting The Dots 2023, 3, 2024.arXiv:2403.13166

J. Guiang et al.,Improving tracking algorithms with machine learning: a case for line-segment tracking at the High Luminosity LHC, inConnecting The Dots 2023, 3, 2024.arXiv:2403.13166

work page arXiv 2023
[36]

Erdmann,A tagger for strange jets based on tracking information using long short-term memory,JINST15(2020), no

J. Erdmann,A tagger for strange jets based on tracking information using long short-term memory,JINST15(2020), no. 01 P01021, [arXiv:1907.07505]

work page arXiv 2020
[37]

Nakai, D

Y. Nakai, D. Shih, and S. Thomas,Strange Jet Tagging,arXiv:2003.09517

work page arXiv 2003
[38]

Erdmann, O

J. Erdmann, O. Nackenhorst, and S. V. Zeißner,Maximum performance of strange-jet tagging at hadron colliders,JINST16(2021), no. 08 P08039, [arXiv:2011.10736]

work page arXiv 2021
[39]

P. T. Komiske, E. M. Metodiev, and M. D. Schwartz,Deep learning in color: towards automated quark/gluon jet discrimination,JHEP01(2017) 110, [arXiv:1612.01551]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Recursive Neural Networks in Quark/Gluon Tagging

T. Cheng,Recursive Neural Networks in Quark/Gluon Tagging,Comput. Softw. Big Sci.2(2018), no. 1 3, [arXiv:1711.02633]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

Abbas, A

M. Abbas, A. Khan, A. S. Qureshi, and M. W. Khan,Extracting Signals of Higgs Boson From Background Noise Using Deep Neural Networks,arXiv:2010.08201. [46]CMSCollaboration, A. Tumasyan et al.,Search for Higgs Boson and Observation of Z Boson through their Decay into a Charm Quark-Antiquark Pair in Boosted Topologies in Proton-Proton Collisions at s=13 TeV,...

work page arXiv 2010
[42]

Zhang, J

Z. Zhang, J. Liu, J. Hu, Q. Wang, and U.-G. Meißner,Revealing the nature of hidden charm pentaquarks with machine learning,Sci. Bull.68(2023) 981–989, [arXiv:2301.05364]

work page arXiv 2023
[43]

Goswami, S

K. Goswami, S. Prasad, N. Mallick, R. Sahoo, and G. B. Mohanty,A machine learning-based study of open-charm hadrons in proton-proton collisions at the Large Hadron Collider,arXiv:2404.09839

work page arXiv
[44]

H. Qu, C. Li, and S. Qian,Particle Transformer for Jet Tagging,arXiv:2202.03772

work page arXiv
[45]

Y. Wu, K. Wang, C. Li, H. Qu, and J. Zhu,Jet tagging with more-interaction particle transformer*,Chin. Phys. C49(2025), no. 1 013110, [arXiv:2407.08682]. 28

work page arXiv 2025
[46]

T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei,Differential transformer,arXiv preprint arXiv:2410.05258(2024)

work page arXiv 2024
[47]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,Attention is all you need,Advances in neural information processing systems30(2017)

work page 2017
[48]

P. T. Komiske, E. M. Metodiev, and J. Thaler,Energy Flow Networks: Deep Sets for Particle Jets,JHEP01(2019) 121, [arXiv:1810.05165]

work page arXiv 2019
[49]

Qu and L

H. Qu and L. Gouskos,ParticleNet: Jet Tagging via Particle Clouds,Phys. Rev. D 101(2020), no. 5 056019, [arXiv:1902.08570]

work page arXiv 2020
[50]

B. Käch, D. Krücker, and I. Melzer-Pellmann,Point Cloud Generation using Transformer Encoders and Normalising Flows,arXiv:2211.13623

work page arXiv
[51]

Blekman, F

F. Blekman, F. Canelli, A. De Moor, K. Gautam, A. Ilg, A. Macchiolo, and E. Ploerer,Tagging more quark jet flavours at FCC-ee at 91 GeV with a transformer-based neural network,Eur. Phys. J. C85(2025), no. 2 165, [arXiv:2406.08590]

work page arXiv 2025
[52]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever,Generating long sequences with sparse transformers,arXiv preprint arXiv:1904.10509(2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[53]

Z. Fu, W. Song, Y. Wang, X. Wu, Y. Zheng, Y. Zhang, D. Xu, X. Wei, T. Xu, and X. Zhao,Sliding window attention training for efficient large language models,arXiv preprint arXiv:2502.18845(2025)

work page arXiv 2025
[54]

X. Pan, T. Ye, Z. Xia, S. Song, and G. Huang,Slide-transformer: Hierarchical vision transformer with local self-attention, inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2082–2091, 2023

work page 2082
[55]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan,Longformer: The long-document transformer,arXiv preprint arXiv:2004.05150(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2004
[56]

Hassani, S

A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi,Escaping the big data paradigm with compact transformers,arXiv preprint arXiv:2104.05704(2021)

work page arXiv 2021
[57]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel,Gaussian error linear units (gelus),arXiv preprint arXiv:1606.08415(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[58]

Zhang and R

B. Zhang and R. Sennrich,Root mean square layer normalization,Advances in Neural Information Processing Systems32(2019)

work page 2019
[59]

Elfwing, E

S. Elfwing, E. Uchibe, and K. Doya,Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,Neural networks107(2018) 3–11

work page 2018
[60]

Kasieczka, T

G. Kasieczka, T. Plehn, J. Thompson, and M. Russel,Top quark tagging reference dataset, Mar., 2019

work page 2019
[61]

Komiske, E

P. Komiske, E. Metodiev, and J. Thaler,Pythia8 quark and gluon jets for energy flow, May, 2019

work page 2019
[62]

A comprehensive guide to the physics and usage of PYTHIA 8.3

C. Bierlich et al.,A comprehensive guide to the physics and usage of PYTHIA 8.3, SciPost Phys. Codeb.2022(2022) 8, [arXiv:2203.11601]. 29 [68]DELPHES 3Collaboration, J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi,DELPHES 3, A modular framework for fast simulation of a generic collider experiment,JHEP02(2014) 0...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Shimmin,Particle Convolution for High Energy Physics, 7, 2021

C. Shimmin,Particle Convolution for High Energy Physics, 7, 2021. arXiv:2107.02908

work page arXiv 2021
[64]

S. Gong, Q. Meng, J. Zhang, H. Qu, C. Li, S. Qian, W. Du, Z.-M. Ma, and T.-Y. Liu, An efficient Lorentz equivariant graph neural network for jet tagging,JHEP07 (2022) 030, [arXiv:2201.08187]

work page arXiv 2022
[65]

Brehmer, V

J. Brehmer, V. Bresó, P. de Haan, T. Plehn, H. Qu, J. Spinner, and J. Thaler,A Lorentz-Equivariant Transformer for All of the LHC,arXiv:2411.00446

work page arXiv
[66]

Mikuni and F

V. Mikuni and F. Canelli,Point cloud transformers applied to collider physics,Mach. Learn. Sci. Tech.2(2021), no. 3 035027, [arXiv:2102.05073]

work page arXiv 2021
[67]

Mikuni and B

V. Mikuni and B. Nachman,Method to simultaneously facilitate all jet physics tasks, Phys. Rev. D111(2025), no. 5 054015, [arXiv:2502.14652]

work page arXiv 2025
[68]

Loshchilov and F

I. Loshchilov and F. Hutter,Decoupled weight decay regularization, inInternational Conference on Learning Representations, 2019

work page 2019
[69]

Mikuni and F

V. Mikuni and F. Canelli,ABCNet: An attention-based method for particle tagging, Eur. Phys. J. Plus135(2020), no. 6 463, [arXiv:2001.05311]

work page arXiv 2020
[70]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al.,Training compute-optimal large language models,arXiv preprint arXiv:2203.15556(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

On the relationship between self-attention and convolutional layers,

J.-B. Cordonnier, A. Loukas, and M. Jaggi,On the relationship between self-attention and convolutional layers,arXiv preprint arXiv:1911.03584(2019)

work page arXiv 1911
[72]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton,Similarity of neural network representations revisited, inInternational conference on machine learning, pp. 3519–3529, PMLR, 2019

work page 2019
[73]

Gomez, T

T. Gomez, T. Fréour, and H. Mouchère,Metrics for saliency map evaluation of deep learning explanation methods, inInternational Conference on Pattern Recognition and Artificial Intelligence, pp. 84–95, Springer, 2022

work page 2022
[74]

Binder, G

A. Binder, G. Montavon, S. Lapuschkin, K.-R. Müller, and W. Samek,Layer-wise relevance propagation for neural networks with local renormalization layers, in Artificial Neural Networks and Machine Learning–ICANN 2016: 25th International Conference on Artificial Neural Networks, Barcelona, Spain, September 6-9, 2016, Proceedings, Part II 25, pp. 63–71, Spri...

work page 2016
[75]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, Grad-cam: visual explanations from deep networks via gradient-based localization, International journal of computer vision128(2020) 336–359

work page 2020
[76]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an 30 imperative style, high-performance deep learning library. Curran Associates Inc., Red Hook, NY, USA, 2019. 31

work page 2019

[1] [1]

J. M. Butterworth, A. R. Davison, M. Rubin, and G. P. Salam,Jet substructure as a new Higgs search channel at the LHC,Phys. Rev. Lett.100(2008) 242001, [arXiv:0802.2470]

work page internal anchor Pith review Pith/arXiv arXiv 2008

[2] [2]

L. G. Almeida, M. Backović, M. Cliche, S. J. Lee, and M. Perelstein,Playing Tag with ANN: Boosted Top Identification with Pattern Recognition,JHEP07(2015) 086, [arXiv:1501.05968]

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Deep-learned Top Tagging with a Lorentz Layer

A. Butter, G. Kasieczka, T. Plehn, and M. Russell,Deep-learned Top Tagging with a Lorentz Layer,SciPost Phys.5(2018), no. 3 028, [arXiv:1707.08966]. 25

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Deep-learning Top Taggers or The End of QCD?

G. Kasieczka, T. Plehn, M. Russell, and T. Schell,Deep-learning Top Taggers or The End of QCD?,JHEP05(2017) 006, [arXiv:1701.08784]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Louppe, K

G. Louppe, K. Cho, C. Becot, and K. Cranmer,QCD-Aware Recursive Neural Networks for Jet Physics,JHEP01(2019) 057, [arXiv:1702.00748]

work page arXiv 2019

[6] [6]

Butter et al.,The Machine Learning landscape of top taggers,SciPost Phys.7 (2019) 014, [arXiv:1902.09914]

A. Butter et al.,The Machine Learning landscape of top taggers,SciPost Phys.7 (2019) 014, [arXiv:1902.09914]

work page arXiv 2019

[7] [7]

Chakraborty, S

A. Chakraborty, S. H. Lim, M. M. Nojiri, and M. Takeuchi,Neural Network-based Top Tagger with Two-Point Energy Correlations and Geometry of Soft Emissions, JHEP07(2020) 111, [arXiv:2003.11787]

work page arXiv 2020

[8] [8]

Bhattacharya, M

S. Bhattacharya, M. Guchait, and A. H. Vijay,Boosted top quark tagging and polarization measurement using machine learning,Phys. Rev. D105(2022), no. 4 042005, [arXiv:2010.11778]

work page arXiv 2022

[9] [9]

Ju and B

X. Ju and B. Nachman,Supervised Jet Clustering with Graph Neural Networks for Lorentz Boosted Bosons,Phys. Rev. D102(2020), no. 7 075014, [arXiv:2008.06064]

work page arXiv 2020

[10] [10]

F. A. Dreyer and H. Qu,Jet tagging in the Lund plane with graph networks,JHEP 03(2021) 052, [arXiv:2012.08526]

work page arXiv 2021

[11] [11]

Tannenwald, C

B. Tannenwald, C. Neu, A. Li, G. Buehlmann, A. Cuddeback, L. Hatfield, R. Parvatam, and C. Thompson,Benchmarking Machine Learning Techniques with Di-Higgs Production at the LHC,arXiv:2009.06754

work page arXiv 2009

[12] [12]

F. A. Dreyer, R. Grabarczyk, and P. F. Monni,Leveraging universality of jet taggers through transfer learning,Eur. Phys. J. C82(2022), no. 6 564, [arXiv:2203.06210]

work page arXiv 2022

[13] [13]

Hammad, S

A. Hammad, S. Khalil, and S. Moretti,Search for mono-Higgs signals in bb¯final states using deep neural networks,Phys. Rev. D107(2023), no. 7 075027, [arXiv:2208.10133]

work page arXiv 2023

[14] [14]

Ahmed, A

I. Ahmed, A. Zada, M. Waqas, and M. U. Ashraf,Application of deep learning in top pair and single top quark production at the LHC,Eur. Phys. J. Plus138(2023), no. 9 795, [arXiv:2203.12871]

work page arXiv 2023

[15] [15]

J. M. Munoz, I. Batatia, and C. Ortner,Boost invariant polynomials for efficient jet tagging,Mach. Learn. Sci. Tech.3(2022), no. 4 04LT05, [arXiv:2207.08272]

work page arXiv 2022

[16] [16]

He and D

M. He and D. Wang,Quark/gluon discrimination and top tagging with dual attention transformer,Eur. Phys. J. C83(2023), no. 12 1116, [arXiv:2307.04723]

work page arXiv 2023

[17] [17]

J. A. Aguilar-Saavedra, E. Arganda, F. R. Joaquim, R. M. Sandá Seoane, and J. F. Seabra,Gradient Boosting MUST taggers for highly-boosted jets,arXiv:2305.04957

work page arXiv

[18] [18]

Athanasakos, A

D. Athanasakos, A. J. Larkoski, and J. Mulligan,Is infrared-collinear safe information all you need for jet classification?,arXiv:2305.08979

work page arXiv

[19] [19]

Grossi, M

M. Grossi, M. Incudini, M. Pellen, and G. Pelliccioli,Amplitude-assisted tagging of longitudinally polarised bosons using wide neural networks,Eur. Phys. J. C83 (2023), no. 8 759, [arXiv:2306.07726]. 26

work page arXiv 2023

[20] [20]

Hammad, P

A. Hammad, P. Ko, C.-T. Lu, and M. Park,Exploring exotic decays of the Higgs boson to multi-photons at the LHC via multimodal learning approaches,JHEP09 (2024) 166, [arXiv:2405.18834]

work page arXiv 2024

[21] [21]

Hammad and M

A. Hammad and M. M. Nojiri,Streamlined jet tagging network assisted by jet prong structure,JHEP06(2024) 176, [arXiv:2404.14677]. [22]CMSCollaboration, A. M. Sirunyan et al.,Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV,JINST13(2018), no. 05 P05011, [arXiv:1712.07158]. [23]ATLASCollaboration,Identification of Jets Co...

work page arXiv 2024

[22] [22]

Andrews et al.,End-to-end jet classification of boosted top quarks with the CMS open data,EPJ Web Conf.251(2021) 04030, [arXiv:2104.14659]

M. Andrews et al.,End-to-end jet classification of boosted top quarks with the CMS open data,EPJ Web Conf.251(2021) 04030, [arXiv:2104.14659]

work page arXiv 2021

[23] [23]

Keicher,Machine Learning in Top Physics in the ATLAS and CMS Collaborations, in15th International Workshop on Top Quark Physics, 1, 2023.arXiv:2301.09534

P. Keicher,Machine Learning in Top Physics in the ATLAS and CMS Collaborations, in15th International Workshop on Top Quark Physics, 1, 2023.arXiv:2301.09534

work page arXiv 2023

[24] [24]

Baroň, J

P. Baroň, J. Kvita, R. Přívara, J. Tomeček, and R. Vodák,Application of Machine Learning Based Top Quark and W Jet Tagging to Hadronic Four-Top Final States Induced by SM as well as BSM Processes, in16th International Workshop on Top Quark Physics, 10, 2023.arXiv:2310.13009

work page arXiv 2023

[25] [25]

Hammad, S

A. Hammad, S. Moretti, and M. Nojiri,Multi-scale cross-attention transformer encoder for event classification,JHEP03(2024) 144, [arXiv:2401.00452]

work page arXiv 2024

[26] [26]

Esmail, A

W. Esmail, A. Hammad, and S. Moretti,Sharpening the A→Z(∗)h signature of the Type-II 2HDM at the LHC through advanced Machine Learning,JHEP11(2023) 020, [arXiv:2305.13781]

work page arXiv 2023

[27] [27]

Datta, A

K. Datta, A. Larkoski, and B. Nachman,Automating the Construction of Jet Observables with Machine Learning,Phys. Rev. D100(2019), no. 9 095016, [arXiv:1902.07180]

work page arXiv 2019

[28] [28]

Interpretable deep learning for two-prong jet classification with jet spectra,

A. Chakraborty, S. H. Lim, and M. M. Nojiri,Interpretable deep learning for two-prong jet classification with jet spectra,JHEP07(2019) 135, [arXiv:1904.02092]

work page arXiv 2019

[29] [29]

Kim and A

T. Kim and A. Martin,AW±polarization analyzer from Deep Neural Networks, arXiv:2102.05124

work page arXiv

[30] [30]

Subba and R

A. Subba and R. K. Singh,Role of polarizations and spin-spin correlations of W’s in e-e+→W-W+ at s=250 GeV to probe anomalous W-W+Z/γcouplings,Phys. Rev. D 107(2023), no. 7 073004, [arXiv:2212.12973]. 27

work page arXiv 2023

[31] [31]

Bogatskiy, T

A. Bogatskiy, T. Hoffman, D. W. Miller, J. T. Offermann, and X. Liu,Explainable equivariant neural networks for particle physics: PELICAN,JHEP03(2024) 113, [arXiv:2307.16506]

work page arXiv 2024

[32] [32]

S. Akar, T. J. Boettcher, S. Carl, H. F. Schreiner, M. D. Sokoloff, M. Stahl, C. Weisser, and M. Williams,An updated hybrid deep learning algorithm for identifying and locating primary vertices,arXiv:2007.01023

work page arXiv 2007

[33] [33]

Shlomi, S

J. Shlomi, S. Ganguly, E. Gross, K. Cranmer, Y. Lipman, H. Serviansky, H. Maron, and N. Segol,Secondary vertex finding in jets with neural networks,Eur. Phys. J. C 81(2021), no. 6 540, [arXiv:2008.02831]

work page arXiv 2021

[34] [34]

K. Goto, T. Suehara, T. Yoshioka, M. Kurata, H. Nagahara, Y. Nakashima, N. Takemura, and M. Iwasaki,Development of a vertex finding algorithm using Recurrent Neural Network,Nucl. Instrum. Meth. A1047(2023) 167836, [arXiv:2101.11906]

work page arXiv 2023

[35] [35]

Guiang et al.,Improving tracking algorithms with machine learning: a case for line-segment tracking at the High Luminosity LHC, inConnecting The Dots 2023, 3, 2024.arXiv:2403.13166

J. Guiang et al.,Improving tracking algorithms with machine learning: a case for line-segment tracking at the High Luminosity LHC, inConnecting The Dots 2023, 3, 2024.arXiv:2403.13166

work page arXiv 2023

[36] [36]

Erdmann,A tagger for strange jets based on tracking information using long short-term memory,JINST15(2020), no

J. Erdmann,A tagger for strange jets based on tracking information using long short-term memory,JINST15(2020), no. 01 P01021, [arXiv:1907.07505]

work page arXiv 2020

[37] [37]

Nakai, D

Y. Nakai, D. Shih, and S. Thomas,Strange Jet Tagging,arXiv:2003.09517

work page arXiv 2003

[38] [38]

Erdmann, O

J. Erdmann, O. Nackenhorst, and S. V. Zeißner,Maximum performance of strange-jet tagging at hadron colliders,JINST16(2021), no. 08 P08039, [arXiv:2011.10736]

work page arXiv 2021

[39] [39]

P. T. Komiske, E. M. Metodiev, and M. D. Schwartz,Deep learning in color: towards automated quark/gluon jet discrimination,JHEP01(2017) 110, [arXiv:1612.01551]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Recursive Neural Networks in Quark/Gluon Tagging

T. Cheng,Recursive Neural Networks in Quark/Gluon Tagging,Comput. Softw. Big Sci.2(2018), no. 1 3, [arXiv:1711.02633]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [41]

Abbas, A

M. Abbas, A. Khan, A. S. Qureshi, and M. W. Khan,Extracting Signals of Higgs Boson From Background Noise Using Deep Neural Networks,arXiv:2010.08201. [46]CMSCollaboration, A. Tumasyan et al.,Search for Higgs Boson and Observation of Z Boson through their Decay into a Charm Quark-Antiquark Pair in Boosted Topologies in Proton-Proton Collisions at s=13 TeV,...

work page arXiv 2010

[42] [42]

Zhang, J

Z. Zhang, J. Liu, J. Hu, Q. Wang, and U.-G. Meißner,Revealing the nature of hidden charm pentaquarks with machine learning,Sci. Bull.68(2023) 981–989, [arXiv:2301.05364]

work page arXiv 2023

[43] [43]

Goswami, S

K. Goswami, S. Prasad, N. Mallick, R. Sahoo, and G. B. Mohanty,A machine learning-based study of open-charm hadrons in proton-proton collisions at the Large Hadron Collider,arXiv:2404.09839

work page arXiv

[44] [44]

H. Qu, C. Li, and S. Qian,Particle Transformer for Jet Tagging,arXiv:2202.03772

work page arXiv

[45] [45]

Y. Wu, K. Wang, C. Li, H. Qu, and J. Zhu,Jet tagging with more-interaction particle transformer*,Chin. Phys. C49(2025), no. 1 013110, [arXiv:2407.08682]. 28

work page arXiv 2025

[46] [46]

T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei,Differential transformer,arXiv preprint arXiv:2410.05258(2024)

work page arXiv 2024

[47] [47]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,Attention is all you need,Advances in neural information processing systems30(2017)

work page 2017

[48] [48]

P. T. Komiske, E. M. Metodiev, and J. Thaler,Energy Flow Networks: Deep Sets for Particle Jets,JHEP01(2019) 121, [arXiv:1810.05165]

work page arXiv 2019

[49] [49]

Qu and L

H. Qu and L. Gouskos,ParticleNet: Jet Tagging via Particle Clouds,Phys. Rev. D 101(2020), no. 5 056019, [arXiv:1902.08570]

work page arXiv 2020

[50] [50]

B. Käch, D. Krücker, and I. Melzer-Pellmann,Point Cloud Generation using Transformer Encoders and Normalising Flows,arXiv:2211.13623

work page arXiv

[51] [51]

Blekman, F

F. Blekman, F. Canelli, A. De Moor, K. Gautam, A. Ilg, A. Macchiolo, and E. Ploerer,Tagging more quark jet flavours at FCC-ee at 91 GeV with a transformer-based neural network,Eur. Phys. J. C85(2025), no. 2 165, [arXiv:2406.08590]

work page arXiv 2025

[52] [52]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever,Generating long sequences with sparse transformers,arXiv preprint arXiv:1904.10509(2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[53] [53]

Z. Fu, W. Song, Y. Wang, X. Wu, Y. Zheng, Y. Zhang, D. Xu, X. Wei, T. Xu, and X. Zhao,Sliding window attention training for efficient large language models,arXiv preprint arXiv:2502.18845(2025)

work page arXiv 2025

[54] [54]

X. Pan, T. Ye, Z. Xia, S. Song, and G. Huang,Slide-transformer: Hierarchical vision transformer with local self-attention, inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2082–2091, 2023

work page 2082

[55] [55]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan,Longformer: The long-document transformer,arXiv preprint arXiv:2004.05150(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2004

[56] [56]

Hassani, S

A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi,Escaping the big data paradigm with compact transformers,arXiv preprint arXiv:2104.05704(2021)

work page arXiv 2021

[57] [57]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel,Gaussian error linear units (gelus),arXiv preprint arXiv:1606.08415(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[58] [58]

Zhang and R

B. Zhang and R. Sennrich,Root mean square layer normalization,Advances in Neural Information Processing Systems32(2019)

work page 2019

[59] [59]

Elfwing, E

S. Elfwing, E. Uchibe, and K. Doya,Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,Neural networks107(2018) 3–11

work page 2018

[60] [60]

Kasieczka, T

G. Kasieczka, T. Plehn, J. Thompson, and M. Russel,Top quark tagging reference dataset, Mar., 2019

work page 2019

[61] [61]

Komiske, E

P. Komiske, E. Metodiev, and J. Thaler,Pythia8 quark and gluon jets for energy flow, May, 2019

work page 2019

[62] [62]

A comprehensive guide to the physics and usage of PYTHIA 8.3

C. Bierlich et al.,A comprehensive guide to the physics and usage of PYTHIA 8.3, SciPost Phys. Codeb.2022(2022) 8, [arXiv:2203.11601]. 29 [68]DELPHES 3Collaboration, J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi,DELPHES 3, A modular framework for fast simulation of a generic collider experiment,JHEP02(2014) 0...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[63] [63]

Shimmin,Particle Convolution for High Energy Physics, 7, 2021

C. Shimmin,Particle Convolution for High Energy Physics, 7, 2021. arXiv:2107.02908

work page arXiv 2021

[64] [64]

S. Gong, Q. Meng, J. Zhang, H. Qu, C. Li, S. Qian, W. Du, Z.-M. Ma, and T.-Y. Liu, An efficient Lorentz equivariant graph neural network for jet tagging,JHEP07 (2022) 030, [arXiv:2201.08187]

work page arXiv 2022

[65] [65]

Brehmer, V

J. Brehmer, V. Bresó, P. de Haan, T. Plehn, H. Qu, J. Spinner, and J. Thaler,A Lorentz-Equivariant Transformer for All of the LHC,arXiv:2411.00446

work page arXiv

[66] [66]

Mikuni and F

V. Mikuni and F. Canelli,Point cloud transformers applied to collider physics,Mach. Learn. Sci. Tech.2(2021), no. 3 035027, [arXiv:2102.05073]

work page arXiv 2021

[67] [67]

Mikuni and B

V. Mikuni and B. Nachman,Method to simultaneously facilitate all jet physics tasks, Phys. Rev. D111(2025), no. 5 054015, [arXiv:2502.14652]

work page arXiv 2025

[68] [68]

Loshchilov and F

I. Loshchilov and F. Hutter,Decoupled weight decay regularization, inInternational Conference on Learning Representations, 2019

work page 2019

[69] [69]

Mikuni and F

V. Mikuni and F. Canelli,ABCNet: An attention-based method for particle tagging, Eur. Phys. J. Plus135(2020), no. 6 463, [arXiv:2001.05311]

work page arXiv 2020

[70] [70]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al.,Training compute-optimal large language models,arXiv preprint arXiv:2203.15556(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[71] [71]

On the relationship between self-attention and convolutional layers,

J.-B. Cordonnier, A. Loukas, and M. Jaggi,On the relationship between self-attention and convolutional layers,arXiv preprint arXiv:1911.03584(2019)

work page arXiv 1911

[72] [72]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton,Similarity of neural network representations revisited, inInternational conference on machine learning, pp. 3519–3529, PMLR, 2019

work page 2019

[73] [73]

Gomez, T

T. Gomez, T. Fréour, and H. Mouchère,Metrics for saliency map evaluation of deep learning explanation methods, inInternational Conference on Pattern Recognition and Artificial Intelligence, pp. 84–95, Springer, 2022

work page 2022

[74] [74]

Binder, G

A. Binder, G. Montavon, S. Lapuschkin, K.-R. Müller, and W. Samek,Layer-wise relevance propagation for neural networks with local renormalization layers, in Artificial Neural Networks and Machine Learning–ICANN 2016: 25th International Conference on Artificial Neural Networks, Barcelona, Spain, September 6-9, 2016, Proceedings, Part II 25, pp. 63–71, Spri...

work page 2016

[75] [75]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, Grad-cam: visual explanations from deep networks via gradient-based localization, International journal of computer vision128(2020) 336–359

work page 2020

[76] [76]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an 30 imperative style, high-performance deep learning library. Curran Associates Inc., Red Hook, NY, USA, 2019. 31

work page 2019