Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging

Aaron Wang; Abhijith Gandrakota; Elham E Khoda; Javier Duarte; Jennifer Ngadiuba; Richard Cavanaugh; Subash Katel; Vivekanand Gyanchand Sahu; Zihan Zhao

arxiv: 2510.23641 · v2 · pith:INJQUSEKnew · submitted 2025-10-24 · 💻 cs.LG · cs.AI· hep-ex· physics.ins-det

Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging

Aaron Wang , Zihan Zhao , Subash Katel , Vivekanand Gyanchand Sahu , Elham E Khoda , Abhijith Gandrakota , Jennifer Ngadiuba , Richard Cavanaugh

show 1 more author

Javier Duarte

This is my paper

Pith reviewed 2026-05-21 19:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIhep-exphysics.ins-det

keywords linear transformerjet taggingparticle physicslinformerspatially aware partitioningconvolutional layershigh energy physicsmachine learning

0 comments

The pith

SAL-T matches full-attention transformer accuracy on jet tagging while using linear attention for lower latency and fewer resources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Spatially Aware Linear Transformer to handle the quadratic cost of standard transformers when classifying particle jets in high-energy collisions. It modifies the linformer by partitioning particles into regions according to kinematic features and inserting convolutional layers that reflect jet physics structures for local correlations. This keeps overall attention linear in sequence length. The resulting model outperforms the plain linformer and reaches classification performance comparable to full quadratic transformers while cutting resource use and inference latency. The same pattern holds on a separate point-cloud benchmark.

Core claim

SAL-T preserves linear attention complexity by computing attention across spatially aware particle regions defined by kinematic features and by using convolutional layers to model local jet correlations, thereby matching the tagging accuracy of full-attention transformers at substantially lower computational cost.

What carries the argument

Spatially aware partitioning of particles into kinematic regions combined with convolutional layers inside a linear-attention linformer backbone.

If this is right

SAL-T can be deployed in high-data-throughput environments such as the CERN LHC without the latency penalty of quadratic attention.
The model outperforms the unmodified linformer on the same jet classification tasks.
Classification results remain comparable to full-attention transformers across the reported benchmarks.
The approach transfers to generic point-cloud classification as demonstrated on ModelNet10.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partitioning-plus-convolution pattern could reduce model size requirements in other physics data tasks that involve spatially structured point clouds.
Real-time event filtering at detectors might become feasible if the latency reduction scales with larger event multiplicities.
Domain-specific region definitions may offer a general route to keep linear transformers competitive with quadratic ones in scientific applications.

Load-bearing premise

Partitioning particles by kinematic features and adding convolutional layers captures all task-relevant global and local correlations for jet tagging without meaningful information loss or bias.

What would settle it

A side-by-side run on the jet tagging dataset in which SAL-T accuracy falls clearly below a full-attention transformer while measured inference latency remains higher than claimed would falsify the performance equivalence.

Figures

Figures reproduced from arXiv: 2510.23641 by Aaron Wang, Abhijith Gandrakota, Elham E Khoda, Javier Duarte, Jennifer Ngadiuba, Richard Cavanaugh, Subash Katel, Vivekanand Gyanchand Sahu, Zihan Zhao.

**Figure 1.** Figure 1: (Left) Jet constituents partitioned and sorted by kT in the (∆η, ∆ϕ) plane in SAL-T, showing how constituents are binned spatially before projection. (Center) Jet constituents partitioned by transverse momentum in the (∆η, ∆ϕ) plane. (Right) Visualization of the projection partitioning strategy used in SAL-T, Jet constituents are partitioned into spatial bins before projection, preserving local structure… view at source ↗

**Figure 2.** Figure 2: (Left) Architecture of the linear partitioned particle multi-head attention (LPP-MHA) module used in SAL-T. The input query, key, and value sequences of dimension n × m are linearly projected to dimension n × d, then spatially partitioned into p groups of size p × d. Attention weights are computed via scaled dot-product attention within each partition, followed by a depthwise convolution over the attention… view at source ↗

**Figure 3.** Figure 3: (Left) Jet classification accuracy of SAL-T, Linformer, and standard Transformer across bins of increasing number of particles per jet. SAL-T consistently matches or exceeds the accuracy of linformer and remains competitive with full transformers. Performance variance in the highest bin (115–150 particles) is attributable to its small sample size (only 41 jets). (Right) Floating-point operation (FLOP) coun… view at source ↗

**Figure 4.** Figure 4: Attention matrices for a top quark jet with 81 particles. Each trio of attention plots rep [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Attention matrices for a top quark jet with 121 particles. Each trio of attention matrices [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Attention matrices for a top quark jet with 44 particles. Each trio of attention matrices [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: histogram of ratio of attention to partition 0 in head 2 throughout all particles in the test [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Training/validation accuracy and loss curves for SAL-T with learning rate decay vs batch [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Transformers are very effective in capturing both global and local correlations within high-energy particle collisions, but they present deployment challenges in high-data-throughput environments, such as the CERN LHC. The quadratic complexity of transformer models demands substantial resources and increases latency during inference. In order to address these issues, we introduce the Spatially Aware Linear Transformer (SAL-T), a physics-inspired enhancement of the linformer architecture that maintains linear attention. Our method incorporates spatially aware partitioning of particles based on kinematic features, thereby computing attention between regions of physical significance. Additionally, we employ convolutional layers to capture local correlations, informed by insights from jet physics. In addition to outperforming the standard linformer in jet classification tasks, SAL-T also achieves classification results comparable to full-attention transformers, while using considerably fewer resources with lower latency during inference. Experiments on a generic point cloud classification dataset (ModelNet10) further confirm this trend. Our code is available at https://github.com/aaronw5/SAL-T4HEP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAL-T is a useful domain-specific tweak to linformer that delivers practical speedups for jet tagging while matching full attention accuracy on the reported tasks, though the spatial partitioning step still needs evidence that it does not drop key cross-region correlations.

read the letter

The main point is that this work takes the linformer idea and adds kinematic spatial partitioning plus convolutional layers to make it more effective for jet tagging. The result is an architecture that runs with linear complexity, beats the plain linformer, and reaches accuracy levels close to standard transformers while using less memory and time at inference. They also show the same pattern holds on a generic point-cloud dataset, and the code is public, which helps reproducibility.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the Spatially Aware Linear Transformer (SAL-T), a physics-inspired modification of the Linformer architecture for particle jet tagging. Particles are partitioned into regions based on kinematic features, with attention computed at the region level and convolutional layers added to capture local correlations. The central claims are that SAL-T outperforms the standard Linformer, achieves classification accuracy comparable to full-attention transformers on jet tagging tasks, and does so with linear complexity, substantially lower resource usage, and reduced inference latency. The approach is further tested on the ModelNet10 point-cloud dataset, and code is released publicly.

Significance. If the empirical results hold under scrutiny, the work offers a practical route to deploying expressive attention-based models in high-throughput HEP environments such as the LHC, where quadratic complexity is prohibitive. The combination of spatially aware partitioning and convolutional layers represents a targeted, physics-motivated attempt to retain task-relevant correlations while enforcing linearity. Public code release supports reproducibility and follow-on work.

major comments (1)

[Architecture / Method] The headline claim of performance comparable to full-attention transformers rests on the assumption that region-level attention plus convolutional layers fully preserve the multi-particle correlations needed for jet substructure (e.g., boosted-object identification or specific decay chains). The architecture description implies a reduction from per-particle to per-region interactions; an explicit ablation or correlation-preservation analysis (for example, comparing intra- vs. inter-region attention contributions on representative jet topologies) is required to substantiate that no critical long-range dependencies are lost.

minor comments (2)

Clarify the exact procedure for determining region boundaries and the number of regions; a sensitivity study with respect to these hyperparameters would strengthen the reproducibility of the spatially aware partitioning step.
The ModelNet10 experiments are mentioned only briefly; a short table or paragraph comparing SAL-T against the same baselines used for jet tagging would make the generalizability claim easier to evaluate.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential impact. We address the single major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Architecture / Method] The headline claim of performance comparable to full-attention transformers rests on the assumption that region-level attention plus convolutional layers fully preserve the multi-particle correlations needed for jet substructure (e.g., boosted-object identification or specific decay chains). The architecture description implies a reduction from per-particle to per-region interactions; an explicit ablation or correlation-preservation analysis (for example, comparing intra- vs. inter-region attention contributions on representative jet topologies) is required to substantiate that no critical long-range dependencies are lost.

Authors: We agree that an explicit ablation would provide stronger substantiation for the claim that critical correlations are retained. While the comparable classification accuracy on jet tagging tasks (including boosted topologies) already indicates that the physics-motivated partitioning and convolutional layers preserve task-relevant information, we will add a dedicated correlation-preservation analysis in the revised manuscript. This will include quantitative comparison of intra- versus inter-region attention contributions on representative jet samples, attention visualization for specific decay chains, and discussion of how the kinematic region boundaries are chosen to align with expected substructure scales. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on external benchmarks

full rationale

The paper introduces SAL-T as a physics-motivated modification to the linformer architecture, incorporating spatially aware partitioning and convolutional layers for jet tagging. All performance claims (comparable accuracy to full-attention transformers with lower latency) are supported by direct experimental results on standard jet classification datasets and ModelNet10, without any equations, derivations, or predictions that reduce by construction to fitted parameters or self-referential definitions. No load-bearing self-citations or uniqueness theorems are invoked to justify core choices; the contribution is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the spatial partitioning step implicitly requires choices of region boundaries or feature thresholds that are not detailed here.

pith-pipeline@v0.9.0 · 5747 in / 1105 out tokens · 51857 ms · 2026-05-21T19:34:01.071974+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sort the input particles by spatial proximity in the (∆η,∆ϕ) plane, weighted by transverse momentum pT ... partition the key and value projections into p groups ... depthwise 2D convolution over each head’s raw attention scores
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Linformer encodes no spatial information ... SAL-T introduces spatial awareness through three modifications

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

E-PCN: Jet Tagging with Explainable Particle Chebyshev Networks Using Kinematic Features
hep-ph 2025-12 conditional novelty 5.0

E-PCN reaches 94.67% macro-accuracy on 10-class jet tagging by weighting graphs with angular separation, transverse momentum, momentum fraction, and invariant mass, with Grad-CAM showing the first two account for 76% ...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

Attention Is All You Need

A. Vaswani et al., “Attention is all you need”, inAdvances in Neural Information Processing Systems, I. Guyon et al., eds., volume 30. Curran Associates, Inc., 2017. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

LHC Machine

L. Evans and P. Bryant, “LHC Machine”,JINST3(2008) S08001, doi:10.1088/1748-0221/3/08/S08001

work page doi:10.1088/1748-0221/3/08/s08001 2008
[3]

Particle Transformer for Jet Tagging

H. Qu, C. Li, and S. Qian, “Particle Transformer for Jet Tagging”, inProceedings of the 39th International Conference on Machine Learning, K. Chaudhuri et al., eds., volume 162, p. 18281. 2022.arXiv:2202.03772

work page arXiv 2022
[4]

Search for highly energetic double Higgs boson production in the two bottom quark and two vector boson all-hadronic final state

CMS Collaboration, “Search for highly energetic double Higgs boson production in the two bottom quark and two vector boson all-hadronic final state”, CMS Physics Analysis Summary CMS-PAS-HIG-23-012, 2024

work page 2024
[5]

Performance of the CMS Level-1 trigger in proton-proton collisions at√s= 13TeV

CMS Collaboration, “Performance of the CMS Level-1 trigger in proton-proton collisions at√s= 13TeV”,JINST15(2020) P10017,arXiv:2006.10165

work page arXiv 2020
[6]

Operation of the ATLAS trigger system in Run 2

ATLAS Collaboration, “Operation of the ATLAS trigger system in Run 2”,JINST15(2020) P10004,arXiv:2007.12539

work page arXiv 2020
[7]

Realtime Anomaly Detection at the L1 Trigger of CMS Experiment

CMS Collaboration, “Realtime Anomaly Detection at the L1 Trigger of CMS Experiment”, PoSICHEP2024(2025) 1025,doi:10.22323/1.476.1025,arXiv:2411.19506

work page doi:10.22323/1.476.1025 2025
[8]

Performance of the CMS high-level trigger during LHC Run 2

CMS Collaboration, “Performance of the CMS high-level trigger during LHC Run 2”,JINST 19(2024) P11021,doi:10.1088/1748-0221/19/11/P11021, arXiv:2410.17038

work page doi:10.1088/1748-0221/19/11/p11021 2024
[9]

Linformer: Self-Attention with Linear Complexity

S. Wang et al., “Linformer: Self-attention with linear complexity”, 2020. arXiv:2006.04768. 11

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

The CMS Particle Flow Algorithm

CMS Collaboration, F. Beaudette, “The CMS Particle Flow Algorithm”, inInternational Conference on Calorimetry for the High Energy Frontier, pp. 295–304. 2013. arXiv:1401.8155

work page internal anchor Pith review Pith/arXiv arXiv 2013
[11]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer”, 2020. arXiv:2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Rethinking Attention with Performers

K. Choromanski et al., “Rethinking attention with performers”, inInternational Conference on Learning Representations. 2021.arXiv:2009.14794

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Reformer: The Efficient Transformer

N. Kitaev, Łukasz Kaiser, and A. Levskaya, “Reformer: The efficient transformer”, in International Conference on Learning Representations. 2020.arXiv:2001.04451

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Jet tagging with more-interaction particle transformer*

Y . Wu et al., “Jet tagging with more-interaction particle transformer*”,Chin. Phys. C49 (2025) 013110,doi:10.1088/1674-1137/ad7f3d,arXiv:2407.08682

work page doi:10.1088/1674-1137/ad7f3d 2025
[15]

Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics

S. Miao et al., “Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics”, inProceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov et al., eds., volume 235, p. 35546. 2024. arXiv:2402.12535

work page arXiv 2024
[16]

Set transformer: A framework for attention-based permutation-invariant neural networks

J. Lee et al., “Set transformer: A framework for attention-based permutation-invariant neural networks”, inProceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., volume 97, p. 3744. 2019

work page 2019
[17]

Evaluating generative models in high energy physics

R. Kansal et al., “Evaluating generative models in high energy physics”,Phys. Rev. D107 (2023) 076017,doi:10.1103/PhysRevD.107.076017,arXiv:2211.10295

work page doi:10.1103/physrevd.107.076017 2023
[18]

Induced Generative Adversarial Particle Transformers

A. Li et al., “Induced Generative Adversarial Particle Transformers”, in6th Machine Learning and the Physical Sciences Workshop at the 37th Conference on Neural Information Processing Systems. 2023.arXiv:2312.04757

work page arXiv 2023
[19]

The anti-k_t jet clustering algorithm

M. Cacciari, G. P. Salam, and G. Soyez, “The anti-k T jet clustering algorithm”,JHEP04 (2008) 063,doi:10.1088/1126-6708/2008/04/063,arXiv:0802.1189

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1088/1126-6708/2008/04/063 2008
[20]

Longitudinally invariantkT clustering algorithms for hadron hadron collisions

S. Catani, Y . L. Dokshitzer, M. H. Seymour, and B. R. Webber, “Longitudinally invariantkT clustering algorithms for hadron hadron collisions”,Nucl. Phys. B406(1993) 187, doi:10.1016/0550-3213(93)90166-M

work page doi:10.1016/0550-3213(93)90166-m 1993
[21]

Successive Combination Jet Algorithm For Hadron Collisions

S. D. Ellis and D. E. Soper, “Successive combination jet algorithm for hadron collisions”, Phys. Rev. D48(1993) 3160,doi:10.1103/PhysRevD.48.3160, arXiv:hep-ph/9305266

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physrevd.48.3160 1993
[22]

FastJet user manual

M. Cacciari, G. P. Salam, and G. Soyez, “FastJet user manual”,Eur . Phys. J. C72(2012) 1896,doi:10.1140/epjc/s10052-012-1896-2,arXiv:1111.6097

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1140/epjc/s10052-012-1896-2 2012
[23]

Multi-token attention

O. Golovneva, T. Wang, J. Weston, and S. Sukhbaatar, “Multi-token attention”, 2025. arXiv:2504.00927

work page arXiv 2025
[24]

Fast inference of deep neural networks in FPGAs for particle physics

J. Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics”, JINST13(2018) P07027,doi:10.1088/1748-0221/13/07/P07027, arXiv:1804.06913

work page doi:10.1088/1748-0221/13/07/p07027 2018
[25]

hls4mlLHC jet dataset (150 particles)

M. Pierini, J. M. Duarte, N. Tran, and M. Freytsis, “hls4mlLHC jet dataset (150 particles)”, 2020.doi:10.5281/zenodo.3602260

work page doi:10.5281/zenodo.3602260 2020
[26]

Ultrafast jet classification on FPGAs for HL-LHC

P. Odagiu et al., “Ultrafast jet classification on FPGAs for HL-LHC”,Mach. Learn.: Sci. Technol.5(2024) 035017,doi:10.1088/2632-2153/ad5f10, arXiv:2402.01876

work page doi:10.1088/2632-2153/ad5f10 2024
[27]

The Machine Learning landscape of top taggers

G. Kasieczka et al., “The Machine Learning landscape of top taggers”,SciPost Phys.7 (2019) 014,doi:10.21468/SciPostPhys.7.1.014,arXiv:1902.09914

work page doi:10.21468/scipostphys.7.1.014 2019
[28]

Top quark tagging reference dataset (v0 (2018 03 27))

G. Kasieczka, T. Plehn, J. Thompson, and M. Russel, “Top quark tagging reference dataset (v0 (2018 03 27))”, 2019.doi:10.5281/zenodo.2603256. 12

work page doi:10.5281/zenodo.2603256 2018
[29]

Energy Flow Networks: Deep Sets for Particle Jets

P. T. Komiske, E. M. Metodiev, and J. Thaler, “Energy Flow Networks: Deep Sets for Particle Jets”,JHEP01(2019) 121,doi:10.1007/JHEP01(2019)121, arXiv:1810.05165

work page doi:10.1007/jhep01(2019)121 2019
[30]

Pythia8 quark and gluon jets for energy flow

P. Komiske, E. Metodiev, and J. Thaler, “Pythia8 quark and gluon jets for energy flow”, 2019. doi:10.5281/zenodo.3164691

work page doi:10.5281/zenodo.3164691 2019
[31]

3D ShapeNets: A Deep Representation for Volumetric Shapes

Z. Wu et al., “3d shapenets: A deep representation for volumetric shapes”, 2015. https://arxiv.org/abs/1406.5670

work page internal anchor Pith review Pith/arXiv arXiv 2015
[32]

Transformers without normalization

J. Zhu et al., “Transformers without normalization”, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2025. arXiv:2503.10622

work page arXiv 2025
[33]

The Pacific Research Platform: Making High-Speed Networking a Reality for the Scientist

L. Smarr et al., “The Pacific Research Platform: Making High-Speed Networking a Reality for the Scientist”, inProceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity. 2018.doi:10.1145/3219104.3219108

work page doi:10.1145/3219104.3219108 2018
[34]

National Research Platform

National Research Platform, “National Research Platform”. https://nationalresearchplatform.org/, 2025. Accessed: 2025-05-15

work page 2025
[35]

Deep Sets

M. Zaheer et al., “Deep sets”, inAdvances in Neural Information Processing Systems, I. Guyon et al., eds., volume 30. Curran Associates, Inc., 2017.arXiv:1703.06114

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Interaction Networks for Learning about Objects, Relations and Physics

P. W. Battaglia et al., “Interaction networks for learning about objects, relations and physics”, inAdvances in Neural Information Processing Systems, D. Lee et al., eds., volume 29. Curran Associates, Inc., 2016.arXiv:1612.00222

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

JEDI-net: a jet identification algorithm based on interaction networks

E. A. Moreno et al., “JEDI-net: a jet identification algorithm based on interaction networks”, Eur . Phys. J. C80(2020) 58,doi:10.1140/epjc/s10052-020-7608-4, arXiv:1908.05318

work page doi:10.1140/epjc/s10052-020-7608-4 2020
[38]

The Lund Jet Plane

F. A. Dreyer, G. P. Salam, and G. Soyez, “The Lund Jet Plane”,JHEP12(2018) 064, doi:10.1007/JHEP12(2018)064,arXiv:1807.04758. 13 SUPPLEMENTARYMATERIAL LLM USAGE Large Language Models (LLMs) were used as a general purpose writing and editing assistant in the preparation of this manuscript. Specifically, LLMs helped with phrasing improvements, grammar check...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep12(2018)064 2018
[39]

We repeat these experiments for inputs sorted by allp T,∆R, andk T

We use the following ablations to the full SAL-T model: Only Partitioning the value matrix, only partitioning the key matrix, only using one set of projections for both the key and value matrix (Share EF), without convolution, and without partitioning. We repeat these experiments for inputs sorted by allp T,∆R, andk T. We find thatk T sorting of inputs co...

work page 2009

[1] [1]

Attention Is All You Need

A. Vaswani et al., “Attention is all you need”, inAdvances in Neural Information Processing Systems, I. Guyon et al., eds., volume 30. Curran Associates, Inc., 2017. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

LHC Machine

L. Evans and P. Bryant, “LHC Machine”,JINST3(2008) S08001, doi:10.1088/1748-0221/3/08/S08001

work page doi:10.1088/1748-0221/3/08/s08001 2008

[3] [3]

Particle Transformer for Jet Tagging

H. Qu, C. Li, and S. Qian, “Particle Transformer for Jet Tagging”, inProceedings of the 39th International Conference on Machine Learning, K. Chaudhuri et al., eds., volume 162, p. 18281. 2022.arXiv:2202.03772

work page arXiv 2022

[4] [4]

Search for highly energetic double Higgs boson production in the two bottom quark and two vector boson all-hadronic final state

CMS Collaboration, “Search for highly energetic double Higgs boson production in the two bottom quark and two vector boson all-hadronic final state”, CMS Physics Analysis Summary CMS-PAS-HIG-23-012, 2024

work page 2024

[5] [5]

Performance of the CMS Level-1 trigger in proton-proton collisions at√s= 13TeV

CMS Collaboration, “Performance of the CMS Level-1 trigger in proton-proton collisions at√s= 13TeV”,JINST15(2020) P10017,arXiv:2006.10165

work page arXiv 2020

[6] [6]

Operation of the ATLAS trigger system in Run 2

ATLAS Collaboration, “Operation of the ATLAS trigger system in Run 2”,JINST15(2020) P10004,arXiv:2007.12539

work page arXiv 2020

[7] [7]

Realtime Anomaly Detection at the L1 Trigger of CMS Experiment

CMS Collaboration, “Realtime Anomaly Detection at the L1 Trigger of CMS Experiment”, PoSICHEP2024(2025) 1025,doi:10.22323/1.476.1025,arXiv:2411.19506

work page doi:10.22323/1.476.1025 2025

[8] [8]

Performance of the CMS high-level trigger during LHC Run 2

CMS Collaboration, “Performance of the CMS high-level trigger during LHC Run 2”,JINST 19(2024) P11021,doi:10.1088/1748-0221/19/11/P11021, arXiv:2410.17038

work page doi:10.1088/1748-0221/19/11/p11021 2024

[9] [9]

Linformer: Self-Attention with Linear Complexity

S. Wang et al., “Linformer: Self-attention with linear complexity”, 2020. arXiv:2006.04768. 11

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

The CMS Particle Flow Algorithm

CMS Collaboration, F. Beaudette, “The CMS Particle Flow Algorithm”, inInternational Conference on Calorimetry for the High Energy Frontier, pp. 295–304. 2013. arXiv:1401.8155

work page internal anchor Pith review Pith/arXiv arXiv 2013

[11] [11]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer”, 2020. arXiv:2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020

[12] [12]

Rethinking Attention with Performers

K. Choromanski et al., “Rethinking attention with performers”, inInternational Conference on Learning Representations. 2021.arXiv:2009.14794

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Reformer: The Efficient Transformer

N. Kitaev, Łukasz Kaiser, and A. Levskaya, “Reformer: The efficient transformer”, in International Conference on Learning Representations. 2020.arXiv:2001.04451

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

Jet tagging with more-interaction particle transformer*

Y . Wu et al., “Jet tagging with more-interaction particle transformer*”,Chin. Phys. C49 (2025) 013110,doi:10.1088/1674-1137/ad7f3d,arXiv:2407.08682

work page doi:10.1088/1674-1137/ad7f3d 2025

[15] [15]

Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics

S. Miao et al., “Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics”, inProceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov et al., eds., volume 235, p. 35546. 2024. arXiv:2402.12535

work page arXiv 2024

[16] [16]

Set transformer: A framework for attention-based permutation-invariant neural networks

J. Lee et al., “Set transformer: A framework for attention-based permutation-invariant neural networks”, inProceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., volume 97, p. 3744. 2019

work page 2019

[17] [17]

Evaluating generative models in high energy physics

R. Kansal et al., “Evaluating generative models in high energy physics”,Phys. Rev. D107 (2023) 076017,doi:10.1103/PhysRevD.107.076017,arXiv:2211.10295

work page doi:10.1103/physrevd.107.076017 2023

[18] [18]

Induced Generative Adversarial Particle Transformers

A. Li et al., “Induced Generative Adversarial Particle Transformers”, in6th Machine Learning and the Physical Sciences Workshop at the 37th Conference on Neural Information Processing Systems. 2023.arXiv:2312.04757

work page arXiv 2023

[19] [19]

The anti-k_t jet clustering algorithm

M. Cacciari, G. P. Salam, and G. Soyez, “The anti-k T jet clustering algorithm”,JHEP04 (2008) 063,doi:10.1088/1126-6708/2008/04/063,arXiv:0802.1189

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1088/1126-6708/2008/04/063 2008

[20] [20]

Longitudinally invariantkT clustering algorithms for hadron hadron collisions

S. Catani, Y . L. Dokshitzer, M. H. Seymour, and B. R. Webber, “Longitudinally invariantkT clustering algorithms for hadron hadron collisions”,Nucl. Phys. B406(1993) 187, doi:10.1016/0550-3213(93)90166-M

work page doi:10.1016/0550-3213(93)90166-m 1993

[21] [21]

Successive Combination Jet Algorithm For Hadron Collisions

S. D. Ellis and D. E. Soper, “Successive combination jet algorithm for hadron collisions”, Phys. Rev. D48(1993) 3160,doi:10.1103/PhysRevD.48.3160, arXiv:hep-ph/9305266

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physrevd.48.3160 1993

[22] [22]

FastJet user manual

M. Cacciari, G. P. Salam, and G. Soyez, “FastJet user manual”,Eur . Phys. J. C72(2012) 1896,doi:10.1140/epjc/s10052-012-1896-2,arXiv:1111.6097

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1140/epjc/s10052-012-1896-2 2012

[23] [23]

Multi-token attention

O. Golovneva, T. Wang, J. Weston, and S. Sukhbaatar, “Multi-token attention”, 2025. arXiv:2504.00927

work page arXiv 2025

[24] [24]

Fast inference of deep neural networks in FPGAs for particle physics

J. Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics”, JINST13(2018) P07027,doi:10.1088/1748-0221/13/07/P07027, arXiv:1804.06913

work page doi:10.1088/1748-0221/13/07/p07027 2018

[25] [25]

hls4mlLHC jet dataset (150 particles)

M. Pierini, J. M. Duarte, N. Tran, and M. Freytsis, “hls4mlLHC jet dataset (150 particles)”, 2020.doi:10.5281/zenodo.3602260

work page doi:10.5281/zenodo.3602260 2020

[26] [26]

Ultrafast jet classification on FPGAs for HL-LHC

P. Odagiu et al., “Ultrafast jet classification on FPGAs for HL-LHC”,Mach. Learn.: Sci. Technol.5(2024) 035017,doi:10.1088/2632-2153/ad5f10, arXiv:2402.01876

work page doi:10.1088/2632-2153/ad5f10 2024

[27] [27]

The Machine Learning landscape of top taggers

G. Kasieczka et al., “The Machine Learning landscape of top taggers”,SciPost Phys.7 (2019) 014,doi:10.21468/SciPostPhys.7.1.014,arXiv:1902.09914

work page doi:10.21468/scipostphys.7.1.014 2019

[28] [28]

Top quark tagging reference dataset (v0 (2018 03 27))

G. Kasieczka, T. Plehn, J. Thompson, and M. Russel, “Top quark tagging reference dataset (v0 (2018 03 27))”, 2019.doi:10.5281/zenodo.2603256. 12

work page doi:10.5281/zenodo.2603256 2018

[29] [29]

Energy Flow Networks: Deep Sets for Particle Jets

P. T. Komiske, E. M. Metodiev, and J. Thaler, “Energy Flow Networks: Deep Sets for Particle Jets”,JHEP01(2019) 121,doi:10.1007/JHEP01(2019)121, arXiv:1810.05165

work page doi:10.1007/jhep01(2019)121 2019

[30] [30]

Pythia8 quark and gluon jets for energy flow

P. Komiske, E. Metodiev, and J. Thaler, “Pythia8 quark and gluon jets for energy flow”, 2019. doi:10.5281/zenodo.3164691

work page doi:10.5281/zenodo.3164691 2019

[31] [31]

3D ShapeNets: A Deep Representation for Volumetric Shapes

Z. Wu et al., “3d shapenets: A deep representation for volumetric shapes”, 2015. https://arxiv.org/abs/1406.5670

work page internal anchor Pith review Pith/arXiv arXiv 2015

[32] [32]

Transformers without normalization

J. Zhu et al., “Transformers without normalization”, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2025. arXiv:2503.10622

work page arXiv 2025

[33] [33]

The Pacific Research Platform: Making High-Speed Networking a Reality for the Scientist

L. Smarr et al., “The Pacific Research Platform: Making High-Speed Networking a Reality for the Scientist”, inProceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity. 2018.doi:10.1145/3219104.3219108

work page doi:10.1145/3219104.3219108 2018

[34] [34]

National Research Platform

National Research Platform, “National Research Platform”. https://nationalresearchplatform.org/, 2025. Accessed: 2025-05-15

work page 2025

[35] [35]

Deep Sets

M. Zaheer et al., “Deep sets”, inAdvances in Neural Information Processing Systems, I. Guyon et al., eds., volume 30. Curran Associates, Inc., 2017.arXiv:1703.06114

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

Interaction Networks for Learning about Objects, Relations and Physics

P. W. Battaglia et al., “Interaction networks for learning about objects, relations and physics”, inAdvances in Neural Information Processing Systems, D. Lee et al., eds., volume 29. Curran Associates, Inc., 2016.arXiv:1612.00222

work page internal anchor Pith review Pith/arXiv arXiv 2016

[37] [37]

JEDI-net: a jet identification algorithm based on interaction networks

E. A. Moreno et al., “JEDI-net: a jet identification algorithm based on interaction networks”, Eur . Phys. J. C80(2020) 58,doi:10.1140/epjc/s10052-020-7608-4, arXiv:1908.05318

work page doi:10.1140/epjc/s10052-020-7608-4 2020

[38] [38]

The Lund Jet Plane

F. A. Dreyer, G. P. Salam, and G. Soyez, “The Lund Jet Plane”,JHEP12(2018) 064, doi:10.1007/JHEP12(2018)064,arXiv:1807.04758. 13 SUPPLEMENTARYMATERIAL LLM USAGE Large Language Models (LLMs) were used as a general purpose writing and editing assistant in the preparation of this manuscript. Specifically, LLMs helped with phrasing improvements, grammar check...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep12(2018)064 2018

[39] [39]

We repeat these experiments for inputs sorted by allp T,∆R, andk T

We use the following ablations to the full SAL-T model: Only Partitioning the value matrix, only partitioning the key matrix, only using one set of projections for both the key and value matrix (Share EF), without convolution, and without partitioning. We repeat these experiments for inputs sorted by allp T,∆R, andk T. We find thatk T sorting of inputs co...

work page 2009