pith. sign in

arxiv: 2510.23641 · v2 · pith:INJQUSEKnew · submitted 2025-10-24 · 💻 cs.LG · cs.AI· hep-ex· physics.ins-det

Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging

Pith reviewed 2026-05-21 19:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIhep-exphysics.ins-det
keywords linear transformerjet taggingparticle physicslinformerspatially aware partitioningconvolutional layershigh energy physicsmachine learning
0
0 comments X

The pith

SAL-T matches full-attention transformer accuracy on jet tagging while using linear attention for lower latency and fewer resources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Spatially Aware Linear Transformer to handle the quadratic cost of standard transformers when classifying particle jets in high-energy collisions. It modifies the linformer by partitioning particles into regions according to kinematic features and inserting convolutional layers that reflect jet physics structures for local correlations. This keeps overall attention linear in sequence length. The resulting model outperforms the plain linformer and reaches classification performance comparable to full quadratic transformers while cutting resource use and inference latency. The same pattern holds on a separate point-cloud benchmark.

Core claim

SAL-T preserves linear attention complexity by computing attention across spatially aware particle regions defined by kinematic features and by using convolutional layers to model local jet correlations, thereby matching the tagging accuracy of full-attention transformers at substantially lower computational cost.

What carries the argument

Spatially aware partitioning of particles into kinematic regions combined with convolutional layers inside a linear-attention linformer backbone.

If this is right

  • SAL-T can be deployed in high-data-throughput environments such as the CERN LHC without the latency penalty of quadratic attention.
  • The model outperforms the unmodified linformer on the same jet classification tasks.
  • Classification results remain comparable to full-attention transformers across the reported benchmarks.
  • The approach transfers to generic point-cloud classification as demonstrated on ModelNet10.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same partitioning-plus-convolution pattern could reduce model size requirements in other physics data tasks that involve spatially structured point clouds.
  • Real-time event filtering at detectors might become feasible if the latency reduction scales with larger event multiplicities.
  • Domain-specific region definitions may offer a general route to keep linear transformers competitive with quadratic ones in scientific applications.

Load-bearing premise

Partitioning particles by kinematic features and adding convolutional layers captures all task-relevant global and local correlations for jet tagging without meaningful information loss or bias.

What would settle it

A side-by-side run on the jet tagging dataset in which SAL-T accuracy falls clearly below a full-attention transformer while measured inference latency remains higher than claimed would falsify the performance equivalence.

Figures

Figures reproduced from arXiv: 2510.23641 by Aaron Wang, Abhijith Gandrakota, Elham E Khoda, Javier Duarte, Jennifer Ngadiuba, Richard Cavanaugh, Subash Katel, Vivekanand Gyanchand Sahu, Zihan Zhao.

Figure 1
Figure 1. Figure 1: (Left) Jet constituents partitioned and sorted by kT in the (∆η, ∆ϕ) plane in SAL-T, show￾ing how constituents are binned spatially before projection. (Center) Jet constituents partitioned by transverse momentum in the (∆η, ∆ϕ) plane. (Right) Visualization of the projection partitioning strategy used in SAL-T, Jet constituents are partitioned into spatial bins before projection, preserv￾ing local structure… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Architecture of the linear partitioned particle multi-head attention (LPP-MHA) module used in SAL-T. The input query, key, and value sequences of dimension n × m are linearly projected to dimension n × d, then spatially partitioned into p groups of size p × d. Attention weights are computed via scaled dot-product attention within each partition, followed by a depthwise convolution over the attention… view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Jet classification accuracy of SAL-T, Linformer, and standard Transformer across bins of increasing number of particles per jet. SAL-T consistently matches or exceeds the accuracy of linformer and remains competitive with full transformers. Performance variance in the highest bin (115–150 particles) is attributable to its small sample size (only 41 jets). (Right) Floating-point operation (FLOP) coun… view at source ↗
Figure 4
Figure 4. Figure 4: Attention matrices for a top quark jet with 81 particles. Each trio of attention plots rep [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention matrices for a top quark jet with 121 particles. Each trio of attention matrices [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention matrices for a top quark jet with 44 particles. Each trio of attention matrices [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: histogram of ratio of attention to partition 0 in head 2 throughout all particles in the test [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training/validation accuracy and loss curves for SAL-T with learning rate decay vs batch [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Transformers are very effective in capturing both global and local correlations within high-energy particle collisions, but they present deployment challenges in high-data-throughput environments, such as the CERN LHC. The quadratic complexity of transformer models demands substantial resources and increases latency during inference. In order to address these issues, we introduce the Spatially Aware Linear Transformer (SAL-T), a physics-inspired enhancement of the linformer architecture that maintains linear attention. Our method incorporates spatially aware partitioning of particles based on kinematic features, thereby computing attention between regions of physical significance. Additionally, we employ convolutional layers to capture local correlations, informed by insights from jet physics. In addition to outperforming the standard linformer in jet classification tasks, SAL-T also achieves classification results comparable to full-attention transformers, while using considerably fewer resources with lower latency during inference. Experiments on a generic point cloud classification dataset (ModelNet10) further confirm this trend. Our code is available at https://github.com/aaronw5/SAL-T4HEP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the Spatially Aware Linear Transformer (SAL-T), a physics-inspired modification of the Linformer architecture for particle jet tagging. Particles are partitioned into regions based on kinematic features, with attention computed at the region level and convolutional layers added to capture local correlations. The central claims are that SAL-T outperforms the standard Linformer, achieves classification accuracy comparable to full-attention transformers on jet tagging tasks, and does so with linear complexity, substantially lower resource usage, and reduced inference latency. The approach is further tested on the ModelNet10 point-cloud dataset, and code is released publicly.

Significance. If the empirical results hold under scrutiny, the work offers a practical route to deploying expressive attention-based models in high-throughput HEP environments such as the LHC, where quadratic complexity is prohibitive. The combination of spatially aware partitioning and convolutional layers represents a targeted, physics-motivated attempt to retain task-relevant correlations while enforcing linearity. Public code release supports reproducibility and follow-on work.

major comments (1)
  1. [Architecture / Method] The headline claim of performance comparable to full-attention transformers rests on the assumption that region-level attention plus convolutional layers fully preserve the multi-particle correlations needed for jet substructure (e.g., boosted-object identification or specific decay chains). The architecture description implies a reduction from per-particle to per-region interactions; an explicit ablation or correlation-preservation analysis (for example, comparing intra- vs. inter-region attention contributions on representative jet topologies) is required to substantiate that no critical long-range dependencies are lost.
minor comments (2)
  1. Clarify the exact procedure for determining region boundaries and the number of regions; a sensitivity study with respect to these hyperparameters would strengthen the reproducibility of the spatially aware partitioning step.
  2. The ModelNet10 experiments are mentioned only briefly; a short table or paragraph comparing SAL-T against the same baselines used for jet tagging would make the generalizability claim easier to evaluate.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential impact. We address the single major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Architecture / Method] The headline claim of performance comparable to full-attention transformers rests on the assumption that region-level attention plus convolutional layers fully preserve the multi-particle correlations needed for jet substructure (e.g., boosted-object identification or specific decay chains). The architecture description implies a reduction from per-particle to per-region interactions; an explicit ablation or correlation-preservation analysis (for example, comparing intra- vs. inter-region attention contributions on representative jet topologies) is required to substantiate that no critical long-range dependencies are lost.

    Authors: We agree that an explicit ablation would provide stronger substantiation for the claim that critical correlations are retained. While the comparable classification accuracy on jet tagging tasks (including boosted topologies) already indicates that the physics-motivated partitioning and convolutional layers preserve task-relevant information, we will add a dedicated correlation-preservation analysis in the revised manuscript. This will include quantitative comparison of intra- versus inter-region attention contributions on representative jet samples, attention visualization for specific decay chains, and discussion of how the kinematic region boundaries are chosen to align with expected substructure scales. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on external benchmarks

full rationale

The paper introduces SAL-T as a physics-motivated modification to the linformer architecture, incorporating spatially aware partitioning and convolutional layers for jet tagging. All performance claims (comparable accuracy to full-attention transformers with lower latency) are supported by direct experimental results on standard jet classification datasets and ModelNet10, without any equations, derivations, or predictions that reduce by construction to fitted parameters or self-referential definitions. No load-bearing self-citations or uniqueness theorems are invoked to justify core choices; the contribution is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the spatial partitioning step implicitly requires choices of region boundaries or feature thresholds that are not detailed here.

pith-pipeline@v0.9.0 · 5747 in / 1105 out tokens · 51857 ms · 2026-05-21T19:34:01.071974+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. E-PCN: Jet Tagging with Explainable Particle Chebyshev Networks Using Kinematic Features

    hep-ph 2025-12 conditional novelty 5.0

    E-PCN reaches 94.67% macro-accuracy on 10-class jet tagging by weighting graphs with angular separation, transverse momentum, momentum fraction, and invariant mass, with Grad-CAM showing the first two account for 76% ...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 13 internal anchors

  1. [1]

    Attention Is All You Need

    A. Vaswani et al., “Attention is all you need”, inAdvances in Neural Information Processing Systems, I. Guyon et al., eds., volume 30. Curran Associates, Inc., 2017. arXiv:1706.03762

  2. [2]

    LHC Machine

    L. Evans and P. Bryant, “LHC Machine”,JINST3(2008) S08001, doi:10.1088/1748-0221/3/08/S08001

  3. [3]

    Particle Transformer for Jet Tagging

    H. Qu, C. Li, and S. Qian, “Particle Transformer for Jet Tagging”, inProceedings of the 39th International Conference on Machine Learning, K. Chaudhuri et al., eds., volume 162, p. 18281. 2022.arXiv:2202.03772

  4. [4]

    Search for highly energetic double Higgs boson production in the two bottom quark and two vector boson all-hadronic final state

    CMS Collaboration, “Search for highly energetic double Higgs boson production in the two bottom quark and two vector boson all-hadronic final state”, CMS Physics Analysis Summary CMS-PAS-HIG-23-012, 2024

  5. [5]

    Performance of the CMS Level-1 trigger in proton-proton collisions at√s= 13TeV

    CMS Collaboration, “Performance of the CMS Level-1 trigger in proton-proton collisions at√s= 13TeV”,JINST15(2020) P10017,arXiv:2006.10165

  6. [6]

    Operation of the ATLAS trigger system in Run 2

    ATLAS Collaboration, “Operation of the ATLAS trigger system in Run 2”,JINST15(2020) P10004,arXiv:2007.12539

  7. [7]

    Realtime Anomaly Detection at the L1 Trigger of CMS Experiment

    CMS Collaboration, “Realtime Anomaly Detection at the L1 Trigger of CMS Experiment”, PoSICHEP2024(2025) 1025,doi:10.22323/1.476.1025,arXiv:2411.19506

  8. [8]

    Performance of the CMS high-level trigger during LHC Run 2

    CMS Collaboration, “Performance of the CMS high-level trigger during LHC Run 2”,JINST 19(2024) P11021,doi:10.1088/1748-0221/19/11/P11021, arXiv:2410.17038

  9. [9]

    Linformer: Self-Attention with Linear Complexity

    S. Wang et al., “Linformer: Self-attention with linear complexity”, 2020. arXiv:2006.04768. 11

  10. [10]

    The CMS Particle Flow Algorithm

    CMS Collaboration, F. Beaudette, “The CMS Particle Flow Algorithm”, inInternational Conference on Calorimetry for the High Energy Frontier, pp. 295–304. 2013. arXiv:1401.8155

  11. [11]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer”, 2020. arXiv:2004.05150

  12. [12]

    Rethinking Attention with Performers

    K. Choromanski et al., “Rethinking attention with performers”, inInternational Conference on Learning Representations. 2021.arXiv:2009.14794

  13. [13]

    Reformer: The Efficient Transformer

    N. Kitaev, Łukasz Kaiser, and A. Levskaya, “Reformer: The efficient transformer”, in International Conference on Learning Representations. 2020.arXiv:2001.04451

  14. [14]

    Jet tagging with more-interaction particle transformer*

    Y . Wu et al., “Jet tagging with more-interaction particle transformer*”,Chin. Phys. C49 (2025) 013110,doi:10.1088/1674-1137/ad7f3d,arXiv:2407.08682

  15. [15]

    Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics

    S. Miao et al., “Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics”, inProceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov et al., eds., volume 235, p. 35546. 2024. arXiv:2402.12535

  16. [16]

    Set transformer: A framework for attention-based permutation-invariant neural networks

    J. Lee et al., “Set transformer: A framework for attention-based permutation-invariant neural networks”, inProceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., volume 97, p. 3744. 2019

  17. [17]

    Evaluating generative models in high energy physics

    R. Kansal et al., “Evaluating generative models in high energy physics”,Phys. Rev. D107 (2023) 076017,doi:10.1103/PhysRevD.107.076017,arXiv:2211.10295

  18. [18]

    Induced Generative Adversarial Particle Transformers

    A. Li et al., “Induced Generative Adversarial Particle Transformers”, in6th Machine Learning and the Physical Sciences Workshop at the 37th Conference on Neural Information Processing Systems. 2023.arXiv:2312.04757

  19. [19]

    The anti-k_t jet clustering algorithm

    M. Cacciari, G. P. Salam, and G. Soyez, “The anti-k T jet clustering algorithm”,JHEP04 (2008) 063,doi:10.1088/1126-6708/2008/04/063,arXiv:0802.1189

  20. [20]

    Longitudinally invariantkT clustering algorithms for hadron hadron collisions

    S. Catani, Y . L. Dokshitzer, M. H. Seymour, and B. R. Webber, “Longitudinally invariantkT clustering algorithms for hadron hadron collisions”,Nucl. Phys. B406(1993) 187, doi:10.1016/0550-3213(93)90166-M

  21. [21]

    Successive Combination Jet Algorithm For Hadron Collisions

    S. D. Ellis and D. E. Soper, “Successive combination jet algorithm for hadron collisions”, Phys. Rev. D48(1993) 3160,doi:10.1103/PhysRevD.48.3160, arXiv:hep-ph/9305266

  22. [22]

    FastJet user manual

    M. Cacciari, G. P. Salam, and G. Soyez, “FastJet user manual”,Eur . Phys. J. C72(2012) 1896,doi:10.1140/epjc/s10052-012-1896-2,arXiv:1111.6097

  23. [23]

    Multi-token attention

    O. Golovneva, T. Wang, J. Weston, and S. Sukhbaatar, “Multi-token attention”, 2025. arXiv:2504.00927

  24. [24]

    Fast inference of deep neural networks in FPGAs for particle physics

    J. Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics”, JINST13(2018) P07027,doi:10.1088/1748-0221/13/07/P07027, arXiv:1804.06913

  25. [25]

    hls4mlLHC jet dataset (150 particles)

    M. Pierini, J. M. Duarte, N. Tran, and M. Freytsis, “hls4mlLHC jet dataset (150 particles)”, 2020.doi:10.5281/zenodo.3602260

  26. [26]

    Ultrafast jet classification on FPGAs for HL-LHC

    P. Odagiu et al., “Ultrafast jet classification on FPGAs for HL-LHC”,Mach. Learn.: Sci. Technol.5(2024) 035017,doi:10.1088/2632-2153/ad5f10, arXiv:2402.01876

  27. [27]

    The Machine Learning landscape of top taggers

    G. Kasieczka et al., “The Machine Learning landscape of top taggers”,SciPost Phys.7 (2019) 014,doi:10.21468/SciPostPhys.7.1.014,arXiv:1902.09914

  28. [28]

    Top quark tagging reference dataset (v0 (2018 03 27))

    G. Kasieczka, T. Plehn, J. Thompson, and M. Russel, “Top quark tagging reference dataset (v0 (2018 03 27))”, 2019.doi:10.5281/zenodo.2603256. 12

  29. [29]

    Energy Flow Networks: Deep Sets for Particle Jets

    P. T. Komiske, E. M. Metodiev, and J. Thaler, “Energy Flow Networks: Deep Sets for Particle Jets”,JHEP01(2019) 121,doi:10.1007/JHEP01(2019)121, arXiv:1810.05165

  30. [30]

    Pythia8 quark and gluon jets for energy flow

    P. Komiske, E. Metodiev, and J. Thaler, “Pythia8 quark and gluon jets for energy flow”, 2019. doi:10.5281/zenodo.3164691

  31. [31]

    3D ShapeNets: A Deep Representation for Volumetric Shapes

    Z. Wu et al., “3d shapenets: A deep representation for volumetric shapes”, 2015. https://arxiv.org/abs/1406.5670

  32. [32]

    Transformers without normalization

    J. Zhu et al., “Transformers without normalization”, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2025. arXiv:2503.10622

  33. [33]

    The Pacific Research Platform: Making High-Speed Networking a Reality for the Scientist

    L. Smarr et al., “The Pacific Research Platform: Making High-Speed Networking a Reality for the Scientist”, inProceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity. 2018.doi:10.1145/3219104.3219108

  34. [34]

    National Research Platform

    National Research Platform, “National Research Platform”. https://nationalresearchplatform.org/, 2025. Accessed: 2025-05-15

  35. [35]

    Deep Sets

    M. Zaheer et al., “Deep sets”, inAdvances in Neural Information Processing Systems, I. Guyon et al., eds., volume 30. Curran Associates, Inc., 2017.arXiv:1703.06114

  36. [36]

    Interaction Networks for Learning about Objects, Relations and Physics

    P. W. Battaglia et al., “Interaction networks for learning about objects, relations and physics”, inAdvances in Neural Information Processing Systems, D. Lee et al., eds., volume 29. Curran Associates, Inc., 2016.arXiv:1612.00222

  37. [37]

    JEDI-net: a jet identification algorithm based on interaction networks

    E. A. Moreno et al., “JEDI-net: a jet identification algorithm based on interaction networks”, Eur . Phys. J. C80(2020) 58,doi:10.1140/epjc/s10052-020-7608-4, arXiv:1908.05318

  38. [38]

    The Lund Jet Plane

    F. A. Dreyer, G. P. Salam, and G. Soyez, “The Lund Jet Plane”,JHEP12(2018) 064, doi:10.1007/JHEP12(2018)064,arXiv:1807.04758. 13 SUPPLEMENTARYMATERIAL LLM USAGE Large Language Models (LLMs) were used as a general purpose writing and editing assistant in the preparation of this manuscript. Specifically, LLMs helped with phrasing improvements, grammar check...

  39. [39]

    We repeat these experiments for inputs sorted by allp T,∆R, andk T

    We use the following ablations to the full SAL-T model: Only Partitioning the value matrix, only partitioning the key matrix, only using one set of projections for both the key and value matrix (Share EF), without convolution, and without partitioning. We repeat these experiments for inputs sorted by allp T,∆R, andk T. We find thatk T sorting of inputs co...