pith. sign in

arxiv: 2605.21789 · v1 · pith:552SAL2Cnew · submitted 2026-05-20 · ✦ hep-ex · cs.AI

Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging

Pith reviewed 2026-05-22 07:31 UTC · model grok-4.3

classification ✦ hep-ex cs.AI
keywords jet taggingtransformerhierarchical attentionefficient inferenceLHC triggersparticle identificationtop taggingquark-gluon discrimination
0
0 comments X

The pith

PHAT-JeT combines geometric message passing with hierarchical patch attention to reach state-of-the-art jet tagging accuracy and background rejection under tight resource limits on four benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the need for accurate real-time identification of particle jets in LHC detectors, where trigger systems must decide which events to store under strict latency constraints. Full transformer models deliver high classification performance but incur quadratic computational costs that exceed available budgets. The proposed architecture splits jet constituents into small patches for exact local attention, communicates global context via lightweight patch tokens, and adds a physics-inspired geometric message-passing step to encode local detector geometry. This combination is shown to outperform other efficient models in accuracy while staying within resource bounds across the hls4ml, JetClass, Top Tagging, and Quark-Gluon benchmarks.

Core claim

The central claim is that a hierarchical patch-based attention scheme with lightweight patch-token communication, paired with a geometric message-passing module that encodes local detector-plane structure, preserves enough global context to achieve state-of-the-art accuracy and background rejection among all resource-constrained jet tagging models on the hls4ml, JetClass, Top Tagging, and Quark-Gluon benchmarks.

What carries the argument

The hierarchical patch-based attention scheme that computes exact attention within small particle groups while preserving global context through lightweight patch-token communication, together with the physics-inspired geometric message-passing module.

If this is right

  • High-accuracy jet tagging becomes feasible inside real-time LHC trigger systems without exceeding latency budgets.
  • Improved background rejection rates allow more precise online selection of rare decay events.
  • The method scales to variable numbers of jet constituents without quadratic cost growth.
  • Physics-informed local encoding can offset reductions in global attention scope while maintaining performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same patch hierarchy could be adapted to other sparse detector data tasks such as track reconstruction or calorimeter clustering.
  • Varying patch sizes systematically might reveal optimal trade-offs for different collision energies or detector upgrades.
  • Hardware-specific optimizations of the lightweight token communication step could yield further latency reductions on FPGAs or ASICs.

Load-bearing premise

The chosen patch sizes and lightweight inter-patch communication mechanisms retain enough global context to avoid degrading classification accuracy on the tested jet datasets and generalize beyond them.

What would settle it

An experiment showing that accuracy or background rejection falls below competing efficient models when the same architecture is applied to a new dataset with substantially different particle multiplicity or detector geometry.

Figures

Figures reproduced from arXiv: 2605.21789 by Aaron Wang, Abhijith Gandrakota, Alan Xia, Chang Sun, Javier Duarte, Jennifer Ngadiuba, Richard Cavanaugh, Zihan Zhao.

Figure 1
Figure 1. Figure 1: Left: Simplified schematic of a proton-proton collision viewed transverse to the beam axis, illustrating multiple jets (colored clusters) emanating from the interaction point. Right: Event display at the LHC, showing reconstructed particle tracks and calorimeter energy deposits from a real collision event [5]. The challenge is formidable: high-energy particle collisions at modern accelerators produce data … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of PHAT-JeT. (Left) The PHAT block partitions input particles into patches [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Particle Hierarchical Attention Transformer (PHAT-JeT) schematic. (a) GMP discretizes [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy and FLOPs as patch size increases for PHAT-JeT. JEDI-Linear is included as a [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-head attention on a representative top jet under [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-head attention on a representative top jet under fixed random sorting. The qualitative [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Attention projected onto the (η, ϕ) plane for a representative top jet under kT sorting. Particles are colored by subjet assignment. Solid colored edges denote intra-subjet attention, while dashed gray edges denote cross-subjet attention. supports the conclusion that the model does not depend strongly on kT ordering itself. Instead, it adapts to whatever ordering it is trained on, provided that the same or… view at source ↗
Figure 8
Figure 8. Figure 8: Attention projected onto the (η, ϕ) plane for the same type of jet under fixed random sorting. The qualitative structure remains similar to the kT -sorted case, consistent with the comparable performance of matched random and matched kT orderings. factorization, rather than as a hard-coded dependence on a particular physics ordering. At the same time, the learned local and global attention patterns remain … view at source ↗
Figure 9
Figure 9. Figure 9: Head-averaged attention for representative jets from all five HLS4ML classes. Rows [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
read the original abstract

Real-time jet tagging is critical for identifying short-lived particle decays in the high-throughput detectors of the Large Hadron Collider, where real-time trigger systems responsible for deciding which collision events to store impose strict latency and accuracy constraints. While transformer architectures achieve the highest jet tagging accuracy when compute is unconstrained, their quadratic self-attention cost makes inference restrictive on trigger budget. Existing efficient variants reduce the computational cost, but hinder the classification performance. To address this limitation, we introduce the Patch Hierarchical Attention Transformer (PHAT-JeT), which combines two mechanisms: a physics-inspired geometric message-passing module that encodes local detector-plane structure, and a hierarchical patch-based attention scheme that computes exact attention within small particle groups while preserving global context through lightweight patch-token communication. Within a restricted budget, PHAT-JeT achieves state-of-the-art accuracy and background rejection among all resource-constrained jet tagging models on four benchmarks (\textsc{hls4ml}, JetClass, Top Tagging, and Quark--Gluon). Our code is available at https://github.com/aaronw5/PHAT-JeT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Patch Hierarchical Attention Transformer (PHAT-JeT) for real-time jet tagging at the LHC. It combines a physics-inspired geometric message-passing module with a hierarchical patch-based attention scheme that performs exact intra-patch attention and uses lightweight patch-token communication to retain global context. The central claim is that, within a restricted computational budget, PHAT-JeT achieves state-of-the-art accuracy and background rejection among resource-constrained models on the hls4ml, JetClass, Top Tagging, and Quark-Gluon benchmarks. The code is released publicly.

Significance. If the empirical results are robust, the work would be significant for LHC trigger systems by improving jet identification efficiency under strict latency constraints. The physics-motivated design and open code release are strengths that support reproducibility and potential adoption in high-energy physics experiments.

major comments (2)
  1. [§4] §4 (Experimental Results): The SOTA claims on all four benchmarks are presented without error bars, statistical significance tests, or ablation studies on patch size, number of patches, or inter-patch communication weights. This is load-bearing for the central claim, as it leaves open whether the reported gains over prior resource-constrained models are statistically meaningful or sensitive to hyper-parameter choices.
  2. [§3.2] §3.2 (Hierarchical Patch Attention): The assertion that lightweight patch-token communication plus the geometric message-passing module preserves sufficient global context for multi-prong discrimination (e.g., top-quark jets) is stated qualitatively. No quantitative analysis, information-flow study, or comparison to full self-attention is given to demonstrate that distant but physically correlated constituents retain mutual information, which directly underpins the performance claims on the Top Tagging and JetClass benchmarks.
minor comments (2)
  1. [Abstract] The abstract and §4 should explicitly state the evaluation metrics (e.g., AUC, background rejection at fixed signal efficiency) and the precise resource budgets (FLOPs or latency) used for each baseline comparison.
  2. [§4] Figure captions in §4 would benefit from clearer indication of which models operate under the same restricted budget as PHAT-JeT.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): The SOTA claims on all four benchmarks are presented without error bars, statistical significance tests, or ablation studies on patch size, number of patches, or inter-patch communication weights. This is load-bearing for the central claim, as it leaves open whether the reported gains over prior resource-constrained models are statistically meaningful or sensitive to hyper-parameter choices.

    Authors: We agree that the lack of error bars, statistical tests, and ablations limits the strength of the SOTA claims. In the revised manuscript we will report mean performance and standard deviations from five independent training runs with different random seeds for all benchmarks. We have additionally performed ablation studies on patch size (varying from 4 to 16) and number of patches, which confirm that the selected configuration yields near-optimal accuracy within the latency budget; these results will be added to Section 4. For inter-patch communication weights, which are fixed by geometric priors, we will include a short sensitivity study in the supplement showing that moderate variations produce only sub-percent changes in background rejection. These additions directly address the concern about statistical meaningfulness and hyper-parameter sensitivity. revision: yes

  2. Referee: [§3.2] §3.2 (Hierarchical Patch Attention): The assertion that lightweight patch-token communication plus the geometric message-passing module preserves sufficient global context for multi-prong discrimination (e.g., top-quark jets) is stated qualitatively. No quantitative analysis, information-flow study, or comparison to full self-attention is given to demonstrate that distant but physically correlated constituents retain mutual information, which directly underpins the performance claims on the Top Tagging and JetClass benchmarks.

    Authors: The design rests on the physical observation that jet constituents are locally clustered on the detector plane, so exact intra-patch attention captures local structure while patch tokens and the geometric message-passing module (encoding distances and angles) propagate global information. The state-of-the-art results on the multi-prong Top Tagging and JetClass benchmarks provide indirect quantitative support that sufficient context is retained. We acknowledge that an explicit mutual-information or full self-attention comparison would offer stronger direct evidence. In the revision we will add attention-weight visualizations across patches for representative top jets to illustrate long-range information flow. A comprehensive information-flow study or direct comparison to unconstrained self-attention lies outside the resource-constrained scope of the present work and would be better suited to a follow-up study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; architecture proposal rests on empirical benchmarks

full rationale

The paper proposes PHAT-JeT as a new architecture combining a geometric message-passing module with hierarchical patch-based attention for efficient jet tagging. Central claims of state-of-the-art accuracy under resource constraints are supported by direct empirical comparisons on four external benchmarks (hls4ml, JetClass, Top Tagging, Quark-Gluon). No equations, fitted parameters, or self-citations are shown that reduce performance claims to definitions or tautologies by construction. The hierarchical scheme is presented as an original design choice evaluated on held-out data rather than derived from prior self-referential results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on several design choices whose values are not derived from first principles but selected to achieve the reported results on the given benchmarks.

free parameters (2)
  • patch size and number of patches
    Chosen to balance local exact attention with global context; specific values affect both speed and accuracy on the benchmarks.
  • inter-patch communication weights
    Lightweight parameters controlling information flow between patches; tuned for the reported performance.
axioms (1)
  • domain assumption Local detector-plane geometry carries useful information for jet classification that can be captured by message passing.
    Invoked to justify the geometric message-passing module; if false the added module would not improve tagging.

pith-pipeline@v0.9.0 · 5741 in / 1191 out tokens · 39862 ms · 2026-05-22T07:31:43.635140+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 12 internal anchors

  1. [1]

    LHC Machine

    L. Evans and P. Bryant, “LHC Machine”,JINST3(2008) S08001, doi:10.1088/1748-0221/3/08/S08001

  2. [2]

    The Phase-2 upgrade of the CMS Level-1 trigger

    CMS Collaboration, “The Phase-2 upgrade of the CMS Level-1 trigger”, CMS Technical Design Report CERN-LHCC-2020-004. CMS-TDR-021, 2020

  3. [3]

    Applications and Techniques for Fast Machine Learning in Science

    A. M. Deiana et al., “Applications and Techniques for Fast Machine Learning in Science”, Front. Big Data5(2022) 787421,doi:10.3389/fdata.2022.787421, arXiv:2110.13041

  4. [4]

    Physics Community Needs, Tools, and Resources for Machine Learning

    P. Harris et al., “Physics Community Needs, Tools, and Resources for Machine Learning”, in Snowmass 2021. 2022.arXiv:2203.16255

  5. [5]

    A four-top event candidate (event display)

    CMS Collaboration, “A four-top event candidate (event display)”. Image (Figure 1) in: CMS observes four-top quark production, 2023. Accessed: 2026-01-29. https://cms.cern/sites/default/files/field/image/TOP-22-013_5.png

  6. [6]

    Performance of the CMS Level-1 trigger in proton-proton collisions at√s= 13TeV

    CMS Collaboration, “Performance of the CMS Level-1 trigger in proton-proton collisions at√s= 13TeV”,JINST15(2020) P10017,arXiv:2006.10165

  7. [7]

    Operation of the ATLAS trigger system in Run 2

    ATLAS Collaboration, “Operation of the ATLAS trigger system in Run 2”,JINST15(2020) P10004,arXiv:2007.12539

  8. [8]

    The anti-k_t jet clustering algorithm

    M. Cacciari, G. P. Salam, and G. Soyez, “The anti-k T jet clustering algorithm”,JHEP04 (2008) 063,doi:10.1088/1126-6708/2008/04/063,arXiv:0802.1189

  9. [9]

    FastJet user manual

    M. Cacciari, G. P. Salam, and G. Soyez, “FastJet user manual”,Eur. Phys. J. C72(2012) 1896,doi:10.1140/epjc/s10052-012-1896-2,arXiv:1111.6097

  10. [10]

    Jet Substructure at the Large Hadron Collider: A Review of Recent Advances in Theory and Machine Learning

    A. J. Larkoski, I. Moult, and B. Nachman, “Jet Substructure at the Large Hadron Collider: A Review of Recent Advances in Theory and Machine Learning”,Phys. Rept.841(2020) 1, doi:10.1016/j.physrep.2019.11.001,arXiv:1709.04464

  11. [11]

    Particle Transformer for Jet Tagging

    H. Qu, C. Li, and S. Qian, “Particle Transformer for Jet Tagging”, inProceedings of the 39th International Conference on Machine Learning, K. Chaudhuri et al., eds., volume 162, p. 18281. 2022.arXiv:2202.03772

  12. [12]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, 2019.arXiv:1910.01108

  13. [13]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications”, 2017.arXiv:1704.04861

  14. [14]

    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

    M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks”, inProceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., volume 97 ofProceedings of Machine Learning Research, p. 6105. 2019.arXiv:1905.11946

  15. [15]

    PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation”, inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), p. 77. 2017.arXiv:1612.00593.doi:10.1109/CVPR.2017.16

  16. [16]

    Deep Sets

    M. Zaheer et al., “Deep sets”, inAdvances in Neural Information Processing Systems, I. Guyon et al., eds., volume 30. Curran Associates, Inc., 2017.arXiv:1703.06114

  17. [17]

    Energy Flow Networks: Deep Sets for Particle Jets

    P. T. Komiske, E. M. Metodiev, and J. Thaler, “Energy Flow Networks: Deep Sets for Particle Jets”,JHEP01(2019) 121,doi:10.1007/JHEP01(2019)121,arXiv:1810.05165

  18. [18]

    JEDI-net: a jet identification algorithm based on interaction networks

    E. A. Moreno et al., “JEDI-net: a jet identification algorithm based on interaction networks”, Eur. Phys. J. C80(2020) 58,doi:10.1140/epjc/s10052-020-7608-4, arXiv:1908.05318. 11

  19. [19]

    Qu and L

    H. Qu and L. Gouskos, “ParticleNet: Jet tagging via particle clouds”,Phys. Rev. D101(2020) 056019,doi:10.1103/PhysRevD.101.056019,arXiv:1902.08570

  20. [20]

    Bae, Y .-J

    X. Wu et al., “Point Transformer V3: Simpler, Faster, Stronger”, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 4840. 2024. arXiv:2312.10035.doi:10.1109/CVPR52733.2024.00463

  21. [21]

    An efficient lorentz equivariant graph neural network for jet tagging

    S. Gong et al., “An efficient lorentz equivariant graph neural network for jet tagging”,Journal of High Energy Physics2022(2022), no. 7, 030

  22. [22]

    Lorentz-equivariant geometric algebra transformers for high-energy physics

    J. Spinner et al., “Lorentz-equivariant geometric algebra transformers for high-energy physics”, 2024.https://arxiv.org/abs/2405.14806

  23. [23]

    Linformer: Self-Attention with Linear Complexity

    S. Wang et al., “Linformer: Self-attention with linear complexity”, 2020. https://arxiv.org/abs/2006.04768

  24. [24]

    JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs

    Z. Que et al., “JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs”, inInternational Conference on Field Programmable Technology. 2025. arXiv:2508.15468

  25. [25]

    Spatially aware linear transformer (sal-t) for particle jet tagging

    Anonymous, “Spatially aware linear transformer (sal-t) for particle jet tagging”,. Under review

  26. [26]

    Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics

    S. Miao et al., “Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics”, inProceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov et al., eds., volume 235, p. 35546. 2024. arXiv:2402.12535

  27. [27]

    Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Reconstruction

    S. Govil et al., “Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Reconstruction”, in39th Annual Conference on Neural Information Processing Systems: Machine Learning and the Physical Sciences (ML4PS) Workshop. 2025. arXiv:2510.07594

  28. [28]

    Set transformer: A framework for attention-based permutation-invariant neural networks

    J. Lee et al., “Set transformer: A framework for attention-based permutation-invariant neural networks”, inInternational Conference on Machine Learning, pp. 3744–3753, PMLR. 2019

  29. [29]

    Paca-vit: Learning patch-to-cluster attention in vision transformers

    R. Grainger et al., “Paca-vit: Learning patch-to-cluster attention in vision transformers”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18568–18578. 2023

  30. [30]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows”, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022. 2021

  31. [31]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer”,arXiv preprint arXiv:2004.05150(2020)

  32. [32]

    Hep-jepa: A foundation model for collider physics

    J. Bardhan et al., “Hep-jepa: A foundation model for collider physics”, inICLR Workshop on World Models: Understanding, Modelling and Scaling. 2025

  33. [33]

    An image is worth 16x16 words: Transformers for image recognition at scale

    A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale”, inInternational Conference on Learning Representations (ICLR). 2021

  34. [34]

    Conditional positional encodings for vision transformers

    X. Chu et al., “Conditional positional encodings for vision transformers”, inInternational Conference on Learning Representations (ICLR). 2023

  35. [35]

    HLS4ML LHC jet dataset (150 particles)

    M. Pierini, J. M. Duarte, N. Tran, and M. Freytsis, “HLS4ML LHC jet dataset (150 particles)”, 2020.doi:10.5281/zenodo.3602260,https://doi.org/10.5281/zenodo.3602260

  36. [36]

    Attention Is All You Need

    A. Vaswani et al., “Attention is all you need”, inAdvances in Neural Information Processing Systems, I. Guyon et al., eds., volume 30. Curran Associates, Inc., 2017. arXiv:1706.03762

  37. [37]

    Successive Combination Jet Algorithm For Hadron Collisions

    S. D. Ellis and D. E. Soper, “Successive combination jet algorithm for hadron collisions”,Phys. Rev. D48(1993) 3160,doi:10.1103/PhysRevD.48.3160,arXiv:hep-ph/9305266. 12

  38. [38]

    Top quark tagging reference dataset (v0 (2018_03_27))

    G. Kasieczka, T. Plehn, J. Thompson, and M. Russel, “Top quark tagging reference dataset (v0 (2018_03_27))”, 2019.doi:10.5281/zenodo.2603256, https://doi.org/10.5281/zenodo.2603256

  39. [39]

    Pythia8 quark and gluon jets for energy flow

    P. Komiske, E. Metodiev, and J. Thaler, “Pythia8 quark and gluon jets for energy flow”, 2019. doi:10.5281/zenodo.3164691,https://doi.org/10.5281/zenodo.3164691

  40. [40]

    Fast inference of deep neural networks in FPGAs for particle physics

    J. Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics”,JINST 13(2018) P07027,doi:10.1088/1748-0221/13/07/P07027,arXiv:1804.06913. 13 A Glossary: jet physics terminology This appendix defines the high-energy physics (HEP) terms used in the paper in language familiar to a general machine learning audience. Data representati...

  41. [41]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...