pith. sign in

arxiv: 2606.17631 · v1 · pith:ZRDJIKWQnew · submitted 2026-06-16 · ✦ hep-ex

Better Queries, Cheaper Attention: Adapting Transformers for Efficient Sparse Reconstruction

Pith reviewed 2026-06-26 22:10 UTC · model grok-4.3

classification ✦ hep-ex
keywords transformersdynamic queriessparse attentionparticle trajectory reconstructionhigh energy physicsquery-based decoderslocal strided cross-attentionsparse reconstruction
0
0 comments X

The pith

A geometry-aware dynamic-query decoder raises charged-particle trajectory reconstruction efficiency from 94.1% to 98.1% while halving the fake rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that query-based transformer decoders can scale to high-multiplicity sparse sensor data by replacing fixed input-independent queries with input-conditioned ones and replacing dense cross-attention with geometry-restricted sparse attention. Queries are seeded from selected encoder representations of measurements that act as candidate trajectory starts, so both the number and content of queries depend on the actual input. If the approach holds, it would let such models handle the data rates of future detectors without the usual prohibitive growth in compute and memory.

Core claim

The paper claims that its dynamic-query (DQ) architecture, which initializes decoder queries from encoder-level measurement representations serving as trajectory seeds, raises reconstruction efficiency from 94.1% to 98.1% and cuts the fake rate by more than a factor of two relative to a fixed-query baseline. Adding Local Strided Cross-Attention (LSCA), which replaces learned mask-gated attention with a geometry-defined local support that limits attention to physically plausible query-hit pairs, further reduces end-to-end inference latency by nearly 50% and peak allocated memory by more than a factor of 10 in a simplified High-Luminosity LHC detector model.

What carries the argument

Geometry-aware dynamic-query decoder paired with Local Strided Cross-Attention (LSCA) that restricts cross-attention to geometry-defined local support regions.

If this is right

  • Trajectory reconstruction efficiency reaches 98.1% instead of the 94.1% fixed-query baseline.
  • The fake rate drops by more than a factor of two.
  • End-to-end inference latency falls by nearly 50%.
  • Peak allocated inference memory falls by more than a factor of 10.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same seeding-plus-local-support pattern could be tested on other sparse scientific reconstruction tasks such as neutrino event reconstruction or medical tomography.
  • Hardware implementations of the strided local support could be benchmarked against dense attention kernels to quantify additional speedups.
  • Varying the stride and support radius in LSCA on multiple detector layouts would show how tightly the gains depend on accurate geometric ordering.

Load-bearing premise

The geometry-defined local support in LSCA captures every physically relevant query-hit interaction without omissions, and results on the simplified detector model generalize to full-scale real detector conditions.

What would settle it

A run of the DQ+LSCA model on full-scale HL-LHC simulation data that shows reconstruction efficiency below 98% or memory reduction below a factor of five would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2606.17631 by Gabriel Facini, Max Hart, Philippa Duckett, Samuel Van Stroud, Tim Scanlon.

Figure 1
Figure 1. Figure 1: Event-display illustration of charged-particle tracking in two detector projections. The reconstruction task is to partition sparse detector measurements, or hits, into sets corresponding to individual particle trajectories. This sparse, irregular, and high-multiplicity structure contrasts with the dense lattice structure of images and motivates the query-based set-prediction formulation studied in this wo… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of locality in ϕ and η for charged-particle hits. Here z denotes the beam axis, r the radial coordinate, and ϕ the azimuthal angle; L1–L3 denote successive detector layers, B the solenoidal magnetic-field direction, and v the longitudinal production vertex. Panel (a) illustrates that charged-particle trajectories remain locally coherent in azimuth, while the polar angle inferred in the (r, z) pr… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the dynamic-query tracking architecture. Raw detector-hit coordinates are encoded by a Transformer encoder, and an auxiliary first-hit classifier identifies candidate track seeds. The selected first-hit embeddings initialise an event-dependent set of decoder object queries aligned with physical track candidates. A Transformer decoder refines these queries through query self-attention and query–… view at source ↗
Figure 4
Figure 4. Figure 4: and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualisation of predicted query–hit assignment masks for (a) fixed learned and (b) dynamic query initialisa￾tion. Fixed learned queries produce diffuse masks with no clear geometric alignment to the ϕ-ordered hits. Dynamic queries initialised from ϕ-ordered hit embeddings produce a pronounced banded structure. The blue shaded band illustrates the locality region used by LSCA. azimuthal locality of charged… view at source ↗
Figure 6
Figure 6. Figure 6: Track reconstruction efficiency for the Pix1.0 configuration as a function of transverse momentum and pseudorapidity. Both double-majority and perfect-match efficiencies are shown for the fixed-query baseline, the DQ+MA model, and the DQ+LSCA decoder. In the Pix1.0 configuration, the DQ+MA model improves the combined filter-plus-tracking double-majority efficiency to over 98%, a gain of 4.0 percentage poin… view at source ↗
Figure 7
Figure 7. Figure 7: Tracking efficiency for the high-occupancy pixel-only configuration, Pix0.6. Solid curves show double￾majority efficiency, while dotted curves show perfect-match efficiency, for the DQ+MA and DQ+LSCA decoder variants. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Tracking-only per-event inference latency across pixel-only detector configurations of increasing complexity, shown as a function of retained hit multiplicity and dynamic query count. Colours distinguish the decoder variants, Masked Attention (DQ+MA) and Local Strided Cross-Attention (DQ+LSCA). Markers show mean latency values within bins of the corresponding event-level quantity, with error bars indicatin… view at source ↗
Figure 9
Figure 9. Figure 9: Allocated inference GPU memory across pixel-only detector configurations of increasing complexity, shown as a function of retained hit multiplicity and dynamic query count. Colours distinguish the decoder variants, Masked Attention (DQ+MA) and Local Strided Cross-Attention (DQ+LSCA). Circular markers denote allocated memory measured for individual events, which reflects event-level scaling. The unshaded re… view at source ↗
read the original abstract

Query-based transformer decoders are effective for object reconstruction from sparse scientific sensor measurements, but their scalability to high-multiplicity data is limited by fixed, input-independent query sets and costly decoder cross-attention. We introduce a geometry-aware dynamic-query decoder that couples input-conditioned query construction with structured sparse cross-attention. Decoder queries are initialised from selected encoder-level measurement representations that serve as candidate trajectory seeds, making both query content and query multiplicity input-dependent. Local Strided Cross-Attention (LSCA) exploits the induced geometric ordering by replacing learned mask-gated cross-attention with a geometry-defined local support that restricts attention to physically plausible query-hit interactions and exposes sparsity for block-sparse execution. We study this architecture for charged-particle trajectory reconstruction in a simplified High-Luminosity Large Hadron Collider detector, where thousands of trajectories must be reconstructed from tens of thousands of sparse measurements. In the nominal configuration, the dynamic-query (DQ) architecture increases trajectory reconstruction efficiency from 94.1% to 98.1% and reduces the fake rate by more than a factor of two relative to the fixed-query baseline. The DQ+LSCA model reduces end-to-end inference latency by nearly 50% and peak allocated inference memory by more than a factor of 10 relative to the fixed-query baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a geometry-aware dynamic-query (DQ) decoder coupled with Local Strided Cross-Attention (LSCA) for transformer-based charged-particle trajectory reconstruction from sparse measurements. On a simplified HL-LHC detector model, the DQ architecture raises reconstruction efficiency from 94.1% to 98.1% and cuts the fake rate by more than a factor of two versus a fixed-query baseline; adding LSCA further reduces end-to-end inference latency by ~50% and peak memory by >10×.

Significance. If the empirical gains hold under the stated assumptions, the work demonstrates a practical route to scaling query-based transformers to the high-multiplicity regime required by HL-LHC tracking, by making both query count and attention support input-dependent and geometrically structured. The combination of efficiency and resource reductions is directly relevant to real-time and offline reconstruction pipelines.

major comments (2)
  1. [Abstract] Abstract (nominal configuration paragraph): The headline efficiency (94.1% → 98.1%) and fake-rate claims rest on the premise that the strided local support exactly matches all physically allowed trajectory-hit interactions. The manuscript provides no ablation varying stride or support radius to confirm that enlarging the window leaves efficiency unchanged; without this diagnostic the central performance attribution to LSCA remains unverified.
  2. [Abstract] Abstract and experimental setup: All quantitative results are obtained exclusively on a simplified toy geometry. Because the central claims concern practical utility for HL-LHC reconstruction, the absence of any transfer test to a full-scale, misaligned, or material-inclusive detector model leaves open whether the reported gains survive the transition to realistic conditions.
minor comments (2)
  1. The description of the fixed-query baseline should explicitly state query multiplicity, initialization, and training protocol so that the 94.1% reference point can be reproduced.
  2. Figure captions and text should clarify whether the reported latency and memory figures include the full encoder-decoder pipeline or only the decoder stage.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive comments. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (nominal configuration paragraph): The headline efficiency (94.1% → 98.1%) and fake-rate claims rest on the premise that the strided local support exactly matches all physically allowed trajectory-hit interactions. The manuscript provides no ablation varying stride or support radius to confirm that enlarging the window leaves efficiency unchanged; without this diagnostic the central performance attribution to LSCA remains unverified.

    Authors: We agree that an explicit ablation would strengthen the attribution. The LSCA stride and support radius are derived from the maximum hit displacements permitted by the toy detector geometry and the track curvature model; the 98.1% efficiency indicates the support is sufficient. In the revised manuscript we will add a methods paragraph detailing this geometric derivation together with a sensitivity table showing that moderate enlargements of the window leave efficiency and fake rate unchanged. revision: yes

  2. Referee: [Abstract] Abstract and experimental setup: All quantitative results are obtained exclusively on a simplified toy geometry. Because the central claims concern practical utility for HL-LHC reconstruction, the absence of any transfer test to a full-scale, misaligned, or material-inclusive detector model leaves open whether the reported gains survive the transition to realistic conditions.

    Authors: The manuscript explicitly positions the study as a controlled demonstration on a simplified HL-LHC toy model to isolate the effects of dynamic queries and structured attention. We acknowledge that transfer to full-scale, misaligned, and material-inclusive simulations is required to assess real-world utility; such experiments lie beyond the present scope. revision: no

standing simulated objections not resolved
  • Validation on full-scale, misaligned, or material-inclusive detector models

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out data

full rationale

The paper proposes a dynamic-query decoder with LSCA and reports direct empirical improvements (efficiency 94.1%→98.1%, latency reduction ~50%) on held-out simulation data relative to a fixed-query baseline. No equations, parameter fits, or self-citations are used to derive the claimed metrics; the results are obtained by running the architectures on independent test samples. The derivation chain is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so a complete audit is impossible. The central claim rests on the validity of a simplified HL-LHC detector simulation, standard definitions of track efficiency and fake rate, and the assumption that geometry provides a reliable local support mask. No explicit free parameters or invented physical entities are stated.

axioms (1)
  • standard math Standard transformer encoder-decoder attention mechanisms form a valid baseline for sparse reconstruction
    The paper compares against a fixed-query baseline that uses conventional cross-attention.

pith-pipeline@v0.9.1-grok · 5768 in / 1491 out tokens · 42251 ms · 2026-06-26T22:10:50.038307+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Attention Is All You Need

    Ashish Vaswani et al. “Attention Is All You Need” (2017). arXiv:1706.03762 [cs.CL]

  2. [2]

    End-to-End Object Detection with Transformers

    Nicolas Carion et al. “End-to-End Object Detection with Transformers” (2020). arXiv:2005.12872 [cs.CV]

  3. [3]

    Masked-attention Mask Transformer for Universal Image Segmentation

    Bowen Cheng et al. “Masked-attention Mask Transformer for Universal Image Segmentation” (2022). arXiv: 2112.01527 [cs.CV]. 17 Adapting Transformers for Efficient Sparse ReconstructionA PREPRINT

  4. [4]

    Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

    Jonas Schult et al. “Mask3D: Mask Transformer for 3D Semantic Instance Segmentation”.Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). 2023, pp. 8216–8223.DOI: 10.1109/ ICRA48891.2023.10160590

  5. [5]

    Secondary vertex reconstruction with MaskFormers

    Samuel Van Stroud et al. “Secondary vertex reconstruction with MaskFormers”. In:Eur. Phys. J. C84.10 (2024), p. 1020.DOI:10.1140/epjc/s10052-024-13374-5. arXiv:2312.12272 [hep-ex]

  6. [6]

    Transformers for Charged Particle Track Reconstruction in High Energy Physics

    Samuel Van Stroud et al. “Transformers for Charged Particle Track Reconstruction in High Energy Physics”. In: Phys. Rev. X15.4 (2025), p. 041046.DOI:10.1103/md46-yqgd. arXiv:2411.07149 [hep-ex]

  7. [7]

    GLOW: A Unified Particle Flow Transformer

    Dmitrii Kobylianskii et al. “GLOW: A Unified Particle Flow Transformer”.Proc. 8th Machine Learning and the Physical Sciences Workshop at the 39th Conference on Neural Information Processing Systems. 2025. arXiv: 2508 . 20092 [hep-ex].URL: https : / / ml4physicalsciences . github . io / 2025 / files / NeurIPS _ ML4PS_2025_65.pdf

  8. [8]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Ze Liu et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021, pp. 10012–10022. arXiv: 2103.14030 [cs.CV].URL:https://arxiv.org/abs/2103.14030

  9. [9]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:arXiv e-prints, arXiv:2205.14135 (May 2022), arXiv:2205.14135.DOI: 10 . 48550 / arXiv . 2205 . 14135. arXiv: 2205.14135 [cs.LG]

  10. [10]

    Longformer: The long-document transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. “Longformer: The long-document transformer”.Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, pp. 3459–3471. arXiv: 2004.05150 [cs.CL].URL:https://arxiv.org/abs/2004.05150

  11. [11]

    Big Bird: Transformers for longer sequences

    Manzil Zaheer et al. “Big Bird: Transformers for longer sequences”.Advances in Neural Information Processing Systems (NeurIPS). V ol. 33. 2020, pp. 17283–17297. arXiv: 2007.14062 [cs.LG] .URL: https://arxiv. org/abs/2007.14062

  12. [12]

    URL https://doi.org/10.48550/arXiv.2501

    Xu Zhao et al. “Fast Segment Anything”. In:arXiv preprint arXiv:2306.12156(2023).DOI: 10.48550/arXiv. 2306.12156. arXiv:2306.12156 [cs.CV]

  13. [13]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Juechu Dong et al. “Flex Attention: A Programming Model for Generating Optimized Attention Kernels”. In: arXiv preprint arXiv:2412.05496(2024).DOI: 10.48550/arXiv.2412.05496. arXiv: 2412.05496 [cs.LG]

  14. [14]

    Blyth,Opticks: GPU optical photon simulation for particle physics with NVIDIA OptiX, EPJ Web Conf.214(2019) 02027, doi:10.1051/epjconf/201921402027

    Moritz Kiehn et al. “The TrackML high-energy physics tracking challenge on Kaggle”.European Physical Journal Web of Conferences. V ol. 214. European Physical Journal Web of Conferences. July 2019, 06037, p. 06037.DOI:10.1051/epjconf/201921406037

  15. [15]

    TrackML: A High Energy Physics Particle Tracking Challenge

    Polo Calafiura et al. “TrackML: A High Energy Physics Particle Tracking Challenge”.2018 IEEE 14th Interna- tional Conference on e-Science (e-Science). 2018, pp. 344–344.DOI:10.1109/eScience.2018.00088

  16. [16]

    Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics

    Siqi Miao et al. “Locality-Sensitive Hashing-Based Efficient Point Transformer with Applications in High-Energy Physics”.Proceedings of the 41st International Conference on Machine Learning (ICML). V ol. 235. PMLR

  17. [17]

    arXiv:2402.12535.URL:https://arxiv.org/abs/2402.12535

  18. [18]

    Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Recon- struction

    Shitij Govil et al. “Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Recon- struction”.39th Annual Conference on Neural Information Processing Systems: Includes Machine Learning and the Physical Sciences (ML4PS). Oct. 2025. arXiv:2510.07594 [hep-ex]

  19. [19]

    IEEE Transactions on Neural Networks20(1), 61–80 (2008)

    Franco Scarselli et al. “The Graph Neural Network Model”. In:IEEE Transactions on Neural Networks20.1 (2009), pp. 61–80.DOI:10.1109/TNN.2008.2005605

  20. [20]

    ATLAS Collaboration.Optimizations of the ATLAS ITk GNN Reconstruction Pipeline. Tech. rep. ATL-PHYS- PUB-2025-046. Geneva: CERN, 2025.URL:https://cds.cern.ch/record/2948192

  21. [21]

    LHC Machine

    Lyndon Evans and Philip Bryant. “LHC Machine”. In:JINST3 (2008), S08001.DOI: 10 . 1088 / 1748 - 0221/3/08/S08001

  22. [22]

    The ATLAS Experiment at the CERN Large Hadron Collider

    ATLAS Collaboration. “The ATLAS Experiment at the CERN Large Hadron Collider”. In:JINST3 (2008), S08003.DOI:10.1088/1748-0221/3/08/S08003

  23. [23]

    The CMS Experiment at the CERN LHC

    CMS Collaboration. “The CMS Experiment at the CERN LHC”. In:JINST3 (2008), S08004.DOI: 10.1088/ 1748-0221/3/08/S08004

  24. [24]

    HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

    Zhijian Zhuo et al. “HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization”. In:arXiv preprint arXiv:2503.04598(2025).DOI: 10 . 48550 / arXiv . 2503 . 04598. arXiv: 2503 . 04598 [cs.LG]

  25. [25]

    GLU Variants Improve Transformer

    Noam Shazeer. “GLU Variants Improve Transformer”. In:arXiv preprint arXiv:2002.05202(2020).DOI: 10.48550/arXiv.2002.05202. arXiv:2002.05202 [cs.LG]

  26. [26]

    G., Naranjo, S., Rideout, W

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. “Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning”. In:Neural Networks107 (2018), pp. 3–11.DOI: 10.1016/j. neunet.2017.12.012. 18 Adapting Transformers for Efficient Sparse ReconstructionA PREPRINT

  27. [27]

    Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations

    Carole H. Sudre et al. “Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations”.Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. V ol. 10553. Lecture Notes in Computer Science. Cham: Springer, 2017, pp. 240–248.DOI: 10.1007/ 978-3-319-67558-9_28. arXiv:1707.03237 [cs.CV]

  28. [28]

    Focal Loss for Dense Object Detection

    Tsung-Yi Lin et al. “Focal Loss for Dense Object Detection”.Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017, pp. 2980–2988.DOI: 10.1109/ICCV.2017.324 . arXiv: 1708.02002 [cs.CV]

  29. [29]

    Algorithm 1015: A Fast Scalable Solver for the Dense Linear (Sum) Assignment Problem

    Stefan Guthe and Daniel Thuerck. “Algorithm 1015: A Fast Scalable Solver for the Dense Linear (Sum) Assignment Problem”. In:ACM Trans. Math. Softw.47.2 (Apr. 2021).ISSN: 0098-3500.DOI: 10.1145/3442348. 19