pith. sign in

arxiv: 2606.21790 · v1 · pith:373L6GQVnew · submitted 2026-06-19 · 💻 cs.LG · hep-ex· hep-ph· physics.data-an

What Do Lorentz-Equivariant Jet Taggers Learn?

Pith reviewed 2026-06-26 14:12 UTC · model grok-4.3

classification 💻 cs.LG hep-exhep-phphysics.data-an
keywords Lorentz equivariancejet tagginglinear probesgrade ablationstop quark taggingN-subjettinessparticle physicsmachine learning
0
0 comments X

The pith

Lorentz-equivariant jet taggers suppress frame-dependent pseudorapidity while encoding jet mass and N-subjettiness strongly through vector channels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the internal representations learned by Lorentz-equivariant neural networks for jet tagging in particle physics. Using linear probes across multiple models, it shows that these networks drive pseudorapidity signals to zero while retaining strong information about jet mass and N-subjettiness. Grade ablations on L-GATr further indicate that bivector channels contribute negligibly to top-quark discrimination, whereas vector-like channels dominate and appear to seed the network's performance. These findings characterize the physical observables and algebraic structures that carry discriminative power under equivariance constraints.

Core claim

Linear probes demonstrate that equivariant models suppress frame-dependent pseudorapidity to zero while encoding jet mass and N-subjettiness strongly; grade ablations on L-GATr show bivector channels are negligible for top-quark tagging while vector-like channels are dominant but seed variable, indicating the network exploits multiple representational pathways.

What carries the argument

Linear probes and grade ablations applied to Lorentz-equivariant models such as L-GATr, L-GATr-slim and LLoCa-T.

If this is right

  • Equivariant models achieve frame invariance by nulling out pseudorapidity dependence in their learned features.
  • Jet mass and N-subjettiness serve as the primary physical observables carrying discriminative information for top tagging.
  • Vector-like channels provide the dominant representational pathway while bivector channels add little value for this task.
  • The presence of multiple pathways suggests the network maintains robustness through redundant feature extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed channel selectivity could motivate simplified architectures that drop bivector components for jet-tagging tasks without loss of accuracy.
  • Similar linear-probe and ablation techniques could be applied to other symmetry-equivariant models to map which observables survive the symmetry constraints.
  • The results raise the question of whether the same suppression of frame-dependent variables occurs in equivariant models trained on different collider observables.
  • If vector dominance generalizes, training procedures might explicitly regularize toward vector representations to improve sample efficiency.

Load-bearing premise

Linear probes and grade ablations fully capture the discriminative information learned by the models without missing non-linear encodings or interactions between channels.

What would settle it

A concrete experiment would be to train a non-linear probe on the same activations and observe whether it recovers significant pseudorapidity signal or bivector contributions that linear methods missed, or to measure tagging performance after explicitly zeroing vector channels versus bivector channels.

Figures

Figures reproduced from arXiv: 2606.21790 by Dhruv Kumar, Jay Agarwal, Siddharth Khare.

Figure 1
Figure 1. Figure 1: Mean relative logit change under pure boosts versus γ across five models, using a sweep from γ = 1 to γ = 5. Shaded bands: combined 95% bootstrap CIs pooled across 3 seeds. Lower is more equivariant. than the equivariant models. We treat this experiment as validation rather than a main result: it confirms that the trained models preserve the intended symmetries closely enough for the representation-level a… view at source ↗
Figure 2
Figure 2. Figure 2: Final-layer linear probe scores for selected targets across five models. Points: combined-bootstrap medians; error bars: 95% CIs pooled across 3 seeds. Targets shown: particle η, the N-subjettiness family, jet-level mass and multiplicity, then particle kinematics; particle ϕ is placed last as it is near zero for all models. Top/QCD classification and particle pT -rank quartile are reported in [PITH_FULL_I… view at source ↗
Figure 3
Figure 3. Figure 3: Scalar-channel probe trajectories across all five models. Shaded bands: combined 95% bootstrap CIs pooled across 3 seeds. Left: particle η — all three symmetry-aware models remain near zero throughout depth; ParT and Vanilla-T retain substantial signal. Centre: τ32 is consistently harder to decode than individual τN values across all architectures. Right: jet mass — LLoCa-T achieves the highest final-layer… view at source ↗
Figure 4
Figure 4. Figure 4: Grade decomposition for L-GATr on TopTagging. Left: global zero-grade ablation ∆AUC for each grade group. Bars show combined 95% bootstrap CIs pooled across 3 seeds; dots show per-seed values. Bivector (G2) is consistently negligible; vector-like (G1+G3) is dominant but highly seed-variable, reflecting multiple usable pathways. Right: layer-resolved zero-grade ablation (seed mean ± std; thin lines show per… view at source ↗
Figure 5
Figure 5. Figure 5: Overview: head-averaged mean final-layer attention maps for L-GATr (left), L-GATr-slim (centre) and LLoCa-T (right), across all three training seeds (rows). L-GATr and L-GATr-slim both show strong attention to the leading-pT constituent; LLoCa-T displays a more diffuse pattern. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Individual attention heads (0–7) for L-GATr-slim, across all three training seeds (rows). Pattern is broadly similar to L-GATr but with a somewhat broader spread across constituents. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Individual attention heads (0–7) for LLoCa-T, across all three training seeds (rows). Attention patterns are more spatially diffuse than L-GATr/L-GATr-slim, reflecting the different internal structure of canonicalization-based equivariance. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Linear CKA between scalar-channel representations at all layer pairs (L0–L11) for L-GATr (top row), L-GATr-slim (middle row) and LLoCa-T (bottom row), across all three training seeds (columns). Axes: L0 at bottom-left, L11 at top-right. Colorscale shared across all panels [0, 1]. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: All remaining hs probe trajectories across all five TopTagging models (4 × 3 grid; particle η and jet τ32 shown in the main text are omitted). Row 1: top/QCD AUC, jet mass, jet multiplicity. Row 2: τ1, τ2, τ3. Row 3: τ21, particle pT , particle E. Row 4: particle ϕ, pT -rank quartile, particle ∆R. Shaded bands: combined 95% bootstrap CIs pooled across 3 seeds for all models. Particle ϕ is near zero for al… view at source ↗
Figure 11
Figure 11. Figure 11: Per-layer geometric-algebra-invariant probe trajectories for all 8 jet-level targets (3 × 3 grid), broken down by grade group. Lines: scalar-like (G0+G4), vector-like (G1+G3), bivector (G2), all 832 features combined and hs scalar-channel reference. Shaded bands: combined 95% bootstrap CIs for the multivector feature sets (600 pooled draws for jet-level targets); hs uses seed mean ± std. 16 [PITH_FULL_IM… view at source ↗
Figure 12
Figure 12. Figure 12: Per-layer geometric-algebra-invariant probe trajectories for all 6 particle-level targets (2 × 3 grid), broken down by grade group. Same line and color conventions as [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: shows the grade ablation results for L-GATr trained with bivectors architecturally disabled (G2 channel zeroed during training). The model achieves baseline AUC 0.9865 ± 0.0002, indistinguishable from the standard L-GATr (0.9867 ± 0.0003). The bivector bar is trivially zero (G2 was never trained), while the vector-like ∆AUC per seed (0.216, 0.082, 0.399; mean 0.232 ± 0.131) shows the same high cross-seed … view at source ↗
Figure 14
Figure 14. Figure 14: Channel ablations for L-GATr-slim on TopTagging. Left: global ∆AUC for all six interventions with combined 95% bootstrap CIs (bars) and per-seed dots. Interventions ordered by decreasing impact: hv (all vectors), hs (all scalars), et (energy), exyz (3-momentum), ez (beam axis), exy (transverse). Right: layer-resolved ∆AUC for hs and hv only (component ablations were not run per-layer); thin lines show per… view at source ↗
Figure 15
Figure 15. Figure 15: Grade ablations for L-GATr with the alternative subgroup setting (5 independent grades G0–G4). Left: global zero-ablation ∆AUC with combined 95% bootstrap CIs and per-seed dots. Right: layer-resolved ∆AUC (seed mean ± std); thin lines show per-seed G1 trajectories. G2, G3, and G4 are all negligible; G1 is again dominant with high cross-seed variance, consistent with the 3-group main result. F. Equivarianc… view at source ↗
read the original abstract

We study what Lorentz-equivariant jet taggers learn internally, using equivariance tests, linear probes and grade ablations across five models including L-GATr, L-GATr-slim and LLoCa-T. Linear probes show that equivariant models suppress frame-dependent pseudorapidity to zero while encoding jet mass and N-subjettiness strongly. Grade ablations on L-GATr reveal that bivector channels are negligible for top-quark tagging while vector-like channels are dominant but seed variable, consistent with the network exploiting multiple representational pathways. These results characterize which physical features and algebraic grade structures carry discriminative information in equivariant taggers and may inform future development of such models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript studies the internal representations learned by Lorentz-equivariant jet taggers (L-GATr, L-GATr-slim, LLoCa-T and two others) via equivariance tests, linear probes and grade ablations. It reports that equivariant models suppress frame-dependent pseudorapidity to zero while strongly encoding jet mass and N-subjettiness, and that grade ablations on L-GATr show bivector channels to be negligible for top-quark tagging while vector-like channels dominate but are seed-variable, consistent with multiple representational pathways.

Significance. If the empirical measurements are robust, the work would usefully characterize which physical observables and algebraic grades carry discriminative information in equivariant jet taggers and could guide future architecture choices. The study performs direct empirical measurements on already-trained models using standard tools, which is a methodological strength.

major comments (2)
  1. [linear probes section] The linear-probe results (suppression of pseudorapidity, strong encoding of mass and N-subjettiness) are presented without details on training procedure, data splits, statistical controls, error estimation or hyperparameter selection; this information is required to establish that the reported quantities are not sensitive to post-hoc analysis choices.
  2. [grade ablations on L-GATr] Grade ablations on L-GATr conclude that bivector channels are negligible while vector channels dominate; because the paper itself remarks on 'multiple representational pathways,' the ablations (performed linearly and in isolation) may miss non-linear cross-grade interactions that could alter the interpretation of channel dominance.
minor comments (1)
  1. The abstract should explicitly list all five models examined and state the precise tagging task (top-quark vs. QCD).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to incorporate additional details and clarifications where appropriate.

read point-by-point responses
  1. Referee: [linear probes section] The linear-probe results (suppression of pseudorapidity, strong encoding of mass and N-subjettiness) are presented without details on training procedure, data splits, statistical controls, error estimation or hyperparameter selection; this information is required to establish that the reported quantities are not sensitive to post-hoc analysis choices.

    Authors: We agree that the linear-probe results require more detailed documentation to ensure robustness and reproducibility. In the revised manuscript, we will add a new subsection in the methods or results detailing: the probe training procedure (including optimizer, learning rate, epochs, and regularization), data splits (using the standard jet tagging train/validation/test partitions with probes trained on training data and evaluated on test data), statistical controls (averaging over multiple random seeds for probe training), error estimation (reporting mean and standard deviation across seeds), and hyperparameter selection (e.g., via validation set). These additions will confirm that the observed suppression of pseudorapidity and strong encoding of mass and N-subjettiness are stable and insensitive to post-hoc choices. revision: yes

  2. Referee: [grade ablations on L-GATr] Grade ablations on L-GATr conclude that bivector channels are negligible while vector channels dominate; because the paper itself remarks on 'multiple representational pathways,' the ablations (performed linearly and in isolation) may miss non-linear cross-grade interactions that could alter the interpretation of channel dominance.

    Authors: We acknowledge the referee's observation that our remark on multiple representational pathways (evidenced by seed-variable vector channel importance) raises the possibility of non-linear cross-grade interactions not captured by linear, isolated ablations. However, the linear ablations provide a direct and standard measure of each grade's isolated contribution, and the negligible bivector role is consistent across seeds. Exploring non-linear interactions would require more advanced methods (e.g., non-linear probes on grade combinations) beyond the scope of this empirical characterization study. In the revision, we will add a discussion paragraph noting this limitation while maintaining that the linear results still validly indicate grade dominance for top tagging. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on trained models

full rationale

The paper reports results from equivariance tests, linear probes, and grade ablations applied to already-trained Lorentz-equivariant jet taggers (L-GATr, LLoCa-T, etc.). These are standard post-training diagnostic tools with no derivations, equations, or fitted parameters that reduce reported findings (e.g., pseudorapidity suppression or channel dominance) to quantities defined or optimized within the same analysis. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The analysis is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard domain assumptions from ML interpretability without introducing new fitted parameters or postulated entities.

axioms (1)
  • domain assumption Linear probes attached to model layers extract the primary discriminative features learned during training
    Core premise of probe-based interpretability studies.

pith-pipeline@v0.9.1-grok · 5649 in / 1116 out tokens · 29809 ms · 2026-06-26T14:12:36.612091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 4 linked inside Pith

  1. [1]

    Advances in Neural Information Processing Systems , volume =

    Brehmer, Johann and de Haan, Pim and Behrends, S. Advances in Neural Information Processing Systems , volume =. 2023 , note =

  2. [2]

    Advances in Neural Information Processing Systems , volume =

    Spinner, Jonas and Bres. Advances in Neural Information Processing Systems , volume =. 2024 , note =

  3. [4]

    , booktitle =

    Spinner, Jonas and Favaro, Luigi and Lippmann, Peter and Pitz, Sebastian and Gerhartz, Gerrit and Plehn, Tilman and Hamprecht, Fred A. , booktitle =. 2025 , note =

  4. [5]

    2022 , note =

    Qu, Huilin and Li, Congqiao and Qian, Sitian , booktitle =. 2022 , note =

  5. [6]

    and Hammad, A

    Esmail, W. and Hammad, A. and Nojiri, M. , journal =. 2026 , note =

  6. [7]

    and Duarte, Javier , booktitle =

    Wang, Aaron and Gandrakota, Abhijith and Ngadiuba, Jennifer and Sahu, Vivekanand and Bhatnagar, Priyansh and Khoda, Elham E. and Duarte, Javier , booktitle =. 2024 , note =

  7. [8]

    and Ngadiuba, Jennifer and Duarte, Javier and Cavanaugh, Richard , booktitle =

    Legge, Timothy and Wang, Aaron and Ortiz, Jacob and Limouzi, Victor and Zhao, Zihan and Gandrakota, Abhijith and Khoda, Elham E. and Ngadiuba, Jennifer and Duarte, Javier and Cavanaugh, Richard , booktitle =. 2025 , note =

  8. [9]

    and Plehn, T

    Kasieczka, G. and Plehn, T. and Butter, A. and Cranmer, K. and Debnath, D. and Dillon, B. M. and others , journal =. 2019 , note =

  9. [10]

    2018 , note =

    Butter, Anja and Kasieczka, Gregor and Plehn, Tilman and Russell, Michael , journal =. 2018 , note =

  10. [11]

    2011 , note =

    Thaler, Jesse and Van Tilburg, Ken , journal =. 2011 , note =

  11. [12]

    2019 , note =

    Cheng, Taoli , booktitle =. 2019 , note =

  12. [13]

    2019 , note =

    Kornblith, Simon and Norouzi, Mohammad and Lee, Honglak and Hinton, Geoffrey , booktitle =. 2019 , note =

  13. [14]

    Geometric Algebra Transformer

    Brehmer, J., de Haan, P., Behrends, S., and Cohen, T. Geometric Algebra Transformer . In Advances in Neural Information Processing Systems, volume 36, 2023. arXiv:2305.18415

  14. [15]

    Deep-learned Top Tagging with a Lorentz Layer

    Butter, A., Kasieczka, G., Plehn, T., and Russell, M. Deep-learned Top Tagging with a Lorentz Layer . SciPost Physics, 5: 0 028, 2018. arXiv:1707.08966

  15. [16]

    Interpretability Study on Deep Learning for Jet Physics at the Large Hadron Collider

    Cheng, T. Interpretability Study on Deep Learning for Jet Physics at the Large Hadron Collider . In Machine Learning and the Physical Sciences Workshop, NeurIPS, 2019. arXiv:1911.01872

  16. [17]

    IAFormer: Interaction-Aware Transformer network for collider data analysis

    Esmail, W., Hammad, A., and Nojiri, M. IAFormer: Interaction-Aware Transformer network for collider data analysis . SciPost Physics, 20: 0 108, 2026. arXiv:2505.03258

  17. [18]

    M., et al

    Kasieczka, G., Plehn, T., Butter, A., Cranmer, K., Debnath, D., Dillon, B. M., et al. The Machine Learning Landscape of Top Taggers . SciPost Physics, 7: 0 014, 2019. arXiv:1902.09914

  18. [19]

    Similarity of Neural Network Representations Revisited

    Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similarity of Neural Network Representations Revisited . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 3519--3529, 2019. arXiv:1905.00414

  19. [20]

    E., Ngadiuba, J., Duarte, J., and Cavanaugh, R

    Legge, T., Wang, A., Ortiz, J., Limouzi, V., Zhao, Z., Gandrakota, A., Khoda, E. E., Ngadiuba, J., Duarte, J., and Cavanaugh, R. Why Is Attention Sparse In Particle Transformer? In Machine Learning and the Physical Sciences Workshop, NeurIPS, 2025. arXiv:2512.00210

  20. [21]

    Economical Jet Taggers -- Equivariant, Slim, and Quantized

    Petitjean, A., Plehn, T., Spinner, J., and K \"o the, U. Economical Jet Taggers -- Equivariant, Slim, and Quantized . arXiv preprint arXiv:2512.17011, 2025. arXiv:2512.17011

  21. [22]

    Particle Transformer for Jet Tagging

    Qu, H., Li, C., and Qian, S. Particle Transformer for Jet Tagging . In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 18281--18292, 2022. arXiv:2202.03772

  22. [23]

    Lorentz-Equivariant Geometric Algebra Transformers for High-Energy Physics

    Spinner, J., Bres \'o , V., de Haan, P., Plehn, T., Thaler, J., and Brehmer, J. Lorentz-Equivariant Geometric Algebra Transformers for High-Energy Physics . In Advances in Neural Information Processing Systems, volume 38, 2024. arXiv:2405.14806

  23. [24]

    Spinner, J., Favaro, L., Lippmann, P., Pitz, S., Gerhartz, G., Plehn, T., and Hamprecht, F. A. Lorentz Local Canonicalization: How to Make Any Network Lorentz-Equivariant . In Advances in Neural Information Processing Systems, 2025. arXiv:2505.20280

  24. [25]

    and Van Tilburg, K

    Thaler, J. and Van Tilburg, K. Identifying Boosted Objects with N-subjettiness . Journal of High Energy Physics, 2011: 0 015, 2011. arXiv:1011.2268

  25. [26]

    E., and Duarte, J

    Wang, A., Gandrakota, A., Ngadiuba, J., Sahu, V., Bhatnagar, P., Khoda, E. E., and Duarte, J. Interpreting Transformers for Jet Tagging . In Machine Learning and the Physical Sciences Workshop, NeurIPS, 2024. arXiv:2412.03673