pith. sign in

arxiv: 2601.11719 · v3 · submitted 2026-01-16 · 💻 cs.LG · hep-ex

jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation

Pith reviewed 2026-05-16 13:11 UTC · model grok-4.3

classification 💻 cs.LG hep-ex
keywords self-supervised learningjet representationsanomaly detectionparticle classificationself-distillationrepresentation clusteringhigh energy physics
0
0 comments X

The pith

Pre-training unlabeled jet data via self-distillation produces emergent semantic clustering in the embedding space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

jBOT is a pre-training method that applies self-distillation to jet data from particle collisions at the LHC. It performs distillation both at the level of individual particles inside each jet and at the level of the full jet to build representations without any labels. This process causes jets that share the same underlying physics class to group together naturally in the learned space. When the model sees only background jets during pre-training, the frozen embedding supports anomaly detection through simple distance measurements. The same embedding, once fine-tuned, reaches higher classification accuracy than models trained from scratch with full supervision.

Core claim

The jBOT method demonstrates that self-distillation applied to unlabeled jets produces emergent semantic class clustering in the representation space. Pre-training performed exclusively on background jets yields a frozen embedding in which anomalies become detectable through straightforward distance-based metrics. The same embedding, when subsequently fine-tuned, delivers improved performance on classification tasks relative to supervised models trained from scratch.

What carries the argument

The jBOT self-distillation procedure, which jointly applies local particle-level distillation and global jet-level distillation to shape the representation space.

Load-bearing premise

The combination of particle-level and jet-level distillation is what produces the semantic clustering rather than properties of the jet data distribution or standard self-supervised objectives alone.

What would settle it

Train an otherwise identical jet model using only particle-level distillation or only jet-level distillation and test whether distinct semantic clusters still appear in the embedding space.

read the original abstract

Self-supervised learning, in the context of foundation model training, is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces jBOT, a self-supervised pre-training method for jet data from the LHC that combines local particle-level self-distillation with global jet-level self-distillation. The central claim is that pre-training on unlabeled jets produces emergent semantic class clustering in the learned representation space; the frozen embedding, when trained only on background jets, supports anomaly detection via simple distance-based metrics, and the embedding can be fine-tuned for classification with performance gains over supervised models trained from scratch.

Significance. If the empirical claims hold, the work would be significant for self-supervised learning in high-energy physics by showing that dual-scale distillation on jet data can discover semantic structures without labels. This could enable more effective anomaly detection in background-only settings and improve data efficiency for classification tasks, contributing to foundation-model-style approaches for LHC analyses.

major comments (2)
  1. Abstract: the claim that the specific combination of local particle-level and global jet-level self-distillation produces emergent semantic clustering is not supported by any ablations or comparisons to simpler baselines (e.g., global-only distillation, SimCLR-style contrastive learning, or masked modeling on identical jet data). Without these controls it is impossible to attribute the clustering to the jBOT design rather than generic properties of the jet kinematic distribution.
  2. Abstract: no quantitative results, clustering metrics (purity, ARI, silhouette scores), anomaly-detection AUCs, or classification accuracy deltas are reported, so the performance claims cannot be evaluated and the soundness of the central empirical observation remains unverified.
minor comments (1)
  1. Abstract: the phrase 'improved performance compared to supervised models trained from scratch' should be accompanied by explicit metrics and dataset details to allow immediate assessment of the magnitude of the gain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive report. We address each major comment point by point below and will revise the manuscript to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Abstract: the claim that the specific combination of local particle-level and global jet-level self-distillation produces emergent semantic clustering is not supported by any ablations or comparisons to simpler baselines (e.g., global-only distillation, SimCLR-style contrastive learning, or masked modeling on identical jet data). Without these controls it is impossible to attribute the clustering to the jBOT design rather than generic properties of the jet kinematic distribution.

    Authors: We agree that explicit ablations are required to isolate the contribution of the dual-scale (local + global) distillation. The current manuscript focuses on the full jBOT pipeline; in the revision we will add a dedicated ablation section comparing jBOT against global-only distillation, SimCLR-style contrastive learning, and masked modeling, all trained on the identical unlabeled jet dataset. These controls will quantify how much of the observed semantic clustering is attributable to the specific combination of scales versus generic properties of the jet kinematic distribution. revision: yes

  2. Referee: Abstract: no quantitative results, clustering metrics (purity, ARI, silhouette scores), anomaly-detection AUCs, or classification accuracy deltas are reported, so the performance claims cannot be evaluated and the soundness of the central empirical observation remains unverified.

    Authors: The abstract is written as a high-level summary and therefore omits specific numbers. The full manuscript already contains the requested quantitative results: clustering purity, ARI and silhouette scores demonstrating emergent semantic structure; anomaly-detection AUCs obtained with distance-based metrics on background-only training; and classification accuracy deltas after fine-tuning versus supervised baselines trained from scratch. To address the concern directly, we will insert the key numerical highlights into the revised abstract while retaining the concise style. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with observational claims

full rationale

The paper introduces jBOT as a self-distillation pre-training approach for jet data that combines particle-level and jet-level objectives, then reports emergent semantic clustering in the learned representations. No equations, derivations, or fitted-parameter predictions appear in the abstract or described content that would reduce the central claim to a tautology or self-referential fit. The clustering observation is presented as an empirical outcome of pre-training on unlabeled jets, with downstream uses for anomaly detection and fine-tuning. This structure is self-contained through experimental results rather than any load-bearing self-citation chain or definitional reduction, consistent with a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of any concrete free parameters, axioms, or invented entities; the description remains at the level of high-level method and observation.

pith-pipeline@v0.9.0 · 5434 in / 1038 out tokens · 39837 ms · 2026-05-16T13:11:56.333363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.