Sensitivity Limits and Operational Threshold Calibration for DINOv2-based Gravitational-Wave Glitch Characterization: A Strain-Domain Mock Data Challenge on LIGO O4a

Luca Cirfeta

arxiv: 2606.06237 · v1 · pith:JJXVH7BXnew · submitted 2026-06-04 · 🌌 astro-ph.IM · gr-qc

Sensitivity Limits and Operational Threshold Calibration for DINOv2-based Gravitational-Wave Glitch Characterization: A Strain-Domain Mock Data Challenge on LIGO O4a

Luca Cirfeta This is my paper

Pith reviewed 2026-06-27 23:28 UTC · model grok-4.3

classification 🌌 astro-ph.IM gr-qc

keywords gravitational wave glitchesDINOv2mock data challengeLIGO O4aspectrogram analysisvision transformersanomaly detectionfalse positive rate calibration

0 comments

The pith

DINOv2 pipeline for LIGO glitch detection returns zero recall when false positives are controlled below 0.01%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a mock data challenge by injecting eight families of synthetic glitches into real LIGO O4a strain data and testing a DINOv2-based unsupervised detection pipeline. It shows that a loose dynamic threshold recovers some visually distinct morphologies at high SNR, but a strict operational threshold calibrated from the full embedding distribution yields no detections whatsoever. The authors attribute the failure to the global average pooling step applied to the model's [CLS] token, which spreads signal information across the entire spectrogram and weakens features that occupy only a few percent of the patches. A reader would care because the result quantifies how standard vision-transformer pipelines can miss localized signals in time-frequency representations even when those signals are injected at high amplitude.

Core claim

Under a statistically rigorous operational threshold (tau_op = 0.874) calibrated at the empirical 5x10^-5 quantile (FPR < 0.01%), the MDC yields Recall = 0 for all eight morphologies at all tested SNR levels, including narrowband structures (HarmonicComb, NarrowChirp) and impulsive transients (AsymBlip) at SNR up to 430. This insensitivity is traced to the global average pooling of the DINOv2 [CLS] token, which dilutes signals occupying a small fraction (<5%) of the spectrogram's 37x37 patch grid. The null result of the earlier work is reinterpreted as confirming the absence of novel macro-structures while leaving open the possibility of localized micro-structures.

What carries the argument

Global average pooling of the DINOv2 [CLS] token, which averages features across the full spectrogram and thereby dilutes localized signals that occupy few patches.

If this is right

The pipeline recovers visually anisotropic morphologies at matched-filter SNR >= 70 only when using an uncontrolled dynamic threshold.
The full O4a embedding distribution (N = 188,142) is extremely non-Gaussian and its left tail is best modeled by a GEV distribution.
The null result confirms absence of novel macro-structures but cannot exclude localized micro-structures.
Next-generation ViT pipelines for this task require patch-level scoring and multi-scale windowing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Replacing global pooling with local patch scoring may restore sensitivity to sparse signals without raising the false-positive rate.
The same pooling limitation could affect DINOv2 use in other spectrogram-based anomaly tasks where signals occupy only a small spatial fraction.
Operational calibration must be performed on the full background distribution rather than session-wise statistics to keep false positives controlled.
Earlier null results in similar unsupervised pipelines may also overlook micro-scale features rather than proving their absence.

Load-bearing premise

That global average pooling of the DINOv2 [CLS] token is the primary cause of the observed insensitivity rather than training data, pretraining, or injection fidelity.

What would settle it

Running the identical MDC injections through a modified DINOv2 pipeline that replaces global average pooling with patch-level anomaly scoring and checking whether recall becomes non-zero at the same tau_op = 0.874 threshold.

Figures

Figures reproduced from arXiv: 2606.06237 by Luca Cirfeta.

**Figure 1.** Figure 1: Empirical distribution of smax for 188,142 O4a segments (left tail). The Generalized Extreme Value (GEV) distribution provides a significantly better fit than a Beta or Gaussian distribution, demonstrating the heavy-tailed nature of the baseline background. ized Extreme Value (GEV) distribution (Fisher & Tippett 1928; Gumbel 1958) provides a significantly better fit than a Beta distribution: LLGEV = 32,4… view at source ↗

**Figure 2.** Figure 2: MDC Sensitivity curves demonstrating the threshold bifurcation. At the dynamic threshold τdyn = 0.9811 (Run A), visually anisotropic morphologies like Butterfly and ZSweep are recovered at high SNR. At the operationally calibrated threshold τop = 0.874 (Runs B and C), Recall is identically zero for all morphologies across all tested SNRs. For short-duration transients at the center of a 32- second window… view at source ↗

**Figure 3.** Figure 3: Schematic representation of the signal dilution effect. A transient occupying a small fraction (< 5%) of the 37 × 37 spectrogram patch grid is severely attenuated by the global average pooling of the DINOv2 [CLS] token, failing to suppress the global similarity below the operational detection threshold. Although the non-linear projection layers and MLP heads of the ViT introduce non-linear coupling terms, … view at source ↗

read the original abstract

We present a Mock Data Challenge (MDC) to characterize the sensitivity limits of the gravi-signal-ml pipeline (Cirfeta 2026) for unsupervised gravitational-wave glitch detection. Strain-domain synthetic injections of eight morphological families into public LIGO O4a L1 data reveal two threshold-dependent sensitivity regimes. With a session-adaptive dynamic threshold (tau_dyn = mu_bg - 2.5 * sigma_bg), the pipeline recovers visually anisotropic morphologies (Butterfly, ZSweep) at matched-filter SNR >= 70, reaching Recall = 1.0, though the False Positive Rate (FPR) remains uncontrolled across sessions. Characterization of the full O4a embedding distribution (N = 188,142 segments) reveals extreme non-Gaussianity (skewness = -4.12, excess kurtosis = 15.38, Shapiro-Wilk p-value near 0), with the left tail best modeled by a Generalized Extreme Value (GEV) distribution. Under a statistically rigorous operational threshold (tau_op = 0.874) calibrated at the empirical 5x10^-5 quantile (FPR < 0.01%), the MDC yields Recall = 0 for all eight morphologies at all tested SNR levels, including narrowband structures (HarmonicComb, NarrowChirp) and impulsive transients (AsymBlip) at SNR up to 430. We trace this insensitivity to the global average pooling of the DINOv2 [CLS] token, which dilutes signals occupying a small fraction (<5%) of the spectrogram's 37x37 patch grid. The null result of Cirfeta (2026) is conditionally reinterpreted: it confirms the absence of novel macro-structures but cannot exclude localized micro-structures. These findings provide a quantitative roadmap for next-generation ViT-based pipelines using patch-level scoring and multi-scale windowing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows zero recall for all injected glitches at a strict FPR-controlled threshold and attributes it to DINOv2 pooling, but offers no ablation to back that up.

read the letter

The main thing to know is that this mock data challenge finds the gravi-signal-ml pipeline returns recall of zero on every injected glitch morphology once the threshold is set at the 5e-5 quantile of the background embeddings. That threshold keeps FPR below 0.01 percent, and the null holds even for high-SNR injections up to 430.

What the work actually does is inject eight glitch families into real LIGO O4a L1 strain, embed the spectrograms with DINOv2, and characterize the full set of 188k background embeddings. The non-Gaussian tail (skew -4.12, kurtosis 15.38) is modeled with a GEV, and they define both a dynamic session threshold and the stricter operational one. The reinterpretation of their own prior null result as only excluding macro-structures is the new framing.

The injection campaign and the explicit quantile-based threshold are concrete and reproducible steps. Reporting the actual embedding statistics and the GEV fit gives readers numbers they can check against their own runs.

The soft spot is the causal claim. The paper states that global average pooling of the [CLS] token dilutes signals that occupy less than 5 percent of the 37x37 patch grid, but there is no test of patch-level scoring, alternative aggregation, or a different backbone to see whether the model would respond under other conditions. No check is shown on whether the strain-to-spectrogram conversion preserves the injected SNR. Without those controls the attribution stays speculative.

This is for people already working on vision-transformer pipelines for LIGO data quality. A reader who wants measured sensitivity limits on one specific setup will find usable numbers; someone looking for a general method or first-principles insight will not.

It has enough concrete results and a clear experimental design to deserve peer review, though the authors should expect questions on the missing ablations for the main explanation.

Referee Report

3 major / 2 minor

Summary. The paper reports a mock data challenge (MDC) injecting eight glitch morphologies into LIGO O4a L1 strain data and processing them through the gravi-signal-ml pipeline based on DINOv2 embeddings. With a session-adaptive dynamic threshold it recovers some anisotropic morphologies at SNR >=70, but under an operational threshold tau_op=0.874 (empirical 5x10^-5 quantile of N=188142 background embeddings, targeting FPR<0.01%) it obtains Recall=0 for all morphologies even at SNR up to 430. The null result is attributed to global average pooling of the DINOv2 [CLS] token diluting signals that occupy <5% of the 37x37 spectrogram patch grid, and is used to reinterpret the null result of Cirfeta (2026) as excluding only macro-structures.

Significance. If the reported null result under the calibrated operational threshold is robust, the work supplies a quantitative upper bound on the sensitivity of standard DINOv2 [CLS]-token pipelines to localized gravitational-wave features and motivates patch-level or multi-scale alternatives. The GEV characterization of background non-Gaussianity and the explicit quantile-based threshold definition are concrete contributions that could be reused by other embedding-based searches.

major comments (3)

[abstract / concluding paragraph] The central attribution of Recall=0 to global average pooling of the [CLS] token (abstract and final paragraph) is load-bearing for the reinterpretation of Cirfeta (2026) yet is unsupported by any ablation: no comparison is shown between [CLS] pooling and patch-level scoring, alternative aggregation, or a model with different pretraining. Without such controls it is impossible to rule out injection fidelity, spectrogram mapping, or pretraining data as the dominant cause.
[methods (implied by MDC description)] The injection and SNR-preservation procedures are not described in sufficient detail to verify that the injected signals retain their nominal matched-filter SNR after strain-to-spectrogram conversion and embedding. The claim that narrowband (HarmonicComb, NarrowChirp) and impulsive (AsymBlip) morphologies remain undetectable up to SNR=430 therefore rests on an unverified assumption about signal fidelity.
[abstract / discussion] The reinterpretation of the Cirfeta (2026) null result as excluding only macro-structures depends on self-citation without an independent reproduction or external benchmark of the prior pipeline; this circularity weakens the load-bearing claim that the present MDC supplies a conditional confirmation.

minor comments (2)

[abstract] The precise definition of the 37x37 patch grid and the fraction of patches occupied by each morphology should be stated explicitly with a figure or equation.
[results] Error bars or bootstrap uncertainties on the empirical 5x10^-5 quantile (tau_op=0.874) and on the reported Recall=0 values are not provided.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We address each of the major comments below, indicating where revisions will be incorporated to improve clarity and rigor.

read point-by-point responses

Referee: [abstract / concluding paragraph] The central attribution of Recall=0 to global average pooling of the [CLS] token (abstract and final paragraph) is load-bearing for the reinterpretation of Cirfeta (2026) yet is unsupported by any ablation: no comparison is shown between [CLS] pooling and patch-level scoring, alternative aggregation, or a model with different pretraining. Without such controls it is impossible to rule out injection fidelity, spectrogram mapping, or pretraining data as the dominant cause.

Authors: We agree that an explicit ablation would strengthen the attribution. The current manuscript bases the explanation on the DINOv2 architecture and the small fractional occupancy of signals in the spectrogram patches. In revision, we will qualify this as a plausible mechanism supported by the model design and observed results, while acknowledging the lack of direct comparison. We will also add a sentence noting that other factors cannot be fully excluded without further experiments. revision: partial
Referee: [methods (implied by MDC description)] The injection and SNR-preservation procedures are not described in sufficient detail to verify that the injected signals retain their nominal matched-filter SNR after strain-to-spectrogram conversion and embedding. The claim that narrowband (HarmonicComb, NarrowChirp) and impulsive (AsymBlip) morphologies remain undetectable up to SNR=430 therefore rests on an unverified assumption about signal fidelity.

Authors: We acknowledge the need for greater detail on the injection pipeline. The full text describes the use of public LIGO O4a data and the gravi-signal-ml pipeline, but we will expand the Methods section to include step-by-step description of how injections are performed, how SNR is calculated and preserved through the spectrogram conversion and embedding process, including any verification steps. This revision will be made. revision: yes
Referee: [abstract / discussion] The reinterpretation of the Cirfeta (2026) null result as excluding only macro-structures depends on self-citation without an independent reproduction or external benchmark of the prior pipeline; this circularity weakens the load-bearing claim that the present MDC supplies a conditional confirmation.

Authors: The reinterpretation is presented as conditional on the shared pipeline between this MDC and Cirfeta (2026). While it relies on self-reference, the MDC provides new data on sensitivity under calibrated thresholds. In the revision, we will rephrase the concluding paragraph to emphasize that the conditional confirmation is based on applying the same embedding method to new injections with rigorous background characterization, rather than claiming an independent validation of the prior work. revision: partial

Circularity Check

1 steps flagged

Reinterpretation of prior null result load-bearing on self-citation to author's own 2026 pipeline

specific steps

self citation load bearing [Abstract]
"The null result of Cirfeta (2026) is conditionally reinterpreted: it confirms the absence of novel macro-structures but cannot exclude localized micro-structures."

The reinterpretation that the 2026 null result only rules out macro-structures (while allowing micro-structures) is justified by citing the author's own prior paper for both the pipeline and the conditional reading. The current paper supplies no independent verification that pooling is the dominant mechanism, making the central claim reduce to a self-citation chain.

full rationale

The paper conducts an MDC on the gravi-signal-ml pipeline defined in Cirfeta (2026) and reports Recall=0 under tau_op. The load-bearing step is the conditional reinterpretation of that prior null result as excluding only macro-structures (due to [CLS] pooling), which is justified solely by citation to the same author's earlier work. No independent reproduction, ablation, or external benchmark is cited to separate the pooling explanation from pretraining or injection factors. This matches self_citation_load_bearing; the rest of the MDC results (GEV fit, quantile calibration) are independent.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper relies on fitted thresholds and distributional assumptions for the embedding space; no new physical entities are postulated.

free parameters (2)

dynamic threshold multiplier = 2.5
The factor 2.5 used in tau_dyn = mu_bg - 2.5 * sigma_bg
operational threshold = 0.874
tau_op = 0.874 chosen to achieve FPR < 0.01% at the 5x10^-5 quantile

axioms (2)

domain assumption The left tail of the DINOv2 embedding distribution for background LIGO segments is adequately modeled by a Generalized Extreme Value distribution
Invoked to justify the operational threshold calibration from the full O4a embedding distribution
domain assumption Synthetic morphological injections into O4a data faithfully represent the statistical properties of real gravitational-wave glitches without introducing systematic artifacts
Required for the MDC to validly measure sensitivity limits

pith-pipeline@v0.9.1-grok · 5893 in / 1656 out tokens · 35457 ms · 2026-06-27T23:28:55.490327+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Patch-Level DINOv2 Scoring for Gravitational-Wave Glitch Detection: Breaking the Signal Dilution Barrier via Vector-Quantized Local Feature Indexing
astro-ph.IM 2026-06 unverdicted novelty 3.0

Patch-level top-k similarity scoring against a vector-quantized DINOv2 reference index yields KS=0.963 separation for extended glitch morphologies on LIGO O4a data, addressing global CLS token dilution.

Reference graph

Works this paper leans on

11 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Abbott, R., Abbott, T.\ D., Abraham, S., et al.\ 2020, Classical and Quantum Gravity, 37, 055002

2020
[2]

Cirfeta, L.\ 2026, arXiv:2605.28572

Pith/arXiv arXiv 2026
[3]

Allen, B., Anderson, W.\ G., Brady, P.\ R., et al.\ 2012, Physical Review D, 85, 122006, doi:10.1103/PhysRevD.85.122006

work page doi:10.1103/physrevd.85.122006 2012
[4]

Darcet, T., Oquab, M., Doup\' e , E., et al.\ 2024, ICLR 2024, arXiv:2309.16588

Pith/arXiv arXiv 2024
[5]

Davis, D., Areeda, J.\ S., Berger, B.\ K., et al.\ 2021, Classical and Quantum Gravity, 38, 135014

2021
[6]

Fisher, R.\ A., & Tippett, L.\ H.\ C.\ 1928, Mathematical Proceedings of the Cambridge Philosophical Society, 24, 180

1928
[7]

Glanzer, J., Saravanan, S., Coughlin, S., et al.\ 2023, Classical and Quantum Gravity, 40, 065006

2023
[8]

Gravitational Wave Open Science Center\ 2023, GWOSC O4a Dataset, https://gwosc.org/

2023
[9]

Gumbel, E.\ J.\ 1958, Statistics of Extremes, Columbia University Press

1958
[10]

Nuttall, L.\ K.\ 2018, Philosophical Transactions of the Royal Society A, 376, 20170286

2018
[11]

Oquab, M., Darcet, T., Moutakanni, T., et al.\ 2024, Transactions on Machine Learning Research (TMLR)

2024

[1] [1]

Abbott, R., Abbott, T.\ D., Abraham, S., et al.\ 2020, Classical and Quantum Gravity, 37, 055002

2020

[2] [2]

Cirfeta, L.\ 2026, arXiv:2605.28572

Pith/arXiv arXiv 2026

[3] [3]

Allen, B., Anderson, W.\ G., Brady, P.\ R., et al.\ 2012, Physical Review D, 85, 122006, doi:10.1103/PhysRevD.85.122006

work page doi:10.1103/physrevd.85.122006 2012

[4] [4]

Darcet, T., Oquab, M., Doup\' e , E., et al.\ 2024, ICLR 2024, arXiv:2309.16588

Pith/arXiv arXiv 2024

[5] [5]

Davis, D., Areeda, J.\ S., Berger, B.\ K., et al.\ 2021, Classical and Quantum Gravity, 38, 135014

2021

[6] [6]

Fisher, R.\ A., & Tippett, L.\ H.\ C.\ 1928, Mathematical Proceedings of the Cambridge Philosophical Society, 24, 180

1928

[7] [7]

Glanzer, J., Saravanan, S., Coughlin, S., et al.\ 2023, Classical and Quantum Gravity, 40, 065006

2023

[8] [8]

Gravitational Wave Open Science Center\ 2023, GWOSC O4a Dataset, https://gwosc.org/

2023

[9] [9]

Gumbel, E.\ J.\ 1958, Statistics of Extremes, Columbia University Press

1958

[10] [10]

Nuttall, L.\ K.\ 2018, Philosophical Transactions of the Royal Society A, 376, 20170286

2018

[11] [11]

Oquab, M., Darcet, T., Moutakanni, T., et al.\ 2024, Transactions on Machine Learning Research (TMLR)

2024