Evolution of Log-Based Detection Rules in Public Repositories
Pith reviewed 2026-05-12 03:02 UTC · model grok-4.3
The pith
Detection rules in public repositories evolve non-monotonically, repeatedly adding and removing logical conditions rather than converging to stable forms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a predicate graph intermediate representation to canonicalize rule logic and a tree alignment procedure to compare versions, the authors examine 6,859 rule histories and determine that roughly 56 percent undergo detection-logic revisions. Evolution is predominantly non-monotonic, with over half of rules both adding and removing clauses over time, and recurring reversions are common. Combining structural metrics with LLM-assisted intent inference and human validation shows that a quarter to a third of rules oscillate between expanding coverage and reducing false positives rather than converging to stable forms.
What carries the argument
The predicate graph intermediate representation that canonicalizes the logical structure of a detection rule, together with a tree alignment procedure for quantifying changes across revisions.
If this is right
- Rule changes frequently revisit prior decisions instead of strictly accumulating improvements.
- A substantial share of rules continue oscillating between broader coverage and lower false-positive rates throughout their lifetimes.
- The same non-monotonic pattern appears in both community-driven and curated public repositories.
- Detection rule development reflects ongoing trade-offs rather than convergence to an optimal stable state.
Where Pith is reading between the lines
- Tools that visualize reversion patterns could help analysts avoid repeating past adjustments.
- The oscillation may arise from real-world shifts in threat activity or alert volume that the paper does not directly measure.
- Automated systems could monitor rule histories to flag when a rule is likely to require a reversal of a recent change.
Load-bearing premise
The predicate graph and tree alignment procedure faithfully capture semantic changes in detection logic without introducing artifacts or losing critical operational distinctions.
What would settle it
A domain-expert manual audit of a random sample of rule histories that found the automated method systematically misclassified monotonic changes as non-monotonic, or missed equivalent logic expressed in different syntax.
Figures
read the original abstract
Log-based detection rules remain central to modern security operations, encoding domain expertise that analysts iteratively refine to balance detection coverage against alert volume. Yet while prior work has examined the evolution of network intrusion detection signatures, the longitudinal behavior of log-based detection rules has received little empirical study. We present the first longitudinal analysis of detection rule evolution across two widely used repositories: the community-driven Sigma project and the curated Splunk Security Content (SSC). To compare rule versions based on detection logic rather than surface syntax, we introduce a predicate graph intermediate representation that canonicalizes the logical structure of a rule, together with a tree alignment procedure for analyzing changes across revisions. We apply this method to 6,859 rule histories from Sigma and SSC and find that roughly 56% of rules undergo at least one revision on detection logic. Across rule lifetimes, evolution is predominantly non-monotonic, with over half of rules both adding and removing clauses over time. We further observe recurring reversions, indicating that changes are often revisited rather than strictly accumulated. Combining structural analysis with LLM-based inference and human validation of operational intent shows that roughly a quarter to a third of rules alternate between expanding coverage and reducing false positives, rather than converging toward a stable form. Together, these results reveal that detection rule evolution in public repositories reflects ongoing operational trade-offs rather than steady convergence. Our study raises questions about why rules change the way they do and supports research towards better processes for devising and deploying security rules.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to perform the first longitudinal empirical analysis of detection rule evolution in public log-based security rule repositories (Sigma and SSC). By developing a predicate graph intermediate representation and tree alignment method to track semantic changes in detection logic across revisions, the authors analyze 6,859 rule histories. Key findings include that roughly 56% of rules are revised at least once, evolution is predominantly non-monotonic with over half of rules both adding and removing clauses, recurring reversions occur, and approximately 25-33% of rules alternate between expanding coverage and reducing false positives instead of converging to a stable form. The work concludes that such evolution reflects ongoing operational trade-offs.
Significance. If the methodological components are validated, this would represent a significant contribution by providing the first data-driven view into the iterative refinement process of log-based detection rules, revealing non-convergent patterns that contrast with assumptions of steady improvement. The predicate graph approach for abstracting syntax is a positive methodological step that enables the analysis. The results could inform security practitioners and tool developers on managing rule complexity and change. However, the current lack of strong validation for the IR and alignment reduces the immediate impact, though the empirical pipeline is clearly outlined.
major comments (2)
- [§3 (Introduction of Predicate Graph IR and Tree Alignment)] All headline quantitative results (56% revision rate, non-monotonic evolution in >50% of rules, 25-33% alternation) depend on the predicate graph canonicalization and tree alignment correctly identifying meaningful semantic edits rather than syntactic variations. The manuscript provides internal consistency checks, LLM-assisted labeling, and limited human spot-checks but lacks a rigorous blinded expert validation on a held-out sample to establish the fidelity of the procedure, such as precision/recall against ground-truth semantic changes. This is load-bearing for the central claims about operational trade-offs.
- [§5 (Results - Intent Classification)] The classification of rule changes as alternating between coverage expansion and false-positive reduction is based on LLM inference of intent with human validation. No quantitative details on the strength of this validation (e.g., agreement rates, number of checked samples, or disagreement resolution) are reported, which is necessary to support the specific fraction of rules exhibiting this behavior.
minor comments (2)
- [Abstract and Data Description] The total of 6,859 rule histories is presented without a breakdown by repository (Sigma vs. SSC), time period, or explicit filtering criteria applied to select histories, making it difficult to assess potential selection biases.
- [Throughout] The paper would benefit from more precise reporting of exact counts and percentages alongside the 'roughly' and 'over half' qualifiers to allow readers to better gauge the effect sizes.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. The comments highlight important opportunities to strengthen the validation of our methodological components. We address each major comment below and will revise the manuscript accordingly to improve transparency and rigor.
read point-by-point responses
-
Referee: [§3 (Introduction of Predicate Graph IR and Tree Alignment)] All headline quantitative results (56% revision rate, non-monotonic evolution in >50% of rules, 25-33% alternation) depend on the predicate graph canonicalization and tree alignment correctly identifying meaningful semantic edits rather than syntactic variations. The manuscript provides internal consistency checks, LLM-assisted labeling, and limited human spot-checks but lacks a rigorous blinded expert validation on a held-out sample to establish the fidelity of the procedure, such as precision/recall against ground-truth semantic changes. This is load-bearing for the central claims about operational trade-offs.
Authors: We agree that the fidelity of the predicate graph IR and tree alignment is critical, as our quantitative findings on revision rates, non-monotonic evolution, and alternation patterns depend on it distinguishing semantic edits from syntactic noise. The manuscript reports internal consistency checks, LLM-assisted labeling, and limited human spot-checks, but we acknowledge that these fall short of a rigorous blinded expert validation on a held-out sample with precision/recall metrics. In the revised manuscript, we will add a blinded validation study: experts will independently label ground-truth semantic changes on a held-out sample of rule revisions, enabling computation and reporting of precision and recall for our canonicalization and alignment procedure. This will directly address the load-bearing nature of the method for claims about operational trade-offs. revision: yes
-
Referee: [§5 (Results - Intent Classification)] The classification of rule changes as alternating between coverage expansion and false-positive reduction is based on LLM inference of intent with human validation. No quantitative details on the strength of this validation (e.g., agreement rates, number of checked samples, or disagreement resolution) are reported, which is necessary to support the specific fraction of rules exhibiting this behavior.
Authors: We thank the referee for noting this gap in reporting. The intent classification combines LLM inference with human validation, but the manuscript does not provide quantitative details on validation strength. In the revised version, we will expand the relevant section to report the number of samples subjected to human validation, inter-annotator agreement rates (e.g., Cohen's kappa or percentage agreement), and the process for resolving disagreements. These additions will offer the necessary transparency and support the reported fraction of rules alternating between coverage expansion and false-positive reduction. revision: yes
Circularity Check
No circularity: results are direct empirical counts from repository data
full rationale
The paper collects 6,859 rule histories from Sigma and SSC repositories, converts each to a predicate graph IR, applies tree alignment to detect changes, and reports aggregate statistics (56% revised, >50% non-monotonic, 25-33% alternating) plus LLM-assisted intent labels. These quantities are straightforward observational tallies and classifications; none are obtained by fitting parameters on a data subset and then treating a related quantity as a 'prediction,' nor are any results defined in terms of themselves. The IR and alignment procedure are introduced as an analysis tool rather than a derived claim whose correctness is presupposed by the output statistics. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work appear in the derivation chain. The analysis is therefore self-contained against external benchmarks and receives a score of 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The predicate graph canonicalizes logical structure of rules sufficiently to enable meaningful comparison of detection intent across revisions
invented entities (1)
-
predicate graph intermediate representation
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.