Evolution of Log-Based Detection Rules in Public Repositories

David Evans; Minjun Long

arxiv: 2605.05383 · v3 · pith:XBA2D4RInew · submitted 2026-05-06 · 💻 cs.CR · cs.SE

Evolution of Log-Based Detection Rules in Public Repositories

Minjun Long , David Evans This is my paper

Pith reviewed 2026-05-12 03:02 UTC · model grok-4.3

classification 💻 cs.CR cs.SE

keywords log-based detection rulesrule evolutionsecurity operationsSigma rulesSplunk security contentnon-monotonic changespredicate graphs

0 comments

The pith

Detection rules in public repositories evolve non-monotonically, repeatedly adding and removing logical conditions rather than converging to stable forms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first longitudinal study of log-based detection rule evolution across the Sigma and Splunk Security Content repositories. It finds that 56 percent of rules receive at least one change to their detection logic, and that most revisions are non-monotonic: over half of rules both add and remove clauses during their lifetimes. Roughly a quarter to a third of rules alternate between expanding coverage to catch more events and tightening conditions to reduce false positives. These patterns indicate that rule development reflects persistent operational trade-offs instead of steady refinement toward an ideal state.

Core claim

Using a predicate graph intermediate representation to canonicalize rule logic and a tree alignment procedure to compare versions, the authors examine 6,859 rule histories and determine that roughly 56 percent undergo detection-logic revisions. Evolution is predominantly non-monotonic, with over half of rules both adding and removing clauses over time, and recurring reversions are common. Combining structural metrics with LLM-assisted intent inference and human validation shows that a quarter to a third of rules oscillate between expanding coverage and reducing false positives rather than converging to stable forms.

What carries the argument

The predicate graph intermediate representation that canonicalizes the logical structure of a detection rule, together with a tree alignment procedure for quantifying changes across revisions.

If this is right

Rule changes frequently revisit prior decisions instead of strictly accumulating improvements.
A substantial share of rules continue oscillating between broader coverage and lower false-positive rates throughout their lifetimes.
The same non-monotonic pattern appears in both community-driven and curated public repositories.
Detection rule development reflects ongoing trade-offs rather than convergence to an optimal stable state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tools that visualize reversion patterns could help analysts avoid repeating past adjustments.
The oscillation may arise from real-world shifts in threat activity or alert volume that the paper does not directly measure.
Automated systems could monitor rule histories to flag when a rule is likely to require a reversal of a recent change.

Load-bearing premise

The predicate graph and tree alignment procedure faithfully capture semantic changes in detection logic without introducing artifacts or losing critical operational distinctions.

What would settle it

A domain-expert manual audit of a random sample of rule histories that found the automated method systematically misclassified monotonic changes as non-monotonic, or missed equivalent logic expressed in different syntax.

Figures

Figures reproduced from arXiv: 2605.05383 by David Evans, Minjun Long.

**Figure 1.** Figure 1: Alignment of two versions of a PowerShell encoded-command detection rule. Matched view at source ↗

**Figure 2.** Figure 2: Quarterly rule creation count and revision volume for Sigma and SSC. view at source ↗

**Figure 3.** Figure 3: Cohort-wise revisions. Each bar shows the average accumulated edit magnitude per rule view at source ↗

**Figure 4.** Figure 4: Prevalence of structural operations among predicate-changing revision steps. Each bar view at source ↗

**Figure 5.** Figure 5: Co-occurrence of structural operation labels. Each cell reports view at source ↗

read the original abstract

Log-based detection rules remain central to modern security operations, encoding domain expertise that analysts iteratively refine to balance detection coverage against alert volume. Yet while prior work has examined the evolution of network intrusion detection signatures, the longitudinal behavior of log-based detection rules has received little empirical study. We present the first longitudinal analysis of detection rule evolution across two widely used repositories: the community-driven Sigma project and the curated Splunk Security Content (SSC). To compare rule versions based on detection logic rather than surface syntax, we introduce a predicate graph intermediate representation that canonicalizes the logical structure of a rule, together with a tree alignment procedure for analyzing changes across revisions. We apply this method to 6,859 rule histories from Sigma and SSC and find that roughly 56% of rules undergo at least one revision on detection logic. Across rule lifetimes, evolution is predominantly non-monotonic, with over half of rules both adding and removing clauses over time. We further observe recurring reversions, indicating that changes are often revisited rather than strictly accumulated. Combining structural analysis with LLM-based inference and human validation of operational intent shows that roughly a quarter to a third of rules alternate between expanding coverage and reducing false positives, rather than converging toward a stable form. Together, these results reveal that detection rule evolution in public repositories reflects ongoing operational trade-offs rather than steady convergence. Our study raises questions about why rules change the way they do and supports research towards better processes for devising and deploying security rules.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first longitudinal look at log detection rule changes in public repos using a new predicate-graph IR and alignment, but the headline stats on non-monotonic evolution rest on an alignment method with only light validation.

read the letter

The core contribution here is straightforward: the first study tracking how log-based detection rules actually evolve across thousands of versions in Sigma and Splunk repositories. They pull 6,859 rule histories, canonicalize the logic into a predicate graph, align versions with a tree procedure, and report that 56% see at least one detection-logic change, more than half are non-monotonic (adding and removing clauses), and a quarter to a third flip between coverage expansion and false-positive reduction rather than settling down. That observational picture of iterative trade-offs is new and useful for anyone thinking about rule maintenance tooling or processes. The method itself is a reasonable attempt to move past raw syntax diffs, and the paper pairs it with some LLM intent labeling plus human spot checks. The data collection from real public repos is solid and reproducible in principle. The soft spot is exactly where the stress test flags it: the predicate graph and alignment have no blinded expert comparison on held-out rule pairs to show they preserve operational distinctions like field mappings, implicit conjunctions, or SIEM evaluation order. Internal consistency and limited spot checks are not enough to rule out artifacts in the non-monotonicity counts. The LLM step for intent also lacks detail on error rates or agreement. These are fixable with more validation, but they sit at the center of the claims. This is the kind of empirical work that belongs in a security conference or journal on operational security. Readers working on detection engineering or rule lifecycle tools will get concrete numbers and questions to build on, even if they treat the exact percentages as provisional. It deserves a serious referee rather than a desk reject; the novelty and data effort outweigh the current validation gaps, which a review can push to strengthen.

Referee Report

2 major / 2 minor

Summary. The paper claims to perform the first longitudinal empirical analysis of detection rule evolution in public log-based security rule repositories (Sigma and SSC). By developing a predicate graph intermediate representation and tree alignment method to track semantic changes in detection logic across revisions, the authors analyze 6,859 rule histories. Key findings include that roughly 56% of rules are revised at least once, evolution is predominantly non-monotonic with over half of rules both adding and removing clauses, recurring reversions occur, and approximately 25-33% of rules alternate between expanding coverage and reducing false positives instead of converging to a stable form. The work concludes that such evolution reflects ongoing operational trade-offs.

Significance. If the methodological components are validated, this would represent a significant contribution by providing the first data-driven view into the iterative refinement process of log-based detection rules, revealing non-convergent patterns that contrast with assumptions of steady improvement. The predicate graph approach for abstracting syntax is a positive methodological step that enables the analysis. The results could inform security practitioners and tool developers on managing rule complexity and change. However, the current lack of strong validation for the IR and alignment reduces the immediate impact, though the empirical pipeline is clearly outlined.

major comments (2)

[§3 (Introduction of Predicate Graph IR and Tree Alignment)] All headline quantitative results (56% revision rate, non-monotonic evolution in >50% of rules, 25-33% alternation) depend on the predicate graph canonicalization and tree alignment correctly identifying meaningful semantic edits rather than syntactic variations. The manuscript provides internal consistency checks, LLM-assisted labeling, and limited human spot-checks but lacks a rigorous blinded expert validation on a held-out sample to establish the fidelity of the procedure, such as precision/recall against ground-truth semantic changes. This is load-bearing for the central claims about operational trade-offs.
[§5 (Results - Intent Classification)] The classification of rule changes as alternating between coverage expansion and false-positive reduction is based on LLM inference of intent with human validation. No quantitative details on the strength of this validation (e.g., agreement rates, number of checked samples, or disagreement resolution) are reported, which is necessary to support the specific fraction of rules exhibiting this behavior.

minor comments (2)

[Abstract and Data Description] The total of 6,859 rule histories is presented without a breakdown by repository (Sigma vs. SSC), time period, or explicit filtering criteria applied to select histories, making it difficult to assess potential selection biases.
[Throughout] The paper would benefit from more precise reporting of exact counts and percentages alongside the 'roughly' and 'over half' qualifiers to allow readers to better gauge the effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important opportunities to strengthen the validation of our methodological components. We address each major comment below and will revise the manuscript accordingly to improve transparency and rigor.

read point-by-point responses

Referee: [§3 (Introduction of Predicate Graph IR and Tree Alignment)] All headline quantitative results (56% revision rate, non-monotonic evolution in >50% of rules, 25-33% alternation) depend on the predicate graph canonicalization and tree alignment correctly identifying meaningful semantic edits rather than syntactic variations. The manuscript provides internal consistency checks, LLM-assisted labeling, and limited human spot-checks but lacks a rigorous blinded expert validation on a held-out sample to establish the fidelity of the procedure, such as precision/recall against ground-truth semantic changes. This is load-bearing for the central claims about operational trade-offs.

Authors: We agree that the fidelity of the predicate graph IR and tree alignment is critical, as our quantitative findings on revision rates, non-monotonic evolution, and alternation patterns depend on it distinguishing semantic edits from syntactic noise. The manuscript reports internal consistency checks, LLM-assisted labeling, and limited human spot-checks, but we acknowledge that these fall short of a rigorous blinded expert validation on a held-out sample with precision/recall metrics. In the revised manuscript, we will add a blinded validation study: experts will independently label ground-truth semantic changes on a held-out sample of rule revisions, enabling computation and reporting of precision and recall for our canonicalization and alignment procedure. This will directly address the load-bearing nature of the method for claims about operational trade-offs. revision: yes
Referee: [§5 (Results - Intent Classification)] The classification of rule changes as alternating between coverage expansion and false-positive reduction is based on LLM inference of intent with human validation. No quantitative details on the strength of this validation (e.g., agreement rates, number of checked samples, or disagreement resolution) are reported, which is necessary to support the specific fraction of rules exhibiting this behavior.

Authors: We thank the referee for noting this gap in reporting. The intent classification combines LLM inference with human validation, but the manuscript does not provide quantitative details on validation strength. In the revised version, we will expand the relevant section to report the number of samples subjected to human validation, inter-annotator agreement rates (e.g., Cohen's kappa or percentage agreement), and the process for resolving disagreements. These additions will offer the necessary transparency and support the reported fraction of rules alternating between coverage expansion and false-positive reduction. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical counts from repository data

full rationale

The paper collects 6,859 rule histories from Sigma and SSC repositories, converts each to a predicate graph IR, applies tree alignment to detect changes, and reports aggregate statistics (56% revised, >50% non-monotonic, 25-33% alternating) plus LLM-assisted intent labels. These quantities are straightforward observational tallies and classifications; none are obtained by fitting parameters on a data subset and then treating a related quantity as a 'prediction,' nor are any results defined in terms of themselves. The IR and alignment procedure are introduced as an analysis tool rather than a derived claim whose correctness is presupposed by the output statistics. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work appear in the derivation chain. The analysis is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the newly introduced predicate graph accurately encodes detection logic semantics and that the sampled rule histories from Sigma and SSC are representative of broader practice.

axioms (1)

domain assumption The predicate graph canonicalizes logical structure of rules sufficiently to enable meaningful comparison of detection intent across revisions
Invoked when the paper states it compares rules based on detection logic rather than surface syntax

invented entities (1)

predicate graph intermediate representation no independent evidence
purpose: To canonicalize and compare the logical structure of detection rules across versions
Newly defined for this study; no independent evidence provided outside the paper

pith-pipeline@v0.9.0 · 5553 in / 1362 out tokens · 69428 ms · 2026-05-12T03:02:13.612031+00:00 · methodology

Review history (2 revisions) →

Evolution of Log-Based Detection Rules in Public Repositories

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)