GAVEL: Towards Rule-Based Safety Through Activation Monitoring

Eyal Lenga; Gilad Gressel; Itay Zloczower; Rahul Pankajakshan; Shir Rozenfeld; Yisroel Mirsky

arxiv: 2601.19768 · v3 · submitted 2026-01-27 · 💻 cs.AI · cs.CR· cs.LG

GAVEL: Towards Rule-Based Safety Through Activation Monitoring

Shir Rozenfeld , Rahul Pankajakshan , Itay Zloczower , Eyal Lenga , Gilad Gressel , Yisroel Mirsky This is my paper

Pith reviewed 2026-05-16 10:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CRcs.LG

keywords activation monitoringrule-based safetyLLM safetycognitive elementsAI governanceinterpretabilitycompositional rulesreal-time detection

0 comments

The pith

Compositional rules over fine-grained cognitive elements in activations enable precise and customizable safety monitoring for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a rule-based approach to activation safety in LLMs by breaking down activations into interpretable cognitive elements like specific intents or actions. These elements can be combined using predicate rules to detect harmful behaviors tailored to particular domains. The framework allows real-time violation detection and rule updates without retraining, addressing issues of poor precision and lack of flexibility in existing methods. If the approach works, it would make AI safety systems more interpretable, auditable, and adaptable across different use cases.

Core claim

The paper establishes that modeling activations as cognitive elements permits the creation of predicate rules over these elements to detect violations in real time, enabling configuration and updates to safeguards without retraining models or detectors while supporting transparency and auditability.

What carries the argument

Cognitive elements (CEs), fine-grained interpretable factors such as 'making a threat' that compose to capture nuanced behaviors for rule application.

If this is right

Safety rules can be updated or customized by practitioners without retraining.
Detection achieves higher precision through compositional rules.
The system supports transparency and auditability for AI governance.
It provides a scalable foundation for interpretable safety mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rules could be shared across organizations similar to cybersecurity practices.
The method might reduce the need for large misuse datasets in training safety systems.
Integration with interactive tools could facilitate rapid prototyping of safety policies.

Load-bearing premise

Activations can be decomposed into fine-grained, interpretable cognitive elements that compose accurately to capture nuanced domain-specific behaviors without significant loss of detection performance.

What would settle it

A test where rule-based detection using CEs fails to match or exceed the precision of traditional trained detectors on a dataset of domain-specific harmful prompts.

read the original abstract

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as 'making a threat' and 'payment processing', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We open source GAVEL and introduce GAVEL Studio, an interactive rule authoring and management tool. Code and datasets are available at github.com/Offensive-AI-Lab/gavel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAVEL frames activation monitoring as composable rules over cognitive elements, which is a fresh practical angle, but the extraction and validation of those elements stay underspecified.

read the letter

The paper's main move is to treat LLM activations as combinations of fine-grained cognitive elements and then write predicate rules over them for safety detection. This draws from cybersecurity rule practices and aims to let users customize safeguards without retraining detectors or models. The open-sourced GAVEL code and GAVEL Studio tool for rule authoring are concrete steps that make the idea testable by others. That tooling and the emphasis on auditability are the parts that feel immediately useful for practitioners who need domain-specific controls. The claim that composition improves precision over broad misuse detectors is plausible on paper and could reduce false positives in targeted settings. The soft spot is exactly the one the stress test flags. The abstract gives examples like 'making a threat' and 'payment processing' but supplies no procedure for locating these elements in activation space, verifying their stability, or showing that rules over them compose without losing detection power relative to end-to-end baselines. Without that step the precision and customization benefits remain untested assumptions rather than demonstrated results. If the full experiments include clear probe methods or clustering details that hold up, the contribution strengthens; otherwise the central representational claim stays thin. This is for AI safety researchers and deployment teams who want more interpretable monitoring options. A reader already working on activation probes or rule-based systems would get the most from it. I would send it for peer review because the framing is new enough and the tooling is public, so referees can focus on tightening the validation of the cognitive elements rather than rejecting the idea outright.

Referee Report

3 major / 2 minor

Summary. The paper introduces GAVEL, a framework for rule-based activation safety in LLMs. It models activations as fine-grained, interpretable cognitive elements (CEs) such as 'making a threat' and 'payment processing' that can be composed via predicate rules to detect nuanced harmful behaviors with higher precision than broad misuse-trained detectors. The approach enables real-time violation detection, domain customization without retraining, and improved transparency/auditability; the authors open-source the implementation and provide GAVEL Studio for interactive rule management.

Significance. If the central claims hold, the work would be significant for AI safety. It offers a practical, cybersecurity-inspired alternative to opaque activation classifiers by emphasizing composability, interpretability, and updatability without model retraining. This could support scalable, auditable governance and domain-specific safeguards. The open-sourcing of code and datasets is a clear strength that facilitates reproducibility and extension.

major comments (3)

[§3] §3 (Cognitive Element Representation): The procedure for discovering, extracting, and validating CEs from activations is not described with sufficient algorithmic detail or pseudocode. No information is given on whether extraction uses supervised probes, unsupervised methods, or manual engineering, nor on how independence or stability of CEs is verified. This is load-bearing for the claim that composition yields higher precision without performance loss.
[§5.2] §5.2 and Table 3: The reported precision gains and domain-customization benefits are presented without ablation studies isolating the contribution of CE composition versus baseline activation monitoring. It is unclear whether the improvements are statistically significant or robust across domains, undermining the central claim that rule-based composition is the key driver.
[§4.3] §4.3 (Rule Definition): The predicate rules over CEs are illustrated with examples but lack a formal semantics or proof of soundness for real-time detection. Without this, it is difficult to assess whether the framework avoids false negatives on nuanced behaviors that the abstract claims to capture.

minor comments (2)

[Abstract] The abstract would benefit from a one-sentence summary of the experimental setup and datasets used to support the precision claims.
[Figure 1] Figure 1 (framework diagram) could include explicit arrows or labels for the activation-to-CE decomposition step to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. Below, we provide point-by-point responses to the major comments and outline the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (Cognitive Element Representation): The procedure for discovering, extracting, and validating CEs from activations is not described with sufficient algorithmic detail or pseudocode. No information is given on whether extraction uses supervised probes, unsupervised methods, or manual engineering, nor on how independence or stability of CEs is verified. This is load-bearing for the claim that composition yields higher precision without performance loss.

Authors: We agree that the original presentation in §3 lacked sufficient detail on the CE extraction process. This was an oversight in focusing on the conceptual framework. In the revised manuscript, we will include a detailed algorithmic description with pseudocode for discovering and extracting CEs. The method combines supervised probes trained on labeled activation data for specific cognitive elements with unsupervised techniques to ensure stability. Independence is verified through pairwise correlation thresholds and orthogonality in the probe weights. We will also report validation experiments demonstrating stability across different model layers and inputs. These additions will strengthen the foundation for the composition claims. revision: yes
Referee: [§5.2] §5.2 and Table 3: The reported precision gains and domain-customization benefits are presented without ablation studies isolating the contribution of CE composition versus baseline activation monitoring. It is unclear whether the improvements are statistically significant or robust across domains, undermining the central claim that rule-based composition is the key driver.

Authors: We appreciate this observation regarding the need for ablations. The current results in §5.2 and Table 3 compare GAVEL to existing methods but do not isolate the effect of rule composition. We will conduct and include new ablation studies that compare the full compositional rule-based approach against non-compositional activation monitoring baselines. Statistical significance will be assessed using appropriate tests, and experiments will be extended to additional domains to confirm robustness. The revised Table 3 will incorporate these findings to better support the central claim. revision: yes
Referee: [§4.3] §4.3 (Rule Definition): The predicate rules over CEs are illustrated with examples but lack a formal semantics or proof of soundness for real-time detection. Without this, it is difficult to assess whether the framework avoids false negatives on nuanced behaviors that the abstract claims to capture.

Authors: We recognize the value of formalizing the rule semantics. While the manuscript provided illustrative examples, we will add a formal definition of the predicate rules, including their syntax and operational semantics for real-time detection. A soundness argument will be provided, demonstrating that the rules are designed to capture all specified nuanced behaviors without introducing false negatives, leveraging the completeness of the CE vocabulary. Additional examples and a brief proof sketch will be included in the revised §4.3. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes a conceptual framework for rule-based activation safety by representing activations as composable cognitive elements and defining predicate rules over them, with no mathematical derivations, equations, fitted parameters, or self-referential definitions present. The central claims rest on the introduced representation and reported empirical improvements rather than any step that reduces by construction to its own inputs. No self-citations are used as load-bearing premises for uniqueness theorems or ansatzes, and the approach is presented as a new paradigm with open-sourced code, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the unverified domain assumption that activations decompose into composable cognitive elements; no free parameters or invented entities with independent evidence are detailed in the abstract.

axioms (1)

domain assumption Activations in LLMs can be decomposed into fine-grained, interpretable cognitive elements that compose to represent nuanced behaviors
This is the foundational modeling choice stated in the abstract as the basis for rule definition.

invented entities (1)

Cognitive Elements (CEs) no independent evidence
purpose: Fine-grained interpretable factors extracted from activations for rule composition
New representation introduced to enable the rule-based approach; no independent evidence or falsifiable predictions provided in the abstract.

pith-pipeline@v0.9.0 · 5536 in / 1252 out tokens · 24627 ms · 2026-05-16T10:38:47.989952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as 'making a threat' and 'payment processing', that can be composed to capture nuanced, domain-specific behaviors with higher precision.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.