GAVEL: Towards Rule-Based Safety Through Activation Monitoring
Pith reviewed 2026-05-16 10:38 UTC · model grok-4.3
The pith
Compositional rules over fine-grained cognitive elements in activations enable precise and customizable safety monitoring for large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that modeling activations as cognitive elements permits the creation of predicate rules over these elements to detect violations in real time, enabling configuration and updates to safeguards without retraining models or detectors while supporting transparency and auditability.
What carries the argument
Cognitive elements (CEs), fine-grained interpretable factors such as 'making a threat' that compose to capture nuanced behaviors for rule application.
If this is right
- Safety rules can be updated or customized by practitioners without retraining.
- Detection achieves higher precision through compositional rules.
- The system supports transparency and auditability for AI governance.
- It provides a scalable foundation for interpretable safety mechanisms.
Where Pith is reading between the lines
- Rules could be shared across organizations similar to cybersecurity practices.
- The method might reduce the need for large misuse datasets in training safety systems.
- Integration with interactive tools could facilitate rapid prototyping of safety policies.
Load-bearing premise
Activations can be decomposed into fine-grained, interpretable cognitive elements that compose accurately to capture nuanced domain-specific behaviors without significant loss of detection performance.
What would settle it
A test where rule-based detection using CEs fails to match or exceed the precision of traditional trained detectors on a dataset of domain-specific harmful prompts.
read the original abstract
Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as 'making a threat' and 'payment processing', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We open source GAVEL and introduce GAVEL Studio, an interactive rule authoring and management tool. Code and datasets are available at github.com/Offensive-AI-Lab/gavel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GAVEL, a framework for rule-based activation safety in LLMs. It models activations as fine-grained, interpretable cognitive elements (CEs) such as 'making a threat' and 'payment processing' that can be composed via predicate rules to detect nuanced harmful behaviors with higher precision than broad misuse-trained detectors. The approach enables real-time violation detection, domain customization without retraining, and improved transparency/auditability; the authors open-source the implementation and provide GAVEL Studio for interactive rule management.
Significance. If the central claims hold, the work would be significant for AI safety. It offers a practical, cybersecurity-inspired alternative to opaque activation classifiers by emphasizing composability, interpretability, and updatability without model retraining. This could support scalable, auditable governance and domain-specific safeguards. The open-sourcing of code and datasets is a clear strength that facilitates reproducibility and extension.
major comments (3)
- [§3] §3 (Cognitive Element Representation): The procedure for discovering, extracting, and validating CEs from activations is not described with sufficient algorithmic detail or pseudocode. No information is given on whether extraction uses supervised probes, unsupervised methods, or manual engineering, nor on how independence or stability of CEs is verified. This is load-bearing for the claim that composition yields higher precision without performance loss.
- [§5.2] §5.2 and Table 3: The reported precision gains and domain-customization benefits are presented without ablation studies isolating the contribution of CE composition versus baseline activation monitoring. It is unclear whether the improvements are statistically significant or robust across domains, undermining the central claim that rule-based composition is the key driver.
- [§4.3] §4.3 (Rule Definition): The predicate rules over CEs are illustrated with examples but lack a formal semantics or proof of soundness for real-time detection. Without this, it is difficult to assess whether the framework avoids false negatives on nuanced behaviors that the abstract claims to capture.
minor comments (2)
- [Abstract] The abstract would benefit from a one-sentence summary of the experimental setup and datasets used to support the precision claims.
- [Figure 1] Figure 1 (framework diagram) could include explicit arrows or labels for the activation-to-CE decomposition step to improve clarity.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. Below, we provide point-by-point responses to the major comments and outline the revisions we will make.
read point-by-point responses
-
Referee: [§3] §3 (Cognitive Element Representation): The procedure for discovering, extracting, and validating CEs from activations is not described with sufficient algorithmic detail or pseudocode. No information is given on whether extraction uses supervised probes, unsupervised methods, or manual engineering, nor on how independence or stability of CEs is verified. This is load-bearing for the claim that composition yields higher precision without performance loss.
Authors: We agree that the original presentation in §3 lacked sufficient detail on the CE extraction process. This was an oversight in focusing on the conceptual framework. In the revised manuscript, we will include a detailed algorithmic description with pseudocode for discovering and extracting CEs. The method combines supervised probes trained on labeled activation data for specific cognitive elements with unsupervised techniques to ensure stability. Independence is verified through pairwise correlation thresholds and orthogonality in the probe weights. We will also report validation experiments demonstrating stability across different model layers and inputs. These additions will strengthen the foundation for the composition claims. revision: yes
-
Referee: [§5.2] §5.2 and Table 3: The reported precision gains and domain-customization benefits are presented without ablation studies isolating the contribution of CE composition versus baseline activation monitoring. It is unclear whether the improvements are statistically significant or robust across domains, undermining the central claim that rule-based composition is the key driver.
Authors: We appreciate this observation regarding the need for ablations. The current results in §5.2 and Table 3 compare GAVEL to existing methods but do not isolate the effect of rule composition. We will conduct and include new ablation studies that compare the full compositional rule-based approach against non-compositional activation monitoring baselines. Statistical significance will be assessed using appropriate tests, and experiments will be extended to additional domains to confirm robustness. The revised Table 3 will incorporate these findings to better support the central claim. revision: yes
-
Referee: [§4.3] §4.3 (Rule Definition): The predicate rules over CEs are illustrated with examples but lack a formal semantics or proof of soundness for real-time detection. Without this, it is difficult to assess whether the framework avoids false negatives on nuanced behaviors that the abstract claims to capture.
Authors: We recognize the value of formalizing the rule semantics. While the manuscript provided illustrative examples, we will add a formal definition of the predicate rules, including their syntax and operational semantics for real-time detection. A soundness argument will be provided, demonstrating that the rules are designed to capture all specified nuanced behaviors without introducing false negatives, leveraging the completeness of the CE vocabulary. Additional examples and a brief proof sketch will be included in the revised §4.3. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper proposes a conceptual framework for rule-based activation safety by representing activations as composable cognitive elements and defining predicate rules over them, with no mathematical derivations, equations, fitted parameters, or self-referential definitions present. The central claims rest on the introduced representation and reported empirical improvements rather than any step that reduces by construction to its own inputs. No self-citations are used as load-bearing premises for uniqueness theorems or ansatzes, and the approach is presented as a new paradigm with open-sourced code, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Activations in LLMs can be decomposed into fine-grained, interpretable cognitive elements that compose to represent nuanced behaviors
invented entities (1)
-
Cognitive Elements (CEs)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as 'making a threat' and 'payment processing', that can be composed to capture nuanced, domain-specific behaviors with higher precision.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.