Mechanistic Anomaly Detection via Functional Attribution
Pith reviewed 2026-05-10 02:56 UTC · model grok-4.3
The pith
A neural network's output can be checked for anomalous internal mechanisms by measuring how much it depends on a small trusted reference set.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By quantifying the functional influence of samples from a trusted reference set on a test input's output via influence functions and parameter-space sampling, attribution failure indicates that the model is relying on anomalous internal mechanisms rather than normal learned behavior.
What carries the argument
Influence functions that measure functional coupling between a test sample and a trusted reference set through parameter-space sampling.
If this is right
- State-of-the-art detection of seven backdoor attacks across four vision datasets with average DER of 0.93.
- Improved detection of backdoors in LLMs, including explicitly obfuscated cases.
- Detection of adversarial and out-of-distribution samples without modality-specific changes.
- Ability to distinguish multiple distinct anomalous mechanisms operating inside one model.
Where Pith is reading between the lines
- The method could support continuous monitoring of deployed models by maintaining a small trusted reference set and flagging outputs that cannot be attributed to it.
- Because the approach is modality-agnostic, it may unify detection of anomalies that previously required separate tools for vision and language models.
- If attribution failure reliably tracks mechanism shifts, the technique might help identify when fine-tuning or updates introduce unintended behaviors.
- Extending the reference set size or sampling strategy could be tested to see whether it improves robustness against sophisticated obfuscation.
Load-bearing premise
Weak attribution to the trusted reference set specifically signals anomalous internal mechanisms rather than other factors such as high uncertainty or ordinary distribution shift.
What would settle it
A controlled test where a model with a known backdoor or adversarial trigger still shows strong influence-function attribution to the trusted set, or where normal samples produce weak attribution without any anomaly present.
Figures
read the original abstract
We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model's output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reframes mechanistic anomaly detection (MAD) as a functional attribution problem: it uses influence functions with parameter-space sampling to measure the extent to which a small trusted reference set explains a test sample's output, interpreting attribution failure as evidence of anomalous internal mechanisms. The approach is evaluated on backdoors in vision models (SOTA average DER 0.93 on BackdoorBench across 7 attacks and 4 datasets), LLMs (including obfuscated models), adversarial examples, OOD samples, and distinguishing multiple anomaly types within one model.
Significance. If the central claim holds, the work would provide a modality-agnostic and architecture-independent tool for MAD that avoids latent-space analysis and works on obfuscated models. The reported benchmark gains and ability to handle multiple anomaly types would make it a practical addition to deployed-model monitoring, especially if the functional-attribution framing proves more robust than prior latent or architecture-specific methods.
major comments (2)
- [Abstract] Abstract: The load-bearing claim is that attribution failure via influence functions specifically signals anomalous internal mechanisms. However, influence functions (even with sampling) are known to produce low coupling scores under high predictive uncertainty or distribution shift even for clean models; the manuscript does not provide controls or ablations showing that the method isolates mechanistic anomalies from these other sources, despite also claiming detection of adversarial and OOD inputs.
- [Methodology and evaluation sections] Methodology and evaluation sections: The construction and size of the trusted reference set are central to the functional-coupling score, yet the manuscript lacks detailed ablations on reference-set selection, sensitivity to its composition, and error analysis of the influence-function approximations. These omissions make it difficult to assess whether the reported DER gains (0.93 vs. 0.83) are robust or depend on favorable reference-set choices.
minor comments (2)
- [Abstract] Abstract: Reporting only the average DER without per-attack/per-dataset breakdowns or confidence intervals in the main text reduces the ability to judge consistency of the improvement.
- [Methods] The description of 'parameter-space sampling' for influence functions would benefit from explicit implementation details (number of samples, sampling distribution, Hessian approximation) to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of our work on functional attribution for mechanistic anomaly detection. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The load-bearing claim is that attribution failure via influence functions specifically signals anomalous internal mechanisms. However, influence functions (even with sampling) are known to produce low coupling scores under high predictive uncertainty or distribution shift even for clean models; the manuscript does not provide controls or ablations showing that the method isolates mechanistic anomalies from these other sources, despite also claiming detection of adversarial and OOD inputs.
Authors: We acknowledge that influence functions can yield low coupling scores in the presence of high predictive uncertainty or distribution shift, even for models without mechanistic anomalies. Our method is intended to detect functional anomalies more broadly, encompassing both internal mechanistic changes (such as backdoors) and other forms of anomalous behavior like adversarial examples and OOD samples. To address the need for better isolation of mechanistic anomalies, we will add targeted controls and ablations in the revised manuscript. These will evaluate the functional coupling score on clean models under controlled levels of uncertainty and non-mechanistic distribution shifts, allowing direct comparison to scores from backdoored or otherwise mechanistically altered models. We will also revise the abstract and introduction to clarify the scope of anomalies detected by the approach. revision: yes
-
Referee: [Methodology and evaluation sections] Methodology and evaluation sections: The construction and size of the trusted reference set are central to the functional-coupling score, yet the manuscript lacks detailed ablations on reference-set selection, sensitivity to its composition, and error analysis of the influence-function approximations. These omissions make it difficult to assess whether the reported DER gains (0.93 vs. 0.83) are robust or depend on favorable reference-set choices.
Authors: We agree that additional analysis of the trusted reference set and influence function approximations is needed to demonstrate robustness. In the revised manuscript, we will include new ablations that systematically vary the size and composition of the reference set, using strategies such as random sampling, class-balanced selection, and feature-stratified selection. We will report how the Defense Effectiveness Rating and other metrics change under these variations. We will also add an error analysis of the influence function approximations, including comparisons to exact computations where feasible (e.g., on smaller models) and assessments of variance across different sampling configurations. These additions will help confirm that the reported performance improvements hold across reasonable reference set choices. revision: yes
Circularity Check
No significant circularity; method uses standard influence functions with empirical validation
full rationale
The paper reframes MAD as functional attribution and operationalizes it with influence functions plus parameter-space sampling. This builds directly on established prior techniques without self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. Performance claims (e.g., DER 0.93 on BackdoorBench) are empirical results on public benchmarks, not reductions by construction. The derivation chain is self-contained against external methods and data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Influence functions accurately approximate the functional effect of reference samples on model outputs for anomaly detection purposes.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.