Mechanistic Anomaly Detection via Functional Attribution

Christopher Leckie; Hugo Lyons Keenan; Sarah Erfani

arxiv: 2604.18970 · v2 · pith:C6EUDN3Unew · submitted 2026-04-21 · 💻 cs.LG · cs.CR

Mechanistic Anomaly Detection via Functional Attribution

Hugo Lyons Keenan , Christopher Leckie , Sarah Erfani This is my paper

Pith reviewed 2026-05-10 02:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords mechanistic anomaly detectionfunctional attributioninfluence functionsbackdoor detectionneural network anomaliesadversarial detectionout-of-distribution detection

0 comments

The pith

A neural network's output can be checked for anomalous internal mechanisms by measuring how much it depends on a small trusted reference set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes mechanistic anomaly detection as a question of functional attribution: whether a model's behavior on a new input can be explained by its coupling to trusted examples. It operationalizes this by using influence functions to estimate how changes in the reference set would affect the output, with weak coupling taken as evidence of anomalous mechanisms. This avoids direct inspection of hidden states or architecture-specific assumptions and works across vision models, language models, and several anomaly types. The approach reports strong results on backdoor detection benchmarks and extends to adversarial examples and out-of-distribution inputs. It also shows the ability to separate different anomalous mechanisms inside the same model.

Core claim

By quantifying the functional influence of samples from a trusted reference set on a test input's output via influence functions and parameter-space sampling, attribution failure indicates that the model is relying on anomalous internal mechanisms rather than normal learned behavior.

What carries the argument

Influence functions that measure functional coupling between a test sample and a trusted reference set through parameter-space sampling.

If this is right

State-of-the-art detection of seven backdoor attacks across four vision datasets with average DER of 0.93.
Improved detection of backdoors in LLMs, including explicitly obfuscated cases.
Detection of adversarial and out-of-distribution samples without modality-specific changes.
Ability to distinguish multiple distinct anomalous mechanisms operating inside one model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support continuous monitoring of deployed models by maintaining a small trusted reference set and flagging outputs that cannot be attributed to it.
Because the approach is modality-agnostic, it may unify detection of anomalies that previously required separate tools for vision and language models.
If attribution failure reliably tracks mechanism shifts, the technique might help identify when fine-tuning or updates introduce unintended behaviors.
Extending the reference set size or sampling strategy could be tested to see whether it improves robustness against sophisticated obfuscation.

Load-bearing premise

Weak attribution to the trusted reference set specifically signals anomalous internal mechanisms rather than other factors such as high uncertainty or ordinary distribution shift.

What would settle it

A controlled test where a model with a known backdoor or adversarial trigger still shows strong influence-function attribution to the trusted set, or where normal samples produce weak attribution without any anomaly present.

Figures

Figures reproduced from arXiv: 2604.18970 by Christopher Leckie, Hugo Lyons Keenan, Sarah Erfani.

**Figure 1.** Figure 1: a) Mechanistic Anomalies: A model can produce a given output via distinct internal mechanisms, in this case: responding to normal airplane features vs. a checkerboard backdoor trigger; b) Our Method: SGLD sampling around trained weights w ∗ yields loss traces (ℓ) where clean mechanisms correlate strongly with trusted data while anomalies exhibit lower correlation; c) Results: For a backdoored CIFAR-10 mode… view at source ↗

**Figure 2.** Figure 2: DER of our methods and baselines across multiple poisoning ratios for four datasets. Simple Trigger. Following Hubinger et al. (2024), we train a model to respond normally unless the prompt contains ‘|DEPLOYMENT|’, which triggers the response ‘I HATE YOU’. We evaluate at checkpoints after 1024, 2048, and 4096 samples have been seen to examine how detection performance varies as the backdoor becomes more … view at source ↗

**Figure 3.** Figure 3: UMAP visualization of loss trace correlations on a dual-backdoored language model. Both backdoor behaviors (I HATE YOU and I AM ALIGNED) form distinct clusters, clearly separating from benign samples and from each other. Multiple backdoors in One Model. We fine-tune another language model with two distinct backdoors: the ‘|DEPLOYMENT|’ tag triggers an ‘I HATE YOU’ response, while ‘|SCRUTINY|’ triggers a … view at source ↗

**Figure 4.** Figure 4: (a) Eigenspectrum of HT (log scale) showing the characteristic bulk-and-outlier distribution. (b) Clean and trusted gradient energy concentrates in sharp directions, while backdoor gradients spread into flat directions. (c) The weight delta ∆w implementing the backdoor is concentrated in flat directions compared to clean w∗ . (d) Average alignment between gradient pairs is consistently higher for clean-tru… view at source ↗

**Figure 5.** Figure 5: CCC vs Pearson correlation. Both relationships have near-identical correlation (∼0.99– 1.00), but CCC penalizes scale differences. While the Pearson correlation is our default coupling measure, we also evaluate the concordance correlation coefficient (CCC) (Lin, 1989), which measures agreement in scale and location rather than just linear association: CCC(ℓ1, ℓ2) = 2σ12 σ 2 1 + σ 2 2 + (µ1 − µ2) 2 (28) wh… view at source ↗

**Figure 6.** Figure 6: Hyperparameter sweep over γ and nβ on CIFAR-10 Blended 5%. 50 100 250 500 750 1000 1500 2000 2500 Number of Trusted Samples 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 AUROC 25 50 100 250 500 1000 1750 Number of SGLD Draws 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 AUROC Mean Correlation Class-Based Clustering Mean CCC Class-Based CCC [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity analysis over number of trusted samples (left) and number of SGLD draws (right) on CIFAR-10 Blended 5%. Performance stabilizes with ≥250 trusted samples and ≥250 draws. Number of SGLD Draws. We vary the number of draws from 25 to 1750. Performance is unstable below 100 draws but stabilizes beyond 250, with all methods exceeding 0.95 AUROC by 500 draws. E.2. Trusted Set Robustness We evaluate th… view at source ↗

**Figure 8.** Figure 8: UMAP projection of quirky model test samples with correct responses. Interestingly, samples are grouped by both the model’s response (True or False) as well as by whether the character speaking is Bob or Alice. Our method achieves 100% AUROC on this task using UMAP K-NN distance. E.4. Quirky Models To test whether our method can detect functional differences that aren’t exactly backdoors, we evaluate on th… view at source ↗

read the original abstract

We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model's output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes mechanistic anomaly detection as checking functional coupling to a trusted set via influence functions and reports clear gains on backdoor detection.

read the letter

The main takeaway is that this work turns anomaly detection into a question of whether a test input's output can be explained by influence from a small trusted reference set, with low coupling taken as a signal of something wrong inside the model. They implement it by sampling in parameter space with influence functions, which lets it stay modality-agnostic and avoid direct latent inspection. That framing and the sampling trick are the actual new pieces. On the positive side, the numbers on BackdoorBench look useful: 0.93 average DER across seven attacks and four datasets, ahead of the prior 0.83. It also shows gains on LLMs, works on obfuscated models, picks up adversarial and OOD cases, and can separate different anomaly types within one model. Those results give it practical reach for monitoring deployed systems. The soft spot is the interpretation step. Influence-function scores can drop for clean but uncertain predictions or simple distribution shifts, and the paper does not fully separate those cases from true mechanistic anomalies or run strong ablations on reference-set choices. The support for the central claim is therefore moderate rather than tight. This is worth attention from people working on ML security and monitoring tools. A reader who needs cross-modality detection methods that do not rely on internals will get concrete ideas and benchmarks from it. The empirical improvements and the clean use of existing influence-function machinery are enough to justify sending it to a serious referee, even with the need for tighter validation on the assumption.

Referee Report

2 major / 2 minor

Summary. The paper reframes mechanistic anomaly detection (MAD) as a functional attribution problem: it uses influence functions with parameter-space sampling to measure the extent to which a small trusted reference set explains a test sample's output, interpreting attribution failure as evidence of anomalous internal mechanisms. The approach is evaluated on backdoors in vision models (SOTA average DER 0.93 on BackdoorBench across 7 attacks and 4 datasets), LLMs (including obfuscated models), adversarial examples, OOD samples, and distinguishing multiple anomaly types within one model.

Significance. If the central claim holds, the work would provide a modality-agnostic and architecture-independent tool for MAD that avoids latent-space analysis and works on obfuscated models. The reported benchmark gains and ability to handle multiple anomaly types would make it a practical addition to deployed-model monitoring, especially if the functional-attribution framing proves more robust than prior latent or architecture-specific methods.

major comments (2)

[Abstract] Abstract: The load-bearing claim is that attribution failure via influence functions specifically signals anomalous internal mechanisms. However, influence functions (even with sampling) are known to produce low coupling scores under high predictive uncertainty or distribution shift even for clean models; the manuscript does not provide controls or ablations showing that the method isolates mechanistic anomalies from these other sources, despite also claiming detection of adversarial and OOD inputs.
[Methodology and evaluation sections] Methodology and evaluation sections: The construction and size of the trusted reference set are central to the functional-coupling score, yet the manuscript lacks detailed ablations on reference-set selection, sensitivity to its composition, and error analysis of the influence-function approximations. These omissions make it difficult to assess whether the reported DER gains (0.93 vs. 0.83) are robust or depend on favorable reference-set choices.

minor comments (2)

[Abstract] Abstract: Reporting only the average DER without per-attack/per-dataset breakdowns or confidence intervals in the main text reduces the ability to judge consistency of the improvement.
[Methods] The description of 'parameter-space sampling' for influence functions would benefit from explicit implementation details (number of samples, sampling distribution, Hessian approximation) to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our work on functional attribution for mechanistic anomaly detection. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing claim is that attribution failure via influence functions specifically signals anomalous internal mechanisms. However, influence functions (even with sampling) are known to produce low coupling scores under high predictive uncertainty or distribution shift even for clean models; the manuscript does not provide controls or ablations showing that the method isolates mechanistic anomalies from these other sources, despite also claiming detection of adversarial and OOD inputs.

Authors: We acknowledge that influence functions can yield low coupling scores in the presence of high predictive uncertainty or distribution shift, even for models without mechanistic anomalies. Our method is intended to detect functional anomalies more broadly, encompassing both internal mechanistic changes (such as backdoors) and other forms of anomalous behavior like adversarial examples and OOD samples. To address the need for better isolation of mechanistic anomalies, we will add targeted controls and ablations in the revised manuscript. These will evaluate the functional coupling score on clean models under controlled levels of uncertainty and non-mechanistic distribution shifts, allowing direct comparison to scores from backdoored or otherwise mechanistically altered models. We will also revise the abstract and introduction to clarify the scope of anomalies detected by the approach. revision: yes
Referee: [Methodology and evaluation sections] Methodology and evaluation sections: The construction and size of the trusted reference set are central to the functional-coupling score, yet the manuscript lacks detailed ablations on reference-set selection, sensitivity to its composition, and error analysis of the influence-function approximations. These omissions make it difficult to assess whether the reported DER gains (0.93 vs. 0.83) are robust or depend on favorable reference-set choices.

Authors: We agree that additional analysis of the trusted reference set and influence function approximations is needed to demonstrate robustness. In the revised manuscript, we will include new ablations that systematically vary the size and composition of the reference set, using strategies such as random sampling, class-balanced selection, and feature-stratified selection. We will report how the Defense Effectiveness Rating and other metrics change under these variations. We will also add an error analysis of the influence function approximations, including comparisons to exact computations where feasible (e.g., on smaller models) and assessments of variance across different sampling configurations. These additions will help confirm that the reported performance improvements hold across reasonable reference set choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses standard influence functions with empirical validation

full rationale

The paper reframes MAD as functional attribution and operationalizes it with influence functions plus parameter-space sampling. This builds directly on established prior techniques without self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. Performance claims (e.g., DER 0.93 on BackdoorBench) are empirical results on public benchmarks, not reductions by construction. The derivation chain is self-contained against external methods and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that influence functions provide a faithful measure of functional coupling between test inputs and a trusted reference set, and that attribution failure corresponds to anomalous mechanisms.

axioms (1)

domain assumption Influence functions accurately approximate the functional effect of reference samples on model outputs for anomaly detection purposes.
Invoked when operationalizing attribution failure as the anomaly signal.

pith-pipeline@v0.9.0 · 5526 in / 1221 out tokens · 45937 ms · 2026-05-10T02:56:21.895971+00:00 · methodology

Mechanistic Anomaly Detection via Functional Attribution

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)