Certified Circuits: Stability Guarantees for Mechanistic Circuits

Alaa Anani; Bernt Schiele; Jonas Fischer; Mario Fritz; Tobias Lorenz

arxiv: 2602.22968 · v3 · pith:JFOEWVNQnew · submitted 2026-02-26 · 💻 cs.AI · cs.CV· cs.CY

Certified Circuits: Stability Guarantees for Mechanistic Circuits

Alaa Anani , Tobias Lorenz , Bernt Schiele , Mario Fritz , Jonas Fischer This is my paper

classification 💻 cs.AI cs.CVcs.CY

keywords circuitscertifiedcircuitconceptdiscoverymechanisticcomponentsdataset

0 comments

read the original abstract

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits--minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture the concept or merely dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that inclusion decisions over circuit components--neurons or edges of the model graph, depending on the base algorithm--are invariant to bounded edit-distance perturbations of the concept dataset. Unstable components are abstained from, yielding circuits that are more compact and more accurate. We validate across three architectures (ResNet, ViT, GPT-2) on vision (ImageNet and four OOD datasets) and language (IOI, IOI-Hard, Greater-Than) tasks. Certified circuits achieve up to 56% higher accuracy and up to 80% fewer components, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code: https://github.com/AlaaAnani/certified-circuits.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision Transformers
cs.AI 2026-04 unverdicted novelty 6.0

Edge-based circuits in vision transformers can be automatically recovered to explain and steer model computations for classification and adversarial behaviors.
Engineering Resource-constrained Software Systems with DNN Components: a Concept-based Pruning Approach
cs.SE 2026-04 unverdicted novelty 5.0

A concept-based pruning method for DNNs guided by interpretable concepts and system requirements produces smaller, computationally efficient models that maintain effectiveness on image classification tasks.