Measuring Uncertainty in Transformer Circuits with Effective Information Consistency

Anatoly A. Krasnovsky

arxiv: 2509.07149 · v1 · submitted 2025-09-08 · 💻 cs.LG · cs.AI· cs.CL· cs.IT· math.IT

Measuring Uncertainty in Transformer Circuits with Effective Information Consistency

Anatoly A. Krasnovsky This is my paper

Pith reviewed 2026-05-18 17:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.ITmath.IT

keywords mechanistic interpretabilitytransformer circuitseffective informationsheaf cohomologycausal emergenceuncertainty quantificationLLM analysiswhite-box evaluation

0 comments

The pith

A new dimensionless score combines sheaf inconsistency from Jacobians with a Gaussian effective-information proxy to quantify coherence in an active Transformer circuit from a single forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Effective-Information Consistency Score to fill a gap in mechanistic interpretability: a formal way to measure when a functional subgraph inside a large language model is operating coherently enough to be trusted. It builds this by taking local Jacobians and activations to compute a normalized measure of sheaf inconsistency, then pairing it with a Gaussian approximation of effective information that captures circuit-level causal emergence. Both quantities are extracted from the same forward state, keeping the method white-box and single-pass while ensuring the final score is dimensionless. A sympathetic reader would care because such a score could turn circuit discovery from descriptive to actionable, allowing practitioners to flag when a circuit is likely producing reliable versus uncertain behavior.

Core claim

We specialize a sheaf/cohomology and causal-emergence perspective to Transformer circuits and define the Effective-Information Consistency Score (EICS) as the combination of (i) a normalized sheaf inconsistency computed from local Jacobians and activations and (ii) a Gaussian EI proxy for circuit-level causal emergence derived from the same forward state; the resulting construction is white-box, single-pass, and dimensionless, with practical guidance supplied for score interpretation and computational modes.

What carries the argument

The Effective-Information Consistency Score (EICS), formed by merging normalized sheaf inconsistency from local Jacobians with a Gaussian effective-information proxy for causal emergence.

If this is right

A circuit can be evaluated for coherence without requiring multiple forward passes or external probes.
The score remains dimensionless because both constituent quantities are normalized to the same forward state.
Practical guidance on fast versus exact computation modes and score interpretation is provided for immediate use.
Empirical validation beyond a toy sanity-check is left for future work on real LLM tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If EICS proves reliable, it could serve as a lightweight runtime monitor to route queries away from circuits that appear incoherent.
The same construction might extend to other architectures that expose Jacobians, such as state-space models or graph networks.
A natural test would be to measure whether circuits with high EICS maintain performance under small input perturbations while low-EICS circuits degrade.

Load-bearing premise

Combining sheaf inconsistency measured on Jacobians with a Gaussian effective-information proxy will reliably signal when an active circuit is behaving coherently and can therefore be treated as trustworthy.

What would settle it

Run EICS on a set of circuits whose coherence has been independently verified by ablation or intervention studies; if the scores do not separate the coherent from the incoherent cases above chance level, the central claim is false.

Figures

Figures reproduced from arXiv: 2509.07149 by Anatoly A. Krasnovsky.

**Figure 1.** Figure 1: Toy sanity-check on a 6-node circuit with two parallel branches. As node-noise τ increases, the sheaf inconsistency Csh rises (so 1/(1 + Csh) falls). We also reduce cross-branch alignment with τ (edge decoherence), causing the emergence proxy ∆gEIG and the overall EICS to decrease. Curves show means over seeds (no error bands for clarity). Definitions follow Eqs. (2), (3), and (7). 8 Discussion & limitatio… view at source ↗

read the original abstract

Mechanistic interpretability has identified functional subgraphs within large language models (LLMs), known as Transformer Circuits (TCs), that appear to implement specific algorithms. Yet we lack a formal, single-pass way to quantify when an active circuit is behaving coherently and thus likely trustworthy. Building on prior systems-theoretic proposals, we specialize a sheaf/cohomology and causal emergence perspective to TCs and introduce the Effective-Information Consistency Score (EICS). EICS combines (i) a normalized sheaf inconsistency computed from local Jacobians and activations, with (ii) a Gaussian EI proxy for circuit-level causal emergence derived from the same forward state. The construction is white-box, single-pass, and makes units explicit so that the score is dimensionless. We further provide practical guidance on score interpretation, computational overhead (with fast and exact modes), and a toy sanity-check analysis. Empirical validation on LLM tasks is deferred.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new EICS score by specializing sheaf inconsistency and causal emergence ideas to transformer circuits, but only a toy check is shown and real validation is deferred.

read the letter

The core move here is packaging existing systems ideas into a single dimensionless score for circuit coherence in transformers. EICS takes normalized sheaf inconsistency from local Jacobians and activations, adds a Gaussian effective-information proxy for causal emergence, and outputs a white-box number from one forward pass. That specialization to transformer circuits is the actual new piece, along with the practical notes on fast versus exact modes and score interpretation guidelines. The construction itself is laid out cleanly enough that someone could implement the basic version without too much guesswork. Credit for making the units explicit and keeping the whole thing single-pass. The soft spot is exactly what the stress-test flags: the claim that this combination tracks coherent versus incoherent circuit behavior rests on the toy sanity-check alone. Empirical validation on LLM tasks is explicitly left for later, so there is no evidence yet that EICS values line up with independent checks like ablation performance or causal interventions. Without that link, it is hard to tell whether the score is doing more than re-expressing properties already present in the activations and Jacobians. The circularity worry is real until the equations are inspected closely. This paper is for readers already inside mechanistic interpretability who want formal tools for circuit reliability rather than new circuit discoveries. A practitioner looking for a quick quantitative filter might try the toy version, but anyone expecting a ready-to-use trustworthiness metric will be disappointed until the deferred experiments appear. It deserves a serious referee. The specialization is legitimate and the formal framing is worth external feedback on the derivations and on what would count as convincing validation. I would send it out for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Effective-Information Consistency Score (EICS) for quantifying uncertainty in Transformer Circuits (TCs) in large language models. Building on systems-theoretic ideas involving sheaves and causal emergence, EICS is defined as a combination of a normalized sheaf inconsistency derived from local Jacobians and activations, and a Gaussian Effective Information (EI) proxy for circuit-level causal emergence, both computed from the same forward pass. The score is claimed to be white-box, single-pass, and dimensionless. The paper provides practical guidance on interpreting the score, computational considerations including fast and exact modes, and includes a toy sanity-check analysis. Full empirical validation on actual LLM tasks is explicitly deferred.

Significance. If the EICS proves to reliably track circuit coherence and trustworthiness as claimed, it would represent a notable contribution to mechanistic interpretability by offering an efficient, formal metric for assessing the reliability of functional subgraphs in LLMs. This could facilitate better identification of trustworthy circuits and enhance safety in AI systems. The approach's strengths include its white-box nature, single-pass computation, and explicit handling of units to achieve a dimensionless score, along with practical implementation guidance.

major comments (2)

Abstract: The central claim that EICS quantifies when an active circuit is behaving coherently and is thus likely trustworthy is not supported by the presented evidence. The manuscript defers empirical validation on LLM tasks and mentions only a toy sanity-check, which does not sufficiently demonstrate that the combination of sheaf inconsistency and EI proxy correlates with independent measures of coherence such as task performance under ablation or causal interventions. This is load-bearing for the paper's primary motivation.
EICS definition: The construction of EICS directly from the same forward-pass Jacobians and activations it aims to evaluate raises concerns about circularity. It is unclear whether the normalized sheaf inconsistency plus Gaussian EI proxy yields an independent consistency measure or reduces to a self-referential quantity without explicit equations demonstrating independence from the forward state used to compute it.

minor comments (2)

Practical guidance section: The discussion of fast and exact computational modes is useful but would benefit from explicit complexity analysis or pseudocode for implementation in standard frameworks like PyTorch.
Notation and references: Ensure consistent use of symbols for Jacobians and activations across sections; add citations to foundational works on sheaf cohomology applications in neural networks to better situate the specialization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: Abstract: The central claim that EICS quantifies when an active circuit is behaving coherently and is thus likely trustworthy is not supported by the presented evidence. The manuscript defers empirical validation on LLM tasks and mentions only a toy sanity-check, which does not sufficiently demonstrate that the combination of sheaf inconsistency and EI proxy correlates with independent measures of coherence such as task performance under ablation or causal interventions. This is load-bearing for the paper's primary motivation.

Authors: We agree that the toy sanity-check alone does not provide sufficient evidence to support the claim that EICS reliably tracks coherence or trustworthiness in actual LLM circuits. The manuscript explicitly defers full empirical validation. We will revise the abstract to present EICS as a proposed white-box metric for circuit consistency derived from sheaf inconsistency and causal emergence, with the connection to trustworthiness framed as a motivating hypothesis rather than a demonstrated result. We will also expand the toy analysis section to include additional controls that better illustrate the score's sensitivity to coherence disruptions. revision: yes
Referee: EICS definition: The construction of EICS directly from the same forward-pass Jacobians and activations it aims to evaluate raises concerns about circularity. It is unclear whether the normalized sheaf inconsistency plus Gaussian EI proxy yields an independent consistency measure or reduces to a self-referential quantity without explicit equations demonstrating independence from the forward state used to compute it.

Authors: The concern about circularity is well-taken. While EICS is computed from the same forward-pass quantities, the sheaf inconsistency term quantifies local-to-global mismatches in the circuit's linear approximations, and the Gaussian EI term approximates causal emergence at the circuit level; their normalized combination is intended to yield a measure of internal alignment rather than a direct restatement of the input state. To address the request for explicit demonstration, we will add equations in the revised Methods section that separate the raw Jacobian/activation inputs from the final dimensionless score, showing that the measure can detect inconsistencies even when evaluated on the model's own forward computations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in EICS construction

full rationale

The paper defines EICS as a composite score built from normalized sheaf inconsistency on local Jacobians/activations plus a Gaussian EI proxy, both extracted from the identical forward pass. This is a definitional construction of a new metric rather than a derivation or prediction that reduces to its inputs by construction. No equations are shown that make the output tautological with the input quantities, no fitted parameters are relabeled as predictions, and the provided text contains no load-bearing self-citations or uniqueness theorems imported from prior author work. The central claim concerns the interpretive utility of the resulting dimensionless score; while the abstract defers empirical validation on LLMs, this is a question of external evidence rather than internal circularity in the derivation chain. The construction is therefore self-contained as an explicit proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The construction rests on the applicability of sheaf inconsistency and causal emergence concepts to Transformer Circuits; these are treated as domain assumptions imported from prior systems-theoretic work without new independent evidence supplied here.

axioms (2)

domain assumption Sheaf inconsistency can be meaningfully computed from local Jacobians and activations of a Transformer Circuit
Invoked when the paper specializes the sheaf/cohomology perspective to TCs
domain assumption A Gaussian EI proxy derived from the forward state captures circuit-level causal emergence
Used to combine with the sheaf term into a single coherence score

invented entities (1)

Effective-Information Consistency Score (EICS) no independent evidence
purpose: Quantify coherence and trustworthiness of an active Transformer Circuit
Newly defined composite metric; no independent falsifiable prediction outside the definition itself

pith-pipeline@v0.9.0 · 5686 in / 1544 out tokens · 38522 ms · 2026-05-18T17:40:06.072867+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EICS(GM;a) = gΔEI_G(GM) / (1 + C_sh(GM,a)) where C_sh is the normalized L2 energy of (ρ_u→v a_u − a_v) and gΔEI_G is the normalized positive part of ½ log det(I+α J_M^⊤ J_M) minus sum of node terms.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We place a cellular sheaf F on the underlying undirected version of GM with stalks R^{d_v} and restriction maps given by the Jacobians ρ_e := J_u→v evaluated at the current state.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Circuit tracing / attribution graphs: Methods & applications (2025),https:// transformer-circuits.pub/2025/attribution-graphs/ 10 A. A. Krasnovsky

work page 2025
[2]

Angelopoulos, A.N., Bates, S.: A gentle introduction to conformal prediction and distribution-free uncertainty quantification (2021),https://arxiv.org/abs/ 2107.07511

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

In: Proceedings of the 34th International Conference on Machine Learning (ICML)

Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neu- ral networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML). pp. 1321–1330. PMLR (2017)

work page 2017
[4]

Hansen, J., Ghrist, R.: Toward a spectral theory of cellular sheaves3(4), 315–358 (2019)

work page 2019
[5]

knowledge edit- ing in language models

Hase, P., Bansal, M., Kim, B., Ghandeharioun, A.: Does localization inform editing? surprising differences in causality-based localization vs. knowledge edit- ing in language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 36, pp. 17643–17668 (2023)

work page 2023
[6]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions43(2), 1–55 (2025)

work page 2025
[7]

Krasnovsky, A.A.: Sheaf-theoretic causal emergence for resilience analysis in dis- tributed systems (2025),https://arxiv.org/abs/2503.14104

work page arXiv 2025
[8]

In: Advances in Neural Information Processing Systems (NeurIPS)

Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017)

work page 2017
[9]

Oizumi, M., Albantakis, L., Tononi, G.: From the phenomenology to the mecha- nisms of consciousness: Integrated information theory 3.010(5), e1003588 (2014)

work page 2014
[10]

Olsson, C., Elhage, N., Nanda, T., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al.: In-context learning and induction heads (2022),https://arxiv.org/abs/2209.11895

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Springer (2014)

Robinson, M.: Topological Signal Processing. Springer (2014)

work page 2014
[12]

Rosas, F.E., Mediano, P.A.M., Jensen, H.J., Seth, A.K., Barrett, A.B., Carhart- Harris, R.L., Bor, D.: Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data16(12), e1008289 (2020)

work page 2020
[13]

Tononi, G., Sporns, O.: Measuring information integration4, 31 (2003)

work page 2003
[14]

13111, iCLR 2024 version

Yang, A.X., Robeyns, M., Wang, X., Aitchison, L.: Bayesian low-rank adaptation for large language models (laplace-lora) (2023),https://arxiv.org/abs/2308. 13111, iCLR 2024 version

work page 2023
[15]

In: Advances in Neural Information Processing Systems (NeurIPS)

Yao, Y., Zhang, N., Xi, Z., Wang, M., Xu, Z., Deng, S., Chen, H.: Knowledge circuits in pretrained transformers. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 118571–118602 (2024)

work page 2024

[1] [1]

Circuit tracing / attribution graphs: Methods & applications (2025),https:// transformer-circuits.pub/2025/attribution-graphs/ 10 A. A. Krasnovsky

work page 2025

[2] [2]

Angelopoulos, A.N., Bates, S.: A gentle introduction to conformal prediction and distribution-free uncertainty quantification (2021),https://arxiv.org/abs/ 2107.07511

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

In: Proceedings of the 34th International Conference on Machine Learning (ICML)

Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neu- ral networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML). pp. 1321–1330. PMLR (2017)

work page 2017

[4] [4]

Hansen, J., Ghrist, R.: Toward a spectral theory of cellular sheaves3(4), 315–358 (2019)

work page 2019

[5] [5]

knowledge edit- ing in language models

Hase, P., Bansal, M., Kim, B., Ghandeharioun, A.: Does localization inform editing? surprising differences in causality-based localization vs. knowledge edit- ing in language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 36, pp. 17643–17668 (2023)

work page 2023

[6] [6]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions43(2), 1–55 (2025)

work page 2025

[7] [7]

Krasnovsky, A.A.: Sheaf-theoretic causal emergence for resilience analysis in dis- tributed systems (2025),https://arxiv.org/abs/2503.14104

work page arXiv 2025

[8] [8]

In: Advances in Neural Information Processing Systems (NeurIPS)

Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017)

work page 2017

[9] [9]

Oizumi, M., Albantakis, L., Tononi, G.: From the phenomenology to the mecha- nisms of consciousness: Integrated information theory 3.010(5), e1003588 (2014)

work page 2014

[10] [10]

Olsson, C., Elhage, N., Nanda, T., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al.: In-context learning and induction heads (2022),https://arxiv.org/abs/2209.11895

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Springer (2014)

Robinson, M.: Topological Signal Processing. Springer (2014)

work page 2014

[12] [12]

Rosas, F.E., Mediano, P.A.M., Jensen, H.J., Seth, A.K., Barrett, A.B., Carhart- Harris, R.L., Bor, D.: Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data16(12), e1008289 (2020)

work page 2020

[13] [13]

Tononi, G., Sporns, O.: Measuring information integration4, 31 (2003)

work page 2003

[14] [14]

13111, iCLR 2024 version

Yang, A.X., Robeyns, M., Wang, X., Aitchison, L.: Bayesian low-rank adaptation for large language models (laplace-lora) (2023),https://arxiv.org/abs/2308. 13111, iCLR 2024 version

work page 2023

[15] [15]

In: Advances in Neural Information Processing Systems (NeurIPS)

Yao, Y., Zhang, N., Xi, Z., Wang, M., Xu, Z., Deng, S., Chen, H.: Knowledge circuits in pretrained transformers. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 118571–118602 (2024)

work page 2024