FaCT: Faithful Concept Traces for Explaining Neural Network Decisions
Pith reviewed 2026-05-18 02:51 UTC · model grok-4.3
The pith
Neural networks gain shared concepts whose effects on decisions can be traced directly from any layer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a model architecture whose internal concepts are shared across classes and allow direct tracing of each concept's additive contribution to the logit together with the corresponding input visualization, all starting from arbitrary layers, without post-hoc fitting or constraints on class specificity or spatial size.
What carries the argument
Faithful Concept Traces that compute and propagate each concept's direct contribution to the logit from any chosen layer while also producing an input visualization for that concept.
If this is right
- The extracted concepts achieve higher scores on the new C² consistency metric than prior concept-based methods.
- Human evaluators rate the concepts as more understandable than those from earlier approaches.
- Classification accuracy on ImageNet remains comparable to standard models without the added tracing structure.
- The same tracing procedure works from every layer rather than only the final one.
Where Pith is reading between the lines
- The method could support layer-wise debugging by revealing which concepts matter at early versus late stages of processing.
- Because concepts are not required to match human labels, the approach might surface model-specific features that current human-centric explanations miss.
- If the tracing remains accurate on architectures other than the ones tested, the same design could be applied to tasks beyond image classification.
Load-bearing premise
The concepts recovered by the method are genuinely produced by the model's own computations and their traced contribution to the logit matches the actual effect inside the network.
What would settle it
Remove or zero the units associated with a traced concept at the chosen layer and check whether the change in the output logit equals the contribution value reported by the trace.
read the original abstract
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FaCT, a new neural network model with built-in mechanistic concept explanations. Concepts are shared across classes rather than class-specific, and the architecture allows faithful tracing of each concept's contribution to the logit as well as its input visualization from any layer. The authors propose the C²-Score, a consistency metric that leverages external foundation models, and report competitive ImageNet accuracy together with higher quantitative consistency and improved user-rated interpretability relative to prior concept-based methods.
Significance. If the tracing procedure is shown to be direct and free of auxiliary fitting or restrictive assumptions on spatial extent and class specificity, the work would meaningfully advance faithful concept-based interpretability by removing several common post-hoc limitations. The C²-Score supplies a new quantitative evaluation tool, and the combination of accuracy preservation, consistency metrics, and user studies provides a reasonably complete assessment. The significance is tempered by the need to verify that the claimed faithfulness does not collapse to a post-hoc explanation on a modified architecture.
major comments (3)
- [§3] §3 (Method): The central claim that concepts are model-inherent and that logit contributions can be traced faithfully from arbitrary layers without post-hoc fitting is load-bearing for the entire contribution. The manuscript must explicitly demonstrate that the tracing mechanism uses only direct activation readout or linear decomposition with no auxiliary learned parameters or optimization; any such component would render the 'faithful' and 'model-inherent' properties circular.
- [§4.1] §4.1 (C²-Score): The consistency metric is defined using external foundation models. It is unclear whether the reported gains remain when the same foundation models are used to evaluate the baseline methods, or whether the FaCT concepts are inadvertently aligned to those models during extraction; this must be clarified to support the quantitative superiority claim.
- [Table 1] Table 1 (ImageNet accuracy): Competitive performance is asserted, yet the paper does not report whether the architectural modifications required for layer-wise tracing alter model capacity, training dynamics, or introduce any regularization that could independently explain the accuracy parity; this detail is necessary to isolate the effect of the concept-tracing design.
minor comments (2)
- [Abstract] The abstract states 'higher consistency scores' without naming the exact baselines or reporting numerical deltas; adding these values would improve readability.
- [§3] Notation for the traced concept contribution (e.g., the symbol used for the per-layer readout) is introduced without a consolidated table of symbols; a short notation table would aid readers.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive comments that have helped us improve the clarity and rigor of our work. Below we address each major comment in detail.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that concepts are model-inherent and that logit contributions can be traced faithfully from arbitrary layers without post-hoc fitting is load-bearing for the entire contribution. The manuscript must explicitly demonstrate that the tracing mechanism uses only direct activation readout or linear decomposition with no auxiliary learned parameters or optimization; any such component would render the 'faithful' and 'model-inherent' properties circular.
Authors: We thank the referee for highlighting this important point. In our architecture, the concept traces are implemented through direct linear projections from the layer activations to the concept space, followed by a direct contribution to the logit via a fixed linear combination without any additional optimization or learned parameters beyond the standard training. To make this explicit, we will revise §3 to include a formal description of the tracing procedure, including the mathematical formulation showing it is a direct readout. We have also added a new paragraph clarifying that no post-hoc fitting is involved. revision: yes
-
Referee: [§4.1] §4.1 (C²-Score): The consistency metric is defined using external foundation models. It is unclear whether the reported gains remain when the same foundation models are used to evaluate the baseline methods, or whether the FaCT concepts are inadvertently aligned to those models during extraction; this must be clarified to support the quantitative superiority claim.
Authors: We agree that fair evaluation is crucial. The C²-Score uses foundation models solely as an external evaluator and is not involved in the concept extraction process of FaCT, which relies on the model's internal activations. To address the concern, we have re-computed the C²-Score for all baseline methods using the identical foundation models and evaluation protocol. The superiority of FaCT remains consistent under this fair comparison. We will update §4.1 and the corresponding results to include this clarification and the updated numbers. revision: yes
-
Referee: [Table 1] Table 1 (ImageNet accuracy): Competitive performance is asserted, yet the paper does not report whether the architectural modifications required for layer-wise tracing alter model capacity, training dynamics, or introduce any regularization that could independently explain the accuracy parity; this detail is necessary to isolate the effect of the concept-tracing design.
Authors: We appreciate this observation. The architectural modifications for enabling layer-wise tracing consist of inserting lightweight concept modules that operate on existing activations without increasing the overall parameter count significantly or altering the core network capacity. Training dynamics remain the same as the base model, with no additional regularization introduced. In the revised manuscript, we will add a dedicated subsection in §4 or the appendix detailing the parameter counts, FLOPs, and training procedure comparisons to the baseline models, confirming that accuracy parity is attributable to the concept design rather than capacity changes. revision: yes
Circularity Check
No circularity: new architecture with external validation metrics
full rationale
The paper introduces a custom model architecture designed for inherent concept tracing, with faithfulness defined directly by the model's construction rather than derived from fitted parameters or prior results. The C²-Score is computed using external foundation models, and quantitative claims (consistency, interpretability, ImageNet accuracy) are evaluated against independent benchmarks and user studies. No load-bearing self-citations, no renaming of known results as new derivations, and no predictions that reduce to the same fitted inputs by construction. The derivation chain remains self-contained against external evaluation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models
OCCAM discovers open-set visual concepts, estimates causal contributions via object-level interventions on black-box vision models, and induces a global concept ontology from aggregated dataset evidence.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.