Understanding Task Representations in Neural Networks via Bayesian Ablation

Andrew Nam; Declan Campbell; Jonathan Cohen; Sarah-Jane Leslie; Thomas Griffiths

arxiv: 2505.13742 · v2 · submitted 2025-05-19 · 💻 cs.LG · cs.AI

Understanding Task Representations in Neural Networks via Bayesian Ablation

Andrew Nam , Declan Campbell , Thomas Griffiths , Jonathan Cohen , Sarah-Jane Leslie This is my paper

Pith reviewed 2026-05-22 13:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords neural network interpretabilityBayesian ablationrepresentational analysiscausal contributionpolysemanticitymanifold complexityinformation theory

0 comments

The pith

A Bayesian distribution over neural network units reveals their causal contributions to task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a probabilistic framework inspired by Bayesian inference to interpret latent task representations inside neural networks. It defines a distribution over representational units and uses ablation or sampling to infer how much each unit causally affects performance on a given task. The approach adds information-theoretic metrics that quantify how distributed the representations are, how complex the underlying manifold is, and how polysemantic individual units tend to be. A sympathetic reader would care because this supplies a systematic way to move beyond black-box performance and examine what the network has actually learned.

Core claim

The authors argue that defining a distribution over representational units and performing Bayesian ablation lets researchers infer the causal contributions of those units to task performance, while information-theoretic tools simultaneously characterize representational distributedness, manifold complexity, and polysemanticity.

What carries the argument

The central mechanism is the probability distribution placed over representational units, which supports inference of causal contributions through controlled ablation or sampling.

If this is right

The framework identifies which units are most responsible for success on a specific task.
It supplies a quantitative measure of how distributed a representation is across units.
The metrics can evaluate the geometric complexity of the learned manifold.
It detects polysemantic units that participate in multiple distinct concepts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distribution-based ablation could be used to compare representations learned by networks trained under different objectives or architectures.
Low-contribution units identified by the method might be candidates for removal during model compression.
The metrics for distributedness and polysemanticity could guide the design of training procedures that encourage more disentangled representations.

Load-bearing premise

Ablating or sampling from a distribution over representational units accurately isolates their causal contributions without being confounded by the network's training dynamics or architecture-specific interactions.

What would settle it

Apply the method to a linear classifier with known feature weights and check whether the inferred causal contributions recover the true importance of each unit.

Figures

Figures reproduced from arXiv: 2505.13742 by Andrew Nam, Declan Campbell, Jonathan Cohen, Sarah-Jane Leslie, Thomas Griffiths.

**Figure 2.** Figure 2: (a) Correlation between task representation metrics and task acquisition order across [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between the true AMD, the GFlowNet approximation, and the uniform base [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Spearman correlation between similarity measures. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Spearman correlation between similarity measures. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Cosine distance between tasks. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Wasserstein distance between tasks. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Neural networks are powerful tools for cognitive modeling due to their flexibility and emergent properties. However, interpreting their learned representations remains challenging due to their sub-symbolic semantics. In this work, we introduce a novel probabilistic framework for interpreting latent task representations in neural networks. Inspired by Bayesian inference, our approach defines a distribution over representational units to infer their causal contributions to task performance. Using ideas from information theory, we propose a suite of tools and metrics to illuminate key model properties, including representational distributedness, manifold complexity, and polysemanticity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a Bayesian take on ablating units to measure their task contributions plus some info-theory metrics, but the causal isolation step looks shaky without handling training interactions.

read the letter

The main takeaway is that they frame representational units as draws from a distribution and use ablation or sampling to estimate causal effects on performance, then layer on metrics for distributedness, manifold complexity, and polysemanticity. This is a modest step beyond plain lesioning studies because the probabilistic wrapper can capture some uncertainty in which units matter. The metrics themselves are straightforward applications of information theory and could help make qualitative claims about representations more quantitative. That part is useful for people already doing interpretability work on cognitive models or trained networks. The soft spot is the causal inference. Ablation changes performance, but in a trained network the units are not independent; gradients and compensatory weights during learning can make the observed effect reflect the whole system rather than the isolated unit. The setup does not appear to include a model of those dependencies in the posterior, so the causal language may overreach what the procedure actually delivers. Synthetic checks with known ground-truth contributions would have helped here. This is aimed at researchers in neural interpretability and cognitive modeling who want tools that go a bit beyond standard ablation. A reader already familiar with Bayesian methods and representation analysis would find the framework coherent enough to engage with, even if the experiments need tightening. It deserves peer review because the core proposal is clear and the gaps are fixable rather than fatal.

Referee Report

1 major / 2 minor

Summary. The paper introduces a novel probabilistic framework for interpreting latent task representations in neural networks. It defines a distribution over representational units to infer their causal contributions to task performance and proposes information-theoretic metrics to quantify properties including representational distributedness, manifold complexity, and polysemanticity.

Significance. If the causal inferences hold after accounting for training dynamics, the framework could provide a principled Bayesian approach to neural interpretability, extending cognitive modeling tools with quantitative metrics for distributed and polysemantic representations.

major comments (1)

[§3] The central claim that ablation/sampling from the distribution over units isolates causal contributions (abstract and §3) is load-bearing but lacks a derivation showing how the posterior accounts for unmodeled dependencies such as compensatory effects during training or architecture-specific unit interactions; without this, performance changes may reflect redundancies rather than true causality.

minor comments (2)

[Methods] Clarify the exact form of the proposed distribution over representational units and how it is fit from data.
[Related Work] Add explicit comparison to existing ablation or attribution methods (e.g., integrated gradients or causal mediation analysis) to highlight novelty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. The feedback on the causal interpretation of the Bayesian ablation framework is particularly valuable, and we have revised the paper to address the concerns raised while clarifying the scope and assumptions of our approach.

read point-by-point responses

Referee: [§3] The central claim that ablation/sampling from the distribution over units isolates causal contributions (abstract and §3) is load-bearing but lacks a derivation showing how the posterior accounts for unmodeled dependencies such as compensatory effects during training or architecture-specific unit interactions; without this, performance changes may reflect redundancies rather than true causality.

Authors: We agree that the original manuscript would benefit from a more explicit derivation and discussion of assumptions regarding causal isolation. In the revised version, we have added a formal derivation in §3 that applies Bayes' rule to compute the posterior over units, with the likelihood defined in terms of the observed change in task performance following ablation or sampling. This derivation proceeds under the modeling assumption of conditional independence among units given the performance metric. We have also inserted a dedicated limitations subsection that directly addresses unmodeled dependencies, including compensatory effects that may arise during training and architecture-specific interactions. The text now explicitly states that performance changes could reflect redundancies in some settings and that the inferred contributions are best viewed as effective causal roles under the stated assumptions rather than exhaustive causal attributions. To help readers diagnose such cases, we have expanded the description of the information-theoretic metrics (distributedness, manifold complexity, and polysemanticity) as diagnostic tools. Additional controlled experiments have been included to illustrate the framework's behavior when redundancies are present. These revisions clarify the claims without overstating the isolation of causality. revision: partial

Circularity Check

0 steps flagged

Bayesian ablation framework introduces new definitions without reducing to self-referential inputs

full rationale

The paper presents a novel probabilistic framework that defines a distribution over representational units and uses ablation/sampling to quantify causal contributions, drawing on Bayesian inference and information theory concepts. No equations, fitted parameters, or self-citations are shown in the provided text that would make any claimed result equivalent to its inputs by construction. The central contribution is the introduction of this interpretive tool itself, which remains self-contained as a methodological proposal rather than a derivation that loops back to presuppose its own outputs. No load-bearing steps reduce predictions to prior fits or author-specific uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The described approach rests on standard Bayesian inference and information theory without introducing new free parameters, axioms beyond domain assumptions, or invented entities in the abstract.

axioms (1)

domain assumption Bayesian inference can be used to define distributions over representational units and infer causal contributions to task performance.
Directly stated as the inspiration for the framework in the abstract.

pith-pipeline@v0.9.0 · 5612 in / 1122 out tokens · 36804 ms · 2026-05-22T13:44:03.437602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Centaur: a foundation model of human cognition.arXiv preprint arXiv:2410.20268,

Marcel Binz, Elif Akata, Matthias Bethge, Franziska Br ¨andle, Fred Callaway, Julian Coda-Forno, Peter Dayan, Can Demircan, Maria K Eckstein, No´emi ´Eltet˝o, et al. Centaur: a foundation model of human cognition.arXiv preprint arXiv:2410.20268,

work page arXiv
[3]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

https://transformer- circuits.pub/2023/monosemantic-features/index.html. S´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka- mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

doi: 10.1214/aos/1176344552. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread,

work page doi:10.1214/aos/1176344552
[5]

Kayson Fakhar and Claus C Hilgetag

https://transformer- circuits.pub/2022/toy model/index.html. Kayson Fakhar and Claus C Hilgetag. Systematic perturbation of an artificial neural network: A step towards quantifying causal contributions in the brain.PLOS Computational Biology, 18(6): e1010250,

work page 2022
[6]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

An analytic theory of generalization dynamics and transfer learning in deep linear networks

Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks.arXiv preprint arXiv:1809.10374,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Radford M Neal

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025). Radford M Neal. Probabilistic inference using markov chain monte carlo methods

work page 2025
[9]

https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. OpenAI. Gpt-5 system card. Technical report, OpenAI, August

work page 2022
[10]

Flemish category norms for exemplars of 39 categories: A replication of the battig and montague (1969) category norms: Pet studies.Brain, 124:1619–1634,

G Storms. Flemish category norms for exemplars of 39 categories: A replication of the battig and montague (1969) category norms: Pet studies.Brain, 124:1619–1634,

work page 1969
[11]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned.arXiv preprint arXiv:1905.09418,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[12]

Understanding Neural Networks Through Deep Visualization

Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization.arXiv preprint arXiv:1506.06579,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Skill-mix: A flexible and expandable family of evaluations for ai models

Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, and Sanjeev Arora. Skill-mix: A flexible and expandable family of evaluations for ai models.arXiv preprint arXiv:2310.17567,

work page arXiv
[14]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6- 12, 2014, Proceedings, Part I 13, pages 818–833. Springer,

work page 2014
[15]

Additional Tasks

with a learning rate of 0.001. Appendix E. Tables and figures A-Category A-Name A-Size C-coord C-subord C-super C-syn E-abstract E-beh E-excomp E-exsurfNV E-exsurfV E-incomp E-insurfNV E-insurfV E-mat E-quant E-sys E-whole I-contin I-emot I-eval LEX-exp LEX-fcc S-action S-build S-event S-function S-living S-loc S-manner S-object S-person S-physt S-socart ...

work page 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Centaur: a foundation model of human cognition.arXiv preprint arXiv:2410.20268,

Marcel Binz, Elif Akata, Matthias Bethge, Franziska Br ¨andle, Fred Callaway, Julian Coda-Forno, Peter Dayan, Can Demircan, Maria K Eckstein, No´emi ´Eltet˝o, et al. Centaur: a foundation model of human cognition.arXiv preprint arXiv:2410.20268,

work page arXiv

[3] [3]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

https://transformer- circuits.pub/2023/monosemantic-features/index.html. S´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka- mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

doi: 10.1214/aos/1176344552. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread,

work page doi:10.1214/aos/1176344552

[5] [5]

Kayson Fakhar and Claus C Hilgetag

https://transformer- circuits.pub/2022/toy model/index.html. Kayson Fakhar and Claus C Hilgetag. Systematic perturbation of an artificial neural network: A step towards quantifying causal contributions in the brain.PLOS Computational Biology, 18(6): e1010250,

work page 2022

[6] [6]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

An analytic theory of generalization dynamics and transfer learning in deep linear networks

Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks.arXiv preprint arXiv:1809.10374,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Radford M Neal

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025). Radford M Neal. Probabilistic inference using markov chain monte carlo methods

work page 2025

[9] [9]

https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. OpenAI. Gpt-5 system card. Technical report, OpenAI, August

work page 2022

[10] [10]

Flemish category norms for exemplars of 39 categories: A replication of the battig and montague (1969) category norms: Pet studies.Brain, 124:1619–1634,

G Storms. Flemish category norms for exemplars of 39 categories: A replication of the battig and montague (1969) category norms: Pet studies.Brain, 124:1619–1634,

work page 1969

[11] [11]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned.arXiv preprint arXiv:1905.09418,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[12] [12]

Understanding Neural Networks Through Deep Visualization

Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization.arXiv preprint arXiv:1506.06579,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Skill-mix: A flexible and expandable family of evaluations for ai models

Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, and Sanjeev Arora. Skill-mix: A flexible and expandable family of evaluations for ai models.arXiv preprint arXiv:2310.17567,

work page arXiv

[14] [14]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6- 12, 2014, Proceedings, Part I 13, pages 818–833. Springer,

work page 2014

[15] [15]

Additional Tasks

with a learning rate of 0.001. Appendix E. Tables and figures A-Category A-Name A-Size C-coord C-subord C-super C-syn E-abstract E-beh E-excomp E-exsurfNV E-exsurfV E-incomp E-insurfNV E-insurfV E-mat E-quant E-sys E-whole I-contin I-emot I-eval LEX-exp LEX-fcc S-action S-build S-event S-function S-living S-loc S-manner S-object S-person S-physt S-socart ...

work page 2024