Disentangling Influence: Using Disentangled Representations to Audit Model Predictions

Carlos Scheidegger; Charles T. Marx; Richard Lanas Phillips; Sorelle A. Friedler; Suresh Venkatasubramanian

arxiv: 1906.08652 · v1 · pith:7RB2M5FDnew · submitted 2019-06-20 · 💻 cs.LG · stat.ML

Disentangling Influence: Using Disentangled Representations to Audit Model Predictions

Charles T. Marx , Richard Lanas Phillips , Sorelle A. Friedler , Carlos Scheidegger , Suresh Venkatasubramanian This is my paper

Pith reviewed 2026-05-25 19:42 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords disentangled representationsmodel auditingfeature influenceproxy featuresindirect influenceblack-box modelsinfluence audits

0 comments

The pith

Disentangled representations let auditors measure indirect proxy influences on black-box model predictions for single points or in aggregate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces disentangled influence audits as a way to separate direct feature effects from indirect ones that operate through proxies. It argues that disentangled representations provide an explicit mechanism to spot these proxies in data and then calculate their influence on classifier outcomes. The approach works both locally for individual predictions and globally across datasets. A reader would care because the method claims to detect and rank the most influential proxies more effectively than prior techniques limited to one dimension of influence at a time.

Core claim

Disentangled influence audits use disentangled representations to identify proxy features and compute their explicit influence on model predictions, either for each individual outcome or in aggregate over the data. Theory and experiments demonstrate that the audits detect proxy features and identify which ones affect the audited classifier most, making the method more powerful than existing approaches for ascertaining feature influence.

What carries the argument

Disentangled representations that isolate proxy features from direct ones, enabling separate computation of indirect influence on model outputs.

If this is right

Audits can flag the strongest proxy drivers for any single prediction.
Aggregate results can reveal overall proxy patterns across an entire dataset.
The same framework applies to influences measured on training data or test data.
Multiple proxies can be compared directly by their computed influence values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The audits could be paired with fairness metrics to trace how proxies encode protected attributes.
If the representations are learned from the same data as the model, circularity might hide some proxies.
Extensions to regression or other output types would require only redefining the influence function.

Load-bearing premise

Disentangled representations can be obtained that reliably separate direct features from proxy features in a way that supports accurate influence calculations.

What would settle it

A controlled test on synthetic data with known proxies where the audits fail to rank the correct proxies by influence strength or miss their presence entirely.

Figures

Figures reproduced from arXiv: 1906.08652 by Carlos Scheidegger, Charles T. Marx, Richard Lanas Phillips, Sorelle A. Friedler, Suresh Venkatasubramanian.

**Figure 2.** Figure 2: Synthetic x + y data direct shap (left) and indirect (right) feature influences using a handcrafted (top row) or learned disentangled representation (bottom row). The results for the handcrafted disentangled representation (top of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Errors on the synthetic x + y data for the reconstruction error (left) when taken across influence audits for each feature, prediction error (middle), and disentanglement error (right). These influence experiments on the x + y dataset demonstrate the importance of a good disentangled representation to the quality of the resulting indirect influence measures, since the handcrafted zero-error disentangled r… view at source ↗

**Figure 4.** Figure 4: dSprites data indirect latent factor influences [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The mean squared reconstruction error (left), absolute prediction error (middle), and absolute [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Ten selected features for Adult dataset. Direct (left) and indirect (right) influence are shown. For all [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: The reconstruction error (left), prediction error (middle), and disentanglement error (right) of selected [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison on the synthetic x + y data of the disentangled influence audits using the handcrafted (left) or learned (middle) disentangled representation with the BBA approach of [1] (right). mean over all instances of the absolute value of the per feature disentangled influence. BBA was designed to audit classifiers, so in order to compare to the results of disentangled influence audits we will consider th… view at source ↗

**Figure 9.** Figure 9: Comparison on the Adult data of the disentan [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: The full influence results for the adult data direct (left) and indirect (right) feature influences. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: The full disentanglement (top), reconstruction (left) and prediction (right) error metrics for the [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

read the original abstract

Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a procedure using disentangled representations to audit proxy influences in black-box models, but the central claims rest on an unverified assumption that the disentanglement cleanly separates direct and indirect effects.

read the letter

The paper introduces disentangled influence audits to separate direct and indirect feature influences using disentangled representations. This allows detecting proxy features and ranking their impact on individual or aggregate predictions. The work builds on existing disentanglement techniques to tackle the problem of proxy detection in a more structured way than previous influence measures that focused on one aspect at a time. It claims to be more powerful because it can rank which proxy features matter most for the audited classifier. This addresses a genuine need in fairness auditing where models might use sensitive attributes indirectly through correlated features. The paper does a decent job framing the problem and positioning the method as an extension. The idea of using disentangled factors to separate direct and indirect effects is a reasonable direction. That said, the load-bearing assumption is that the disentangled representations actually achieve a clean separation between direct features and proxies. The abstract says theory and experiments back this up, but from what's visible, there's no description of how they validate the disentanglement or test on data with known ground-truth proxies. Any mixing between the factors would undermine the influence calculations. This makes the superiority claim hard to evaluate without the full details on the experiments. Overall, this is for researchers in machine learning interpretability and algorithmic fairness. Someone looking for new tools to audit models for proxy influences might get something out of it. I think it deserves a serious referee because the problem is important and the approach is distinct enough, even if the current evidence is thin on the validation side.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes 'disentangled influence audits,' a procedure that uses disentangled representations to identify proxy features in a dataset and explicitly compute their indirect influence on a black-box classifier's predictions, either locally for individual points or in aggregate. It claims to show via theory and experiments that the method detects proxies and ranks which proxy affects the audited model most, making it more powerful than prior feature-influence techniques that typically address only one dimension (direct/indirect, local/global).

Significance. If the central claims hold, the work would offer a structured mechanism for auditing indirect/proxy influences that current methods do not jointly address, with potential value for fairness auditing and interpretability of complex models. The integration of disentanglement techniques with influence computation is a distinctive contribution, though its practical utility hinges on reliable separation of factors.

major comments (2)

[Abstract] Abstract: The claim that 'disentangled representations provide a mechanism to identify proxy features' while enabling 'explicit computation of feature influence' is load-bearing, yet the visible text supplies no derivation, equation, or formal statement showing how the disentangled factors map to a decomposition of direct versus indirect influence. Without this, the asserted theoretical support cannot be evaluated.
[Abstract] Abstract: The assertion of superiority ('our method is more powerful than existing methods') and the experimental validation rest on the unverified assumption that learned disentangled factors cleanly isolate proxy features from direct ones. No mention is made of synthetic-data controls with known ground-truth proxies or quantitative recovery metrics that would confirm the separation is accurate enough for the influence ranking to be reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. Below we respond point-by-point to the major comments, drawing on the full manuscript content and indicating where revisions will strengthen clarity without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'disentangled representations provide a mechanism to identify proxy features' while enabling 'explicit computation of feature influence' is load-bearing, yet the visible text supplies no derivation, equation, or formal statement showing how the disentangled factors map to a decomposition of direct versus indirect influence. Without this, the asserted theoretical support cannot be evaluated.

Authors: The abstract is a concise summary and therefore omits detailed derivations. Section 3 of the manuscript contains the formal definitions, the mapping from disentangled factors to the direct/indirect influence decomposition, and the associated proofs. We will revise the abstract to include a short parenthetical reference to the key equation in Section 3 so that the theoretical support is signposted from the outset. revision: partial
Referee: [Abstract] Abstract: The assertion of superiority ('our method is more powerful than existing methods') and the experimental validation rest on the unverified assumption that learned disentangled factors cleanly isolate proxy features from direct ones. No mention is made of synthetic-data controls with known ground-truth proxies or quantitative recovery metrics that would confirm the separation is accurate enough for the influence ranking to be reliable.

Authors: Section 5.1 describes synthetic-data experiments that generate datasets with explicitly known proxy relationships and direct features. In these experiments we report quantitative recovery metrics (proxy identification accuracy and rank correlation of influence scores against ground truth) that verify the disentanglement isolates the proxies sufficiently for the subsequent influence ranking. These results directly support the superiority claim under controlled conditions. We will add an explicit sentence to the abstract (or a footnote) highlighting the synthetic controls and metrics. revision: yes

Circularity Check

0 steps flagged

No circularity; method relies on external disentanglement techniques without self-referential reduction

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that reduce the claimed results to fitted parameters or definitions by construction. The approach is presented as building on existing disentanglement research to enable audits, with theory and experiments offered as validation; no load-bearing step is shown to collapse into its own inputs. This is the expected self-contained case where the derivation chain does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the learnability of disentangled representations that isolate proxy effects; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Disentangled representations exist and can be learned to separate direct and proxy features
Central to identifying proxies and computing their influence

pith-pipeline@v0.9.0 · 5726 in / 1091 out tokens · 23022 ms · 2026-05-25T19:42:14.831484+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean (distinguishability floor) absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

adversarial autoencoders... LEnc = MSE(x, x̂) − β MSE(p, p̂)... f disentangles p from the other features

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Adler, C

P. Adler, C. Falk, S. A. Friedler, T. Nix, G. Rybeck, C. Scheidegger, B. Smith, and S. Venkatasubramanian. Auditing black-box models for indirect inﬂuence. Knowledge and Information Systems, 54(1):95–122, 2018

work page 2018
[2]

A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy. Deep variational information bottleneck. International Conference on Learning Representations, 2016

work page 2016
[3]

Bengio, A

Y . Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013

work page 2013
[4]

Datta, S

A. Datta, S. Sen, and Y . Zick. Algorithmic transparency via quantitative input inﬂuence: Theory and experiments with learning systems. In Proceedings of 37th IEEE Symposium on Security and Privacy, 2016

work page 2016
[5]

Edwards and A

H. Edwards and A. Storkey. Censoring representations with an adversary. In Proceedings of the 33th International Conference on Machine Learning, 2016

work page 2016
[6]

Esmaeili, H

B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige, D. H. Brooks, J. Dy, and J.-W. van de Meent. Structured disentangled representations. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89, pages 2525–2534. PMLR, 16–18 Apr 2019

work page 2019
[7]

S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 329–338. ACM, 2019

work page 2019
[8]

Guidotti, A

R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2018

work page 2018
[9]

Henelius, K

A. Henelius, K. Puolamäki, H. Boström, L. Asker, and P. Papapetrou. A peek into the black box: exploring classiﬁers by randomization. Data Min Knowl Disc, 28:1503–1529, 2014

work page 2014
[10]

Towards a Definition of Disentangled Representations

I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner. Towards a deﬁnition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

P. W. Koh and P. Liang. Understanding black-box predictions via inﬂuence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017

work page 2017
[12]

Kumar, P

A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. International Conference on Learning Representations, 2017

work page 2017
[13]

S. M. Lundberg and S.-I. Lee. A uniﬁed approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017

work page 2017
[14]

Madras, E

D. Madras, E. Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferable representations. In Proceedings of the 35th International Conference on Machine Learning, 2018

work page 2018
[15]

Adversarial Autoencoders

A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Matthey, I

L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017

work page 2017
[17]

C. Molnar. Interpretable machine learning: A guide for making black box models explainable. Christoph Molnar, Leanpub, 2018

work page 2018
[18]

Why Should I Trust You?

M. T. Ribeiro, S. Singh, and C. Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classiﬁer. In Proc. ACM KDD, 2016

work page 2016
[19]

Recent Advances in Autoencoder-Based Representation Learning

M. Tschannen, O. Bachem, and M. Lucic. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

I. M. L. R. University of California. Adult income dataset. https://archive.ics.uci.edu/ml/datasets/ adult

work page
[21]

education_num

B. Ustun, A. Spangher, and Y . Liu. Actionable recourse in linear classiﬁcation. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 10–19. ACM, 2019. 9 Disentangling Inﬂuence: Using disentangled representations to audit model predictions A Implementation Details Syntheticx +y model and disentangled representation informat...

work page 2019

[1] [1]

Adler, C

P. Adler, C. Falk, S. A. Friedler, T. Nix, G. Rybeck, C. Scheidegger, B. Smith, and S. Venkatasubramanian. Auditing black-box models for indirect inﬂuence. Knowledge and Information Systems, 54(1):95–122, 2018

work page 2018

[2] [2]

A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy. Deep variational information bottleneck. International Conference on Learning Representations, 2016

work page 2016

[3] [3]

Bengio, A

Y . Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013

work page 2013

[4] [4]

Datta, S

A. Datta, S. Sen, and Y . Zick. Algorithmic transparency via quantitative input inﬂuence: Theory and experiments with learning systems. In Proceedings of 37th IEEE Symposium on Security and Privacy, 2016

work page 2016

[5] [5]

Edwards and A

H. Edwards and A. Storkey. Censoring representations with an adversary. In Proceedings of the 33th International Conference on Machine Learning, 2016

work page 2016

[6] [6]

Esmaeili, H

B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige, D. H. Brooks, J. Dy, and J.-W. van de Meent. Structured disentangled representations. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89, pages 2525–2534. PMLR, 16–18 Apr 2019

work page 2019

[7] [7]

S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 329–338. ACM, 2019

work page 2019

[8] [8]

Guidotti, A

R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2018

work page 2018

[9] [9]

Henelius, K

A. Henelius, K. Puolamäki, H. Boström, L. Asker, and P. Papapetrou. A peek into the black box: exploring classiﬁers by randomization. Data Min Knowl Disc, 28:1503–1529, 2014

work page 2014

[10] [10]

Towards a Definition of Disentangled Representations

I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner. Towards a deﬁnition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

P. W. Koh and P. Liang. Understanding black-box predictions via inﬂuence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017

work page 2017

[12] [12]

Kumar, P

A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. International Conference on Learning Representations, 2017

work page 2017

[13] [13]

S. M. Lundberg and S.-I. Lee. A uniﬁed approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017

work page 2017

[14] [14]

Madras, E

D. Madras, E. Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferable representations. In Proceedings of the 35th International Conference on Machine Learning, 2018

work page 2018

[15] [15]

Adversarial Autoencoders

A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

Matthey, I

L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017

work page 2017

[17] [17]

C. Molnar. Interpretable machine learning: A guide for making black box models explainable. Christoph Molnar, Leanpub, 2018

work page 2018

[18] [18]

Why Should I Trust You?

M. T. Ribeiro, S. Singh, and C. Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classiﬁer. In Proc. ACM KDD, 2016

work page 2016

[19] [19]

Recent Advances in Autoencoder-Based Representation Learning

M. Tschannen, O. Bachem, and M. Lucic. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

I. M. L. R. University of California. Adult income dataset. https://archive.ics.uci.edu/ml/datasets/ adult

work page

[21] [21]

education_num

B. Ustun, A. Spangher, and Y . Liu. Actionable recourse in linear classiﬁcation. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 10–19. ACM, 2019. 9 Disentangling Inﬂuence: Using disentangled representations to audit model predictions A Implementation Details Syntheticx +y model and disentangled representation informat...

work page 2019