pith. sign in

arxiv: 1906.08652 · v1 · pith:7RB2M5FDnew · submitted 2019-06-20 · 💻 cs.LG · stat.ML

Disentangling Influence: Using Disentangled Representations to Audit Model Predictions

Pith reviewed 2026-05-25 19:42 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords disentangled representationsmodel auditingfeature influenceproxy featuresindirect influenceblack-box modelsinfluence audits
0
0 comments X

The pith

Disentangled representations let auditors measure indirect proxy influences on black-box model predictions for single points or in aggregate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces disentangled influence audits as a way to separate direct feature effects from indirect ones that operate through proxies. It argues that disentangled representations provide an explicit mechanism to spot these proxies in data and then calculate their influence on classifier outcomes. The approach works both locally for individual predictions and globally across datasets. A reader would care because the method claims to detect and rank the most influential proxies more effectively than prior techniques limited to one dimension of influence at a time.

Core claim

Disentangled influence audits use disentangled representations to identify proxy features and compute their explicit influence on model predictions, either for each individual outcome or in aggregate over the data. Theory and experiments demonstrate that the audits detect proxy features and identify which ones affect the audited classifier most, making the method more powerful than existing approaches for ascertaining feature influence.

What carries the argument

Disentangled representations that isolate proxy features from direct ones, enabling separate computation of indirect influence on model outputs.

If this is right

  • Audits can flag the strongest proxy drivers for any single prediction.
  • Aggregate results can reveal overall proxy patterns across an entire dataset.
  • The same framework applies to influences measured on training data or test data.
  • Multiple proxies can be compared directly by their computed influence values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The audits could be paired with fairness metrics to trace how proxies encode protected attributes.
  • If the representations are learned from the same data as the model, circularity might hide some proxies.
  • Extensions to regression or other output types would require only redefining the influence function.

Load-bearing premise

Disentangled representations can be obtained that reliably separate direct features from proxy features in a way that supports accurate influence calculations.

What would settle it

A controlled test on synthetic data with known proxies where the audits fail to rank the correct proxies by influence strength or miss their presence entirely.

Figures

Figures reproduced from arXiv: 1906.08652 by Carlos Scheidegger, Charles T. Marx, Richard Lanas Phillips, Sorelle A. Friedler, Suresh Venkatasubramanian.

Figure 1
Figure 1. Figure 1: System diagram when auditing the indirect [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic x + y data direct shap (left) and indirect (right) feature influences using a handcrafted (top row) or learned disentangled representation (bottom row). The results for the handcrafted disentangled representation (top of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Errors on the synthetic x + y data for the reconstruction error (left) when taken across influence audits for each feature, prediction error (middle), and disentanglement error (right). These influence experiments on the x + y dataset demonstrate the importance of a good disentangled represen￾tation to the quality of the resulting indirect influence measures, since the handcrafted zero-error disentangled r… view at source ↗
Figure 4
Figure 4. Figure 4: dSprites data indirect latent factor influences [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The mean squared reconstruction error (left), absolute prediction error (middle), and absolute [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ten selected features for Adult dataset. Direct (left) and indirect (right) influence are shown. For all [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The reconstruction error (left), prediction error (middle), and disentanglement error (right) of selected [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison on the synthetic x + y data of the disentangled influence audits using the handcrafted (left) or learned (middle) disentangled representation with the BBA approach of [1] (right). mean over all instances of the absolute value of the per feature disentangled influence. BBA was designed to audit classifiers, so in order to compare to the results of disentangled influence audits we will consider th… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison on the Adult data of the disentan [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The full influence results for the adult data direct (left) and indirect (right) feature influences. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The full disentanglement (top), reconstruction (left) and prediction (right) error metrics for the [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
read the original abstract

Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes 'disentangled influence audits,' a procedure that uses disentangled representations to identify proxy features in a dataset and explicitly compute their indirect influence on a black-box classifier's predictions, either locally for individual points or in aggregate. It claims to show via theory and experiments that the method detects proxies and ranks which proxy affects the audited model most, making it more powerful than prior feature-influence techniques that typically address only one dimension (direct/indirect, local/global).

Significance. If the central claims hold, the work would offer a structured mechanism for auditing indirect/proxy influences that current methods do not jointly address, with potential value for fairness auditing and interpretability of complex models. The integration of disentanglement techniques with influence computation is a distinctive contribution, though its practical utility hinges on reliable separation of factors.

major comments (2)
  1. [Abstract] Abstract: The claim that 'disentangled representations provide a mechanism to identify proxy features' while enabling 'explicit computation of feature influence' is load-bearing, yet the visible text supplies no derivation, equation, or formal statement showing how the disentangled factors map to a decomposition of direct versus indirect influence. Without this, the asserted theoretical support cannot be evaluated.
  2. [Abstract] Abstract: The assertion of superiority ('our method is more powerful than existing methods') and the experimental validation rest on the unverified assumption that learned disentangled factors cleanly isolate proxy features from direct ones. No mention is made of synthetic-data controls with known ground-truth proxies or quantitative recovery metrics that would confirm the separation is accurate enough for the influence ranking to be reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. Below we respond point-by-point to the major comments, drawing on the full manuscript content and indicating where revisions will strengthen clarity without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'disentangled representations provide a mechanism to identify proxy features' while enabling 'explicit computation of feature influence' is load-bearing, yet the visible text supplies no derivation, equation, or formal statement showing how the disentangled factors map to a decomposition of direct versus indirect influence. Without this, the asserted theoretical support cannot be evaluated.

    Authors: The abstract is a concise summary and therefore omits detailed derivations. Section 3 of the manuscript contains the formal definitions, the mapping from disentangled factors to the direct/indirect influence decomposition, and the associated proofs. We will revise the abstract to include a short parenthetical reference to the key equation in Section 3 so that the theoretical support is signposted from the outset. revision: partial

  2. Referee: [Abstract] Abstract: The assertion of superiority ('our method is more powerful than existing methods') and the experimental validation rest on the unverified assumption that learned disentangled factors cleanly isolate proxy features from direct ones. No mention is made of synthetic-data controls with known ground-truth proxies or quantitative recovery metrics that would confirm the separation is accurate enough for the influence ranking to be reliable.

    Authors: Section 5.1 describes synthetic-data experiments that generate datasets with explicitly known proxy relationships and direct features. In these experiments we report quantitative recovery metrics (proxy identification accuracy and rank correlation of influence scores against ground truth) that verify the disentanglement isolates the proxies sufficiently for the subsequent influence ranking. These results directly support the superiority claim under controlled conditions. We will add an explicit sentence to the abstract (or a footnote) highlighting the synthetic controls and metrics. revision: yes

Circularity Check

0 steps flagged

No circularity; method relies on external disentanglement techniques without self-referential reduction

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that reduce the claimed results to fitted parameters or definitions by construction. The approach is presented as building on existing disentanglement research to enable audits, with theory and experiments offered as validation; no load-bearing step is shown to collapse into its own inputs. This is the expected self-contained case where the derivation chain does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the learnability of disentangled representations that isolate proxy effects; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Disentangled representations exist and can be learned to separate direct and proxy features
    Central to identifying proxies and computing their influence

pith-pipeline@v0.9.0 · 5726 in / 1091 out tokens · 23022 ms · 2026-05-25T19:42:14.831484+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    Adler, C

    P. Adler, C. Falk, S. A. Friedler, T. Nix, G. Rybeck, C. Scheidegger, B. Smith, and S. Venkatasubramanian. Auditing black-box models for indirect influence. Knowledge and Information Systems, 54(1):95–122, 2018

  2. [2]

    A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy. Deep variational information bottleneck. International Conference on Learning Representations, 2016

  3. [3]

    Bengio, A

    Y . Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013

  4. [4]

    Datta, S

    A. Datta, S. Sen, and Y . Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Proceedings of 37th IEEE Symposium on Security and Privacy, 2016

  5. [5]

    Edwards and A

    H. Edwards and A. Storkey. Censoring representations with an adversary. In Proceedings of the 33th International Conference on Machine Learning, 2016

  6. [6]

    Esmaeili, H

    B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige, D. H. Brooks, J. Dy, and J.-W. van de Meent. Structured disentangled representations. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89, pages 2525–2534. PMLR, 16–18 Apr 2019

  7. [7]

    S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 329–338. ACM, 2019

  8. [8]

    Guidotti, A

    R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2018

  9. [9]

    Henelius, K

    A. Henelius, K. Puolamäki, H. Boström, L. Asker, and P. Papapetrou. A peek into the black box: exploring classifiers by randomization. Data Min Knowl Disc, 28:1503–1529, 2014

  10. [10]

    Towards a Definition of Disentangled Representations

    I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018

  11. [11]

    P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017

  12. [12]

    Kumar, P

    A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. International Conference on Learning Representations, 2017

  13. [13]

    S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017

  14. [14]

    Madras, E

    D. Madras, E. Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferable representations. In Proceedings of the 35th International Conference on Machine Learning, 2018

  15. [15]

    Adversarial Autoencoders

    A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015

  16. [16]

    Matthey, I

    L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017

  17. [17]

    C. Molnar. Interpretable machine learning: A guide for making black box models explainable. Christoph Molnar, Leanpub, 2018

  18. [18]

    Why Should I Trust You?

    M. T. Ribeiro, S. Singh, and C. Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proc. ACM KDD, 2016

  19. [19]

    Recent Advances in Autoencoder-Based Representation Learning

    M. Tschannen, O. Bachem, and M. Lucic. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018

  20. [20]

    I. M. L. R. University of California. Adult income dataset. https://archive.ics.uci.edu/ml/datasets/ adult

  21. [21]

    education_num

    B. Ustun, A. Spangher, and Y . Liu. Actionable recourse in linear classification. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 10–19. ACM, 2019. 9 Disentangling Influence: Using disentangled representations to audit model predictions A Implementation Details Syntheticx +y model and disentangled representation informat...