Disentangling Influence: Using Disentangled Representations to Audit Model Predictions
Pith reviewed 2026-05-25 19:42 UTC · model grok-4.3
The pith
Disentangled representations let auditors measure indirect proxy influences on black-box model predictions for single points or in aggregate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Disentangled influence audits use disentangled representations to identify proxy features and compute their explicit influence on model predictions, either for each individual outcome or in aggregate over the data. Theory and experiments demonstrate that the audits detect proxy features and identify which ones affect the audited classifier most, making the method more powerful than existing approaches for ascertaining feature influence.
What carries the argument
Disentangled representations that isolate proxy features from direct ones, enabling separate computation of indirect influence on model outputs.
If this is right
- Audits can flag the strongest proxy drivers for any single prediction.
- Aggregate results can reveal overall proxy patterns across an entire dataset.
- The same framework applies to influences measured on training data or test data.
- Multiple proxies can be compared directly by their computed influence values.
Where Pith is reading between the lines
- The audits could be paired with fairness metrics to trace how proxies encode protected attributes.
- If the representations are learned from the same data as the model, circularity might hide some proxies.
- Extensions to regression or other output types would require only redefining the influence function.
Load-bearing premise
Disentangled representations can be obtained that reliably separate direct features from proxy features in a way that supports accurate influence calculations.
What would settle it
A controlled test on synthetic data with known proxies where the audits fail to rank the correct proxies by influence strength or miss their presence entirely.
Figures
read the original abstract
Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes 'disentangled influence audits,' a procedure that uses disentangled representations to identify proxy features in a dataset and explicitly compute their indirect influence on a black-box classifier's predictions, either locally for individual points or in aggregate. It claims to show via theory and experiments that the method detects proxies and ranks which proxy affects the audited model most, making it more powerful than prior feature-influence techniques that typically address only one dimension (direct/indirect, local/global).
Significance. If the central claims hold, the work would offer a structured mechanism for auditing indirect/proxy influences that current methods do not jointly address, with potential value for fairness auditing and interpretability of complex models. The integration of disentanglement techniques with influence computation is a distinctive contribution, though its practical utility hinges on reliable separation of factors.
major comments (2)
- [Abstract] Abstract: The claim that 'disentangled representations provide a mechanism to identify proxy features' while enabling 'explicit computation of feature influence' is load-bearing, yet the visible text supplies no derivation, equation, or formal statement showing how the disentangled factors map to a decomposition of direct versus indirect influence. Without this, the asserted theoretical support cannot be evaluated.
- [Abstract] Abstract: The assertion of superiority ('our method is more powerful than existing methods') and the experimental validation rest on the unverified assumption that learned disentangled factors cleanly isolate proxy features from direct ones. No mention is made of synthetic-data controls with known ground-truth proxies or quantitative recovery metrics that would confirm the separation is accurate enough for the influence ranking to be reliable.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive comments. Below we respond point-by-point to the major comments, drawing on the full manuscript content and indicating where revisions will strengthen clarity without altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'disentangled representations provide a mechanism to identify proxy features' while enabling 'explicit computation of feature influence' is load-bearing, yet the visible text supplies no derivation, equation, or formal statement showing how the disentangled factors map to a decomposition of direct versus indirect influence. Without this, the asserted theoretical support cannot be evaluated.
Authors: The abstract is a concise summary and therefore omits detailed derivations. Section 3 of the manuscript contains the formal definitions, the mapping from disentangled factors to the direct/indirect influence decomposition, and the associated proofs. We will revise the abstract to include a short parenthetical reference to the key equation in Section 3 so that the theoretical support is signposted from the outset. revision: partial
-
Referee: [Abstract] Abstract: The assertion of superiority ('our method is more powerful than existing methods') and the experimental validation rest on the unverified assumption that learned disentangled factors cleanly isolate proxy features from direct ones. No mention is made of synthetic-data controls with known ground-truth proxies or quantitative recovery metrics that would confirm the separation is accurate enough for the influence ranking to be reliable.
Authors: Section 5.1 describes synthetic-data experiments that generate datasets with explicitly known proxy relationships and direct features. In these experiments we report quantitative recovery metrics (proxy identification accuracy and rank correlation of influence scores against ground truth) that verify the disentanglement isolates the proxies sufficiently for the subsequent influence ranking. These results directly support the superiority claim under controlled conditions. We will add an explicit sentence to the abstract (or a footnote) highlighting the synthetic controls and metrics. revision: yes
Circularity Check
No circularity; method relies on external disentanglement techniques without self-referential reduction
full rationale
The provided abstract and description contain no equations, derivations, or self-citations that reduce the claimed results to fitted parameters or definitions by construction. The approach is presented as building on existing disentanglement research to enable audits, with theory and experiments offered as validation; no load-bearing step is shown to collapse into its own inputs. This is the expected self-contained case where the derivation chain does not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Disentangled representations exist and can be learned to separate direct and proxy features
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean (distinguishability floor)absolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
adversarial autoencoders... LEnc = MSE(x, x̂) − β MSE(p, p̂)... f disentangles p from the other features
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy. Deep variational information bottleneck. International Conference on Learning Representations, 2016
work page 2016
- [3]
- [4]
-
[5]
H. Edwards and A. Storkey. Censoring representations with an adversary. In Proceedings of the 33th International Conference on Machine Learning, 2016
work page 2016
-
[6]
B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige, D. H. Brooks, J. Dy, and J.-W. van de Meent. Structured disentangled representations. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89, pages 2525–2534. PMLR, 16–18 Apr 2019
work page 2019
-
[7]
S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 329–338. ACM, 2019
work page 2019
-
[8]
R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2018
work page 2018
-
[9]
A. Henelius, K. Puolamäki, H. Boström, L. Asker, and P. Papapetrou. A peek into the black box: exploring classifiers by randomization. Data Min Knowl Disc, 28:1503–1529, 2014
work page 2014
-
[10]
Towards a Definition of Disentangled Representations
I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017
work page 2017
- [12]
-
[13]
S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017
work page 2017
- [14]
-
[15]
A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[16]
L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017
work page 2017
-
[17]
C. Molnar. Interpretable machine learning: A guide for making black box models explainable. Christoph Molnar, Leanpub, 2018
work page 2018
-
[18]
M. T. Ribeiro, S. Singh, and C. Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proc. ACM KDD, 2016
work page 2016
-
[19]
Recent Advances in Autoencoder-Based Representation Learning
M. Tschannen, O. Bachem, and M. Lucic. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
I. M. L. R. University of California. Adult income dataset. https://archive.ics.uci.edu/ml/datasets/ adult
-
[21]
B. Ustun, A. Spangher, and Y . Liu. Actionable recourse in linear classification. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 10–19. ACM, 2019. 9 Disentangling Influence: Using disentangled representations to audit model predictions A Implementation Details Syntheticx +y model and disentangled representation informat...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.