pith. machine review for the scientific record. sign in

arxiv: 2605.02962 · v1 · submitted 2026-05-03 · 💻 cs.LG · stat.CO· stat.ML

Recognition: unknown

ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:39 UTC · model grok-4.3

classification 💻 cs.LG stat.COstat.ML
keywords drug-target interactioncausal reasoningmodel auditingdeep learningintervention-based evaluationstructural sensitivitysequence-based modelsDavis benchmark
0
0 comments X

The pith

ISAAC reveals that deep learning models for drug-target prediction can differ substantially in causal reasoning even when their accuracy is nearly the same.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning models for drug-target interaction prediction often achieve strong benchmark performance without relying on mechanistically meaningful molecular features. Standard accuracy metrics such as AUROC cannot detect this limitation. The ISAAC framework evaluates models by applying matched mechanistic and spurious input-level interventions to frozen networks and computing a structural sensitivity score independent of predictive accuracy. When applied to three sequence-based architectures on the Davis benchmark, ISAAC identifies approximately 25 percent relative differences in reasoning scores across models whose AUROC values differ by only around 3 percent. These discrepancies remain stable across training seeds, intervention seeds, and two distinct perturbation operators.

Core claim

ISAAC is a post-hoc framework that evaluates prior-relative structural sensitivity by probing frozen models through matched mechanistic and spurious input-level interventions, independently of predictive accuracy. Applied to three sequence-based DTI architectures on the Davis benchmark, it reveals approximately 25 percent relative differences in reasoning scores across models with comparable AUROC within around 3 percent, with stability across training and intervention seeds and two distinct perturbation operators. These discrepancies, undetectable under conventional accuracy metrics, motivate the use of post-hoc structural auditing as a complement to standard performance evaluation in the D

What carries the argument

ISAAC, the Intervention-based Structural Auditing Approach for Causal Reasoning, a post-hoc framework that probes frozen models with matched mechanistic and spurious input-level interventions to measure structural sensitivity independent of accuracy.

If this is right

  • Models with nearly identical AUROC can exhibit meaningfully different reliance on mechanistic versus spurious features.
  • Conventional accuracy metrics alone are insufficient to certify the scientific validity of molecular prediction models.
  • Post-hoc auditing for structural sensitivity provides a practical complement to benchmark performance in scientific machine learning.
  • The observed stability of reasoning scores across seeds and operators supports treating ISAAC scores as reproducible model properties.
  • Sequence-based DTI architectures are not interchangeable even when their predictive accuracy matches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If higher ISAAC reasoning scores correlate with better performance on unseen compound classes, the framework could serve as a selection criterion during model development.
  • Adapting the same intervention-matching logic to graph-based or 3D-structure DTI models would test whether the 25 percent gap generalizes beyond sequence inputs.
  • In drug discovery pipelines, models with lower ISAAC scores might be flagged for additional mechanistic validation before use in virtual screening.
  • Extending ISAAC-style audits to related tasks such as protein-ligand binding affinity or toxicity prediction could expose similar hidden reasoning differences.

Load-bearing premise

The chosen input-level interventions can be reliably labeled as mechanistic versus spurious in a matched way that isolates causal reasoning in sequence-based DTI models.

What would settle it

Repeating the full ISAAC evaluation on the same three models and finding that reasoning scores show no relative differences exceeding a few percent or that results vary strongly with the choice of perturbation operator would falsify the reported discrepancies and stability.

read the original abstract

Deep learning models for drug--target interaction (DTI) prediction often achieve strong benchmark performance without necessarily relying on mechanistically meaningful molecular features, a limitation that standard accuracy-based evaluation cannot detect. We introduce ISAAC (Intervention-based Structural Auditing Approach for Causal Reasoning), a post-hoc framework that evaluates prior-relative structural sensitivity by probing frozen models through matched mechanistic and spurious input-level interventions, independently of predictive accuracy. Applied to three sequence-based DTI architectures on the Davis benchmark, ISAAC reveals approximately 25\% relative differences in reasoning scores across models with comparable AUROC (within around 3\%), stable across training and intervention seeds and two distinct perturbation operators. These discrepancies, undetectable under conventional accuracy metrics, motivate the use of post-hoc structural auditing as a complement to standard performance evaluation in scientific machine learning for molecular modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ISAAC, a post-hoc auditing framework that probes frozen sequence-based DTI models with matched mechanistic and spurious input-level interventions to compute a reasoning score measuring prior-relative structural sensitivity, independent of predictive accuracy. On the Davis benchmark, it reports that three architectures with comparable AUROC (within ~3%) exhibit ~25% relative differences in reasoning scores, with stability across training/intervention seeds and two perturbation operators.

Significance. If the intervention labeling and score computation are shown to isolate causal reasoning without confounding by sequence properties, the work would usefully demonstrate that accuracy metrics alone miss important mechanistic differences in molecular ML models. The empirical gap between AUROC parity and reasoning-score divergence, plus seed stability, would be a concrete contribution to post-hoc auditing in scientific ML.

major comments (3)
  1. [Methods] Methods section on intervention construction and labeling: the partitioning of perturbations into mechanistic versus spurious classes lacks any quantitative validation (e.g., inter-rater agreement, ablation removing the label step, or checks against correlated features such as sequence length or hydrophobicity). This labeling step is load-bearing for the central claim that observed reasoning-score gaps reflect causal reasoning rather than sensitivity to other input statistics.
  2. [Results / Methods] Results and Methods on reasoning score: no explicit equation, pseudocode, or derivation is supplied for how the reasoning score is computed from the intervention outcomes (e.g., how prior-relative structural sensitivity is aggregated or normalized). Without this, it is impossible to verify the claimed independence from accuracy or to reproduce the reported 25% relative differences.
  3. [Experiments] Experiments on Davis benchmark: the headline result (25% reasoning-score gap with ~3% AUROC parity) is presented without an ablation that holds the label assignment fixed while varying only the model architecture, leaving open whether the gap arises from the models' causal sensitivity or from differential sensitivity to the particular perturbation operators chosen.
minor comments (2)
  1. [Abstract] Abstract and introduction: the phrase 'prior-relative structural sensitivity' is used without a concise definition or reference to the precise quantity being measured.
  2. [Figures / Tables] Figure captions and tables: stability across seeds is asserted but the number of seeds and exact variance values are not reported in the main text or captions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment point by point below, indicating revisions where appropriate to improve the clarity and rigor of the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section on intervention construction and labeling: the partitioning of perturbations into mechanistic versus spurious classes lacks any quantitative validation (e.g., inter-rater agreement, ablation removing the label step, or checks against correlated features such as sequence length or hydrophobicity). This labeling step is load-bearing for the central claim that observed reasoning-score gaps reflect causal reasoning rather than sensitivity to other input statistics.

    Authors: The intervention labeling is grounded in domain expertise from molecular biology, where mechanistic perturbations target residues known to influence drug binding affinity based on prior literature, while spurious ones do not. We recognize the value of additional validation and will include in the revision: (1) correlation checks with sequence features such as length and hydrophobicity to rule out confounding, and (2) an ablation that bypasses the labeling to assess its impact. Inter-rater agreement can be added if we consult additional experts, though the current labeling follows a deterministic rule based on binding site annotations. revision: partial

  2. Referee: [Results / Methods] Results and Methods on reasoning score: no explicit equation, pseudocode, or derivation is supplied for how the reasoning score is computed from the intervention outcomes (e.g., how prior-relative structural sensitivity is aggregated or normalized). Without this, it is impossible to verify the claimed independence from accuracy or to reproduce the reported 25% relative differences.

    Authors: We apologize for the omission of the explicit formulation. The reasoning score is computed as RS = (P_mech - P_spur) / P_prior, where P denotes the model's predicted interaction probability under each condition, aggregated over multiple interventions and normalized to ensure independence from baseline accuracy. We will add the full equation, pseudocode for the computation, and a derivation showing why this isolates structural sensitivity in the revised Methods section. revision: yes

  3. Referee: [Experiments] Experiments on Davis benchmark: the headline result (25% reasoning-score gap with ~3% AUROC parity) is presented without an ablation that holds the label assignment fixed while varying only the model architecture, leaving open whether the gap arises from the models' causal sensitivity or from differential sensitivity to the particular perturbation operators chosen.

    Authors: The experimental design applies the identical set of labeled interventions to all three model architectures on the Davis benchmark, thereby holding the label assignment fixed while varying only the model. This is already the case in the reported results. To make this explicit, we will add a sentence in the Experiments section clarifying that the intervention set is shared across models. The reported stability across two perturbation operators further supports that the gaps are not due to operator choice. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in ISAAC derivation chain.

full rationale

The ISAAC framework is presented as a post-hoc auditing method applied to already-trained, frozen models. It computes reasoning scores via sensitivity to explicitly defined input-level interventions (mechanistic vs. spurious) that are independent of the models' predictive accuracy or training loss. No equations, fitted parameters, or self-referential definitions are shown that would make the reported 25% relative score differences equivalent to the input data or labels by construction. The abstract explicitly states independence from accuracy metrics and stability across seeds, with no load-bearing self-citations or ansatzes invoked to force the result. The central empirical claim therefore remains a non-tautological observation rather than a renaming or re-derivation of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that mechanistic and spurious interventions can be defined and matched at the input-sequence level for DTI models; no free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption Input-level perturbations can be partitioned into matched mechanistic and spurious sets that isolate causal reasoning in frozen sequence-based DTI models.
    This partition is required for the reasoning score to measure structural sensitivity rather than generic sensitivity.
invented entities (1)
  • Reasoning score no independent evidence
    purpose: Quantifies relative structural sensitivity to mechanistic versus spurious interventions
    New derived quantity introduced by the ISAAC framework; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5443 in / 1383 out tokens · 49276 ms · 2026-05-10T15:39:34.897040+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    S., Brendel, W., Bethge, M., and Wichmann, F

    Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R. S., Brendel, W., Bethge, M., and Wichmann, F. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2:665--673, 2020

  2. [2]

    Adversarial examples are not bugs, they are features

    Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., and Madry, A. Adversarial examples are not bugs, they are features. In NeurIPS, 2019

  3. [3]

    Kirichenko, P., Izmailov, P., and Wilson, A. G. Why normalizing flows fail to detect out-of-distribution data. In NeurIPS, 2020

  4. [4]

    Invariant Risk Minimization

    Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

  5. [5]

    Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. In NeurIPS, pp. 4768--4777, 2017

  6. [6]

    Causal inference by using invariant prediction

    Peters, J., B\"uhlmann, P., and Meinshausen, N. Causal inference by using invariant prediction. JRSS B, 78(5):947--1012, 2016

  7. [7]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019

  8. [8]

    Measuring robustness to natural distribution shifts

    Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L. Measuring robustness to natural distribution shifts. In NeurIPS, 2020

  9. [9]

    Causality

    Pearl, J. Causality. Cambridge University Press, 2nd edition, 2009

  10. [10]

    Toward causal representation learning

    Sch\"olkopf, B., Locatello, F., Bauer, S., et al. Toward causal representation learning. Proceedings of the IEEE, 109:612--634, 2021

  11. [11]

    atsch, G., Sch\

    Locatello, F., Poole, B., R\"atsch, G., Sch\"olkopf, B., Bachem, O., and Tschannen, M. Weakly-supervised disentanglement without compromises. In ICML, 2020

  12. [12]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks. arXiv preprint arXiv:1312.6034, 2014

  13. [13]

    Axiomatic attribution for deep networks

    Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In ICML, pp. 3319--3328, 2017

  14. [14]

    T., Singh, S., and Guestrin, C

    Ribeiro, M. T., Singh, S., and Guestrin, C. Why should I trust you? In KDD, pp. 1135--1144, 2016

  15. [15]

    Towards A Rigorous Science of Interpretable Machine Learning

    Doshi-Velez, F. and Kim, B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017

  16. [16]

    Sanity checks for saliency maps

    Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. In NeurIPS, 2018

  17. [17]

    RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation

    Xu, X., et al. RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation. In ICML, 2025

  18. [18]

    LLMScan: Causal scan for LLM misbehavior detection

    Zhang, M., et al. LLMScan: Causal scan for LLM misbehavior detection. In ICML, 2025

  19. [19]

    Ozt\"urk, H., \

    \"Ozt\"urk, H., \"Ozg\"ur, A., and \"Ozkirimli, E. DeepDTA: Deep drug--target binding affinity prediction. Bioinformatics, 34:i821--i829, 2018

  20. [20]

    Y., et al

    Gao, K. Y., et al. Interpretable drug target prediction. In IJCAI, pp. 3371--3377, 2018

  21. [21]

    Deep learning for drug-drug interaction prediction

    Li, X., et al. Deep learning for drug-drug interaction prediction. Quantitative Biology, 12:30--52, 2024

  22. [22]

    MolTrans: Molecular interaction transformer

    Huang, K., Xiao, C., Glass, L., and Sun, J. MolTrans: Molecular interaction transformer. Bioinformatics, 37:830--836, 2020

  23. [23]

    GraphDTA

    Nguyen, T., et al. GraphDTA. Bioinformatics, 37:1140--1147, 2020

  24. [24]

    DeepConv-DTI

    Lee, I., Keum, J., and Nam, H. DeepConv-DTI. PLOS Computational Biology, 15, 2019

  25. [25]

    Lin, G., et al. TAPB. Nature Communications, 16, 2025

  26. [26]

    TransformerCPI

    Chen, L., et al. TransformerCPI. Bioinformatics, 36:4406--4414, 2020

  27. [27]

    Bai, P., et al. DrugBAN. Nature Machine Intelligence, 5:126--136, 2022

  28. [28]

    J., et al

    Kooistra, A. J., et al. KLIFS database. Nucleic Acids Research, 44:D365--D371, 2015

  29. [29]

    and Kumbier, K

    Yu, B. and Kumbier, K. Veridical data science. PNAS, 117:3920--3929, 2020

  30. [30]

    Causal abstractions of neural networks

    Geiger, A., et al. Causal abstractions of neural networks. In NeurIPS, 2021

  31. [31]

    Investigating gender bias using causal mediation

    Vig, J., et al. Investigating gender bias using causal mediation. In NeurIPS, 2020

  32. [32]

    Perturbation-based methods for explaining neural networks

    Ivanovs, M., Kadikis, R., and Ozols, K. Perturbation-based methods for explaining neural networks. Pattern Recognition Letters, 150:228--234, 2021