arxiv: 2605.02962 · v1 · submitted 2026-05-03 · 💻 cs.LG · stat.CO· stat.ML

Recognition: unknown

ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction

Barbara Tarantino , Sun Kim , Yijingxiu Lu , Paolo Giudici

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:39 UTC · model grok-4.3

classification 💻 cs.LG stat.COstat.ML

keywords drug-target interactioncausal reasoningmodel auditingdeep learningintervention-based evaluationstructural sensitivitysequence-based modelsDavis benchmark

0 comments

The pith

ISAAC reveals that deep learning models for drug-target prediction can differ substantially in causal reasoning even when their accuracy is nearly the same.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning models for drug-target interaction prediction often achieve strong benchmark performance without relying on mechanistically meaningful molecular features. Standard accuracy metrics such as AUROC cannot detect this limitation. The ISAAC framework evaluates models by applying matched mechanistic and spurious input-level interventions to frozen networks and computing a structural sensitivity score independent of predictive accuracy. When applied to three sequence-based architectures on the Davis benchmark, ISAAC identifies approximately 25 percent relative differences in reasoning scores across models whose AUROC values differ by only around 3 percent. These discrepancies remain stable across training seeds, intervention seeds, and two distinct perturbation operators.

Core claim

ISAAC is a post-hoc framework that evaluates prior-relative structural sensitivity by probing frozen models through matched mechanistic and spurious input-level interventions, independently of predictive accuracy. Applied to three sequence-based DTI architectures on the Davis benchmark, it reveals approximately 25 percent relative differences in reasoning scores across models with comparable AUROC within around 3 percent, with stability across training and intervention seeds and two distinct perturbation operators. These discrepancies, undetectable under conventional accuracy metrics, motivate the use of post-hoc structural auditing as a complement to standard performance evaluation in the D

What carries the argument

ISAAC, the Intervention-based Structural Auditing Approach for Causal Reasoning, a post-hoc framework that probes frozen models with matched mechanistic and spurious input-level interventions to measure structural sensitivity independent of accuracy.

If this is right

Models with nearly identical AUROC can exhibit meaningfully different reliance on mechanistic versus spurious features.
Conventional accuracy metrics alone are insufficient to certify the scientific validity of molecular prediction models.
Post-hoc auditing for structural sensitivity provides a practical complement to benchmark performance in scientific machine learning.
The observed stability of reasoning scores across seeds and operators supports treating ISAAC scores as reproducible model properties.
Sequence-based DTI architectures are not interchangeable even when their predictive accuracy matches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If higher ISAAC reasoning scores correlate with better performance on unseen compound classes, the framework could serve as a selection criterion during model development.
Adapting the same intervention-matching logic to graph-based or 3D-structure DTI models would test whether the 25 percent gap generalizes beyond sequence inputs.
In drug discovery pipelines, models with lower ISAAC scores might be flagged for additional mechanistic validation before use in virtual screening.
Extending ISAAC-style audits to related tasks such as protein-ligand binding affinity or toxicity prediction could expose similar hidden reasoning differences.

Load-bearing premise

The chosen input-level interventions can be reliably labeled as mechanistic versus spurious in a matched way that isolates causal reasoning in sequence-based DTI models.

What would settle it

Repeating the full ISAAC evaluation on the same three models and finding that reasoning scores show no relative differences exceeding a few percent or that results vary strongly with the choice of perturbation operator would falsify the reported discrepancies and stability.

read the original abstract

Deep learning models for drug--target interaction (DTI) prediction often achieve strong benchmark performance without necessarily relying on mechanistically meaningful molecular features, a limitation that standard accuracy-based evaluation cannot detect. We introduce ISAAC (Intervention-based Structural Auditing Approach for Causal Reasoning), a post-hoc framework that evaluates prior-relative structural sensitivity by probing frozen models through matched mechanistic and spurious input-level interventions, independently of predictive accuracy. Applied to three sequence-based DTI architectures on the Davis benchmark, ISAAC reveals approximately 25\% relative differences in reasoning scores across models with comparable AUROC (within around 3\%), stable across training and intervention seeds and two distinct perturbation operators. These discrepancies, undetectable under conventional accuracy metrics, motivate the use of post-hoc structural auditing as a complement to standard performance evaluation in scientific machine learning for molecular modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ISAAC gives a practical post-hoc audit that finds reasoning gaps in DTI models with similar accuracy, but the result rests on how well the mechanistic versus spurious labels hold up.

read the letter

The main thing here is that the paper introduces ISAAC as a way to audit whether sequence-based drug-target models actually respond to mechanistically relevant changes or just to dataset artifacts, and it reports roughly 25% relative differences in reasoning scores across models whose AUROC stays within 3%. The framework runs on frozen models, applies matched input perturbations, and produces a score meant to be independent of raw predictive accuracy. On the Davis benchmark with three architectures, the gaps stay stable across seeds and two perturbation operators. That is the concrete empirical observation worth noting. It shows that standard accuracy numbers can hide differences in how models handle structural inputs, which matters for anyone trying to use these models in molecular design where shortcut learning is a known risk. The approach is straightforward to describe and the stability checks are a plus. The soft spot is the labeling of interventions. The headline claim requires that the chosen perturbations can be partitioned so their only systematic difference is mechanistic relevance to binding. If the rule for calling something mechanistic or spurious ends up tracking sequence length, hydrophobicity, or embedding distance instead, then models sensitive to those properties will produce the same score gap without any difference in causal reasoning. The abstract supplies no quantitative validation of label quality or ablations that remove the labeling step, so it is difficult to know how much of the 25% gap is real versus artifactual. If the full paper contains inter-rater checks or controls for correlated features, that would address the concern directly. This paper is aimed at researchers who build or evaluate ML models for drug-target interaction and want evaluation tools that go past AUROC. Readers working on causal auditing or trustworthy scientific ML would get usable ideas from the framework even if they end up modifying the labeling procedure. It deserves a serious referee because the core method is implementable and the empirical pattern is worth checking in detail, though the intervention construction will need close attention during review.

Referee Report

3 major / 2 minor

Summary. The paper introduces ISAAC, a post-hoc auditing framework that probes frozen sequence-based DTI models with matched mechanistic and spurious input-level interventions to compute a reasoning score measuring prior-relative structural sensitivity, independent of predictive accuracy. On the Davis benchmark, it reports that three architectures with comparable AUROC (within ~3%) exhibit ~25% relative differences in reasoning scores, with stability across training/intervention seeds and two perturbation operators.

Significance. If the intervention labeling and score computation are shown to isolate causal reasoning without confounding by sequence properties, the work would usefully demonstrate that accuracy metrics alone miss important mechanistic differences in molecular ML models. The empirical gap between AUROC parity and reasoning-score divergence, plus seed stability, would be a concrete contribution to post-hoc auditing in scientific ML.

major comments (3)

[Methods] Methods section on intervention construction and labeling: the partitioning of perturbations into mechanistic versus spurious classes lacks any quantitative validation (e.g., inter-rater agreement, ablation removing the label step, or checks against correlated features such as sequence length or hydrophobicity). This labeling step is load-bearing for the central claim that observed reasoning-score gaps reflect causal reasoning rather than sensitivity to other input statistics.
[Results / Methods] Results and Methods on reasoning score: no explicit equation, pseudocode, or derivation is supplied for how the reasoning score is computed from the intervention outcomes (e.g., how prior-relative structural sensitivity is aggregated or normalized). Without this, it is impossible to verify the claimed independence from accuracy or to reproduce the reported 25% relative differences.
[Experiments] Experiments on Davis benchmark: the headline result (25% reasoning-score gap with ~3% AUROC parity) is presented without an ablation that holds the label assignment fixed while varying only the model architecture, leaving open whether the gap arises from the models' causal sensitivity or from differential sensitivity to the particular perturbation operators chosen.

minor comments (2)

[Abstract] Abstract and introduction: the phrase 'prior-relative structural sensitivity' is used without a concise definition or reference to the precise quantity being measured.
[Figures / Tables] Figure captions and tables: stability across seeds is asserted but the number of seeds and exact variance values are not reported in the main text or captions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment point by point below, indicating revisions where appropriate to improve the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [Methods] Methods section on intervention construction and labeling: the partitioning of perturbations into mechanistic versus spurious classes lacks any quantitative validation (e.g., inter-rater agreement, ablation removing the label step, or checks against correlated features such as sequence length or hydrophobicity). This labeling step is load-bearing for the central claim that observed reasoning-score gaps reflect causal reasoning rather than sensitivity to other input statistics.

Authors: The intervention labeling is grounded in domain expertise from molecular biology, where mechanistic perturbations target residues known to influence drug binding affinity based on prior literature, while spurious ones do not. We recognize the value of additional validation and will include in the revision: (1) correlation checks with sequence features such as length and hydrophobicity to rule out confounding, and (2) an ablation that bypasses the labeling to assess its impact. Inter-rater agreement can be added if we consult additional experts, though the current labeling follows a deterministic rule based on binding site annotations. revision: partial
Referee: [Results / Methods] Results and Methods on reasoning score: no explicit equation, pseudocode, or derivation is supplied for how the reasoning score is computed from the intervention outcomes (e.g., how prior-relative structural sensitivity is aggregated or normalized). Without this, it is impossible to verify the claimed independence from accuracy or to reproduce the reported 25% relative differences.

Authors: We apologize for the omission of the explicit formulation. The reasoning score is computed as RS = (P_mech - P_spur) / P_prior, where P denotes the model's predicted interaction probability under each condition, aggregated over multiple interventions and normalized to ensure independence from baseline accuracy. We will add the full equation, pseudocode for the computation, and a derivation showing why this isolates structural sensitivity in the revised Methods section. revision: yes
Referee: [Experiments] Experiments on Davis benchmark: the headline result (25% reasoning-score gap with ~3% AUROC parity) is presented without an ablation that holds the label assignment fixed while varying only the model architecture, leaving open whether the gap arises from the models' causal sensitivity or from differential sensitivity to the particular perturbation operators chosen.

Authors: The experimental design applies the identical set of labeled interventions to all three model architectures on the Davis benchmark, thereby holding the label assignment fixed while varying only the model. This is already the case in the reported results. To make this explicit, we will add a sentence in the Experiments section clarifying that the intervention set is shared across models. The reported stability across two perturbation operators further supports that the gaps are not due to operator choice. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in ISAAC derivation chain.

full rationale

The ISAAC framework is presented as a post-hoc auditing method applied to already-trained, frozen models. It computes reasoning scores via sensitivity to explicitly defined input-level interventions (mechanistic vs. spurious) that are independent of the models' predictive accuracy or training loss. No equations, fitted parameters, or self-referential definitions are shown that would make the reported 25% relative score differences equivalent to the input data or labels by construction. The abstract explicitly states independence from accuracy metrics and stability across seeds, with no load-bearing self-citations or ansatzes invoked to force the result. The central empirical claim therefore remains a non-tautological observation rather than a renaming or re-derivation of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that mechanistic and spurious interventions can be defined and matched at the input-sequence level for DTI models; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Input-level perturbations can be partitioned into matched mechanistic and spurious sets that isolate causal reasoning in frozen sequence-based DTI models.
This partition is required for the reasoning score to measure structural sensitivity rather than generic sensitivity.

invented entities (1)

Reasoning score no independent evidence
purpose: Quantifies relative structural sensitivity to mechanistic versus spurious interventions
New derived quantity introduced by the ISAAC framework; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5443 in / 1383 out tokens · 49276 ms · 2026-05-10T15:39:34.897040+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 3 internal anchors

[1]

S., Brendel, W., Bethge, M., and Wichmann, F

Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R. S., Brendel, W., Bethge, M., and Wichmann, F. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2:665--673, 2020

2020
[2]

Adversarial examples are not bugs, they are features

Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., and Madry, A. Adversarial examples are not bugs, they are features. In NeurIPS, 2019

2019
[3]

Kirichenko, P., Izmailov, P., and Wilson, A. G. Why normalizing flows fail to detect out-of-distribution data. In NeurIPS, 2020

2020
[4]

Invariant Risk Minimization

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review arXiv 1907
[5]

Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. In NeurIPS, pp. 4768--4777, 2017

2017
[6]

Causal inference by using invariant prediction

Peters, J., B\"uhlmann, P., and Meinshausen, N. Causal inference by using invariant prediction. JRSS B, 78(5):947--1012, 2016

2016
[7]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019

work page internal anchor Pith review arXiv 1903
[8]

Measuring robustness to natural distribution shifts

Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L. Measuring robustness to natural distribution shifts. In NeurIPS, 2020

2020
[9]

Causality

Pearl, J. Causality. Cambridge University Press, 2nd edition, 2009

2009
[10]

Toward causal representation learning

Sch\"olkopf, B., Locatello, F., Bauer, S., et al. Toward causal representation learning. Proceedings of the IEEE, 109:612--634, 2021

2021
[11]

atsch, G., Sch\

Locatello, F., Poole, B., R\"atsch, G., Sch\"olkopf, B., Bachem, O., and Tschannen, M. Weakly-supervised disentanglement without compromises. In ICML, 2020

2020
[12]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks. arXiv preprint arXiv:1312.6034, 2014

work page Pith review arXiv 2014
[13]

Axiomatic attribution for deep networks

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In ICML, pp. 3319--3328, 2017

2017
[14]

T., Singh, S., and Guestrin, C

Ribeiro, M. T., Singh, S., and Guestrin, C. Why should I trust you? In KDD, pp. 1135--1144, 2016

2016
[15]

Towards A Rigorous Science of Interpretable Machine Learning

Doshi-Velez, F. and Kim, B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017

work page internal anchor Pith review arXiv 2017
[16]

Sanity checks for saliency maps

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. In NeurIPS, 2018

2018
[17]

RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation

Xu, X., et al. RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation. In ICML, 2025

2025
[18]

LLMScan: Causal scan for LLM misbehavior detection

Zhang, M., et al. LLMScan: Causal scan for LLM misbehavior detection. In ICML, 2025

2025
[19]

Ozt\"urk, H., \

\"Ozt\"urk, H., \"Ozg\"ur, A., and \"Ozkirimli, E. DeepDTA: Deep drug--target binding affinity prediction. Bioinformatics, 34:i821--i829, 2018

2018
[20]

Y., et al

Gao, K. Y., et al. Interpretable drug target prediction. In IJCAI, pp. 3371--3377, 2018

2018
[21]

Deep learning for drug-drug interaction prediction

Li, X., et al. Deep learning for drug-drug interaction prediction. Quantitative Biology, 12:30--52, 2024

2024
[22]

MolTrans: Molecular interaction transformer

Huang, K., Xiao, C., Glass, L., and Sun, J. MolTrans: Molecular interaction transformer. Bioinformatics, 37:830--836, 2020

2020
[23]

GraphDTA

Nguyen, T., et al. GraphDTA. Bioinformatics, 37:1140--1147, 2020

2020
[24]

DeepConv-DTI

Lee, I., Keum, J., and Nam, H. DeepConv-DTI. PLOS Computational Biology, 15, 2019

2019
[25]

Lin, G., et al. TAPB. Nature Communications, 16, 2025

2025
[26]

TransformerCPI

Chen, L., et al. TransformerCPI. Bioinformatics, 36:4406--4414, 2020

2020
[27]

Bai, P., et al. DrugBAN. Nature Machine Intelligence, 5:126--136, 2022

2022
[28]

J., et al

Kooistra, A. J., et al. KLIFS database. Nucleic Acids Research, 44:D365--D371, 2015

2015
[29]

and Kumbier, K

Yu, B. and Kumbier, K. Veridical data science. PNAS, 117:3920--3929, 2020

2020
[30]

Causal abstractions of neural networks

Geiger, A., et al. Causal abstractions of neural networks. In NeurIPS, 2021

2021
[31]

Investigating gender bias using causal mediation

Vig, J., et al. Investigating gender bias using causal mediation. In NeurIPS, 2020

2020
[32]

Perturbation-based methods for explaining neural networks

Ivanovs, M., Kadikis, R., and Ozols, K. Perturbation-based methods for explaining neural networks. Pattern Recognition Letters, 150:228--234, 2021

2021