pith. sign in

arxiv: 2605.21731 · v1 · pith:4VVZE3IUnew · submitted 2026-05-20 · 💻 cs.LG

I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models

Pith reviewed 2026-05-22 09:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords scientific AI auditingWasserstein distancedrug-target interactiondistributional coherencestructural perturbationspost-hoc evaluationmodel interpretability
0
0 comments X

The pith

I-SAFE auditing reveals different distributional profiles in DTI models with similar accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the I-SAFE framework to audit black-box scientific AI models by measuring coherence of their output distributions under perturbations guided by an external structural prior. It defines three metrics: a quantile-based measure for location shifts, the Wasserstein Coherence Metric for ordinal consistency, and a translation-invariant version for distributional shape. The approach matters because benchmark accuracy alone cannot distinguish models that capture domain-relevant structure from those that exploit shortcuts or biases. When applied to three sequence-based drug-target interaction models on the Davis benchmark, the audit detects substantially different response profiles that accuracy scores do not reveal.

Core claim

Given a trained black-box predictor and an external structural prior encoding domain knowledge about task-relevant input structure, I-SAFE evaluates raw model outputs under structurally guided perturbations of the input. The proposed audit measures output-distribution coherence through three complementary metrics: a Quantile-Based Metric for location-level coherence, the Wasserstein Coherence Metric for ordinal coherence, and a translation-invariant WCM variant for shape coherence. Instantiated on drug-target interaction prediction using the Davis kinase benchmark, KLIFS binding-pocket annotations, and three models, the framework shows that models with comparable predictive performance can,

What carries the argument

Wasserstein Coherence Metric that quantifies ordinal and shape coherence of model output distributions under perturbations derived from the external structural prior.

If this is right

  • Models can be compared and selected according to structural coherence in addition to predictive accuracy.
  • The audit can identify reliance on dataset-specific regularities rather than domain-relevant features.
  • The framework applies directly to any scientific prediction task where inputs admit structured decomposition and an external prior exists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining coherence scores with standard accuracy could produce a joint ranking criterion for model deployment in scientific settings.
  • Low coherence on specific perturbation types might guide targeted data collection or architecture adjustments.

Load-bearing premise

The external structural prior accurately encodes task-relevant input structure that can be used to generate meaningful perturbations for the audit.

What would settle it

Running the three coherence metrics on the same set of KLIFS-guided perturbations and finding that the three DTI models produce identical or statistically indistinguishable distributional response profiles.

Figures

Figures reproduced from arXiv: 2605.21731 by Barbara Tarantino, Gennaro Auricchio, Paolo Giudici.

Figure 1
Figure 1. Figure 1: I-SAFE prior-relative coherence contrasts on the Davis benchmark: ∆QBM (a), ∆WCM (b), and ∆TI-WCM (c), computed as spurious minus mechanistic coherence. The dashed line marks no differential coherence; positive values indicate greater coherence under mechanistic perturbations. Error bars denote 95 % confidence intervals across five seeds. and ∆WCM “ ´0.013 (r´0.057, 0.031s), showing no comparable prior-ali… view at source ↗
read the original abstract

Deep learning models are increasingly used in scientific prediction tasks where strong benchmark performance is often interpreted as evidence of scientifically meaningful behavior. This interpretation is fragile, as models may exploit shortcut features, dataset-specific regularities, or distributional biases that are predictive on held-out data but not aligned with domain-relevant structure. To address this limitation, we introduce the \textsc{I-SAFE} (Interventional Secure, Accurate, Fair and Explainable) framework, a post-hoc distributional auditing framework for scientific AI models centered on the Wasserstein Coherence Metric (WCM). Given a trained black-box predictor and an external structural prior encoding domain knowledge about task-relevant input structure, \textsc{I-SAFE} evaluates raw model outputs under structurally guided perturbations of the input. The proposed audit measures output-distribution coherence through three complementary metrics: a Quantile-Based Metric (QBM) for location-level coherence, the WCM for ordinal coherence, and a translation-invariant WCM variant for shape coherence. We instantiate \textsc{I-SAFE} on drug--target interaction (DTI) prediction using the Davis kinase benchmark, KLIFS (Kinase--Ligand Interaction Fingerprints and Structures) binding-pocket annotations, and three sequence-based DTI models: DeepConvDTI, DeepDTA, and TAPB. Although the models operate in a comparable predictive regime, \textsc{I-SAFE} reveals substantially different distributional response profiles, a distinction invisible to accuracy-based evaluation. The framework is model-agnostic and applicable to any domain where inputs admit a structured decomposition and an external prior is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the I-SAFE framework, a post-hoc auditing method for scientific AI models that applies Wasserstein Coherence Metrics (WCM) and a Quantile-Based Metric (QBM) to evaluate output-distribution coherence under perturbations generated from an external structural prior (KLIFS binding-pocket annotations). It demonstrates the approach on three sequence-based drug-target interaction models (DeepConvDTI, DeepDTA, TAPB) trained on the Davis kinase benchmark, claiming that the models exhibit substantially different distributional response profiles despite comparable predictive accuracy.

Significance. If the central claims hold, I-SAFE offers a model-agnostic tool for detecting misalignment between model behavior and domain-relevant structure that standard accuracy metrics miss. The use of Wasserstein distances for ordinal and shape coherence, combined with an external prior, provides a concrete way to audit shortcut exploitation in scientific prediction tasks.

major comments (2)
  1. [§3.2] §3.2 (Perturbation Generation): The paper does not report whether KLIFS-guided perturbations preserve marginal input statistics such as amino-acid composition or sequence-length distribution across the three models. Without this check, the observed differences in QBM/WCM profiles could reflect architecture-specific sensitivity to any structured input change rather than genuine misalignment with binding-pocket structure, directly undermining the central claim that I-SAFE isolates task-relevant structural coherence.
  2. [§4.3] §4.3 (Results and Comparison): The claim that accuracy-based evaluation is 'invisible' to the distinctions found by I-SAFE requires explicit quantification of how much of the WCM/QBM separation is explained by residual correlations with non-structural features; the current presentation leaves open the possibility that the metrics are re-detecting known architecture differences rather than new scientific misalignment.
minor comments (2)
  1. [§2.3] The definition of the translation-invariant WCM variant should include an explicit equation showing how translation invariance is enforced, to allow readers to verify it does not inadvertently remove shape information relevant to the audit.
  2. [Figure 2] Figure 2 (Distributional response profiles): Axis labels and legend entries are too small for readability; increase font size and add a brief caption explaining the color coding for the three models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. These observations help clarify how to better isolate the contribution of structural priors in the I-SAFE framework. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Perturbation Generation): The paper does not report whether KLIFS-guided perturbations preserve marginal input statistics such as amino-acid composition or sequence-length distribution across the three models. Without this check, the observed differences in QBM/WCM profiles could reflect architecture-specific sensitivity to any structured input change rather than genuine misalignment with binding-pocket structure, directly undermining the central claim that I-SAFE isolates task-relevant structural coherence.

    Authors: We agree that an explicit check on marginal input statistics would strengthen the interpretation. In the revised manuscript we will add a supplementary table and brief analysis comparing amino-acid composition and sequence-length distributions between the original sequences and the KLIFS-guided perturbations for each of the three models. The perturbations are constructed by targeted residue substitutions within the binding-pocket regions annotated by KLIFS; because the changes are localized and the overall sequence length is unchanged, we expect the marginals to remain largely preserved. Including this verification will directly address the concern that the observed coherence differences could arise from generic sensitivity to any input modification. revision: yes

  2. Referee: [§4.3] §4.3 (Results and Comparison): The claim that accuracy-based evaluation is 'invisible' to the distinctions found by I-SAFE requires explicit quantification of how much of the WCM/QBM separation is explained by residual correlations with non-structural features; the current presentation leaves open the possibility that the metrics are re-detecting known architecture differences rather than new scientific misalignment.

    Authors: We acknowledge that a quantitative separation from non-structural factors would make the claim more robust. In the revision we will add a short analysis (new panel or appendix) that reports partial correlations and a simple regression of the WCM/QBM scores against a set of non-structural covariates (model depth, embedding dimension, and basic sequence statistics). This will allow readers to see the fraction of metric separation that remains after controlling for these factors. We maintain that the primary distinction arises from differential sensitivity to the KLIFS structural prior, but the added quantification will clarify the extent to which architecture-specific traits contribute. revision: yes

Circularity Check

0 steps flagged

No circularity: I-SAFE metrics defined directly from external prior and Wasserstein distances

full rationale

The paper defines the Quantile-Based Metric (QBM), Wasserstein Coherence Metric (WCM), and its translation-invariant variant explicitly as functions of raw model outputs under perturbations generated from the independent KLIFS binding-pocket annotations. These definitions rely on standard Wasserstein distance applied to the resulting output distributions and do not reduce to fitted parameters, self-referential quantities, or prior results by the same authors. The central empirical claim—that the three DTI models exhibit distinct distributional response profiles despite comparable accuracy—is an observation obtained by applying the externally defined metrics, not a tautology. No self-citation chains, uniqueness theorems, or smuggled ansatzes appear in the load-bearing steps of the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; framework rests on domain assumption that structural priors are valid and introduces new metrics without explicit free parameters or invented physical entities.

axioms (1)
  • domain assumption External structural prior encodes domain knowledge about task-relevant input structure
    Invoked when using KLIFS annotations to guide perturbations in the I-SAFE audit.
invented entities (1)
  • Wasserstein Coherence Metric (WCM) no independent evidence
    purpose: Quantify ordinal and shape coherence of model output distributions under structural perturbations
    New metric family introduced as core of the auditing framework.

pith-pipeline@v0.9.0 · 5818 in / 1285 out tokens · 63581 ms · 2026-05-22T09:52:51.572309+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    Goodfellow, Moritz Hardt, and Been Kim

    Julius Adebayo, Justin Gilmer, Michael Muelly, Ian J. Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. InAdvances in Neural Information Processing 11 Systems, volume 31, pages 9525–9536, 2018

  2. [2]

    T. W. Anderson. On the distribution of the two-sample Cramér–von Mises criterion.The Annals of Mathematical Statistics, 33(3):1148–1159, 1962

  3. [3]

    Invariant Risk Minimization

    Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019

  4. [4]

    On rank graduation metrics for high-dimensional ordinal data.Mathematical Models and Methods in Applied Sciences, pages 1–35, 2026

    Gennaro Auricchio, Adelaide Emma Bernardelli, Paolo Giudici, and Giuseppe Toscani. On rank graduation metrics for high-dimensional ordinal data.Mathematical Models and Methods in Applied Sciences, pages 1–35, 2026

  5. [5]

    The equivalence of fourier-based and wasserstein metrics on imaging problems

    Gennaro Auricchio, Andrea Codegoni, Stefano Gualandi, Giuseppe Toscani, and Marco Veneroni. The equivalence of fourier-based and wasserstein metrics on imaging problems. Rendiconti Lincei, 31(3):627–649, 2020

  6. [6]

    A rank graduation box for safe ai.Expert systems with applications, 259:125239, 2025

    Golnoosh Babaei, Paolo Giudici, and Emanuela Raffinetti. A rank graduation box for safe ai.Expert systems with applications, 259:125239, 2025

  7. [7]

    On the composition of elementary errors.Scandinavian Actuarial Journal, 1928(1):13–74, 1928

    Harald Cramér. On the composition of elementary errors.Scandinavian Actuarial Journal, 1928(1):13–74, 1928

  8. [8]

    Davis, Jeremy P

    Mindy I. Davis, Jeremy P. Hunt, Sanna Herrgard, Pietro Ciceri, Lisa M. Wodicka, Gabriel Pallares, Michael Hocker, Daniel K. Treiber, and Patrick P. Zarrinkar. Comprehensive analysis of kinase inhibitor selectivity.Nature Biotechnology, 29(11):1046–1051, 2011

  9. [9]

    Towards A Rigorous Science of Interpretable Machine Learning

    Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017

  10. [10]

    Causal abstractions of neural networks

    Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. InAdvances in Neural Information Processing Systems, volume 34, pages 9574–9586, 2021

  11. [11]

    Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural repre- sentations. InProceedings of the Third Conference on Causal Learning and Reasoning, volume 236 ofProceedings of Machine Learning Research, pages 160–187, 2024

  12. [12]

    Zemel, Wieland Brendel, Matthias Bethge, and Felix A

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2:665–673, 2020

  13. [13]

    Resolving data bias improves generalization in binding affinity prediction.Nature Machine Intelligence, 7(10):1713–1725, 2025

    Dennis Graber, Patrick Stockinger, Fabian Meyer, Siddharth Mishra, Christopher Horn, and Rebecca Buller. Resolving data bias improves generalization in binding affinity prediction.Nature Machine Intelligence, 7(10):1713–1725, 2025

  14. [14]

    Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik

    Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: machine learning datasets and tasks for drug discovery and development. In Advances in Neural Information Processing Systems, 2021. Datasets and Benchmarks Track

  15. [15]

    Adversarial examples are not bugs, they are features

    Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Mądry. Adversarial examples are not bugs, they are features. InAdvances in Neural Information Processing Systems, volume 32, 2019. 12

  16. [16]

    Kanev, Chris de Graaf, Bart A

    Georgi K. Kanev, Chris de Graaf, Bart A. Westerman, Iwan J. P. de Esch, and Albert J. Kooistra. KLIFS: an overhaul after the first 5 years of supporting kinase research.Nucleic Acids Research, 49(D1):D562–D569, 2021

  17. [17]

    Kooistra, Georgi K

    Albert J. Kooistra, Georgi K. Kanev, Oscar P. J. van Linden, Rob Leurs, Iwan J. P. de Esch, and Chris de Graaf. KLIFS: a structural kinase–ligand interaction database. Nucleic Acids Research, 44(D1):D365–D371, 2016

  18. [18]

    DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences.PLOS Computational Biology, 15(6):e1007129, 2019

    Ingoo Lee, Jongsoo Keum, and Hojung Nam. DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences.PLOS Computational Biology, 15(6):e1007129, 2019

  19. [19]

    Cosgrove, Christopher D

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian D. Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zeiler, Dan Jurafsky, Tatsunori Hashimoto, Peter Hende...

  20. [20]

    TAPB: an interventional debiasing framework for alleviating target prior bias in drug–target interaction prediction.Nature Communications, 16:10867, 2025

    Guanxing Lin, Xinyi Zhang, Zhen Ren, Quan Zou, Prayag Tiwari, Cheng Zhou, and Yi Ding. TAPB: an interventional debiasing framework for alleviating target prior bias in drug–target interaction prediction.Nature Communications, 16:10867, 2025

  21. [21]

    Predicting cellular responses to complex perturbations in high-throughput screens.Molecular Systems Biology, 19:e11517, 2023

    Mohammad Lotfollahi, Anna Klimovskaia Susmelj, Carlo De Donno, et al. Predicting cellular responses to complex perturbations in high-throughput screens.Molecular Systems Biology, 19:e11517, 2023

  22. [22]

    Lundberg and Su-In Lee

    Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InAdvances in Neural Information Processing Systems, volume 30, pages 4766–4777, 2017

  23. [23]

    Learning characteristics of graph neural networks predicting protein–ligand affinities.Nature Machine Intelligence, 5:1427–1436, 2023

    Andrea Mastropietro, Giuseppe Pasculli, and Jürgen Bajorath. Learning characteristics of graph neural networks predicting protein–ligand affinities.Nature Machine Intelligence, 5:1427–1436, 2023

  24. [24]

    DeepDTA: deep drug–target binding affinity prediction.Bioinformatics, 34(17):i821–i829, 2018

    Hakime Öztürk, Arzucan Özgür, and Elif Ozkirimli. DeepDTA: deep drug–target binding affinity prediction.Bioinformatics, 34(17):i821–i829, 2018

  25. [25]

    Cambridge University Press, 2nd edition, 2009

    Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

  26. [26]

    Causal inference by using invariant prediction: identification and confidence intervals.Journal of the Royal Statistical Society: Series B, 78(5):947–1012, 2016

    Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals.Journal of the Royal Statistical Society: Series B, 78(5):947–1012, 2016

  27. [27]

    Computational optimal transport.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

    Gabriel Peyré and Marco Cuturi. Computational optimal transport.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

  28. [28]

    Why should I trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?”: explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016

  29. [29]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034, 2014. 13

  30. [30]

    Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023

  31. [31]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328, 2017

  32. [32]

    Isaac: Auditing causal reasoning in deep models for drug-target interaction, 2026

    Barbara Tarantino, Sun Kim, Yijingxiu Lu, and Paolo Giudici. Isaac: Auditing causal reasoning in deep models for drug-target interaction, 2026

  33. [33]

    Exposing the limitations of molecular machine learning with activity cliffs.Journal of Chemical Information and Modeling, 62(23):5938–5951, 2022

    Derek van Tilborg, Alisa Alenicheva, and Francesca Grisoni. Exposing the limitations of molecular machine learning with activity cliffs.Journal of Chemical Information and Modeling, 62(23):5938–5951, 2022

  34. [34]

    Springer, Berlin, 2009

    Cédric Villani.Optimal Transport: Old and New. Springer, Berlin, 2009

  35. [35]

    Most ligand-based classification benchmarks reward memorization rather than generalization.Journal of Chemical Information and Modeling, 58(5):916–932, 2018

    Izhar Wallach and Abraham Heifets. Most ligand-based classification benchmarks reward memorization rather than generalization.Journal of Chemical Information and Modeling, 58(5):916–932, 2018

  36. [36]

    Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Ryan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. DecodingTrust: a comprehensive assessment of trustworthiness in GPT models. In Advances in Neural Informati...

  37. [37]

    Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S

    Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning.Chemical Science, 9(2):513–530, 2018

  38. [38]

    Nori, Rishabh Sharma, Abhay Sharma, and Javier González

    Xiao Xu, Robert Lawrence, Kumar Dubey, Ayush Pandey, Ryo Ueno, Fabian Falck, Aditya V. Nori, Rishabh Sharma, Abhay Sharma, and Javier González. RE-IMAGINE: symbolic benchmark synthesis for reasoning evaluation. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research, 2025

  39. [39]

    Veridical data science.Proceedings of the National Academy of Sciences, 117(8):3920–3929, 2020

    Bin Yu and Karl Kumbier. Veridical data science.Proceedings of the National Academy of Sciences, 117(8):3920–3929, 2020

  40. [40]

    LLMScan: causal scan for LLM misbehavior detection

    Meng Zhang, Keng Kiat Goh, Ping Zhang, Jingwei Sun, Ronald Lok Xin, and Huan Zhang. LLMScan: causal scan for LLM misbehavior detection. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research, 2025. 14 A Appendix In this appendix we report the missing proof and all the technical discussion omitted f...