Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods
Pith reviewed 2026-06-28 23:14 UTC · model grok-4.3
The pith
Genomic interpretability research relies on anecdotal success stories that different methods often contradict.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that anecdotal validation of interpretability methods in genomics is unreliable because the same model predictions can receive contradictory explanations from different methods, those explanations often miss known sequence motifs, and they do not faithfully track the model's internal computations; therefore genomic IML work must adopt systematic assessment of consistency, faithfulness, and biological validity.
What carries the argument
A benchmarking study on transcription factor binding that compares multiple IML methods on the same models and measures agreement, motif recovery, and faithfulness to model behavior.
Load-bearing premise
The assumption that the inconsistencies observed in one transcription-factor-binding benchmark will appear across other genomic tasks, model architectures, and datasets.
What would settle it
A replication study that applies the same set of IML methods to multiple independent genomic datasets and finds high agreement on explanations plus reliable recovery of known motifs in the great majority of cases.
Figures
read the original abstract
Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists now demand that these models also elucidate the underlying biological mechanisms. While interpretable machine learning (IML) techniques have been increasingly applied to bridge this gap, there has been a pervasive reliance on anecdotal validation: the vast majority of research relies on a single IML method and reports only isolated successful instances. Through a benchmarking study on transcription factor binding, we demonstrate the risks of current practices. We show that different IML methods can often (1) yield contradictory explanations for the same predictions, (2) fail to localize known regulatory motifs, and (3) fail to faithfully reflect the model's internal decision process. In light of this, we argue for a validation framework analogous to clinical trials: just as trials require rigorous design and adverse-event reporting, genomic interpretability must move beyond cherry-picked plausibility toward systematic assessment of consistency, faithfulness, and biological validity. To facilitate this, we propose a tiered framework to guide rigorous evaluation and reporting of genomic IML methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a position paper arguing that genomic ML interpretability research relies excessively on anecdotal validation of single IML methods. It supports this via a benchmarking study on transcription factor binding demonstrating that different IML methods often (1) yield contradictory explanations for the same predictions, (2) fail to localize known regulatory motifs, and (3) fail to faithfully reflect the model's internal decision process. The paper advocates replacing such practices with a clinical-trial-style validation framework emphasizing systematic assessment of consistency, faithfulness, and biological validity, and proposes a tiered framework to guide evaluation and reporting.
Significance. If the benchmarking results prove robust and the single-task findings generalize, the paper would highlight a systemic weakness in how IML is applied to genomics and provide a constructive path toward more reliable biological insights from models. The explicit analogy to clinical trials and the tiered framework are practical contributions that could influence community standards.
major comments (2)
- [Abstract] Abstract: the three concrete risks are asserted from a benchmarking study, but no details on study design, number of models, quantitative metrics, or statistical tests are supplied, preventing assessment of support for the claims of contradictory explanations, motif localization failure, and lack of faithfulness.
- [Benchmarking study and position argument] The position that 'the vast majority of research' must move to a clinical-trial-style framework rests on the single TF-binding benchmark generalizing to broader genomic IML; no multi-task replication, comparison to other targets (e.g., chromatin accessibility), or argument for representativeness of the chosen models and IML suite is provided.
minor comments (1)
- [Proposed framework] The tiered framework is introduced at a high level; adding concrete criteria, example metrics, or reporting templates for each tier would improve actionability without altering the central argument.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive suggestions. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the three concrete risks are asserted from a benchmarking study, but no details on study design, number of models, quantitative metrics, or statistical tests are supplied, preventing assessment of support for the claims of contradictory explanations, motif localization failure, and lack of faithfulness.
Authors: The abstract is intentionally concise, as is standard. The full manuscript details the benchmarking study design, including the models evaluated, the suite of IML methods, the quantitative metrics used to measure contradictory explanations (e.g., agreement rates), motif localization performance against known ground truth, and faithfulness assessments (e.g., via perturbation or surrogate model tests), along with any statistical comparisons. To improve accessibility, we will revise the abstract to incorporate a brief statement on study scale and the nature of the quantitative findings. revision: yes
-
Referee: [Benchmarking study and position argument] The position that 'the vast majority of research' must move to a clinical-trial-style framework rests on the single TF-binding benchmark generalizing to broader genomic IML; no multi-task replication, comparison to other targets (e.g., chromatin accessibility), or argument for representativeness of the chosen models and IML suite is provided.
Authors: We selected transcription factor binding precisely because it supplies established biological ground truth (known motifs), enabling rigorous assessment of localization and faithfulness that would be harder on less annotated tasks. The observed inconsistencies illustrate the risks of anecdotal validation even under favorable conditions. While we do not provide multi-task replication here, we will add a dedicated discussion paragraph explaining the choice of TF binding as a representative and stringent test case, acknowledging the single-task scope, and calling for future multi-task studies (including chromatin accessibility) to further test generalizability. revision: partial
Circularity Check
No circularity: position rests on direct empirical benchmarking, not self-referential logic or derivations
full rationale
The paper is a position statement whose central claims are grounded in the authors' own benchmarking study on transcription-factor binding (explicitly described in the abstract as demonstrating contradictory explanations, motif localization failures, and faithfulness issues). No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. The generalization concern raised by the skeptic is an empirical representativeness issue, not a circular reduction where a result equals its inputs by construction. The proposed tiered framework is presented as a normative recommendation following the observations, not derived from any self-definitional step. This matches the default expectation of no significant circularity for non-derivational papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Lundberg and Su
Ian Covert and Scott M. Lundberg and Su. Explaining by Removing:. J. Mach. Learn. Res. , volume =
-
[2]
Zou , title =
Amirata Ghorbani and Abubakar Abid and James Y. Zou , title =
-
[3]
Preece , title =
Richard Tomsett and Dan Harborne and Supriyo Chakraborty and Prudhvi Gurram and Alun D. Preece , title =
-
[4]
Do Feature Attribution Methods Correctly Attribute Features? , booktitle =
Yilun Zhou and Serena Booth and Marco T. Do Feature Attribution Methods Correctly Attribute Features? , booktitle =
-
[5]
Weina Jin and Xiaoxiao Li and Ghassan Hamarneh , title =
-
[6]
Towards Faithfully Interpretable
Alon Jacovi and Yoav Goldberg , editor =. Towards Faithfully Interpretable
-
[7]
Plataniotis , title =
Sam Sattarzadeh and Mahesh Sudhakar and Konstantinos N. Plataniotis , title =
-
[8]
On the Robustness of Interpretability Methods , journal =
David Alvarez. On the Robustness of Interpretability Methods , journal =. 2018 , eprinttype =
2018
-
[9]
CoRR , volume =
Chirag Agarwal and Nari Johnson and Martin Pawelczyk and Satyapriya Krishna and Eshika Saxena and Marinka Zitnik and Himabindu Lakkaraju , title =. CoRR , volume =. 2022 , eprinttype =
2022
-
[10]
Goodfellow and Been Kim , title =
Julius Adebayo and Justin Gilmer and Ian J. Goodfellow and Been Kim , title =
-
[11]
Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani , title =
Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani , title =
-
[12]
Peiyu Yang and Naveed Akhtar and Zeyi Wen and Mubarak Shah and Ajmal Saeed Mian , title =
-
[13]
Allen , title =
Camille Olivia Little and Debolina Halder Lina and Genevera I. Allen , title =. Trans. Mach. Learn. Res. , year =
-
[14]
Elizabeth Kumar and Suresh Venkatasubramanian and Carlos Scheidegger and Sorelle A
I. Elizabeth Kumar and Suresh Venkatasubramanian and Carlos Scheidegger and Sorelle A. Friedler , title =
-
[15]
Himabindu Lakkaraju and Nino Arsov and Osbert Bastani , title =
-
[16]
Sushant Agarwal and Shahin Jabbari and Chirag Agarwal and Sohini Upadhyay and Steven Wu and Himabindu Lakkaraju , title =
-
[17]
Yao Rong and Tobias Leemann and Vadim Borisov and Gjergji Kasneci and Enkelejda Kasneci , title =
-
[18]
Joon Sik Kim and Gregory Plumb and Ameet Talwalkar , title =
-
[19]
Goodfellow and Moritz Hardt and Been Kim , title =
Julius Adebayo and Justin Gilmer and Michael Muelly and Ian J. Goodfellow and Moritz Hardt and Been Kim , title =
-
[20]
Towards Robust Interpretability with Self-Explaining Neural Networks , booktitle =
David Alvarez. Towards Robust Interpretability with Self-Explaining Neural Networks , booktitle =
-
[21]
A Benchmark for Interpretability Methods in Deep Neural Networks , booktitle =
Sara Hooker and Dumitru Erhan and Pieter. A Benchmark for Interpretability Methods in Deep Neural Networks , booktitle =
-
[22]
On the (In)fidelity and Sensitivity of Explanations , booktitle =
Chih. On the (In)fidelity and Sensitivity of Explanations , booktitle =
-
[23]
Gunady and H
Aya Abdelsalam Ismail and Mohamed K. Gunady and H. Benchmarking Deep Learning Interpretability in Time Series Predictions , booktitle =
-
[24]
Julius Adebayo and Michael Muelly and Ilaria Liccardi and Been Kim , title =
-
[25]
Michael Tsang and Sirisha Rambhatla and Yan Liu , title =
-
[26]
Lundberg and Su
Ian Covert and Scott M. Lundberg and Su. Understanding Global Feature Contributions With Additive Importance Measures , booktitle =
-
[27]
Dylan Slack and Anna Hilgard and Sameer Singh and Himabindu Lakkaraju , title =
-
[28]
Giang Nguyen and Daeyoung Kim and Anh Nguyen , title =
-
[29]
Peter Hase and Harry Xie and Mohit Bansal , title =
-
[30]
Chirag Agarwal and Satyapriya Krishna and Eshika Saxena and Martin Pawelczyk and Nari Johnson and Isha Puri and Marinka Zitnik and Himabindu Lakkaraju , title =
-
[31]
Zou , title =
Yongchan Kwon and James Y. Zou , title =
-
[32]
Tessa Han and Suraj Srinivas and Himabindu Lakkaraju , title =
-
[33]
Usha Bhalla and Suraj Srinivas and Himabindu Lakkaraju , title =
-
[34]
Evaluating the Robustness of Interpretability Methods through Explanation Invariance and Equivariance , booktitle =
Jonathan Crabb. Evaluating the Robustness of Interpretability Methods through Explanation Invariance and Equivariance , booktitle =
-
[35]
Xuhong Li and Mengnan Du and Jiamin Chen and Yekun Chai and Himabindu Lakkaraju and Haoyi Xiong , title =
-
[36]
Lukas Klein and Carsten T. L. Navigating the Maze of Explainable
-
[37]
and Kinney, Justin B
Melnikov, Alexandre and Murugan, Anand and Zhang, Xiaolan and Tesileanu, Tiberiu and Wang, Li and Rogov, Peter and Feizi, Soheil and Gnirke, Andreas and Callan, Jr., Curtis G. and Kinney, Justin B. and Kellis, Manolis and Lander, Eric S. and Mikkelsen, Tarjei S. , title =. Nat. Biotechnol. , year =
-
[38]
and Araya, Carlos L
Fowler, Douglas M. and Araya, Carlos L. and Fleishman, Sarel J. and Kellogg, Elizabeth H. and Stephany, Jason J. and Baker, David and Fields, Stanley , title =. Nat. Methods , year =
-
[39]
Zou, James and Huss, Mikael and Abid, Abubakar and Mohammadi, Pejman and Torkamani, Ali and Telenti, Amalio , title =. Nat. Genet. , year =
-
[40]
Deep learning: new computational modelling techniques for genomics , journal =
Eraslan, G. Deep learning: new computational modelling techniques for genomics , journal =. 2019 , volume =
2019
-
[41]
Barbadilla-Mart. Nat. Rev. Genet. , title =
-
[42]
and Dufault, Cameron and Wainberg, Michael and Forster, Duncan and Karimzadeh, Mehran and Goodarzi, Hani and Theis, Fabian J
Consens, Micaela E. and Dufault, Cameron and Wainberg, Michael and Forster, Duncan and Karimzadeh, Mehran and Goodarzi, Hani and Theis, Fabian J. and Moses, Alan and Wang, Bo , journal =. Transformers and genome language models , volume =
-
[43]
Base-resolution models of transcription-factor binding reveal soft motif syntax , volume =
Avsec,. Base-resolution models of transcription-factor binding reveal soft motif syntax , volume =. Nat. Genet. , number =
-
[44]
Effective gene expression prediction from sequence by integrating long-range interactions , volume =
Avsec,. Effective gene expression prediction from sequence by integrating long-range interactions , volume =. Nat. Methods , number =
-
[45]
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , volume =
Alipanahi, Babak and Delong, Andrew and Weirauch, Matthew T and Frey, Brendan J , journal =. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , volume =
-
[46]
and Reiter, Franziska and Pagani, Michaela and Stark, Alexander , journal =
de Almeida, Bernardo P. and Reiter, Franziska and Pagani, Michaela and Stark, Alexander , journal =. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers , volume =
-
[47]
Predicting effects of noncoding variants with deep learning--based sequence model , volume =
Zhou, Jian and Troyanskaya, Olga G , journal =. Predicting effects of noncoding variants with deep learning--based sequence model , volume =
-
[48]
and Yao, Kevin and Chen, Kathleen M
Zhou, Jian and Theesfeld, Chandra L. and Yao, Kevin and Chen, Kathleen M. and Wong, Aaron K. and Troyanskaya, Olga G. , journal =. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk , volume =
-
[49]
Predicting Splicing from Primary Sequence with Deep Learning , volume =
Kishore Jaganathan and Sofia. Predicting Splicing from Primary Sequence with Deep Learning , volume =. Cell , number =
-
[50]
Karen Simonyan and Andrea Vedaldi and Andrew Zisserman , title =
-
[51]
Avanti Shrikumar and Peyton Greenside and Anshul Kundaje , title =
-
[52]
Daniel Smilkov and Nikhil Thorat and Been Kim and Fernanda B. Vi. SmoothGrad: removing noise by adding noise , journal =. 2017 , eprinttype =
2017
-
[53]
Mukund Sundararajan and Ankur Taly and Qiqi Yan , title =
-
[54]
Lundberg and Su
Scott M. Lundberg and Su. A Unified Approach to Interpreting Model Predictions , booktitle =
-
[55]
Janizek and Pascal Sturmfels and Su
Joseph D. Janizek and Pascal Sturmfels and Su. Explaining Explanations: Axiomatic Feature Interactions for Deep Networks , journal =
-
[56]
Bioinformatics , volume =
Greenside, Peyton and Shimko, Tyler and Fordyce, Polly and Kundaje, Anshul , title =. Bioinformatics , volume =. 2018 , month =
2018
-
[57]
and McCandlish, David M
Seitz, Evan E. and McCandlish, David M. and Kinney, Justin B. and Koo, Peter K. , journal =. Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models , volume =
-
[58]
McCandlish and Joshua B
Jakub Otwinowski and David M. McCandlish and Joshua B. Plotkin , title =. Proc. Natl. Acad. Sci. U. S. A. , volume =
-
[59]
Science , volume =
CW Bakerlee and AN Nguyen Ba and Y Shulgina and JI Rojas Echenique and MM Desai , title =. Science , volume =
-
[60]
Li and Tan and Ma and Zhong and Yu and Zhou and Ouyang and Zhou and Tan and Hong , title =
-
[61]
Sofroniew and Deniz Oktay and Zeming Lin and Robert Verkuil and Vincent Q
Thomas Hayes and Roshan Rao and Halil Akin and Nicholas J. Sofroniew and Deniz Oktay and Zeming Lin and Robert Verkuil and Vincent Q. Tran and Jonathan Deaton and Marius Wiggert and Rohil Badkundri and Irhum Shafkat and Jun Gong and Alexander Derry and Raul S. Molina and Neil Thomas and Yousuf A. Khan and Chetan Mishra and Carolyn Kim and Liam J. Bartie a...
-
[62]
Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval , booktitle =
Pascal Notin and Mafalda Dias and Jonathan Frazer and Javier Marchena. Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval , booktitle =
-
[63]
Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning , volume =
Wang, Ning and Bian, Jiang and Li, Yuchen and Li, Xuhong and Mumtaz, Shahid and Kong, Linghe and Xiong, Haoyi , journal =. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning , volume =
-
[64]
and Schaub, Christoph and Pagani, Michaela and Secchia, Stefano and Furlong, Eileen E
de Almeida, Bernardo P. and Schaub, Christoph and Pagani, Michaela and Secchia, Stefano and Furlong, Eileen E. M. and Stark, Alexander , journal =. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo , volume =
-
[65]
and Zhang, Ruochi and Ma, Sai and Shrestha, Rojesh and Kartha, Vinay K
Hu, Yan and Horlbeck, Max A. and Zhang, Ruochi and Ma, Sai and Shrestha, Rojesh and Kartha, Vinay K. and Duarte, Fabiana M. and Hock, Conrad and Savage, Rachel E. and Labade, Ajay and Kletzien, Heidi and Meliki, Alia and Castillo, Andrew and Durand, Neva C. and Mattei, Eugenio and Anderson, Lauren J. and Tay, Tristan and Earl, Andrew S. and Shoresh, Noam ...
-
[66]
and Tasaki, Shinya and Bennett, David A
Sasse, Alexander and Ng, Bernard and Spiro, Anna E. and Tasaki, Shinya and Bennett, David A. and Gaiteri, Christopher and De Jager, Philip L. and Chikina, Maria and Mostafavi, Sara , journal =. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings , volume =
-
[67]
The power of multiplexed functional analysis of genetic variants , volume =
Gasperini, Molly and Starita, Lea and Shendure, Jay , journal =. The power of multiplexed functional analysis of genetic variants , volume =
-
[68]
Deep mutational scanning: a new style of protein science , volume =
Fowler, Douglas M and Fields, Stanley , journal =. Deep mutational scanning: a new style of protein science , volume =
-
[69]
Saporta, Adriel and Gui, Xiaotong and Agrawal, Ashwin and Pareek, Anuj and Truong, Steven Q. H. and Nguyen, Chanh D. T. and Ngo, Van-Doan and Seekins, Jayne and Blankenberg, Francis G. and Ng, Andrew Y. and Lungren, Matthew P. and Rajpurkar, Pranav , journal =. Benchmarking saliency methods for chest X-ray interpretation , volume =
-
[70]
Evaluation of post-hoc interpretability methods in time-series classification , volume =
Turb. Evaluation of post-hoc interpretability methods in time-series classification , volume =. Nat. Mach. Intell. , number =
-
[71]
Why Should
Ghada El. Why Should
-
[72]
and Molinet, Jennifer and Yassour, Moran and Fan, Lin and Adiconis, Xian and Thompson, Dawn A
Vaishnav, Eeshit Dhaval and de Boer, Carl G. and Molinet, Jennifer and Yassour, Moran and Fan, Lin and Adiconis, Xian and Thompson, Dawn A. and Levin, Joshua Z. and Cubillos, Francisco A. and Regev, Aviv , isbn =. The evolution, evolvability and engineering of gene regulatory DNA , volume =. Nature , number =
-
[73]
and Wagner, Andreas , journal =
Payne, Joshua L. and Wagner, Andreas , journal =. The causes of evolvability and their evolution , volume =
-
[74]
and Frydman, Judith and Andino, Raul , journal =
Lauring, Adam S. and Frydman, Judith and Andino, Raul , journal =. The role of mutational robustness in RNA virus evolution , volume =
-
[75]
and Parsons, Todd L
Draghi, Jeremy A. and Parsons, Todd L. and Wagner, G. Mutational robustness can facilitate adaptation , volume =. Nature , number =
-
[76]
, journal =
Phillips, Patrick C. , journal =. Epistasis ---the essential role of gene interactions in the structure and evolution of genetic systems , volume =
-
[77]
Pairwise and higher-order genetic interactions during the evolution of a tRNA , volume =
Domingo, J. Pairwise and higher-order genetic interactions during the evolution of a tRNA , volume =. Nature , number =
-
[78]
Sewall Wright , title =. Proc. XI Int. Congr. Genet. , volume =
-
[79]
Natural Selection and the Concept of a Protein Space , volume =
Maynard Smith, John , journal =. Natural Selection and the Concept of a Protein Space , volume =
-
[80]
Mingyu Huang and Shasha Zhou and Yuxuan Chen and Ke Li , title =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.