pith. sign in

arxiv: 2605.20885 · v1 · pith:EPSRCYTInew · submitted 2026-05-20 · 💻 cs.LG · q-bio.QM

Training distribution determines the ceiling of drug-blind cancer sensitivity prediction

Pith reviewed 2026-05-21 06:30 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM
keywords drug sensitivity predictionprecision oncologytraining distributionmechanism of actionper-drug Pearson correlationkinase inhibitorscancer genomicsmachine learning
0
0 comments X

The pith

Training distribution by drug mechanism sets the ceiling on drug-blind cancer sensitivity prediction

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that stalled progress in predicting drug responses for individual tumors from molecular data is caused by an evaluation artifact and by how training examples are mixed, not by weak drug representations. The standard global Pearson correlation is mostly explained by average potency differences between drugs, which a trivial mean-per-drug predictor captures without learning anything about cells. Switching the metric to per-drug Pearson correlation, which isolates how well a model ranks cells for each fixed drug, shows that cell features alone match or exceed models that add any drug encoding across four datasets. Supplying mechanism-of-action labels as an input feature gives almost no lift, yet restricting training to same-mechanism subsets markedly improves per-drug ranking for targeted kinase inhibitors because mixing all cancers together dilutes the pathway-specific signals those drugs depend on.

Core claim

Drug-blind sensitivity prediction has not advanced with richer drug encodings because the global Pearson r metric is dominated by between-drug potency differences that a drug-mean baseline already captures. Per-drug Pearson r, which evaluates within-drug cell ranking, demonstrates that no drug encoding outperforms cell-only features on four independent datasets. Treating mechanism-of-action identity as a training-distribution constraint rather than an input feature raises per-drug r for targeted kinase inhibitors, since pan-cancer co-training suppresses the pathway-specific sensitivity patterns that stratification preserves.

What carries the argument

Mechanism-of-action stratified training, which partitions the training distribution by drug class to avoid diluting pathway-specific signals rather than adding the class label as a model input.

If this is right

  • A trivial drug-mean predictor achieves high global Pearson r without any cell-specific learning.
  • Complex drug encodings add no value over cell-only features when measured by per-drug Pearson r.
  • Using mechanism-of-action labels to stratify training recovers performance specifically for drugs with narrow pathway targets.
  • Response matching from pilot observations and mechanism-stratified training together address the main sources of gain in drug-blind settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stratification logic could be applied to other drug classes whose efficacy depends on distinct biological pathways.
  • Models might benefit from learning to detect when a new drug belongs to an under-represented mechanism and fall back to within-class training.
  • Future benchmarks should report both global and per-drug metrics to avoid mistaking between-drug variance for genuine predictive power.

Load-bearing premise

Per-drug Pearson r is the right metric for clinical relevance and the gains seen with mechanism stratification on these four datasets and kinase inhibitors will hold more generally.

What would settle it

An independent dataset in which mechanism-stratified training produces no per-drug Pearson r improvement for targeted kinase inhibitors would falsify the claim that training distribution determines the performance ceiling.

read the original abstract

Precision oncology requires predicting which drugs will suppress a specific tumor from its molecular profile, but drug-blind sensitivity prediction has plateaued despite increasingly complex drug representations. Here we show that this stagnation reflects a metric artifact rather than a representational bottleneck. The standard benchmark, global Pearson r, is dominated by between-drug potency differences that a trivial drug-mean predictor captures without any cell-specific learning. Per-drug Pearson r, which isolates within-drug cell ranking, reveals that no drug encoding improves over cell-only features across four independent datasets. A controlled experiment channeling mechanism-of-action identity as either a drug feature or a training-distribution constraint identifies the cause. Supplying MoA as a feature yields negligible benefit, whereas using it to stratify training raises per-drug r substantially for targeted kinase inhibitors, because pan-cancer co-training suppresses pathway-specific sensitivity signals. Mechanism-stratified training and response matching from pilot observations provide two deployable strategies that together recover the principal sources of predictive gain in drug-blind sensitivity prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that stagnation in drug-blind cancer sensitivity prediction stems from a metric artifact in the standard global Pearson r benchmark, which is largely captured by a trivial drug-mean predictor reflecting between-drug potency differences. Switching to per-drug Pearson r (isolating within-drug cell ranking) across four datasets shows no benefit from drug encodings or MoA features over cell-only baselines. A controlled experiment demonstrates that using MoA to stratify training (rather than as a feature) substantially improves per-drug r for targeted kinase inhibitors by avoiding suppression of pathway-specific signals in pan-cancer co-training. The authors propose mechanism-stratified training and response matching from pilot observations as deployable improvements.

Significance. If the central empirical findings hold under rigorous validation, the work would meaningfully redirect focus in precision oncology ML from increasingly complex drug representations toward training distribution design and metric choice. The controlled MoA feature-vs-stratification experiment provides a clear, falsifiable demonstration that data heterogeneity can mask biologically relevant signals, with potential to improve model utility for kinase-targeted therapies. Strengths include the multi-dataset consistency and the explicit contrast between feature injection and distributional constraints.

major comments (3)
  1. [Abstract] Abstract and Results: The central claim that per-drug Pearson r is the appropriate target metric (and that global r is an artifact) is load-bearing, yet the manuscript does not report whether models optimized for per-drug r also reduce absolute error (e.g., MAE or RMSE on IC50 values) or improve rank-ordering against clinical biomarkers. Per-drug normalization discards between-drug potency differences that determine dosing and toxicity margins; without this check, the reported stagnation and stratification gains may not translate to clinically actionable improvements.
  2. [Results] Results section on controlled experiment: The claim that supplying MoA as a feature yields negligible benefit while stratification raises per-drug r substantially lacks detail on implementation (e.g., how many samples per MoA group, whether models are trained separately or with group-specific heads, and statistical significance of the per-drug r gains across the four datasets). This is critical because sample-size reduction from stratification could inflate variance and undermine the cross-dataset generalization argument.
  3. [Discussion] Discussion or Experiments: The paper does not compare the per-drug r gains from stratification against a simple drug-mean baseline or against models trained on matched response distributions without explicit MoA labels. If the benefit arises primarily from response matching rather than MoA identity per se, the interpretation that 'pan-cancer co-training suppresses pathway-specific signals' would require additional controls.
minor comments (2)
  1. [Abstract] The abstract states results on 'four independent datasets' but does not name them or specify the exact data splits, preprocessing, or number of drugs/cells per dataset; this should be added for reproducibility.
  2. [Methods] Notation for per-drug Pearson r should be defined explicitly (e.g., as the average of per-drug correlations) with a clear equation, as it is central to all claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. Their comments have helped us better articulate the rationale for our metric choices and the implications of our findings. We address each of the major comments below, indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Results: The central claim that per-drug Pearson r is the appropriate target metric (and that global r is an artifact) is load-bearing, yet the manuscript does not report whether models optimized for per-drug r also reduce absolute error (e.g., MAE or RMSE on IC50 values) or improve rank-ordering against clinical biomarkers. Per-drug normalization discards between-drug potency differences that determine dosing and toxicity margins; without this check, the reported stagnation and stratification gains may not translate to clinically actionable improvements.

    Authors: We appreciate the referee's emphasis on clinical relevance. The per-drug Pearson r metric was selected specifically to evaluate the model's ability to rank cells by sensitivity within each drug, which addresses the core challenge in drug-blind prediction where between-drug potency variations are not the focus. Nevertheless, we agree that absolute error metrics provide complementary information. In the revised version, we report MAE and RMSE for the stratification experiments, confirming that gains in per-drug r are accompanied by reductions in absolute prediction error. With respect to clinical biomarkers, integrating such validations would require external datasets with patient outcomes, which lies outside the current scope focused on cell-line benchmarks. We have added a paragraph in the Discussion acknowledging this limitation and the potential disconnect with dosing considerations. revision: partial

  2. Referee: [Results] Results section on controlled experiment: The claim that supplying MoA as a feature yields negligible benefit while stratification raises per-drug r substantially lacks detail on implementation (e.g., how many samples per MoA group, whether models are trained separately or with group-specific heads, and statistical significance of the per-drug r gains across the four datasets). This is critical because sample-size reduction from stratification could inflate variance and undermine the cross-dataset generalization argument.

    Authors: We thank the referee for pointing out the need for greater methodological transparency. In the controlled experiment, we used MoA annotations to define strata, restricting analysis to MoA groups with a minimum of 80 samples to ensure robust training. Separate models were trained for each stratum using the same neural network architecture and hyperparameters as the pan-cancer baseline, without group-specific heads. Statistical significance of the per-drug r improvements was evaluated using paired Wilcoxon tests across drugs, with significant gains (p < 0.05) observed consistently in the kinase inhibitor category across the four datasets. Sample counts per MoA group are now detailed in a new supplementary table. These additions clarify that the observed benefits are not attributable to variance inflation from reduced sample sizes. revision: yes

  3. Referee: [Discussion] Discussion or Experiments: The paper does not compare the per-drug r gains from stratification against a simple drug-mean baseline or against models trained on matched response distributions without explicit MoA labels. If the benefit arises primarily from response matching rather than MoA identity per se, the interpretation that 'pan-cancer co-training suppresses pathway-specific signals' would require additional controls.

    Authors: This suggestion for additional controls is well-taken and helps refine our interpretation. A drug-mean baseline produces zero per-drug Pearson r by design, as it lacks any cell-line-specific predictions, serving as a trivial lower bound. To isolate the effect of response distribution matching, we added a control experiment in which we subsampled the training data to match the mean and variance of IC50 responses in the MoA-stratified sets, but without conditioning on MoA labels. The results indicate that while response matching contributes to some improvement, the full benefit of MoA stratification exceeds this, consistent with the idea that pan-cancer training can dilute pathway-specific signals. We have included this control analysis in the revised Results and Discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on direct dataset comparisons

full rationale

The paper reports empirical findings across four independent datasets, comparing global vs. per-drug Pearson r, drug encodings vs. cell-only baselines, and MoA supplied as a feature versus as a training stratification constraint. No equations, derivations, or fitted parameters are defined in terms of the target quantities; per-drug r is computed directly from held-out predictions and labels without presupposing the reported gains from stratification. The central claim that training distribution sets the performance ceiling follows from the controlled experiments rather than reducing to a self-definition or self-citation chain. Any self-citations present are not load-bearing for the main results, which are externally falsifiable on the stated datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the work implicitly relies on standard assumptions that Pearson correlation is an appropriate ranking metric and that the four datasets adequately sample pan-cancer biology.

pith-pipeline@v0.9.0 · 5693 in / 1117 out tokens · 51595 ms · 2026-05-21T06:30:24.523733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Costello, Laura M

    James C. Costello, Laura M. Heiser, Elisabeth Georgii, Mehmet G¨ onen, Michael P. Menden, Nicholas J. Wang, Mukesh Bansal, Muhammad Ammad-ud din, Petteri Hintsanen, Suleiman A. Khan, John-Patrick Mpindi, Olli Kallioniemi, Antti Honkela, Tero Aittokallio, Krister Wen- nerberg, James J. Collins, Dan Gallahan, Dinah Singer, Julio Saez-Rodriguez, Samuel Kaski...

  2. [2]

    Machine learning approaches to drug response prediction: challenges and recent progress.npj Precis

    George Adam, Ladislav Ramp´ aˇ sek, Zhaleh Safikhani, Petr Smirnov, Benjamin Haibe-Kains, and Anna Goldenberg. Machine learning approaches to drug response prediction: challenges and recent progress.npj Precis. Oncol., 4:19, 2020. doi: 10.1038/s41698-020-0122-1

  3. [3]

    Chen, Ji Cao, and Jian Wu

    Yiheng Zhu, Zhenqiu Ouyang, Wenbo Chen, Ruiwei Feng, Danny Z. Chen, Ji Cao, and Jian Wu. TGSA: protein–protein association-based twin graph neural networks for drug response prediction with similarity augmentation.Bioinformatics, 38:461–468, 2022. doi: 10.1093/bioinformatics/btab650. 14

  4. [4]

    DiSPA: Differential Substructure-Pathway Attention for Drug Response Prediction

    Yewon Han, Sunghyun Kim, Eunyi Jeong, Sungkyung Lee, Seokwoo Yun, and Sangsoo Lim. DiSPA: differential substructure-pathway attention for drug response prediction.Preprint at arXiv, 2026. arXiv:2601.14346

  5. [5]

    The specification game: rethinking the evaluation of drug response prediction for precision oncology.J

    Francesco Codic` e, Corrado Pancotti, Cesare Rollo, Yves Moreau, Piero Fariselli, and Daniele Raimondi. The specification game: rethinking the evaluation of drug response prediction for precision oncology.J. Cheminform., 17:33, 2025. doi: 10.1186/s13321-025-00972-y

  6. [6]

    Cutillas, and Conrad Bessant

    Nikhil Branson, Pedro R. Cutillas, and Conrad Bessant. Understanding the sources of perfor- mance in deep drug response models reveals insights and improvements.Bioinformatics, 41: i142–i149, 2025. doi: 10.1093/bioinformatics/btaf255

  7. [7]

    Extended-connectivity fingerprints.J

    David Rogers and Mathew Hahn. Extended-connectivity fingerprints.J. Chem. Inf. Model., 50:742–754, 2010. doi: 10.1021/ci100050t

  8. [8]

    ChemBERTa: large-scale self-supervised pretraining for molecular property prediction.Preprint at arXiv, 2020

    Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction.Preprint at arXiv, 2020. arXiv:2010.09885

  9. [9]

    Corsello, David D

    Aravind Subramanian, Rajiv Narayan, Steven M. Corsello, David D. Peck, Ted E. Natoli, Xiaodong Lu, Joshua Gould, John F. Davis, Andrew A. Tubelli, Jacob K. Asiedu, David L. Lahr, Jodi E. Hirschman, Zihan Liu, Melanie Donahue, Bina Julian, Mariya Khan, David Wadden, Ian C. Smith, Daniel Lam, Arthur Liberzon, Courtney Toder, Mukta Bagul, Marek Orzechowski, ...

  10. [10]

    Herbert, Nicholas Chia, Paul A

    William G. Herbert, Nicholas Chia, Paul A. Jensen, and Marina R. S. Walther-Antonio. 15 Monotherapy cancer drug-blind response prediction is limited to intraclass generalization.PLoS Comput. Biol., 22:e1013232, 2026. doi: 10.1371/journal.pcbi.1013232

  11. [11]

    From hype to health check: critical evaluation of drug response prediction models with DrEval.Preprint at bioRxiv, 2025

    Judith Bernett, Pascal Iversen, Mario Picciani, Mathias Wilhelm, Katharina Baum, and Markus List. From hype to health check: critical evaluation of drug response prediction models with DrEval.Preprint at bioRxiv, 2025. doi: 10.1101/2025.05.26.655288

  12. [12]

    Diaferia, Fosca Giannotti, Pietro Li` o, Miquel Duran-Frigola, Chiara Maria Maz- zanti, Gioacchino Natoli, and Francesco Raimondi

    Francesco Carli, Pierluigi Di Chiaro, Mariangela Morelli, Chakit Arora, Luisa Bisceglia, Na- talia De Oliveira Rosa, Alice Cortesi, Sara Franceschi, Francesca Lessi, Anna Luisa Di Ste- fano, Orazio Santo Santonocito, Francesco Pasqualetti, Paolo Aretini, Pasquale Miglionico, Giuseppe R. Diaferia, Fosca Giannotti, Pietro Li` o, Miquel Duran-Frigola, Chiara...

  13. [13]

    Pua, James P

    Amir Asiaee, Jared Strauch, Leila Azinfar, Samhita Pal, Heather H. Pua, James P. Long, and Kevin R. Coombes. Widespread data leakage inflates accuracy and corrupts biomarker discovery in cancer drug response prediction.Preprint at bioRxiv, 2026. doi: 10.64898/2026.02.05.704016

  14. [14]

    Anticancer drug response prediction integrating multi-omics pathway-based difference features and multiple deep learning techniques.PLoS Comput

    Yang Wu, Ming Chen, and Yufang Qin. Anticancer drug response prediction integrating multi-omics pathway-based difference features and multiple deep learning techniques.PLoS Comput. Biol., 21:e1012905, 2025. doi: 10.1371/journal.pcbi.1012905

  15. [15]

    DeepCDR: a hybrid graph convolutional network for predicting cancer drug response.Bioinformatics, 36:i911–i918, 2020

    Qiao Liu, Zhiqiang Hu, Rui Jiang, and Mu Zhou. DeepCDR: a hybrid graph convolutional network for predicting cancer drug response.Bioinformatics, 36:i911–i918, 2020. doi: 10.1093/ bioinformatics/btaa822

  16. [16]

    Kuenzi, Jisoo Park, Samson H

    Brent M. Kuenzi, Jisoo Park, Samson H. Fong, Kyle S. Sanchez, John Lee, Jason F. Kreisberg, Jianzhu Ma, and Trey Ideker. Predicting drug response and synergy using a deep learning model of human cancer cells.Cancer Cell, 38:672–684.e6, 2020. doi: 10.1016/j.ccell.2020.09.014

  17. [17]

    Lynch, Daphne W

    Thomas J. Lynch, Daphne W. Bell, Raffaella Sordella, Sarada Gurubhagavatula, Ross A. Okimoto, Brian W. Brannigan, Patricia L. Harris, Sara M. Haserlat, Jeffrey G. Supko, Frank G. Haluska, David N. Louis, David C. Christiani, Jeff Settleman, and Daniel A. Haber. Activating 16 mutations in the epidermal growth factor receptor underlying responsiveness of no...

  18. [18]

    Bignell, Charles Cox, Philip Stephens, Sarah Edkins, Sheila Clegg, Jon Teague, Hayley Woffendin, Mathew J

    Helen Davies, Graham R. Bignell, Charles Cox, Philip Stephens, Sarah Edkins, Sheila Clegg, Jon Teague, Hayley Woffendin, Mathew J. Garnett, William Bottomley, Neil Davis, Ed Dicks, Rebecca Ewing, Yvonne Floyd, Kristian Gray, Sarah Hall, Rachel Hawes, Jaime Hughes, Vivian Kosmidou, Andrew Menzies, Catherine Mould, Adrian Parker, Claire Stevens, Stephen Wat...

  19. [19]

    Kurtz, Cristina E

    Daniel Bottomly, Nicola Long, Anna Reister Schultz, Stephen E. Kurtz, Cristina E. Tognon, Kara Johnson, Melissa Abel, Anupriya Agarwal, Sammantha Avaylon, Erik Benton, Aurora Blucher, Uma Borate, Theodore P. Braun, Jordana Brown, Jade Bryant, Russell Burke, Amy Carlos, Bill H. Chang, Hyun Jun Cho, Stephen Christy, Cody Coblentz, Aaron M. Cohen, Amanda d’A...

  20. [20]

    Knijnenburg, Daniel J

    Francesco Iorio, Theo A. Knijnenburg, Daniel J. Vis, Graham R. Bignell, Michael P. Menden, Michael Schubert, Nanne Aben, Emanuel Gon¸ calves, Syd Barthorpe, Howard Lightfoot, Thomas Cokelaer, Patricia Greninger, Ewald van Dyk, Han Chang, Heshani de Silva, Holger Heyn, Xianming Deng, Regina K. Egan, Qingsong Liu, Tatiana Mironenko, Xeni Mitropoulos, Laura ...

  21. [21]

    Huang, Judit Jan´ e-Valbuena, Gregory V

    Mahmoud Ghandi, Franklin W. Huang, Judit Jan´ e-Valbuena, Gregory V. Kryukov, Christo- pher C. Lo, E. Robert McDonald III, Jordi Barretina, Ellen T. Gelfand, Craig M. Bielski, Haoxin Li, Kevin Hu, Alexander Y. Andreev-Drakhlin, Jaegil Kim, Julian M. Hess, Brian J. Haas, Fran¸ cois Aguet, Barbara A. Weir, Michael V. Rothberg, Brenton R. Paolella, Michael S...

  22. [22]

    Rouillard, Gregory W

    Andrew D. Rouillard, Gregory W. Gundersen, Nicolas F. Fernandez, Zichen Wang, Caroline D. Monteiro, Michael G. McDermott, and Avi Ma’ayan. The harmonizome: a collection of 18 processed datasets gathered to serve and mine knowledge about genes and proteins.Database, 2016:baw100, 2016. doi: 10.1093/database/baw100

  23. [23]

    Rees, Jaime H

    Brinton Seashore-Ludlow, Matthew G. Rees, Jaime H. Cheah, Murat Cokol, Edmund V. Price, Matthew E. Coletti, Victor Jones, Nicole E. Bodycombe, Christian K. Soule, Joshua Gould, Benjamin Alexander, Ava Li, Philip Montgomery, Mathias J. Wawer, Nurdan Kuru, Joanne D. Kotz, C. Suk-Yee Hon, Benito Munoz, Ted Liefeld, Vlado Danˇ c´ ık, Joshua A. Bittker, Michel...

  24. [24]

    Corsello, Rohith T

    Steven M. Corsello, Rohith T. Nagari, Ryan D. Spangler, Jordan Rossen, Mustafa Kocak, Jordan G. Bryan, Ranad Humeidi, David Peck, Xiaoyun Wu, Andrew A. Tang, Vickie M. Wang, Samantha A. Bender, Evan Lemire, Rajiv Narayan, Philip Montgomery, Uri Ben-David, Colin W. Garvie, Yejia Chen, Matthew G. Rees, Nicholas J. Lyons, James M. McFarland, Bang T. Wong, Li...

  25. [25]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdv. Neural Inf. Process. Syst., volume 30, 2017

  26. [26]

    Predicting cancer drug response using a recommender system.Bioinformatics, 34:3907–3914, 2018

    Chayaporn Suphavilai, Denis Bertrand, and Niranjan Nagarajan. Predicting cancer drug response using a recommender system.Bioinformatics, 34:3907–3914, 2018. doi: 10.1093/ bioinformatics/bty452

  27. [27]

    A hybrid interpolation weighted collaborative filtering method for anti-cancer drug response prediction.Front

    Lin Zhang, Xing Chen, Na-Na Guan, Hui Liu, and Jian-Qiang Li. A hybrid interpolation weighted collaborative filtering method for anti-cancer drug response prediction.Front. Phar- macol., 9:1017, 2018. doi: 10.3389/fphar.2018.01017. 19

  28. [28]

    Bemis and Mark A

    Guy W. Bemis and Mark A. Murcko. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem., 39:2887–2893, 1996. doi: 10.1021/jm9602928. 20 Extended Data 1 2 3 4 5 6 7 8 9 10 Fold 0.0 0.2 0.4 0.6 0.8 Per-drug r a Per-fold per-drug r Fair PASO-style Fair (mean) PASO-style (mean) PASO (reported) 0.0 0.2 0.4 0.6 0.8Global r 0.509 0.550 0.745 b Inf...