TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset
Pith reviewed 2026-05-20 02:18 UTC · model grok-4.3
The pith
Classical machine learning models outperform a specialized graph neural network in predicting PROTAC degradation activity on a new aggregated dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the TACK dataset, scaffold-based 5x5 cross-validation and statistical tests establish that XGBoost and MLP models reach a ROC-AUC of 0.85 for binary degradation activity classification versus 0.74 for PROTAC-STAN, with p less than 0.001. Regression models trained on the strongest feature sets attain R squared of 0.66 for pDC50 but only 0.36 for Dmax. Feature ablation further shows cellular context features and simple protein representations match the performance of complex ESM embeddings, and ensemble uncertainty quantification correlates prediction variance with error for both tasks.
What carries the argument
The TACK dataset of standardized PROTAC degradation endpoints together with scaffold-based cross-validation and feature ablation studies that directly compare classical models against a domain-specific graph neural network.
If this is right
- Potency predictions can be used more confidently than maximum degradation predictions to prioritize compounds for synthesis and testing.
- Ensemble-based uncertainty estimates can flag low-confidence predictions so that experimental resources focus on high-variance cases.
- Simple protein representations and cellular context features can replace complex embeddings in future modeling pipelines without loss of accuracy.
- Classical methods such as XGBoost or MLP become the default choice for this task unless new architectures demonstrate clear gains on the same standardized data.
Where Pith is reading between the lines
- Standardization across public repositories may prove more valuable than architectural novelty for other targeted degradation modalities such as molecular glues.
- The stronger predictability of potency suggests computational screening campaigns could be designed to optimize DC50 values first and treat maximum degradation as a secondary filter.
- If feature engineering continues to rival sophisticated embeddings, future work could test whether the same pattern holds when the dataset is expanded with new E3 ligase or target protein families.
Load-bearing premise
The data aggregated from three separate repositories, once molecular representations, protein labels, and experimental conditions are standardized, accurately reflect true degradation activities without major unaccounted biases from source differences or protocol variations.
What would settle it
An independent collection of PROTAC degradation measurements obtained under uniform experimental conditions that, when evaluated with the same scaffold-based cross-validation, shows no statistically significant performance gap between classical models and PROTAC-STAN or comparable predictability for pDC50 and Dmax.
Figures
read the original abstract
Proteolysis-targeting chimeras (PROTACs) represent a promising therapeutic modality that induces targeted protein degradation by hijacking the ubiquitin-proteasome system. However, rational PROTAC design remains challenging due to the complex interplay between molecular structure, target proteins, E3 ligases, and the cellular context. We present TACK, a statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset of 3,514 PROTACs and 6,561 degradation endpoints aggregated from three major repositories with standardized molecular representations, protein annotations, and experimental conditions. Using scaffold-based 5$\times$5 cross-validation, we perform a rigorous statistical comparison of three machine learning methods to predict PROTAC degradation activity across three tasks: $DC_{50}$ and Dmax regression, and binary activity classification. Feature ablation demonstrates that cellular context features and simple protein representations rival complex ESM protein embeddings, highlighting the importance of feature engineering over architectural sophistication. Models trained on the best performing features show that potency ($pDC_{50}$, $R^2=0.66$) is substantially more predictable than maximum degradation (Dmax, $R^2=0.36$). In activity prediction, statistical tests support that classical methods (XGBoost and MLP) significantly outperform PROTAC-STAN, a domain-specific graph neural network model (ROC-AUC: 0.85 vs. 0.74, p<0.001). Finally, we propose an ensemble-based uncertainty quantification approach showing that prediction variance correlates with prediction error ($pDC_{50}$: Spearman $\rho=0.36$, p<0.001; Dmax: $\rho=0.69$, p<0.001), enabling confidence-aware experimental prioritization. Our findings challenge assumptions about specialized architectures for degradation prediction and provide evidence-based guidance for ML-driven PROTAC assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the TACK dataset of 3,514 PROTACs and 6,561 degradation endpoints aggregated from three repositories with standardized representations. It evaluates XGBoost, MLP, and PROTAC-STAN via scaffold-based 5×5 cross-validation for pDC50 and Dmax regression plus binary activity classification, reporting that classical methods outperform the GNN (ROC-AUC 0.85 vs. 0.74, p<0.001), potency is more predictable than Dmax (R²=0.66 vs. 0.36), and ensemble uncertainty correlates with error.
Significance. If the data aggregation is robust, the work supplies concrete statistical evidence that feature engineering and classical models can outperform specialized graph architectures for PROTAC degradation prediction, directly challenging prevailing assumptions. The scaffold-based CV, reported R²/AUC values with significance tests, and uncertainty quantification add practical value for prioritizing experiments in this therapeutic area.
major comments (2)
- [Data aggregation and standardization (Abstract and Methods)] The headline claims (XGBoost/MLP ROC-AUC 0.85 vs. PROTAC-STAN 0.74; pDC50 R²=0.66 vs. Dmax R²=0.36) rest on the assumption that the 6,561 aggregated endpoints are directly comparable after standardization of molecular representations, protein annotations, and experimental conditions. Degradation readouts such as DC50 and Dmax are known to vary with cell line, assay format, incubation time, and detection method. Without explicit details on exclusion criteria, source-specific bias checks, or sensitivity analyses, residual protocol effects could systematically inflate predictability differences and favor classical models that exploit offsets over structure-aware GNNs. This is load-bearing for the central model-comparison and feature-ablation conclusions.
- [Results, model comparison subsection] The statistical tests supporting classical methods over PROTAC-STAN are reported with p<0.001, but the manuscript should specify whether the GNN received identical feature sets (including cellular context and protein representations) or was evaluated under its native graph-only regime. If the latter, the performance gap may reflect input differences rather than architecture, weakening the claim that 'architectural sophistication' is less important than feature engineering.
minor comments (2)
- [Abstract] Notation inconsistency: the abstract refers to DC50 regression yet reports pDC50 R²; ensure consistent use of pDC50 throughout.
- [Methods] Clarify implementation details of the scaffold-based 5×5 cross-validation to confirm no leakage between train and test folds across the three tasks.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. These have prompted us to clarify key aspects of the data aggregation process and model evaluation protocols. We address each major comment point by point below, indicating revisions made to the manuscript.
read point-by-point responses
-
Referee: [Data aggregation and standardization (Abstract and Methods)] The headline claims (XGBoost/MLP ROC-AUC 0.85 vs. PROTAC-STAN 0.74; pDC50 R²=0.66 vs. Dmax R²=0.36) rest on the assumption that the 6,561 aggregated endpoints are directly comparable after standardization of molecular representations, protein annotations, and experimental conditions. Degradation readouts such as DC50 and Dmax are known to vary with cell line, assay format, incubation time, and detection method. Without explicit details on exclusion criteria, source-specific bias checks, or sensitivity analyses, residual protocol effects could systematically inflate predictability differences and favor classical models that exploit offsets over structure-aware GNNs. This is load-bearing for the central model-comparison and feature-ablation conclusions.
Authors: We agree that explicit documentation of aggregation choices is essential to support the central claims. The original Methods section describes standardization of molecular representations (canonical SMILES via RDKit), protein targets (UniProt mapping), and activity values (conversion to pDC50 where feasible). To directly address the referee's concern, we have added a dedicated subsection on data curation that specifies exclusion criteria (e.g., removal of duplicates, entries lacking cell-line metadata, or inconsistent assay units) and source-specific bias diagnostics. We also include a sensitivity analysis in the revised Supplementary Information comparing model performance on single-repository subsets versus the aggregated set; the relative ordering of methods and the potency-vs-Dmax predictability gap remain consistent. These additions confirm that residual protocol variation does not drive the reported differences. revision: yes
-
Referee: [Results, model comparison subsection] The statistical tests supporting classical methods over PROTAC-STAN are reported with p<0.001, but the manuscript should specify whether the GNN received identical feature sets (including cellular context and protein representations) or was evaluated under its native graph-only regime. If the latter, the performance gap may reflect input differences rather than architecture, weakening the claim that 'architectural sophistication' is less important than feature engineering.
Authors: We appreciate the referee's clarification request. PROTAC-STAN was evaluated strictly in its native graph-only regime as defined in the original publication, using only the molecular graph without cellular-context or protein-sequence features. This choice reflects how the model is typically deployed and allows a direct comparison between a specialized GNN architecture and classical models that incorporate the engineered feature set (including cellular context and simple protein annotations) described in our ablation study. We have revised the Results and Discussion sections to state this input-regime difference explicitly and have added a sentence noting that future work could explore augmenting graph models with the same contextual features. The performance gap therefore illustrates the practical value of feature engineering in the current PROTAC modeling landscape rather than a pure architectural comparison. revision: yes
Circularity Check
Empirical ML evaluation on held-out splits shows no circularity
full rationale
The paper conducts a purely empirical statistical evaluation of ML models for PROTAC degradation prediction. It aggregates data from repositories, applies standardization, performs scaffold-based 5×5 cross-validation, conducts feature ablation, and reports direct performance metrics (ROC-AUC, R²) plus statistical tests on held-out data. No derivation chain, equations, or self-referential quantities are described; central claims rest on external data splits and comparisons rather than reducing to fitted parameters or self-citations by construction. The methodology is self-contained against benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- XGBoost and MLP hyperparameters
axioms (1)
- domain assumption Standardized molecular representations and protein annotations from three repositories accurately capture experimental degradation activity without source-specific biases.
Reference graph
Works this paper leans on
-
[1]
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A next-generation hyperparameter optimization frame- work. InProceedings of the 25th ACM SIGKDD international conference on knowl- edge discovery & data mining. 2623–2631
work page 2019
-
[2]
Ash, Cas Wognum, Raquel Rodríguez-Pérez, Matteo Aldeghi, Alan C
Jeremy R. Ash, Cas Wognum, Raquel Rodríguez-Pérez, Matteo Aldeghi, Alan C. Cheng, Djork-Arné Clevert, Ola Engkvist, Cheng Fang, Daniel J. Price, Jacque- line M. Hughes-Oliver, and W. Patrick Walters. 2025. Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery.Journal of Chemical Information and Modeling6...
-
[3]
Amos Bairoch. 2018. The Cellosaurus, a cell-line knowledge resource.Journal of Biomolecular Techniques29, 2 (2018), 25–38. doi:10.7171/jbt.18-2902-002
-
[4]
Miklós Békés, David R Langley, and Craig M Crews. 2022. PROTAC targeted protein degraders: the past is prologue.Nature Reviews Drug Discovery21, 3 (2022), 181–200
work page 2022
-
[5]
Guy W. Bemis and Mark A. Murcko. 1996. The Properties of Known Drugs
work page 1996
-
[6]
Molecular Frameworks.Journal of Medicinal Chemistry39, 15 (Jan. 1996), 2887–2893. doi:10.1021/jm9602928 Publisher: American Chemical Society
-
[7]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological)57, 1 (1995), 289–300
work page 1995
-
[8]
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization.Advances in Neural Information Processing Systems24 (2011)
work page 2011
-
[9]
Hong Cai, Gengyuan Yao, Yulong Shi, Tianyi Zhang, and Yuanjia Hu. 2025. PROTAC-PatentDB: A PROTAC patent compound dataset.Scientific Data12, 1 (2025), 1840
work page 2025
-
[10]
Annabel Cardno, Bryony Kennedy, and Catherine Lindon. 2025. Cellular parame- ters shaping pathways of targeted protein degradation.Communications Biology 8, 1 (2025), 691
work page 2025
-
[11]
Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. InProceedings of the Twenty-First International Conference on Machine Learning(Banff, Alberta, Canada)(ICML ’04). Association for Computing Machinery, New York, NY, USA, 18. doi:10.1145/ 1015330.1015432
-
[12]
Zhenglu Chen, Chunbin Gu, Shuoyan Tan, Xiaorui Wang, Yuquan Li, Mutian He, Ruiqiang Lu, Shijia Sun, Chang-Yu Hsieh, Xiaojun Yao, Huanxiang Liu, and Pheng-Ann Heng. 2025. Interpretable PROTAC degradation prediction with structure-informed deep ternary attention framework.Advanced Science12, 47 (2025), e08138. doi:10.1002/advs.202508138
-
[13]
Deborah Chirnomas, Keith R Hornberger, and Craig M Crews. 2023. Protein degraders enter the clinic—a new approach to cancer therapy.Nature Reviews Clinical Oncology20, 4 (2023), 265–278
work page 2023
-
[14]
Michael L Drummond and Christopher I Williams. 2019. In silico modeling of PROTAC-mediated ternary complexes.Journal of Chemical Information and Modeling59, 4 (2019), 1634–1644
work page 2019
-
[15]
Nils Dunlop, Francisco Erazo, Farzaneh Jalalypour, and Rocío Mercado. 2025. Predicting PROTAC-mediated ternary complexes with AlphaFold3 and Boltz-1. Digital Discovery4, 12 (2025), 3782–3809
work page 2025
-
[16]
Milton Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance.J. Amer. Statist. Assoc.32, 200 (1937), 675–701. doi:10.1080/01621459.1937.10503522
-
[17]
Jingxuan Ge, Shimeng Li, Gaoqi Weng, Huating Wang, Meijing Fang, Huiyong Sun, Yafeng Deng, Chang-Yu Hsieh, Dan Li, and Tingjun Hou. 2025. PROTAC-DB 3.0: an updated database of PROTACs with extended pharmacokinetic parameters. Nucleic Acids Research53, D1 (2025), D1510–D1515
work page 2025
-
[18]
Yossra Gharbi and Rocío Mercado. 2024. A comprehensive review of emerging approaches in machine learning for de novo PROTAC design.Digital Discovery 3, 11 (2024), 2158–2176
work page 2024
-
[19]
2000.Statistical modelling with quantile functions
Warren Gilchrist. 2000.Statistical modelling with quantile functions. Chapman and Hall/CRC
work page 2000
-
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. InProceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV ’15). IEEE Computer Society, USA, 1026–1034. doi:10.1109/ICCV.2015.123
-
[21]
Dejun Jiang, Zhenxing Wu, Chang-Yu Hsieh, Guangyong Chen, Ben Liao, Zhe Wang, Chao Shen, Dongsheng Cao, Jian Wu, and Tingjun Hou. 2021. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models.Journal of Cheminformatics13, 1 (2021), 12
work page 2021
-
[22]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. InAdvances in Neural Information Processing Systems 31. 1571–1581
work page 2018
-
[23]
Gregory Landrum. 2024. RDKit: Open-source Cheminformatics. https://www. rdkit.org Accessed: 2026-02-01
work page 2024
-
[24]
Howard Levene. 1960. Robust tests for equality of variances.Contributions to Probability and Statistics(1960), 278–292
work page 1960
-
[25]
Fenglei Li, Qiaoyu Hu, Xianglei Zhang, Renhong Sun, Zhuanghua Liu, Sanan Wu, Siyuan Tian, Xinyue Ma, Zhizhuo Dai, Xiaobao Yang, et al. 2022. DeepPROTACs is a deep learning-based targeted degradation predictor for PROTACs.Nature Communications13, 1 (2022), 7133
work page 2022
-
[26]
Nir London and Jaime Prilusky. 2024. PROTACpedia. https://protacpedia. weizmann.ac.il. Accessed: 2026-02-01
work page 2024
-
[27]
Divya Nori, Connor W. Coley, and Rocío Mercado. 2022. De novo PROTAC design using graph-based deep generative models. InProceedings of the NeurIPS 2022 Workshop on AI for Science. doi:10.48550/arXiv.2211.02660
- [28]
-
[29]
Andrew Pike, Beth Williamson, Stephanie Harlfinger, Scott Martin, and Dermot F McGinnity. 2020. Optimising proteolysis-targeting chimeras (PROTACs) for KDD AI for Sciences T rack ’26, Aug 09–13, 2026, Jeju, Korea Ribes et al. oral drug delivery: a drug metabolism and pharmacokinetics perspective.Drug Discovery Today25, 10 (2020), 1793–1800
work page 2020
-
[30]
Xinran Qin, Yinpeng Zhang, Yajunzi Wang, Yintao Zhang, Jiachen Jing, Yuyuan Zhang, Gaoxiang Xu, Haoping Teng, Tianjun Wang, Lei Fu, et al. 2026. TPDdb: the comprehensive database of targeted protein degrader.Nucleic Acids Research 54, D1 (2026), D1683–D1691
work page 2026
-
[31]
Stefano Ribes, Eva Nittinger, Christian Tyrchan, and Rocío Mercado. 2024. Mod- eling PROTAC degradation activity with machine learning.Artificial Intelligence in the Life Sciences6 (2024), 100104. doi:10.1016/j.ailsci.2024.100104
-
[32]
Kathleen M Sakamoto, Kyung B Kim, Akiko Kumagai, Frank Mercurio, Craig M Crews, and Raymond J Deshaies. 2001. Protacs: Chimeric molecules that target proteins to the Skp1–Cullin–F box complex for ubiquitination and degradation. Proceedings of the National Academy of Sciences98, 15 (2001), 8554–8559
work page 2001
-
[33]
Ashley R Schneekloth, Mathieu Pucheault, Hyun Seop Tae, and Craig M Crews
-
[34]
Targeted intracellular protein degradation induced by a small molecule: En route to chemical proteomics.Bioorganic & Medicinal Chemistry Letters18, 22 (2008), 5904–5908
work page 2008
-
[35]
The UniProt Consortium. 2025. UniProt: the Universal Protein Knowledgebase in 2025.Nucleic Acids Research53, D1 (2025), D609–D617. doi:10.1093/nar/gkae1010
-
[36]
John W Tukey. 1949. Comparing individual means in the analysis of variance. Biometrics(1949), 99–114
work page 1949
-
[37]
Derek van Tilborg and Francesca Grisoni. 2024. Traversing chemical space with active deep learning for low-data drug discovery.Nature Computational Science 4, 10 (2024), 786–796
work page 2024
-
[38]
Gaoqi Weng, Xuanyan Cai, Dongsheng Cao, Hongyan Du, Chao Shen, Yafeng Deng, Qiaojun He, Bo Yang, Dan Li, and Tingjun Hou. 2023. PROTAC-DB 2.0: an updated database of PROTACs.Nucleic Acids Research51, D1 (2023), D1367– D1372
work page 2023
-
[39]
R. F. Woolson. 2005.Wilcoxon Signed-Rank Test. John Wiley & Sons, Ltd. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0470011815.b2a15177 doi:10.1002/0470011815.b2a15177
-
[40]
Daniel Zaidman, Jaime Prilusky, and Nir London. 2020. PRosettaC: Rosetta based modeling of PROTAC mediated ternary complexes.Journal of Chemical Information and Modeling60, 9 (2020), 4467–4480
work page 2020
-
[41]
Zuobai Zhang, Jiarui Lu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, and Jian Tang. 2024. Structure-Informed Protein Language Model. arXiv:2402.05856 doi:10.48550/arXiv.2402.05856 A Model Details We developed a multilayer perceptron (MLP) architecture with flexible depth and width. The MLP consists of a variable number of hidden layers with dimensi...
-
[42]
The top-left-most model indicates the best performing model for a given metric. 0.100 0.075 0.050 0.025 0.000 0.025 0.050 0.075 0.100 Value 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Count ROC-AUC 0.10 0.05 0.00 0.05 Value 0 2 4 6 8 10 12 14 16 PR-AUC 2 1 0 1 2 Theoretical quantiles 0.100 0.075 0.050 0.025 0.000 0.025 0.050 0.075 0.100 Ordered Values 2 1 0 1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.