Recognition: unknown
Benchmarking Single-Pose Docking, Consensus Rescoring, and Supervised ML on the LIT-PCBA Library: A Critical Evaluation of DiffDock, AutoDock-GPU, GNINA, and DiffDock-NMDN
Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3
The pith
Supervised ML re-ranking of docking scores raises early enrichment 110 percent over the best classical method on the LIT-PCBA library.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GNINA rescoring of AutoDock-GPU poses emerged as the strongest single method with a median EF1% of 2.14. DiffDock-based approaches underperformed relative to AutoDock-GNINA, particularly on challenging targets such as OPRK1. Carefully designed consensus ranking improved robustness but did not surpass the best single scorer. Supervised ML re-ranking delivered the largest gains, achieving a median EF1% of 4.49 (+110% over AutoDock-GNINA).
What carries the argument
Supervised machine learning re-ranking model trained on features extracted from docking poses and scores to prioritize experimentally confirmed actives.
If this is right
- DiffDock-based approaches underperformed relative to AutoDock-GNINA, particularly on challenging targets such as OPRK1.
- Consensus ranking improved robustness but did not surpass the best single scorer.
- No single docking method dominates across targets.
- Rigorously validated, cost-effective combinations with supervised re-ranking currently offer the most practical value for virtual screening.
Where Pith is reading between the lines
- If the ML models overfit to the specific 15 targets and feature distributions in LIT-PCBA, performance on truly novel targets could drop well below the reported levels.
- The modest absolute enrichment values suggest virtual screening may need to be paired with orthogonal filters such as ligand-based similarity or experimental follow-up to reach usable hit rates.
- Future tests could examine whether adding explicit 3D structural or interaction fingerprints to the ML feature set would push EF1% higher than the current docking-derived features allow.
Load-bearing premise
The LIT-PCBA library of 15 targets with experimentally confirmed actives and inactives represents real-world virtual screening and the ML models trained on its docking features will generalize to new targets.
What would settle it
Apply the same supervised ML re-ranking pipeline to an independent library of targets and ligands outside LIT-PCBA and measure whether median EF1% stays near 4.49 or falls substantially.
Figures
read the original abstract
Virtual screening performance depends heavily on the chosen docking and scoring methods. Recent AI-based tools such as DiffDock and NMDN have reported strong benchmark results, but their practical utility on realistic, experimentally-derived datasets remains unclear. Here we perform a large-scale evaluation on the LIT-PCBA library (15 targets, 578,295 ligand-target pairs with experimentally confirmed actives and inactives). We compare AutoDock-GPU and DiffDock for pose generation, followed by rescoring with GNINA and NMDN. We further evaluate rank-based consensus strategies and supervised machine learning models trained on docking features. GNINA rescoring of AutoDock-GPU poses (AutoDock-GNINA) emerged as the strongest single method with a median EF1% of 2.14. DiffDock-based approaches underperformed relative to AutoDock-GNINA, particularly on challenging targets such as OPRK1. Carefully designed consensus ranking improved robustness but did not surpass the best single scorer. Supervised ML re-ranking delivered the largest gains, achieving a median EF1% of 4.49 (+110% over AutoDock-GNINA). Our results highlight that even the best classical+ML hybrid workflows provide only modest early enrichment on realistic benchmarks. We conclude that no single docking method dominates across targets and that rigorously validated, cost-effective combinations with supervised re-ranking currently offer the most practical value for virtual screening.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks pose generation with DiffDock and AutoDock-GPU, rescoring with GNINA and NMDN, rank-based consensus, and supervised ML re-ranking on docking-derived features using the LIT-PCBA library (15 targets, 578,295 ligand-target pairs with experimental labels). GNINA rescoring of AutoDock-GPU poses is the strongest single method (median EF1% 2.14); DiffDock-based methods underperform, especially on targets like OPRK1; consensus improves robustness but does not exceed the top single scorer; supervised ML achieves the largest gain (median EF1% 4.49, +110% over AutoDock-GNINA). The central conclusion is that no single method dominates across targets and that classical+ML hybrids currently offer the most practical value, though overall early enrichment remains modest on this realistic benchmark.
Significance. If the ML results hold under proper validation, the work provides a valuable large-scale, experimentally grounded critique of recent AI docking tools and demonstrates that supervised re-ranking on classical docking features can deliver meaningful but still limited gains. The scale of the LIT-PCBA evaluation (hundreds of thousands of pairs) and direct use of experimental actives/inactives strengthen the assessment of practical utility; the finding that even the best hybrid yields only modest EF1% underscores the difficulty of early enrichment and could usefully inform virtual screening practice.
major comments (2)
- [Methods] Methods (ML re-ranking subsection): The supervised models are trained on docking features from the 15 LIT-PCBA targets, but the manuscript provides no description of the train/test split strategy, whether it is target-hold-out or random per-ligand, hyperparameter selection, or feature preprocessing. With only 15 targets this is load-bearing for the headline claim of a generalizable +110% EF1% lift and the conclusion that such hybrids offer practical value for new targets; without explicit cross-target validation the observed gain could reflect fitting to target-specific pocket or chemotype distributions rather than transferable scoring signals.
- [Results] Results (median EF1% reporting and Table of per-target values): The median EF1% figures (2.14 for AutoDock-GNINA, 4.49 for ML) are presented without accompanying per-target distributions, inter-quartile ranges, or statistical significance tests (e.g., paired Wilcoxon across the 15 targets). This weakens the robustness claim that “no single docking method dominates across targets” and the assertion that ML delivers consistent gains, especially given the noted underperformance on challenging targets such as OPRK1.
minor comments (2)
- [Abstract] Abstract: EF1% is used without a one-sentence definition or reference to its standard formula; while familiar to specialists, a brief parenthetical would improve accessibility for the broader readership.
- [Figures] Figure legends (consensus and ML panels): Axis labels and color keys for the EF1% boxplots or bar charts should explicitly state the number of targets (n=15) and whether medians are weighted or unweighted.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review, which has identified important areas for improving the clarity and rigor of our manuscript. We address each major comment below and will revise the paper accordingly to incorporate the requested details and analyses.
read point-by-point responses
-
Referee: [Methods] Methods (ML re-ranking subsection): The supervised models are trained on docking features from the 15 LIT-PCBA targets, but the manuscript provides no description of the train/test split strategy, whether it is target-hold-out or random per-ligand, hyperparameter selection, or feature preprocessing. With only 15 targets this is load-bearing for the headline claim of a generalizable +110% EF1% lift and the conclusion that such hybrids offer practical value for new targets; without explicit cross-target validation the observed gain could reflect fitting to target-specific pocket or chemotype distributions rather than transferable scoring signals.
Authors: We agree that the ML validation details are essential for supporting the generalizability claims. The original manuscript omitted a full description of these procedures. In the revised version, we will expand the Methods section to explicitly state that we used a leave-one-target-out cross-validation strategy (training on 14 targets and testing on the held-out target) to evaluate transferability across targets. Hyperparameters were tuned via grid search using an inner validation split within the training targets, and all features were z-score standardized using statistics computed solely from the training data for each fold. These additions will clarify that the reported EF1% gains are based on cross-target evaluation rather than within-target fitting. revision: yes
-
Referee: [Results] Results (median EF1% reporting and Table of per-target values): The median EF1% figures (2.14 for AutoDock-GNINA, 4.49 for ML) are presented without accompanying per-target distributions, inter-quartile ranges, or statistical significance tests (e.g., paired Wilcoxon across the 15 targets). This weakens the robustness claim that “no single docking method dominates across targets” and the assertion that ML delivers consistent gains, especially given the noted underperformance on challenging targets such as OPRK1.
Authors: We concur that additional statistical detail would strengthen the robustness of our conclusions. In the revision, we will add a supplementary table listing the per-target EF1% values for all methods, along with a figure showing boxplots of the distributions and inter-quartile ranges across the 15 targets. We will also compute and report paired Wilcoxon signed-rank tests (with p-values) comparing the methods across targets, including the significance of the ML improvement over AutoDock-GNINA. This will provide quantitative support for the statements that no single method dominates and that ML yields consistent gains despite variability on targets like OPRK1. revision: yes
Circularity Check
Empirical benchmarking on external experimental labels shows no circularity
full rationale
The paper reports empirical enrichment factors (EF1%) from direct comparison of docking and ML rescoring outputs against experimentally confirmed actives/inactives in the LIT-PCBA library (15 targets, 578k pairs). No derivation chain, equations, or first-principles results exist that reduce to fitted parameters or self-referential definitions by construction. Supervised ML re-ranking is trained on docking-derived features and evaluated for performance gains, but these are standard held-out or cross-validated metrics on external labels rather than tautological predictions. Self-citations, if present, are not load-bearing; the central claims rest on independent experimental ground truth.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In the first stage, DiffDock generates multiple potential binding candidate poses across the entire protein surface per ligand in a blind docking fashion
Pose Generation: For each protein -ligand pair, DiffDock is used to sample protocol called DiffDock-NMDN to improve DiffDock’s screening enrichment [ 13]. In the first stage, DiffDock generates multiple potential binding candidate poses across the entire protein surface per ligand in a blind docking fashion
-
[2]
pKd-like
Rescoring with NMDN: The generated poses are then evaluated and ranked in the second stage, using Normalized Mixture Density Network (NMDN) score. NMDN is a distinct deep learning - based scoring function that learns the probability density distribution of distances between protein residues and ligand atoms to estimate binding strength, producing a “pKd-l...
2023
-
[3]
Virtual screening with Gnina 1.0
Sunseri J, Koes DR. Virtual screening with Gnina 1.0. Molecules. 2021;26(23):7369. doi:10.3390/molecules26237369. 26
-
[4]
MILCDock: Machine Learning-Enhanced Consensus Docking for Virtual Screening in Drug Discovery [senior thesis]
Morris CJ. MILCDock: Machine Learning-Enhanced Consensus Docking for Virtual Screening in Drug Discovery [senior thesis]. Provo (UT): Brigham Young University; 2022. Available from: https://physics.byu.edu/docs/thesis/1542
2022
-
[5]
Deep learning strategies for enhanced molecular docking and virtual screening
da Silva MMP, Guedes IA, Custódio FL, Krempser E, Dardenne LE. Deep learning strategies for enhanced molecular docking and virtual screening. ChemRxiv [Preprint]. 2023 Nov 7. doi:10.26434/chemrxiv-2023-zfv87-v2
-
[6]
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville
Mysinger MM, Carchia M, Irwin JJ, Shoichet BK. Directory of useful decoys, enhanced (DUD- E): better ligands and decoys for better benchmarking. J Med Chem. 2012;55(14):6582-6594. doi:10.1021/jm300687e
-
[7]
Derek van Tilborg, Alisa Alenicheva, and Francesca Grisoni
Tran-Nguyen VK, Jacquemard C, Rognan D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J Chem Inf Model. 2020;60(9):4263-4273. doi:10.1021/acs.jcim.0c00155
-
[8]
Data leakage and redundancy in the LIT-PCBA benchmark
Huang A, Knight IS, Naprienko S. Data leakage and redundancy in the LIT-PCBA benchmark. arXiv [Preprint]. 2025 Jul 29; arXiv:2507.21404. Available from: https://arxiv.org/abs/2507.21404
-
[9]
Truchon JF, Bayly CI. Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J Chem Inf Model. 2007;47(2):488-508. doi:10.1021/ci600426e
-
[10]
Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking
Solis-Vasquez L, Tillack AF, Santos-Martins D, Koch A, LeGrand S, Forli S. Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking. Parallel Comput. 2022;109:102861. doi:10.1016/j.parco.2021.102861
-
[11]
Darme P, Dauchez M, Renard A, Voutquenne-Nazabadioko L, Aubert D, Escotte-Binet S, et al. AMIDE v2: high-throughput screening based on AutoDock-GPU and improved workflow leading to better performance and reliability. Int J Mol Sci. 2021;22(14):7489. doi:10.3390/ijms22147489
-
[12]
Accelerating AutoDock4 with GPUs and gradient-based local search
Santos-Martins D, Solis-Vasquez L, Tillack AF, Sanner MF, Koch A, Forli S. Accelerating AutoDock4 with GPUs and gradient-based local search. J Chem Theory Comput. 2021;17(2):1060-1073. doi:10.1021/acs.jctc.0c01006
-
[13]
Corso G, Stärk H, Jing B, Barzilay R, Jaakkola T. DiffDock: diffusion steps, twists, and turns for molecular docking. arXiv [Preprint]. 2022 Oct 4; arXiv:2210.01776. Available from: https://arxiv.org/abs/2210.01776
-
[14]
Deep-learning based docking methods: fair comparisons to conventional docking workflows
Jain AN, Cleves AE, Walters WP. Deep-learning based docking methods: fair comparisons to conventional docking workflows. arXiv [Preprint]. 2024 Dec 3; arXiv:2412.02889. Available from: https://arxiv.org/abs/2412.02889
-
[15]
Xia S, Gu Y, Zhang Y. Normalized protein-ligand distance likelihood score for end-to-end blind docking and virtual screening. J Chem Inf Model. 2025;65(3):1101-1114. doi:10.1021/acs.jcim.4c01014
-
[16]
Receiver operating characteristic curve: overview and practical use for clinicians
Nahm FS. Receiver operating characteristic curve: overview and practical use for clinicians. Korean J Anesthesiol. 2022;75(1):25-36. doi:10.4097/kja.21209
-
[17]
Accelerating AutoDock Vina with GPUs
Tang S, Chen R, Lin M, Lin Q, Zhu Y, Ding J, et al. Accelerating AutoDock Vina with GPUs. Molecules. 2022;27(9):3041. doi:10.3390/molecules27093041
-
[18]
NVIDIA RTX 3090 vs NVIDIA A100 40 GB (PCIe): GPU benchmarks [Internet]
Bizon-Tech. NVIDIA RTX 3090 vs NVIDIA A100 40 GB (PCIe): GPU benchmarks [Internet]. [cited 2025 Aug 27]. Available from: https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX- 3090-vs-NVIDIA-A100-40-GB-(PCIe)/579vs592 27
2025
-
[19]
Uni-Dock: GPU-accelerated docking enables ultralarge virtual screening
Yu Y, Cai C, Wang J, Bo Z, Zhu Z, Zheng H. Uni-Dock: GPU-accelerated docking enables ultralarge virtual screening. J Chem Theory Comput. 2023;19(11):3336-3345. doi:10.1021/acs.jctc.2c01145
-
[20]
TB-IECS: an accurate machine learning-based scoring function for virtual screening
Zhang X, Shen C, Jiang D, et al. TB-IECS: an accurate machine learning-based scoring function for virtual screening. J Cheminform. 2023;15:63. doi:10.1186/s13321-023-00731-x
-
[21]
DockM8: an all-in-one open-source platform for consensus virtual screening in drug design
Lacour A, Ibrahim H, Volkamer A, Hirsch AKH. DockM8: an all-in-one open-source platform for consensus virtual screening in drug design. ChemRxiv [Preprint]. 2024 Jul. doi:10.26434/chemrxiv-2024-17k46
-
[22]
Accurate predictions of novel biomolecular interactions with IsoDDE
Isomorphic Labs Team. Accurate predictions of novel biomolecular interactions with IsoDDE. Zenodo [Report]. 2026 Feb 10. doi:10.5281/zenodo.18606681
-
[23]
Abo-Dahab, Y., Hernandez, R. & Duran, I.C.A. Pharmacology knowledge graphs enable drug repurposing without chemical structure information. Discov Artif Intell (2026). https://doi.org/10.1007/s44163-026-01303-2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.