arxiv: 2605.01681 · v2 · submitted 2026-05-03 · 💻 cs.LG · q-bio.BM

Recognition: unknown

Benchmarking Single-Pose Docking, Consensus Rescoring, and Supervised ML on the LIT-PCBA Library: A Critical Evaluation of DiffDock, AutoDock-GPU, GNINA, and DiffDock-NMDN

Joanne Chun, Liang Zhao, Xiaoiang Xiang, Youssef Abo-Dahab

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3

classification 💻 cs.LG q-bio.BM

keywords virtual screeningmolecular dockingmachine learning re-rankingbenchmark evaluationLIT-PCBAearly enrichment factorDiffDockGNINA

0 comments

The pith

Supervised ML re-ranking of docking scores raises early enrichment 110 percent over the best classical method on the LIT-PCBA library.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates pose generation with AutoDock-GPU and DiffDock, followed by rescoring with GNINA and NMDN, plus consensus and supervised ML models, on the LIT-PCBA library of 15 targets and 578295 experimentally labeled ligand-target pairs. AutoDock-GNINA achieves the highest median EF1% among single methods at 2.14, while DiffDock variants perform worse especially on hard targets. Consensus strategies add robustness without beating the top scorer. Supervised ML re-ranking trained on docking features produces the largest lift to a median EF1% of 4.49. The authors conclude that no method dominates all targets and that validated classical-plus-ML hybrids currently give the most practical results, though enrichment remains modest on realistic data.

Core claim

GNINA rescoring of AutoDock-GPU poses emerged as the strongest single method with a median EF1% of 2.14. DiffDock-based approaches underperformed relative to AutoDock-GNINA, particularly on challenging targets such as OPRK1. Carefully designed consensus ranking improved robustness but did not surpass the best single scorer. Supervised ML re-ranking delivered the largest gains, achieving a median EF1% of 4.49 (+110% over AutoDock-GNINA).

What carries the argument

Supervised machine learning re-ranking model trained on features extracted from docking poses and scores to prioritize experimentally confirmed actives.

If this is right

DiffDock-based approaches underperformed relative to AutoDock-GNINA, particularly on challenging targets such as OPRK1.
Consensus ranking improved robustness but did not surpass the best single scorer.
No single docking method dominates across targets.
Rigorously validated, cost-effective combinations with supervised re-ranking currently offer the most practical value for virtual screening.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the ML models overfit to the specific 15 targets and feature distributions in LIT-PCBA, performance on truly novel targets could drop well below the reported levels.
The modest absolute enrichment values suggest virtual screening may need to be paired with orthogonal filters such as ligand-based similarity or experimental follow-up to reach usable hit rates.
Future tests could examine whether adding explicit 3D structural or interaction fingerprints to the ML feature set would push EF1% higher than the current docking-derived features allow.

Load-bearing premise

The LIT-PCBA library of 15 targets with experimentally confirmed actives and inactives represents real-world virtual screening and the ML models trained on its docking features will generalize to new targets.

What would settle it

Apply the same supervised ML re-ranking pipeline to an independent library of targets and ligands outside LIT-PCBA and measure whether median EF1% stays near 4.49 or falls substantially.

Figures

Figures reproduced from arXiv: 2605.01681 by Joanne Chun, Liang Zhao, Xiaoiang Xiang, Youssef Abo-Dahab.

**Figure 2.** Figure 2: Methodology flowchart of the virtual screening pipeline. This summarizes the virtual screening pipeline used in the study. The LIT-PCBA dataset (15 targets; ~578k ligands) is screened using two parallel docking engines: classical AutoDock and ML-based DiffDock. For each ligand, a single representative pose is selected (bestaffinity pose for AutoDock; highest-confidence pose for DiffDock) and evaluated wit… view at source ↗

read the original abstract

Virtual screening performance depends heavily on the chosen docking and scoring methods. Recent AI-based tools such as DiffDock and NMDN have reported strong benchmark results, but their practical utility on realistic, experimentally-derived datasets remains unclear. Here we perform a large-scale evaluation on the LIT-PCBA library (15 targets, 578,295 ligand-target pairs with experimentally confirmed actives and inactives). We compare AutoDock-GPU and DiffDock for pose generation, followed by rescoring with GNINA and NMDN. We further evaluate rank-based consensus strategies and supervised machine learning models trained on docking features. GNINA rescoring of AutoDock-GPU poses (AutoDock-GNINA) emerged as the strongest single method with a median EF1% of 2.14. DiffDock-based approaches underperformed relative to AutoDock-GNINA, particularly on challenging targets such as OPRK1. Carefully designed consensus ranking improved robustness but did not surpass the best single scorer. Supervised ML re-ranking delivered the largest gains, achieving a median EF1% of 4.49 (+110% over AutoDock-GNINA). Our results highlight that even the best classical+ML hybrid workflows provide only modest early enrichment on realistic benchmarks. We conclude that no single docking method dominates across targets and that rigorously validated, cost-effective combinations with supervised re-ranking currently offer the most practical value for virtual screening.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks pose generation with DiffDock and AutoDock-GPU, rescoring with GNINA and NMDN, rank-based consensus, and supervised ML re-ranking on docking-derived features using the LIT-PCBA library (15 targets, 578,295 ligand-target pairs with experimental labels). GNINA rescoring of AutoDock-GPU poses is the strongest single method (median EF1% 2.14); DiffDock-based methods underperform, especially on targets like OPRK1; consensus improves robustness but does not exceed the top single scorer; supervised ML achieves the largest gain (median EF1% 4.49, +110% over AutoDock-GNINA). The central conclusion is that no single method dominates across targets and that classical+ML hybrids currently offer the most practical value, though overall early enrichment remains modest on this realistic benchmark.

Significance. If the ML results hold under proper validation, the work provides a valuable large-scale, experimentally grounded critique of recent AI docking tools and demonstrates that supervised re-ranking on classical docking features can deliver meaningful but still limited gains. The scale of the LIT-PCBA evaluation (hundreds of thousands of pairs) and direct use of experimental actives/inactives strengthen the assessment of practical utility; the finding that even the best hybrid yields only modest EF1% underscores the difficulty of early enrichment and could usefully inform virtual screening practice.

major comments (2)

[Methods] Methods (ML re-ranking subsection): The supervised models are trained on docking features from the 15 LIT-PCBA targets, but the manuscript provides no description of the train/test split strategy, whether it is target-hold-out or random per-ligand, hyperparameter selection, or feature preprocessing. With only 15 targets this is load-bearing for the headline claim of a generalizable +110% EF1% lift and the conclusion that such hybrids offer practical value for new targets; without explicit cross-target validation the observed gain could reflect fitting to target-specific pocket or chemotype distributions rather than transferable scoring signals.
[Results] Results (median EF1% reporting and Table of per-target values): The median EF1% figures (2.14 for AutoDock-GNINA, 4.49 for ML) are presented without accompanying per-target distributions, inter-quartile ranges, or statistical significance tests (e.g., paired Wilcoxon across the 15 targets). This weakens the robustness claim that “no single docking method dominates across targets” and the assertion that ML delivers consistent gains, especially given the noted underperformance on challenging targets such as OPRK1.

minor comments (2)

[Abstract] Abstract: EF1% is used without a one-sentence definition or reference to its standard formula; while familiar to specialists, a brief parenthetical would improve accessibility for the broader readership.
[Figures] Figure legends (consensus and ML panels): Axis labels and color keys for the EF1% boxplots or bar charts should explicitly state the number of targets (n=15) and whether medians are weighted or unweighted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review, which has identified important areas for improving the clarity and rigor of our manuscript. We address each major comment below and will revise the paper accordingly to incorporate the requested details and analyses.

read point-by-point responses

Referee: [Methods] Methods (ML re-ranking subsection): The supervised models are trained on docking features from the 15 LIT-PCBA targets, but the manuscript provides no description of the train/test split strategy, whether it is target-hold-out or random per-ligand, hyperparameter selection, or feature preprocessing. With only 15 targets this is load-bearing for the headline claim of a generalizable +110% EF1% lift and the conclusion that such hybrids offer practical value for new targets; without explicit cross-target validation the observed gain could reflect fitting to target-specific pocket or chemotype distributions rather than transferable scoring signals.

Authors: We agree that the ML validation details are essential for supporting the generalizability claims. The original manuscript omitted a full description of these procedures. In the revised version, we will expand the Methods section to explicitly state that we used a leave-one-target-out cross-validation strategy (training on 14 targets and testing on the held-out target) to evaluate transferability across targets. Hyperparameters were tuned via grid search using an inner validation split within the training targets, and all features were z-score standardized using statistics computed solely from the training data for each fold. These additions will clarify that the reported EF1% gains are based on cross-target evaluation rather than within-target fitting. revision: yes
Referee: [Results] Results (median EF1% reporting and Table of per-target values): The median EF1% figures (2.14 for AutoDock-GNINA, 4.49 for ML) are presented without accompanying per-target distributions, inter-quartile ranges, or statistical significance tests (e.g., paired Wilcoxon across the 15 targets). This weakens the robustness claim that “no single docking method dominates across targets” and the assertion that ML delivers consistent gains, especially given the noted underperformance on challenging targets such as OPRK1.

Authors: We concur that additional statistical detail would strengthen the robustness of our conclusions. In the revision, we will add a supplementary table listing the per-target EF1% values for all methods, along with a figure showing boxplots of the distributions and inter-quartile ranges across the 15 targets. We will also compute and report paired Wilcoxon signed-rank tests (with p-values) comparing the methods across targets, including the significance of the ML improvement over AutoDock-GNINA. This will provide quantitative support for the statements that no single method dominates and that ML yields consistent gains despite variability on targets like OPRK1. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking on external experimental labels shows no circularity

full rationale

The paper reports empirical enrichment factors (EF1%) from direct comparison of docking and ML rescoring outputs against experimentally confirmed actives/inactives in the LIT-PCBA library (15 targets, 578k pairs). No derivation chain, equations, or first-principles results exist that reduce to fitted parameters or self-referential definitions by construction. Supervised ML re-ranking is trained on docking-derived features and evaluated for performance gains, but these are standard held-out or cross-validated metrics on external labels rather than tautological predictions. Self-citations, if present, are not load-bearing; the central claims rest on independent experimental ground truth.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study; the central claims rest on standard docking and ML pipelines applied to an external experimental dataset with no new free parameters, axioms, or invented entities introduced.

pith-pipeline@v0.9.0 · 5597 in / 1197 out tokens · 77601 ms · 2026-05-10T16:02:08.125041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 19 canonical work pages

[1]

In the first stage, DiffDock generates multiple potential binding candidate poses across the entire protein surface per ligand in a blind docking fashion

Pose Generation: For each protein -ligand pair, DiffDock is used to sample protocol called DiffDock-NMDN to improve DiffDock’s screening enrichment [ 13]. In the first stage, DiffDock generates multiple potential binding candidate poses across the entire protein surface per ligand in a blind docking fashion
[2]

pKd-like

Rescoring with NMDN: The generated poses are then evaluated and ranked in the second stage, using Normalized Mixture Density Network (NMDN) score. NMDN is a distinct deep learning - based scoring function that learns the probability density distribution of distances between protein residues and ligand atoms to estimate binding strength, producing a “pKd-l...

2023
[3]

Virtual screening with Gnina 1.0

Sunseri J, Koes DR. Virtual screening with Gnina 1.0. Molecules. 2021;26(23):7369. doi:10.3390/molecules26237369. 26

work page doi:10.3390/molecules26237369 2021
[4]

MILCDock: Machine Learning-Enhanced Consensus Docking for Virtual Screening in Drug Discovery [senior thesis]

Morris CJ. MILCDock: Machine Learning-Enhanced Consensus Docking for Virtual Screening in Drug Discovery [senior thesis]. Provo (UT): Brigham Young University; 2022. Available from: https://physics.byu.edu/docs/thesis/1542

2022
[5]

Deep learning strategies for enhanced molecular docking and virtual screening

da Silva MMP, Guedes IA, Custódio FL, Krempser E, Dardenne LE. Deep learning strategies for enhanced molecular docking and virtual screening. ChemRxiv [Preprint]. 2023 Nov 7. doi:10.26434/chemrxiv-2023-zfv87-v2

work page doi:10.26434/chemrxiv-2023-zfv87-v2 2023
[6]

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville

Mysinger MM, Carchia M, Irwin JJ, Shoichet BK. Directory of useful decoys, enhanced (DUD- E): better ligands and decoys for better benchmarking. J Med Chem. 2012;55(14):6582-6594. doi:10.1021/jm300687e

work page doi:10.1021/jm300687e 2012
[7]

Derek van Tilborg, Alisa Alenicheva, and Francesca Grisoni

Tran-Nguyen VK, Jacquemard C, Rognan D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J Chem Inf Model. 2020;60(9):4263-4273. doi:10.1021/acs.jcim.0c00155

work page doi:10.1021/acs.jcim.0c00155 2020
[8]

Data leakage and redundancy in the LIT-PCBA benchmark

Huang A, Knight IS, Naprienko S. Data leakage and redundancy in the LIT-PCBA benchmark. arXiv [Preprint]. 2025 Jul 29; arXiv:2507.21404. Available from: https://arxiv.org/abs/2507.21404

work page arXiv 2025
[9]

early recognition

Truchon JF, Bayly CI. Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J Chem Inf Model. 2007;47(2):488-508. doi:10.1021/ci600426e

work page doi:10.1021/ci600426e 2007
[10]

Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking

Solis-Vasquez L, Tillack AF, Santos-Martins D, Koch A, LeGrand S, Forli S. Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking. Parallel Comput. 2022;109:102861. doi:10.1016/j.parco.2021.102861

work page doi:10.1016/j.parco.2021.102861 2022
[11]

AMIDE v2: high-throughput screening based on AutoDock-GPU and improved workflow leading to better performance and reliability

Darme P, Dauchez M, Renard A, Voutquenne-Nazabadioko L, Aubert D, Escotte-Binet S, et al. AMIDE v2: high-throughput screening based on AutoDock-GPU and improved workflow leading to better performance and reliability. Int J Mol Sci. 2021;22(14):7489. doi:10.3390/ijms22147489

work page doi:10.3390/ijms22147489 2021
[12]

Accelerating AutoDock4 with GPUs and gradient-based local search

Santos-Martins D, Solis-Vasquez L, Tillack AF, Sanner MF, Koch A, Forli S. Accelerating AutoDock4 with GPUs and gradient-based local search. J Chem Theory Comput. 2021;17(2):1060-1073. doi:10.1021/acs.jctc.0c01006

work page doi:10.1021/acs.jctc.0c01006 2021
[13]

Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

Corso G, Stärk H, Jing B, Barzilay R, Jaakkola T. DiffDock: diffusion steps, twists, and turns for molecular docking. arXiv [Preprint]. 2022 Oct 4; arXiv:2210.01776. Available from: https://arxiv.org/abs/2210.01776

work page arXiv 2022
[14]

Deep-learning based docking methods: fair comparisons to conventional docking workflows

Jain AN, Cleves AE, Walters WP. Deep-learning based docking methods: fair comparisons to conventional docking workflows. arXiv [Preprint]. 2024 Dec 3; arXiv:2412.02889. Available from: https://arxiv.org/abs/2412.02889

work page arXiv 2024
[15]

Normalized protein-ligand distance likelihood score for end-to-end blind docking and virtual screening

Xia S, Gu Y, Zhang Y. Normalized protein-ligand distance likelihood score for end-to-end blind docking and virtual screening. J Chem Inf Model. 2025;65(3):1101-1114. doi:10.1021/acs.jcim.4c01014

work page doi:10.1021/acs.jcim.4c01014 2025
[16]

Receiver operating characteristic curve: overview and practical use for clinicians

Nahm FS. Receiver operating characteristic curve: overview and practical use for clinicians. Korean J Anesthesiol. 2022;75(1):25-36. doi:10.4097/kja.21209

work page doi:10.4097/kja.21209 2022
[17]

Accelerating AutoDock Vina with GPUs

Tang S, Chen R, Lin M, Lin Q, Zhu Y, Ding J, et al. Accelerating AutoDock Vina with GPUs. Molecules. 2022;27(9):3041. doi:10.3390/molecules27093041

work page doi:10.3390/molecules27093041 2022
[18]

NVIDIA RTX 3090 vs NVIDIA A100 40 GB (PCIe): GPU benchmarks [Internet]

Bizon-Tech. NVIDIA RTX 3090 vs NVIDIA A100 40 GB (PCIe): GPU benchmarks [Internet]. [cited 2025 Aug 27]. Available from: https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX- 3090-vs-NVIDIA-A100-40-GB-(PCIe)/579vs592 27

2025
[19]

Uni-Dock: GPU-accelerated docking enables ultralarge virtual screening

Yu Y, Cai C, Wang J, Bo Z, Zhu Z, Zheng H. Uni-Dock: GPU-accelerated docking enables ultralarge virtual screening. J Chem Theory Comput. 2023;19(11):3336-3345. doi:10.1021/acs.jctc.2c01145

work page doi:10.1021/acs.jctc.2c01145 2023
[20]

TB-IECS: an accurate machine learning-based scoring function for virtual screening

Zhang X, Shen C, Jiang D, et al. TB-IECS: an accurate machine learning-based scoring function for virtual screening. J Cheminform. 2023;15:63. doi:10.1186/s13321-023-00731-x

work page doi:10.1186/s13321-023-00731-x 2023
[21]

DockM8: an all-in-one open-source platform for consensus virtual screening in drug design

Lacour A, Ibrahim H, Volkamer A, Hirsch AKH. DockM8: an all-in-one open-source platform for consensus virtual screening in drug design. ChemRxiv [Preprint]. 2024 Jul. doi:10.26434/chemrxiv-2024-17k46

work page doi:10.26434/chemrxiv-2024-17k46 2024
[22]

Accurate predictions of novel biomolecular interactions with IsoDDE

Isomorphic Labs Team. Accurate predictions of novel biomolecular interactions with IsoDDE. Zenodo [Report]. 2026 Feb 10. doi:10.5281/zenodo.18606681

work page doi:10.5281/zenodo.18606681 2026
[23]

& Duran, I.C.A

Abo-Dahab, Y., Hernandez, R. & Duran, I.C.A. Pharmacology knowledge graphs enable drug repurposing without chemical structure information. Discov Artif Intell (2026). https://doi.org/10.1007/s44163-026-01303-2

work page doi:10.1007/s44163-026-01303-2 2026