pith. sign in

arxiv: 2505.01700 · v3 · submitted 2025-05-03 · 💻 cs.LG · q-bio.QM

PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking

Pith reviewed 2026-05-22 16:44 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM
keywords protein-ligand dockingcross-dockingAI dockingbenchmarkrelaxationbinding pocketschirality
0
0 comments X

The pith

AI docking methods outperform physics-based ones in realistic cross-docking tests on the new PoseX benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates PoseX, an open benchmark that measures how well different docking methods place ligands into proteins under both easy self-docking conditions and harder cross-docking conditions that better match real drug design. It assembles a fresh dataset, runs twenty-three methods from physics-based, AI docking, and AI co-folding families, adds a relaxation step to clean up poses, and publishes a live leaderboard. The main finding is that AI methods reach higher success rates than physics methods across the board. Relaxation removes most of the remaining clashes in AI poses, and telling the model where the pocket is improves results further, especially for co-folding approaches. These patterns matter because cross-docking tests whether a method can handle proteins whose shape was not seen with the ligand in question.

Core claim

AI approaches have consistently outperformed physics-based methods in overall docking success rate. Most intra- and intermolecular clashes of AI approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics-based post-processing could achieve excellent performance. AI co-folding methods exhibit ligand chirality issues, except for Boltz-1x, which introduced physics-inspired potentials to fix hallucinations. Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co-folding methods.

What carries the argument

The PoseX benchmark, which supplies curated self-docking and cross-docking datasets, runs a panel of twenty-three methods, applies energy-minimizing relaxation, and ranks results on a public leaderboard.

If this is right

  • Combining AI-generated poses with a physics-based relaxation step produces high-quality results with fewer clashes.
  • Providing explicit pocket information improves accuracy for AI co-folding models in particular.
  • Physics-inspired corrections for stereochemistry can eliminate chirality hallucinations in co-folding outputs.
  • Future method development should prioritize cross-docking evaluation over self-docking because it better reflects practical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Drug-design workflows could adopt AI docking first and then apply quick relaxation, shifting away from pure physics pipelines.
  • The same benchmark approach could be repeated on larger or more diverse protein families to test whether the AI advantage persists.
  • Models that already handle pocket specification well may become the default starting point when only sequence and ligand identity are known.

Load-bearing premise

The chosen success metrics, RMSD thresholds, and rules for selecting cross-docking pairs and defining pockets produce an unbiased comparison that holds for other datasets and methods.

What would settle it

A new test set of cross-docking pairs chosen by different rules, or evaluated with a stricter RMSD cutoff or different pocket definition, on which the physics-based methods exceed the success rates of the AI methods.

Figures

Figures reproduced from arXiv: 2505.01700 by Ayush Pandit, Fang Wu, Guang Yang, Jin Han, Junhong Liu, Mengdi Wang, Mengyang Wang, Minjie Shen, Tianfan Fu, Wu-Jun Li, Xinze Li, Yejin Choi, Yize Jiang, Youjun Xu, Yuanyuan Zhang, Zaixi Zhang.

Figure 1
Figure 1. Figure 1: Illustration of two docking setups: (1) self-docking, vs. (2) cross-docking. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance on PoseX-SD and PoseX-CD. Mean values of three independent runs are [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of each model on the PoseX-SD and PoseX-CD datasets. Results are split [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-docking performance difference on “similar” and “dissimilar” binding pockets. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Existing protein-ligand docking studies typically focus on the self-docking scenario, which is less practical in real applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges for convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open-source benchmark to evaluate both self-docking and cross-docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self-docking and 1,312 entries for cross-docking; second, we incorporated 23 docking methods in three methodological categories, including physics-based methods (e.g., Schr\"odinger Glide), AI docking methods (e.g., DiffDock) and AI co-folding methods (e.g., AlphaFold3); third, we developed a relaxation method for post-processing to minimize conformational energy and refine binding poses; fourth, we built a leaderboard to rank submitted models in real-time. We derived some key insights and conclusions from extensive experiments: (1) AI approaches have consistently outperformed physics-based methods in overall docking success rate. (2) Most intra- and intermolecular clashes of AI approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics-based post-processing could achieve excellent performance. (3) AI co-folding methods exhibit ligand chirality issues, except for Boltz-1x, which introduced physics-inspired potentials to fix hallucinations, suggesting modeling on stereochemistry improves the structural plausibility markedly. (4) Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co-folding methods, in future modeling efforts. The code, dataset, and leaderboard are released at https://github.com/CataAI/PoseX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PoseX, an open-source benchmark for protein-ligand docking that covers both self-docking (718 entries) and cross-docking (1,312 entries). It evaluates 23 methods spanning physics-based approaches (e.g., Schrödinger Glide), AI docking methods (e.g., DiffDock), and AI co-folding methods (e.g., AlphaFold3), introduces a custom relaxation post-processing step to reduce clashes, and provides a real-time leaderboard. The main conclusions are that AI methods consistently outperform physics-based ones in success rate, relaxation markedly improves AI poses, most AI co-folding methods exhibit ligand chirality issues (except Boltz-1x), and specifying binding pockets substantially boosts performance, especially for co-folding approaches.

Significance. If the performance differences prove robust under identical protocols, PoseX could serve as a practical standard for docking evaluation, emphasizing cross-docking relevance to real-world applications and the value of hybrid AI-plus-physics pipelines. The public release of the curated dataset, code, and leaderboard supports reproducibility and community benchmarking, which are strengths in this empirical domain.

major comments (3)
  1. [Abstract and §4 (Results)] Abstract and §4 (Results): The headline claim that 'AI approaches have consistently outperformed physics-based methods in overall docking success rate' is load-bearing on uniform experimental conditions. The abstract states that specifying pockets 'markedly improves' AI co-folding performance and that the custom relaxation 'alleviates AI clashes'; it is therefore necessary to confirm that physics-based methods (e.g., Glide) received identical pocket definitions, search radii, flexibility settings, and equivalent post-processing, otherwise the reported gap may reflect protocol differences rather than intrinsic method quality.
  2. [§3 (Dataset Curation)] §3 (Dataset Curation): The 1,312 cross-docking entries are central to the generalization claim. Details on sequence-identity cutoffs, minimum conformational RMSD between paired structures, and the precise rule used to enforce 'cross' (as opposed to self) docking must be provided, together with any statistical controls for pocket similarity or ligand diversity, to rule out curation bias that could favor one methodological category.
  3. [§4.2 (Evaluation Protocol)] §4.2 (Evaluation Protocol): The exact success definition (RMSD threshold, whether top-1 or top-N poses are considered, and handling of chirality or clash penalties) is not stated in the abstract and must be explicitly reported for all 23 methods. Without this, it is impossible to assess whether the outperformance result is sensitive to the chosen metric or to potential data overlap between training sets of the AI methods and the benchmark ligands.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'extensive experiments' could be accompanied by a brief mention of the RMSD cutoff used for success to allow readers to immediately gauge the strength of the reported rates.
  2. [Figure 1 or equivalent] Figure 1 or equivalent: The schematic of the relaxation procedure would benefit from an explicit before/after clash count or energy metric to illustrate the magnitude of improvement for AI methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract and §4 (Results)] Abstract and §4 (Results): The headline claim that 'AI approaches have consistently outperformed physics-based methods in overall docking success rate' is load-bearing on uniform experimental conditions. The abstract states that specifying pockets 'markedly improves' AI co-folding performance and that the custom relaxation 'alleviates AI clashes'; it is therefore necessary to confirm that physics-based methods (e.g., Glide) received identical pocket definitions, search radii, flexibility settings, and equivalent post-processing, otherwise the reported gap may reflect protocol differences rather than intrinsic method quality.

    Authors: We confirm that the experimental conditions were uniform across all methods. Physics-based methods such as Schrödinger Glide were provided with the same pocket definitions, search radii, and flexibility settings as the AI-based methods. The relaxation post-processing was applied consistently to refine poses from all approaches. We will revise the manuscript to explicitly document these identical protocols in §4, ensuring that the performance differences are attributable to the methods themselves rather than variations in setup. revision: yes

  2. Referee: [§3 (Dataset Curation)] §3 (Dataset Curation): The 1,312 cross-docking entries are central to the generalization claim. Details on sequence-identity cutoffs, minimum conformational RMSD between paired structures, and the precise rule used to enforce 'cross' (as opposed to self) docking must be provided, together with any statistical controls for pocket similarity or ligand diversity, to rule out curation bias that could favor one methodological category.

    Authors: We agree that these details are important for reproducibility and to address potential bias concerns. We will expand §3 to include the specific sequence-identity cutoffs used, the minimum conformational RMSD threshold for selecting cross-docking pairs, the exact criteria for distinguishing cross-docking from self-docking, and the statistical analyses performed on pocket similarity and ligand diversity. revision: yes

  3. Referee: [§4.2 (Evaluation Protocol)] §4.2 (Evaluation Protocol): The exact success definition (RMSD threshold, whether top-1 or top-N poses are considered, and handling of chirality or clash penalties) is not stated in the abstract and must be explicitly reported for all 23 methods. Without this, it is impossible to assess whether the outperformance result is sensitive to the chosen metric or to potential data overlap between training sets of the AI methods and the benchmark ligands.

    Authors: The success criterion is defined in §4.2 as the fraction of top-1 poses with heavy-atom RMSD less than 2.0 Å to the crystal structure reference. Chirality issues are separately analyzed and reported for the co-folding methods, as highlighted in our results. Clash penalties are addressed through the relaxation procedure but are not part of the primary RMSD-based success metric. We will add a clear statement of this definition to the abstract and ensure it is reiterated for all methods in §4.2. On the matter of training set overlap, we note that the benchmark was designed with recent structures where possible, but a comprehensive audit of all AI models' training data is beyond the scope of this work; we will add a discussion of this limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark with direct measurements

full rationale

The paper curates fixed datasets (718 self-docking and 1,312 cross-docking entries), runs 23 pre-existing methods from three categories, applies a post-processing relaxation step, and reports empirical success rates using standard RMSD thresholds. No derivation chain, equations, first-principles results, or fitted parameters exist whose outputs are claimed as independent predictions. Key insights (AI outperforming physics-based methods, benefits of relaxation and pocket specification) are direct observations from the benchmark runs, not reductions to author-defined quantities. No self-citations load-bear uniqueness theorems or ansatzes. Code and data release further supports external verification. This is self-contained empirical work with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions about docking evaluation rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Docking success is correctly measured by RMSD thresholds to native poses.
    Used to compute the reported success rates for all 23 methods.
  • domain assumption The curated cross-docking pairs and pocket definitions do not introduce selection bias favoring AI methods.
    Central to the claim that AI outperforms physics-based approaches.

pith-pipeline@v0.9.0 · 5912 in / 1391 out tokens · 46111 ms · 2026-05-22T16:44:35.715012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. D-Flow: Multi-modality Flow Matching for D-peptide Design

    cs.CE 2024-11 unverdicted novelty 6.0

    D-Flow applies multi-modality flow matching and a mirror-image data augmentation to generate D-peptides with 10.2% higher sequence identity and 24.31% top affinity on the PepMerge benchmark.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper

  1. [1]

    GabrieleCorso,HannesStärk,BowenJing,ReginaBarzilay,andTommiJaakkola

    Accessed: 2025-04-12. GabrieleCorso,HannesStärk,BowenJing,ReginaBarzilay,andTommiJaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking.arXivpreprint arXiv:2210.01776, 2022. Gabriele Corso, Arthur Deng, Benjamin Fry, Nicholas Polizzi, Regina Barzilay, and Tommi Jaakkola. Deepconfidentstepstonewpockets: Strategiesfordockinggeneralizat...

  2. [2]

    UsePrepWizardto preprocess the protein files by adding hydrogens and optimizing with the OPLS3 force field at pH 7.4

  3. [3]

    Use Epikto predict the pKa and protonation states of small molecules at pH 7.0

    UseLigPreptopreprocesssmallmolecules,preservingthechiralityoftheinputligand. Use Epikto predict the pKa and protonation states of small molecules at pH 7.0. Optimize the small-molecule conformations using the S-OPLS force field, and output one small-molecule conformation as the input for docking

  4. [4]

    Generate a grid file

    DefinetheINNERBOXdimensionsas 10×10×10 Å,andtheOUTERBOXdimensionsas: Sizex Sizey Sizez ! = xmax −x min + 20 ymax −y min + 20 zmax −z min + 20 ! The force field is set to OPLS3, and all other parameters are set by default. Generate a grid file

  5. [5]

    Runtime Environment: Run on an Intel i9-10920X CPU using 16 cores

    Perform molecular docking using Glide SP (Standard Precision), and output one small molecule pose as the docking result. Runtime Environment: Run on an Intel i9-10920X CPU using 16 cores. B.1.2 Discovery Studio Discovery Studio(Pawar & Rohane, 2021), developed by Dassault Systèmes BIOVIA, is a com- prehensive life sciences research platform that covers mo...

  6. [6]

    The protein was protonated at pH 7.4 with a solvent ionic strength of 0.145 M

    Use theProteins Preparationcomponents in Discovery Studio to process the protein files. The protein was protonated at pH 7.4 with a solvent ionic strength of 0.145 M. Minimization was performed using theCHARMmforce field to optimize the protein structure, and all other parameters are set by default

  7. [7]

    Enumerate ionization states for each ligand within a pH range of 6.5-8.5

    Use theLigands Preparationcomponents in Discovery Studio to process the ligand files. Enumerate ionization states for each ligand within a pH range of 6.5-8.5. Enumerate automeric forms for each ligand with a maximum of 10 tautomers per ligand. Fix the bad valencies by adjusting formal charges, and all other parameters are set by default

  8. [8]

    Dock the prepared proteins and the corresponding prepared ligands using theCDOCKER components in Discovery Studio. The docking site was centered at: xc yc zc ! =   xmax+xmin 2ymax+ymin 2zmax+zmin 2   Define the binding sphere radius as: R= max{(x max −x min),(y max −y min)−(z max −z min)}+ 20 The docking simulations were performed using theCHARMmforce...

  9. [9]

    An SVL script automates the docking pipeline

  10. [10]

    TheStructurePreparationfunction is employed to preprocess protein structures

  11. [11]

    The binding site is defined by reference ligands

  12. [12]

    TheTriangle Matcheralgorithm is utilized to generate initial ligand poses

  13. [13]

    The scoring function is configured asLondon dG, with a maximum of 30 poses generated

  14. [14]

    Runtime Environment: Run on an AMD EPYC 9554 CPU

    Poses are refined using a fixed receptor, optimizing only the ligand’s position and conforma- tion, with the re-scoring function configured asGBVI/WSA dGand a maximum of 5 poses retained. Runtime Environment: Run on an AMD EPYC 9554 CPU. B.1.4 AutoDock Vina AutoDock Vina(Eberhardt et al., 2021) is one of the fastest and most widely usedopen-source molecul...

  15. [15]

    UseReduceto add polar hydrogens to the protein structure

  16. [16]

    UseOpenBabelto add non-polar hydrogens and normalize atom names, exporting the protein in a format recognizable by MGLTools

  17. [17]

    Usethereceptor_prepare4.pyscriptfromMGLToolstoconvertthehydrogen-addedprotein PDB file into a PDBQT file

  18. [18]

    UseOpenBabelto add hydrogens to the ligand molecule at pH 7.4

  19. [19]

    Use themk_prepare_ligand.pyscript from Meeko to convert the hydrogen-added ligand SDF file into a PDBQT file

  20. [20]

    Define the docking box center and size as follows: xc yc zc ! =   xmax+xmin 2ymax+ymin 2zmax+zmin 2   Sizex Sizey Sizez ! = xmax −x min + 20 ymax −y min + 20 zmax −z min + 20 !

  21. [21]

    Perform molecular docking using the prepared protein and ligand PDBQT files

  22. [22]

    Runtime Environment: Run on an AMD EPYC 9554 CPU, with no specified core limit and up to 256 cores available

    Usevina_splitto split the output file, extract the best-scored pose for each ligand, and convert the resulting PDBQT file into an SDF file using Meeko for the final output. Runtime Environment: Run on an AMD EPYC 9554 CPU, with no specified core limit and up to 256 cores available. B.1.5 GNINA GNINA(McNutt et al., 2021; 2025) is a relatively new project t...