pith. sign in

arxiv: 2605.04118 · v2 · pith:YXRUPWHInew · submitted 2026-05-05 · 🧬 q-bio.QM · cs.AI

ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation

Pith reviewed 2026-05-25 06:45 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AI
keywords protein binder designbenchmarkevaluation protocolstructure predictiongenerative methodswet-lab validationthroughput metricsstructural diversity
0
0 comments X

The pith

ProtDBench standardizes binder design evaluation and shows that structure prediction verifiers disagree on the same designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProtDBench as a fixed set of tasks, protocols, and success rules to make protein binder design results comparable across studies. It first uses a large set of wet-lab annotated sequences to test how different structure prediction models perform as verifiers, finding that they produce different success calls even when the filtering rules stay the same. It then runs several open-source generative methods on ten protein targets under one shared protocol that includes a 24-hour compute budget and cluster-based diversity checks. The resulting data show that rankings of methods shift when success is measured by per-sequence rate versus throughput or diversity.

Core claim

ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria for protein binder design. Using wet-lab annotated data, it reveals substantial verifier-dependent bias and limited agreement among structure prediction models under identical filtering. Benchmarking representative generative methods across ten targets under a fixed protocol that includes throughput-aware metrics and cluster-level criteria exposes systematic differences induced by filtering rules, success definitions, and evaluation settings.

What carries the argument

ProtDBench, the evaluation framework that applies a fixed 24-hour budget, cluster-level success criteria, and consistent verifier protocols across methods and targets.

If this is right

  • The same set of designed sequences receives different success labels depending on which structure prediction model is used as verifier.
  • Throughput-aware scoring under a fixed time budget produces different efficiency rankings than per-sequence success rates alone.
  • Cluster-level criteria that penalize redundant structures change which methods appear to succeed compared with sequence-level counts.
  • A shared protocol makes it possible to attribute performance gaps to method differences rather than evaluation choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting one shared benchmark could reduce the number of designs that later fail experimental validation.
  • Methods that balance high success under multiple verifiers with good throughput may prove more practical than those optimized for a single metric.
  • The observed verifier disagreement suggests that combining several structure predictors could give more stable estimates of design quality.
  • The framework could be extended to additional targets or to closed-source methods to test whether current rankings hold.

Load-bearing premise

The wet-lab annotations are treated as accurate ground truth for measuring verifier bias and method performance on the chosen targets.

What would settle it

A new independent wet-lab screen on the same ten targets that assigns different binding labels to the evaluated sequences would alter both the measured verifier disagreements and the method rankings.

Figures

Figures reproduced from arXiv: 2605.04118 by Chengyue Gong, Cong Liu, Jiaqi Guan, Jinyuan Sun, Milong Ren, Wenzhi Xiao, Xinshi Chen.

Figure 1
Figure 1. Figure 1: Benchmarking structure prediction models as filters on the Cao dataset. (a) Top-1% success rates achieved by individual confidence metrics (e.g., ipTM, pTM, pLDDT, ipAE) derived from AF2-IG, Boltz-1, Boltz-2, Chai-1, ColabFold, Protenix, and Protenix￾Mini. (b) Success rates of combined filtering strategies across eight targets. “AF3 (published)” denotes baseline results from prior work, while other bars co… view at source ↗
Figure 2
Figure 2. Figure 2: Binder design benchmarks: (a) Success rates under AF2-IG-Easy; (b) Structural consistency across targets. Enlarged view across columns for better granularity. defined as the fraction of sequences whose predicted struc￾tures recapitulate the generated backbone: CR = 1 P i |Si | X i X s∈Si C(s). (8) 4.3. Main Findings Generation throughput varies substantially across gen￾erative paradigms view at source ↗
Figure 3
Figure 3. Figure 3: Binder structural diversity benchmark. For each target, bars show the diversity-adjusted cluster pass rate of each model after AF2-IG-Easy filtering. Clustering is performed only on backbones that pass the filter, and the reported percentage corresponds to the number of unique structural clusters at TM-score thresholds 0.6, 0.8, and 1.0, normalized by the total number of designed backbones. Different color… view at source ↗
Figure 4
Figure 4. Figure 4: Protenix-Mini sequence-level success rates. Fraction of generated binder sequences passing the Protenix-Mini filter for each target. Absolute success rates are lower than AF2-IG-Easy due to the conservative confidence scores of Protenix-Mini view at source ↗
Figure 5
Figure 5. Figure 5: Protenix-Mini cluster pass rates. Diversity-adjusted success rate computed as the number of unique structural clusters among passed designs (clustered at TM-score thresholds 0.6/0.8/1.0) divided by the total number of generated backbones. This metric captures both filter success and diversity among successful designs. Protenix-Mini filter for each target view at source ↗
Figure 6
Figure 6. Figure 6: Performance of binders with different secondary structures designed by various methods on AlphaFold2- and Protenix-based metrics view at source ↗
Figure 7
Figure 7. Figure 7: Alpha-helix ratio on different targets across various methods. 15 view at source ↗
Figure 7
Figure 7. Figure 7: Performance of binders with different secondary structures designed by various methods on AlphaFold2- and Protenix-based metrics [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reference ratio of gyration radius on different targets across various methods view at source ↗
Figure 9
Figure 9. Figure 9: AlphaFold2 initial guess interface pAE on different targets across various methods. 16 view at source ↗
Figure 9
Figure 9. Figure 9: Reference ratio of gyration radius on different targets across various methods [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: AlphaFold2 initial guess pLDDT on different targets across various methods view at source ↗
Figure 11
Figure 11. Figure 11: Bound unbound RMSD on different targets across various methods. 17 view at source ↗
Figure 11
Figure 11. Figure 11: AlphaFold2 initial guess pLDDT on different targets across various methods [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: AlphaFold2 initial guess pTM on different targets across various methods. Since BindCraft integrates hallucination and evaluation in a single pipeline, we removed evaluation time from our mea￾surement to enable a fairer comparison. Specifically, we measured the time consumed by the binder hallucination function (https://github.com/martinpacesa/BindCraft/blob/main/bindcraft.py, Lines 109- 111). Within this… view at source ↗
Figure 13
Figure 13. Figure 13: The performance of the AF2 confidence score filters. The SR for each confidence combination is plotted as one gray dot. The pareto frontier filters are highlighted as red stars, and the selected one is marked as a black star. For each threshold combination, we compute the success rate (SR) on each target. Following the definition of Pareto Frontier, the search algorithm can come to a set of optimal points… view at source ↗
Figure 14
Figure 14. Figure 14: AUC and Average Precision scores for individual confidence metrics on subsampled Cao data. (a) Higher values indicate better global discrimination between binders and non-binders. (b) Similar to AUC, but more sensitive to top-ranking false positives. We report AUC and average precision scores for individual confidence metrics across diverse design targets ( view at source ↗
Figure 14
Figure 14. Figure 14: The performance of the AF2 confidence score filters. The SR for each confidence combination is plotted as one gray dot. The pareto frontier filters are highlighted as red stars, and the selected one is marked as a black star. For each threshold combination, we compute the success rate (SR) on each target. Following the definition of Pareto Frontier, the search algorithm can come to a set of optimal points… view at source ↗
Figure 15
Figure 15. Figure 15: Per-target agreement curves under Top-1% ipTM filtering. For each target, recall (%) is plotted against the agreement threshold k (a true positive must be identified by at least k verifiers simultaneously). Annotations on each curve indicate the number of true positives that satisfy the agreement threshold. Recall decreases rapidly as k increases, indicating limited consensus among verifiers and highlight… view at source ↗
Figure 16
Figure 16. Figure 16: AUC and Average Precision scores for individual confidence metrics on the nine subsampled Cao targets. Same target set and verifier set as Figure 1a. (a) ROC-AUC: higher values indicate better global discrimination between binders and non-binders, integrated across all decision thresholds. (b) Average Precision (area under the PR curve): more sensitive than AUC to top-ranking false positives, and therefor… view at source ↗
read the original abstract

Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProtDBench, a standardized evaluation framework for protein binder design that defines unified benchmark tasks, protocols, success criteria, and throughput-aware metrics (including a fixed 24-hour budget and cluster-level diversity). It uses a large wet-lab annotated dataset to analyze structure prediction models as verifiers, reporting substantial verifier-dependent bias and limited agreement under identical filtering, then benchmarks representative open-source generative methods across ten diverse targets under a fixed protocol.

Significance. If the wet-lab annotations prove reliable and representative, ProtDBench would provide a needed reproducible pipeline for fair comparison of binder design methods, exposing how filtering rules and success definitions affect rankings of computational efficiency, success rate, and structural diversity. The empirical focus on external wet-lab data and inclusion of throughput metrics are strengths for practical utility.

major comments (2)
  1. [Abstract / verifier analysis section] Abstract and the section on verifier analysis: the central claim of 'substantial verifier-dependent bias and limited agreement' is stated without any quantitative details on dataset size, exact filtering rules, agreement metrics (e.g., percentage overlap or statistical tests), or how the wet-lab annotations were obtained and validated. This directly undermines assessment of the bias finding.
  2. [Wet-lab dataset / benchmarking results section] Section describing the wet-lab dataset and its use for method benchmarking: the dataset is treated as ground truth both for quantifying verifier bias and for ranking generative methods on ten targets, yet no validation of annotation accuracy, no distributional comparison (sequence, fold, or binding regime) to the benchmark targets, and no discussion of potential systematic assay errors are provided. This is load-bearing for both the bias results and the reported method rankings.
minor comments (2)
  1. [Methods / evaluation protocol] Clarify the exact implementation of the 24-hour throughput budget and cluster-level success criteria, including any pseudocode or parameter values used.
  2. [Benchmark targets description] Add explicit statements on how the ten target proteins were selected and whether they overlap with or are representative of the wet-lab dataset proteins.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the clarity and transparency of our claims. We address each major point below and have made revisions to incorporate additional details and discussion.

read point-by-point responses
  1. Referee: [Abstract / verifier analysis section] Abstract and the section on verifier analysis: the central claim of 'substantial verifier-dependent bias and limited agreement' is stated without any quantitative details on dataset size, exact filtering rules, agreement metrics (e.g., percentage overlap or statistical tests), or how the wet-lab annotations were obtained and validated. This directly undermines assessment of the bias finding.

    Authors: We agree that the abstract and verifier analysis section would benefit from explicit quantitative support for the central claim. In the revised manuscript we have added the size of the wet-lab annotated dataset, the precise filtering rules applied, agreement metrics (including percentage overlap and appropriate statistical tests), and a description of how the annotations were sourced and validated from the original experimental studies. These additions directly address the concern and allow readers to better evaluate the reported verifier-dependent bias. revision: yes

  2. Referee: [Wet-lab dataset / benchmarking results section] Section describing the wet-lab dataset and its use for method benchmarking: the dataset is treated as ground truth both for quantifying verifier bias and for ranking generative methods on ten targets, yet no validation of annotation accuracy, no distributional comparison (sequence, fold, or binding regime) to the benchmark targets, and no discussion of potential systematic assay errors are provided. This is load-bearing for both the bias results and the reported method rankings.

    Authors: We acknowledge that treating the wet-lab annotations as ground truth requires supporting context. In the revision we have expanded the dataset description to include (i) a summary of annotation validation procedures reported in the source studies, (ii) distributional comparisons of sequence identity, fold classes, and binding regimes between the annotated dataset and the ten benchmark targets, and (iii) an explicit discussion of potential systematic assay errors together with the limitations these impose on interpretation. These additions provide the necessary caveats for both the bias analysis and the method rankings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark against external wet-lab data

full rationale

The paper introduces ProtDBench as a standardized evaluation framework and performs empirical comparisons of structure prediction models and generative methods using a large external wet-lab annotated dataset as ground truth. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations are present in the provided text that reduce any claim to its own inputs by construction. The work is self-contained as a benchmark study; assumptions about the wet-lab data's reliability are correctness concerns, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the representativeness of the wet-lab dataset and the assumption that the chosen ten targets and fixed protocol capture realistic evaluation conditions; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Wet-lab annotated dataset serves as reliable ground truth for in silico verifier evaluation
    Invoked when using the dataset to quantify bias and agreement among structure prediction models
  • domain assumption Fixed 24-hour budget and cluster-level criteria are appropriate proxies for practical utility
    Used to define throughput-aware and diversity-aware success
invented entities (1)
  • ProtDBench benchmark framework no independent evidence
    purpose: Unified evaluation of binder design methods
    Newly defined tasks, protocols, and metrics introduced in the paper

pith-pipeline@v0.9.0 · 5749 in / 1398 out tokens · 21709 ms · 2026-05-25T06:45:06.296459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Geoflow-v2: A unified atomic diffusion model for protein structure prediction and de novo de- sign.bioRxiv, pp

    BioGeometry. Geoflow-v2: A unified atomic diffusion model for protein structure prediction and de novo de- sign.bioRxiv, pp. 2025–05,

  2. [2]

    Butcher, J. K. V ., Krishna, R., Mitra, R., Brent, R. I., Li, Y ., Corley, N., Kim, P., Funk, J., Mathis, S. V ., Salike, S., et al. De novo design of all-atom biomolecular interac- tions with rfdiffusion3.bioRxiv, pp. 2025–09,

  3. [3]

    Chai- 1: Decoding the molecular interactions of life.BioRxiv, pp

    ChaiDiscovery, Boitreaud, J., Dent, J., McPartlon, M., Meier, J., Reis, V ., Rogozhonikov, A., and Wu, K. Chai- 1: Decoding the molecular interactions of life.BioRxiv, pp. 2024–10,

  4. [4]

    Zero-shot antibody design in a 24-well plate.bioRxiv, pp

    ChaiDiscovery, Boitreaud, J., Dent, J., Geisz, D., McPart- lon, M., Meier, J., Qiao, Z., Rogozhnikov, A., Rollins, N., Wollenhaupt, P., et al. Zero-shot antibody design in a 24-well plate.bioRxiv, pp. 2025–07,

  5. [5]

    Cho, Y ., Pacesa, M., Zhang, Z., Correia, B

    doi: 10.1101/2025.01.08.631967. Cho, Y ., Pacesa, M., Zhang, Z., Correia, B. E., and Ovchin- nikov, S. Boltzdesign1: Inverting all-atom structure pre- diction model for generalized biomolecular binder de- sign.bioRxiv, pp. 2025–04,

  6. [6]

    URLhttps://www.biorxiv.org/content/ early/2023/05/25/2023.05.24.542194

    doi: 10.1101/2023.05.24.542194. URLhttps://www.biorxiv.org/content/ early/2023/05/25/2023.05.24.542194. Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., ˇZ´ıdek, A., Bates, R., Blackwell, S., Yim, J., et al. Protein complex prediction with alphafold- multimer.biorxiv, pp. 2021–10,

  7. [7]

    Protenix-mini: Efficient structure predictor via compact architecture, few-step diffusion and switchable plm.arXiv preprint arXiv:2507.11839,

    Gong, C., Chen, X., Zhang, Y ., Song, Y ., Zhou, H., and Xiao, W. Protenix-mini: Efficient structure predictor via compact architecture, few-step diffusion and switchable plm.arXiv preprint arXiv:2507.11839,

  8. [8]

    Qu, W., Ma, Y ., Ye, F., Lu, C., Zhou, Y ., Zhang, K., Wang, L., Gui, M., and Gu, Q

    doi: 10.1101/2025.06.14.659707. Qu, W., Ma, Y ., Ye, F., Lu, C., Zhou, Y ., Zhang, K., Wang, L., Gui, M., and Gu, Q. Seedproteo: Accurate de novo all-atom design of protein binders,

  9. [9]

    URLhttps: //arxiv.org/abs/2512.24192. Stark, H., Faltings, F., Choi, M., Xie, Y ., Hur, E., O’Donnell, T., Bushuiev, A., Uc ¸ar, T., Passaro, S., Mao, W., Reveiz, M., Bushuiev, R., Pluskal, T., Sivic, J., Kreis, K., Vahdat, A., Ray, S., Goldstein, J. T., Savinov, A., Hambalek, J. A., Gupta, A., Taquiri-Diaz, D. A., Zhang, Y ., Hatstat, A. K., Arada, A., K...

  10. [10]

    URLhttps://www.biorxiv.org/content/ early/2025/11/24/2025.11.20.689494

    doi: 10.1101/2025.11.20.689494. URLhttps://www.biorxiv.org/content/ early/2025/11/24/2025.11.20.689494. Team, P., Ren, M., Sun, J., Guan, J., Liu, C., Gong, C., Wang, Y ., Wang, L., Cai, Q., Chen, X., and Xiao, W. Pxdesign: Fast, modular, and accu- rate de novo design of protein binders.bioRxiv,

  11. [11]

    URL https://www.biorxiv.org/content/ early/2025/08/16/2025.08.15.670450

    doi: 10.1101/2025.08.15.670450. URL https://www.biorxiv.org/content/ early/2025/08/16/2025.08.15.670450. Van Kempen, M., Kim, S. S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C. L., S ¨oding, J., and Steinegger, M. Fast and accurate protein structure search with foldseek. Nature biotechnology, 42(2):243–246,

  12. [12]

    E., Patani, H., Danson, A

    Zambaldi, V ., La, D., Chu, A. E., Patani, H., Danson, A. E., Kwan, T. O., Frerix, T., Schneider, R. G., Saxton, D., Thillaisundaram, A., et al. De novo design of high- affinity protein binders with alphaproteo.arXiv preprint arXiv:2409.08022,

  13. [13]

    URLhttps: //arxiv.org/abs/2510.22304. 11 ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation Target PDB ID Crop Hotspot Natural binder Binder length BHRF12wh6A2–158 A65, A74, A77, A82, A85, A93 BH3 helix 80–120 SC2RBD6m0jE333–526 E485, E489, E494, E500, E505 ACE2 receptor 80–120 IL-7RA3di3B17–209 B58, B80, B139 IL-7 50–120 PD-L15o45A17...

  14. [14]

    For each target we report the total number of experimentally annotated non-binders (Neg.) and binders (Pos.), the number of non-binders retained after subsampling (Ret., capped at 20,000per crop for tractability in Figure 1a), and binder-sequence length statistics before/after subsampling. The EGFR row pools two crops, EGFRc and EGFRn, that share the same...

  15. [15]

    On this AF2-IG-pre-filtered AI-generated dataset, Protenix-Mini is more selective and retains fewer candidates, but achieves non-trivial precision on Mdm2 (0.623), PDL1 (0.389), IL7Ra (0.414), and InsulinReceptor (0.300), suggesting that our framework remains meaningfully correlated with wet-lab outcomes on AI-generated candidate distributions and not onl...

  16. [16]

    Multiple open-source variants of AF3 are released later, including Boltz-1 (Wohlwend et al., 2024), Boltz-2 (Passaro et al., 2025), Chai-1 (ChaiDiscovery et al.,

    achieves higher pre- diction accuracy and is able to predict the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. Multiple open-source variants of AF3 are released later, including Boltz-1 (Wohlwend et al., 2024), Boltz-2 (Passaro et al., 2025), Chai-1 (ChaiDiscovery et al.,

  17. [17]

    Wet+/Wet−

    and Protenix (Chen et al., 2025). These predictors are now integral to design and evaluation workflows. 23 ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation Table 10.Retrospective verifier evaluation on AF2-IG-prefiltered RFdiffusion designs.“Wet+/Wet−” are the wet-lab confirmed binder/non-binder counts per target. “TP” is the number...