PepBenchmark: A Standardized Benchmark for Peptide Machine Learning

Jiahui Zhang; Kuangqi Zhou; Lingyan Zhu; Rouyi Wang; Tianshu Xiao; Yang Wang; Yaosen Min

arxiv: 2604.10531 · v1 · submitted 2026-04-12 · 💻 cs.LG · cs.AI

PepBenchmark: A Standardized Benchmark for Peptide Machine Learning

Jiahui Zhang , Rouyi Wang , Kuangqi Zhou , Tianshu Xiao , Lingyan Zhu , Yaosen Min , Yang Wang This is my paper

Pith reviewed 2026-05-10 15:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords peptide machine learningbenchmarkdrug discoverystandardized pipelinepeptide datasetsmachine learning modelsGNNPLM

0 comments

The pith

PepBenchmark supplies the first standardized benchmark for peptide machine learning through unified datasets, preprocessing, and evaluation protocols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Peptide therapeutics are considered the third generation of drugs, but machine learning for peptides has struggled due to inconsistent benchmarks. The authors address this by assembling PepBenchmark, which includes a collection of 35 peptide datasets, a fixed pipeline for handling them, and a leaderboard for model evaluation. This matters to a reader because without such standardization, progress is hard to measure or build upon across different studies. If the benchmark succeeds, it should allow clearer identification of effective techniques and smoother movement toward real peptide drug applications.

Core claim

PepBenchmark comprises PepBenchData with 29 canonical and 6 non-canonical peptide datasets in 7 groups, PepBenchPipeline for standardized cleaning and splitting, and PepBenchLeaderboard with baselines across fingerprint, GNN, PLM, and SMILES model families, establishing a common foundation for peptide drug discovery.

What carries the argument

The PepBenchmark framework, which unifies data resources and protocols to enable consistent and comparable machine learning experiments on peptides.

If this is right

Peptide ML methods can now be evaluated under identical conditions for direct comparison.
Baselines from four methodological families provide reference points for new work.
The coverage of both canonical and non-canonical peptides broadens applicability to diverse drug design problems.
Standardized preprocessing reduces quality issues that arise in custom pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The availability of this resource may encourage more researchers to focus on peptide-specific modeling challenges rather than data preparation.
Performance insights from the leaderboard could guide the selection of model architectures for particular peptide tasks.
Over time, the benchmark might be expanded to incorporate additional real-world metrics such as experimental validation results.

Load-bearing premise

The 35 datasets and the preprocessing and splitting rules chosen for PepBenchPipeline are representative of real peptide drug-discovery challenges and free from systematic biases introduced by curation decisions.

What would settle it

An experiment showing that using different datasets or preprocessing steps outside PepBenchPipeline produces substantially altered model performance rankings would indicate that the benchmark does not fully capture the domain.

Figures

Figures reproduced from arXiv: 2604.10531 by Jiahui Zhang, Kuangqi Zhou, Lingyan Zhu, Rouyi Wang, Tianshu Xiao, Yang Wang, Yaosen Min.

**Figure 2.** Figure 2: Amino acid distribution comparison between positive and negative samples for nonfouling [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Length distribution of nonfouling dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Property comparison between positive and negative samples for nonfouling dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Amino acid distribution comparison between positive and negative samples for cpp dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Length distribution of cpp dataset [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Property comparison between positive and negative samples for cpp dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Amino acid distribution for nc-cpp pampa dataset [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: non-canonical monomers distribution for nc-cpp [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Length distribution of nc-cpp pampa dataset. B.1.4 BLOOD-BRAIN BARRIER PEPTIDES (BBP) • Property and Application: The bbp dataset focuses on peptide’s ability to penetrate the bloodbrain barrier, a critical property for developing neuroactive peptides suitable for central nervous system diseases, such as Alzheimer’s disease. • Data Source: The dataset is sourced from BBPpred (Dai et al., 2021), which col… view at source ↗

**Figure 11.** Figure 11: Amino acid distribution comparison between positive and negative samples for bbp [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Length distribution of bbp dataset [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Property comparison between positive and negative samples for bbp dataset. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Amino acid distribution comparison between positive and negative samples for antimi [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Length distribution of antimicrobial dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Property comparison between positive and negative samples for antimicrobial dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Amino acid distribution comparison between positive and negative samples for antibac [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Length distribution of antibacterial dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Property comparison between positive and negative samples for antibacterial dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Amino acid distribution comparison between positive and negative samples for antifungal [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: Length distribution of antifungal dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 22.** Figure 22: Property comparison between positive and negative samples for antifungal dataset. [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 23.** Figure 23: Amino acid distribution comparison between positive and negative samples for antipara [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗

**Figure 24.** Figure 24: Length distribution of antiparasitic dataset. [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗

**Figure 25.** Figure 25: Property comparison between positive and negative samples for antiparasitic dataset. [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗

**Figure 26.** Figure 26: Amino acid distribution comparison between positive and negative samples for antiviral [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗

**Figure 27.** Figure 27: Length distribution of antiviral dataset. [PITH_FULL_IMAGE:figures/full_fig_p028_27.png] view at source ↗

**Figure 28.** Figure 28: Property comparison between positive and negative samples for antiviral dataset. [PITH_FULL_IMAGE:figures/full_fig_p029_28.png] view at source ↗

**Figure 29.** Figure 29: Length distribution of E. coli MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p029_29.png] view at source ↗

**Figure 30.** Figure 30: Label distribution of E. coli MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p029_30.png] view at source ↗

**Figure 31.** Figure 31: Amino acid distribution of E. coli MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p030_31.png] view at source ↗

**Figure 32.** Figure 32: Length distribution of P. aeruginosa MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p030_32.png] view at source ↗

**Figure 33.** Figure 33: Label distribution of P. aeruginosa MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p030_33.png] view at source ↗

**Figure 34.** Figure 34: Amino acid distribution of P. aeruginosa MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p031_34.png] view at source ↗

**Figure 35.** Figure 35: Length distribution of S. aureus MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p031_35.png] view at source ↗

**Figure 36.** Figure 36: Label distribution of S. aureus MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p031_36.png] view at source ↗

**Figure 37.** Figure 37: Amino acid distribution of S. aureus MIC dataset. [PITH_FULL_IMAGE:figures/full_fig_p032_37.png] view at source ↗

**Figure 38.** Figure 38: Canonical Amino acid distribution comparison between positive and negative samples [PITH_FULL_IMAGE:figures/full_fig_p032_38.png] view at source ↗

**Figure 39.** Figure 39: non-canonical Amino acid distribution comparison between positive and negative sam [PITH_FULL_IMAGE:figures/full_fig_p033_39.png] view at source ↗

**Figure 40.** Figure 40: Length distribution of nc-antimicrobial dataset. [PITH_FULL_IMAGE:figures/full_fig_p033_40.png] view at source ↗

**Figure 41.** Figure 41: Canonical Amino acid distribution comparison between positive and negative samples [PITH_FULL_IMAGE:figures/full_fig_p033_41.png] view at source ↗

**Figure 42.** Figure 42: non-canonical Amino acid distribution comparison between positive and negative sam [PITH_FULL_IMAGE:figures/full_fig_p034_42.png] view at source ↗

**Figure 43.** Figure 43: Length distribution of nc-antibacterial dataset. [PITH_FULL_IMAGE:figures/full_fig_p034_43.png] view at source ↗

**Figure 44.** Figure 44: Canonical Amino acid distribution comparison between positive and negative samples [PITH_FULL_IMAGE:figures/full_fig_p034_44.png] view at source ↗

**Figure 45.** Figure 45: non-canonical Amino acid distribution comparison between positive and negative sam [PITH_FULL_IMAGE:figures/full_fig_p035_45.png] view at source ↗

**Figure 46.** Figure 46: Length distribution of nc-antifungal dataset. [PITH_FULL_IMAGE:figures/full_fig_p035_46.png] view at source ↗

**Figure 47.** Figure 47: Amino acid distribution comparison between positive and negative samples for anticancer [PITH_FULL_IMAGE:figures/full_fig_p036_47.png] view at source ↗

**Figure 48.** Figure 48: Length distribution of anticancer dataset. [PITH_FULL_IMAGE:figures/full_fig_p036_48.png] view at source ↗

**Figure 49.** Figure 49: Property comparison between positive and negative samples for anticancer dataset. [PITH_FULL_IMAGE:figures/full_fig_p036_49.png] view at source ↗

**Figure 50.** Figure 50: Amino acid distribution comparison between positive and negative samples for tumor [PITH_FULL_IMAGE:figures/full_fig_p037_50.png] view at source ↗

**Figure 51.** Figure 51: Length distribution of tumor T-cell antigen dataset. [PITH_FULL_IMAGE:figures/full_fig_p037_51.png] view at source ↗

**Figure 52.** Figure 52: Property comparison between positive and negative samples for tumor T-cell antigen [PITH_FULL_IMAGE:figures/full_fig_p037_52.png] view at source ↗

**Figure 53.** Figure 53: Amino acid distribution comparison between positive and negative samples for ace in [PITH_FULL_IMAGE:figures/full_fig_p038_53.png] view at source ↗

**Figure 54.** Figure 54: Length distribution of ace inhibitory dataset. [PITH_FULL_IMAGE:figures/full_fig_p039_54.png] view at source ↗

**Figure 55.** Figure 55: Property comparison between positive and negative samples for ace inhibitory dataset. [PITH_FULL_IMAGE:figures/full_fig_p039_55.png] view at source ↗

**Figure 56.** Figure 56: Length distribution of ace inhibitory ic50 dataset. [PITH_FULL_IMAGE:figures/full_fig_p040_56.png] view at source ↗

**Figure 57.** Figure 57: Label distribution of ACE inhibitory IC50 dataset. [PITH_FULL_IMAGE:figures/full_fig_p040_57.png] view at source ↗

**Figure 58.** Figure 58: Amino acid distribution of ace inhibitory ic50 dataset. [PITH_FULL_IMAGE:figures/full_fig_p040_58.png] view at source ↗

**Figure 59.** Figure 59: Amino acid distribution comparison between positive and negative samples for dpp-iv [PITH_FULL_IMAGE:figures/full_fig_p041_59.png] view at source ↗

**Figure 60.** Figure 60: Length distribution of dpp-iv inhibitor dataset. [PITH_FULL_IMAGE:figures/full_fig_p041_60.png] view at source ↗

**Figure 61.** Figure 61: Property comparison between positive and negative samples for dpp-iv inhibitor dataset. [PITH_FULL_IMAGE:figures/full_fig_p042_61.png] view at source ↗

**Figure 62.** Figure 62: Amino acid distribution comparison between positive and negative samples for antidia [PITH_FULL_IMAGE:figures/full_fig_p042_62.png] view at source ↗

**Figure 63.** Figure 63: Length distribution of antidiabetic dataset. [PITH_FULL_IMAGE:figures/full_fig_p043_63.png] view at source ↗

**Figure 64.** Figure 64: Property comparison between positive and negative samples for antidiabetic dataset. [PITH_FULL_IMAGE:figures/full_fig_p043_64.png] view at source ↗

**Figure 65.** Figure 65: Amino acid distribution comparison between positive and negative samples for antiaging [PITH_FULL_IMAGE:figures/full_fig_p044_65.png] view at source ↗

**Figure 66.** Figure 66: Length distribution of antiaging dataset. [PITH_FULL_IMAGE:figures/full_fig_p044_66.png] view at source ↗

**Figure 67.** Figure 67: Property comparison between positive and negative samples for antiaging dataset. [PITH_FULL_IMAGE:figures/full_fig_p044_67.png] view at source ↗

**Figure 68.** Figure 68: Amino acid distribution comparison between positive and negative samples for anti [PITH_FULL_IMAGE:figures/full_fig_p045_68.png] view at source ↗

**Figure 69.** Figure 69: Length distribution of anti-inflammatory dataset. [PITH_FULL_IMAGE:figures/full_fig_p045_69.png] view at source ↗

**Figure 70.** Figure 70: Property comparison between positive and negative samples for anti-inflammatory [PITH_FULL_IMAGE:figures/full_fig_p045_70.png] view at source ↗

**Figure 71.** Figure 71: Amino acid distribution comparison between positive and negative samples for antioxi [PITH_FULL_IMAGE:figures/full_fig_p046_71.png] view at source ↗

**Figure 72.** Figure 72: Length distribution of antioxidant dataset. [PITH_FULL_IMAGE:figures/full_fig_p046_72.png] view at source ↗

**Figure 73.** Figure 73: Property comparison between positive and negative samples for antioxidant dataset. [PITH_FULL_IMAGE:figures/full_fig_p047_73.png] view at source ↗

**Figure 74.** Figure 74: Amino acid distribution comparison between positive and negative samples for neuropep [PITH_FULL_IMAGE:figures/full_fig_p047_74.png] view at source ↗

**Figure 75.** Figure 75: Length distribution of neuropeptide dataset. [PITH_FULL_IMAGE:figures/full_fig_p047_75.png] view at source ↗

**Figure 76.** Figure 76: Property comparison between positive and negative samples for neuropeptide dataset. [PITH_FULL_IMAGE:figures/full_fig_p048_76.png] view at source ↗

**Figure 77.** Figure 77: Amino acid distribution comparison between positive and negative samples for quorum [PITH_FULL_IMAGE:figures/full_fig_p048_77.png] view at source ↗

**Figure 78.** Figure 78: Length distribution of quorum sensing dataset. [PITH_FULL_IMAGE:figures/full_fig_p048_78.png] view at source ↗

**Figure 79.** Figure 79: Property comparison between positive and negative samples for quorum sensing dataset. [PITH_FULL_IMAGE:figures/full_fig_p049_79.png] view at source ↗

**Figure 80.** Figure 80: Amino acid distribution for PpI dataset. [PITH_FULL_IMAGE:figures/full_fig_p049_80.png] view at source ↗

**Figure 81.** Figure 81: Length distribution of PpI dataset. B.6.2 PEPTIDE-PROTEIN BINDING AFFINITY (PPI BA) • Property and Application: The PpI ba dataset provides binding affinity measurements for peptide-protein interactions, reported using metrics -lgKd(M). These quantitative values allow evaluation of interaction strength and support predictive modeling of high-affinity peptide ligands. • Data Source: The dataset is sourced… view at source ↗

**Figure 82.** Figure 82: Amino acid distribution for PpI ba dataset [PITH_FULL_IMAGE:figures/full_fig_p050_82.png] view at source ↗

**Figure 83.** Figure 83: Length distribution of PpI ba dataset. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_83.png] view at source ↗

**Figure 84.** Figure 84: Amino acid distribution comparison between positive and negative samples for allergen [PITH_FULL_IMAGE:figures/full_fig_p051_84.png] view at source ↗

**Figure 85.** Figure 85: Length distribution of allergen dataset. [PITH_FULL_IMAGE:figures/full_fig_p052_85.png] view at source ↗

**Figure 86.** Figure 86: Property comparison between positive and negative samples for allergen dataset. [PITH_FULL_IMAGE:figures/full_fig_p052_86.png] view at source ↗

**Figure 87.** Figure 87: Length distribution of hemolytic hc50 dataset. [PITH_FULL_IMAGE:figures/full_fig_p053_87.png] view at source ↗

**Figure 88.** Figure 88: Label distribution of hemolytic HC50 dataset. [PITH_FULL_IMAGE:figures/full_fig_p053_88.png] view at source ↗

**Figure 89.** Figure 89: Amino acid distribution of hemolytic hc50 dataset. [PITH_FULL_IMAGE:figures/full_fig_p053_89.png] view at source ↗

**Figure 90.** Figure 90: Amino acid distribution comparison between positive and negative samples for hemolytic [PITH_FULL_IMAGE:figures/full_fig_p054_90.png] view at source ↗

**Figure 91.** Figure 91: Length distribution of hemolytic dataset. [PITH_FULL_IMAGE:figures/full_fig_p054_91.png] view at source ↗

**Figure 92.** Figure 92: Property comparison between positive and negative samples for hemolytic dataset. [PITH_FULL_IMAGE:figures/full_fig_p054_92.png] view at source ↗

**Figure 93.** Figure 93: Canonical Amino acid distribution comparison between positive and negative samples [PITH_FULL_IMAGE:figures/full_fig_p055_93.png] view at source ↗

**Figure 94.** Figure 94: non-canonical Amino acid distribution comparison between positive and negative sam [PITH_FULL_IMAGE:figures/full_fig_p055_94.png] view at source ↗

**Figure 95.** Figure 95: Length distribution of nc-hemolytic dataset. [PITH_FULL_IMAGE:figures/full_fig_p055_95.png] view at source ↗

**Figure 96.** Figure 96: Amino acid distribution comparison between positive and negative samples for neuro [PITH_FULL_IMAGE:figures/full_fig_p056_96.png] view at source ↗

**Figure 97.** Figure 97: Length distribution of neurotoxin dataset. [PITH_FULL_IMAGE:figures/full_fig_p056_97.png] view at source ↗

**Figure 98.** Figure 98: Property comparison between positive and negative samples for neurotoxin dataset. [PITH_FULL_IMAGE:figures/full_fig_p056_98.png] view at source ↗

**Figure 99.** Figure 99: Amino acid distribution comparison between positive and negative samples for toxicity [PITH_FULL_IMAGE:figures/full_fig_p057_99.png] view at source ↗

**Figure 100.** Figure 100: Length distribution of toxicity dataset. [PITH_FULL_IMAGE:figures/full_fig_p057_100.png] view at source ↗

**Figure 101.** Figure 101: Property comparison between positive and negative samples for toxicity dataset. [PITH_FULL_IMAGE:figures/full_fig_p057_101.png] view at source ↗

**Figure 102.** Figure 102: Overview of the construction pipeline for peptide classification datasets [PITH_FULL_IMAGE:figures/full_fig_p059_102.png] view at source ↗

**Figure 103.** Figure 103: Dataset sequence overlap heatmap. The color intensity represents the proportion of [PITH_FULL_IMAGE:figures/full_fig_p063_103.png] view at source ↗

**Figure 104.** Figure 104: Distributional comparison of positive and negative samples before and after sampling. [PITH_FULL_IMAGE:figures/full_fig_p064_104.png] view at source ↗

**Figure 105.** Figure 105: Isolated Sequences Ratio vs MMseqs2 Identity Threshold (from 0.1 to 1.0) Across [PITH_FULL_IMAGE:figures/full_fig_p067_105.png] view at source ↗

**Figure 106.** Figure 106: Overview of the construction pipeline for peptide regression datasets [PITH_FULL_IMAGE:figures/full_fig_p069_106.png] view at source ↗

**Figure 107.** Figure 107: Isolated Sequences Ratio vs MMseqs2 Identity Threshold (from 0.1 to 1.0) Across [PITH_FULL_IMAGE:figures/full_fig_p070_107.png] view at source ↗

**Figure 108.** Figure 108: Schematic diagram of the classification dataset construction. [PITH_FULL_IMAGE:figures/full_fig_p071_108.png] view at source ↗

**Figure 109.** Figure 109: Length distribution of the uniref50 50 dataset [PITH_FULL_IMAGE:figures/full_fig_p073_109.png] view at source ↗

**Figure 110.** Figure 110: Amino acid distribution of the uniref50 50 dataset. masked language models require the following variant: PseudoPPL(x) = exp − 1 L X L i=1 log p(xi | xj̸=i) ! . We used 85,113 unique peptide sequences absent from UniRef100 for evaluation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p073_110.png] view at source ↗

**Figure 111.** Figure 111: Perplexity of ESM-150M (ESM2) and fine-tuned ESM-150M-F (ours) across peptide [PITH_FULL_IMAGE:figures/full_fig_p073_111.png] view at source ↗

**Figure 112.** Figure 112: Comparison of physicochemical property distributions between generated molecules [PITH_FULL_IMAGE:figures/full_fig_p084_112.png] view at source ↗

read the original abstract

Peptide therapeutics are widely regarded as the "third generation" of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well-curated collection comprising 29 canonical-peptide and 6 non-canonical-peptide datasets across 7 groups, systematically covering key aspects of peptide drug development, representing, to the best of our knowledge, the most comprehensive AI-ready dataset resource to date; (2) PepBenchPipeline, a standardized preprocessing pipeline that ensures consistent dataset cleaning, construction, splitting, and feature transformation, mitigating quality issues common in ad hoc pipelines; and (3) PepBenchLeaderboard, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: Fingerprint-based, GNN-based, PLM-based, and SMILES-based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real-world applications. The data and code are publicly available at https://github.com/ZGCI-AI4S-Pep/PepBenchmark/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PepBenchmark aggregates 35 peptide datasets into one pipeline and leaderboard, a useful standardization move whose representativeness is asserted without quantitative checks against real-world peptide distributions.

read the letter

The main takeaway is that this paper collects 35 datasets (29 canonical, 6 non-canonical) across 7 groups, applies one consistent preprocessing and splitting pipeline, and runs a shared leaderboard with baselines from fingerprint, GNN, PLM, and SMILES models. That unification is the concrete advance over the scattered datasets and protocols that came before, and the public GitHub release of code and data makes it immediately usable.

Referee Report

2 major / 2 minor

Summary. The paper introduces PepBenchmark as a unified resource for peptide ML in drug discovery, comprising PepBenchData (a curated set of 35 datasets: 29 canonical and 6 non-canonical across 7 groups), PepBenchPipeline (standardized preprocessing, cleaning, splitting, and feature transformation), and PepBenchLeaderboard (unified evaluation with baselines from fingerprint-based, GNN-based, PLM-based, and SMILES-based models). It claims this provides the first standardized and comparable foundation for the field, with public data and code release.

Significance. If the datasets prove representative of real-world peptide drug-discovery tasks and the pipeline avoids introducing curation biases, the benchmark could enable reproducible comparisons across methods and support translation to applications, analogous to MoleculeNet-style resources in other domains. The public release of datasets and code is a clear positive contribution.

major comments (2)

[PepBenchData] PepBenchData section: The claim that the 35 datasets constitute 'the most comprehensive AI-ready dataset resource to date' and systematically cover key aspects of peptide drug development is asserted without quantitative validation. No statistical comparisons (e.g., sequence length distributions, modification frequencies, physicochemical property distributions, or task difficulty metrics) are provided against external references such as clinical peptide therapeutics databases or large peptide libraries, which is load-bearing for the 'standardized and comparable foundation' claim.
[PepBenchPipeline] PepBenchPipeline description: The pipeline is presented as mitigating quality issues via systematic cleaning and splitting, yet no validation (e.g., ablation studies or checks confirming preservation of original task semantics) is reported. This leaves open the possibility that curation choices systematically bias the leaderboard results or limit generalizability to real discovery pipelines.

minor comments (2)

The introduction could include explicit comparisons to prior benchmarks in related areas (e.g., protein or small-molecule ML) to sharpen the novelty statement.
Clarify the precise selection criteria for the 7 groups and 35 datasets, including any exclusion rules, to improve reproducibility of the curation process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive view of PepBenchmark's potential contribution. We address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [PepBenchData] PepBenchData section: The claim that the 35 datasets constitute 'the most comprehensive AI-ready dataset resource to date' and systematically cover key aspects of peptide drug development is asserted without quantitative validation. No statistical comparisons (e.g., sequence length distributions, modification frequencies, physicochemical property distributions, or task difficulty metrics) are provided against external references such as clinical peptide therapeutics databases or large peptide libraries, which is load-bearing for the 'standardized and comparable foundation' claim.

Authors: We agree that quantitative validation would strengthen the claim. Our statement that PepBenchData is the most comprehensive AI-ready resource to date rests on a literature survey showing that prior peptide collections are smaller, task-specific, and lack unified preprocessing. The seven groups were deliberately chosen to span core peptide drug-development tasks (binding, toxicity, stability, etc.). In the revised manuscript we will add a dedicated subsection containing statistical comparisons of sequence-length distributions, modification frequencies, and physicochemical properties against accessible external references such as PepBank and curated clinical-peptide lists, thereby providing the requested quantitative support. revision: yes
Referee: [PepBenchPipeline] PepBenchPipeline description: The pipeline is presented as mitigating quality issues via systematic cleaning and splitting, yet no validation (e.g., ablation studies or checks confirming preservation of original task semantics) is reported. This leaves open the possibility that curation choices systematically bias the leaderboard results or limit generalizability to real discovery pipelines.

Authors: We appreciate the concern. The pipeline applies standard, widely adopted cleaning and splitting procedures to eliminate common ad-hoc artifacts and ensure reproducibility. We acknowledge that the original submission did not include explicit ablation or semantic-preservation checks. In the revised version we will add (i) ablation experiments quantifying the effect of each cleaning step on downstream model performance and (ii) distributional analyses demonstrating that the chosen splits preserve the original task semantics and label distributions. These additions will directly address the possibility of systematic bias. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark construction is explicit data aggregation and protocol definition with no derivations or self-referential reductions.

full rationale

The manuscript describes PepBenchmark as the assembly of 35 curated datasets (PepBenchData), a fixed preprocessing/splitting pipeline (PepBenchPipeline), and a unified leaderboard with baselines across model families (PepBenchLeaderboard). No equations, parameter fitting, predictions, or uniqueness theorems appear in the provided text. The claim of supplying the 'first standardized and comparable foundation' is an assertion of novelty and coverage rather than a derivation that reduces to its own inputs by construction. No self-citations are used to justify load-bearing steps, and the representativeness of the chosen datasets is presented as a curation choice rather than a fitted or self-defined result. The work is therefore self-contained as an engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on domain choices about which datasets matter for peptide drug discovery and how to clean them; no numerical parameters are fitted and no new physical entities are postulated.

axioms (1)

domain assumption The 29 canonical and 6 non-canonical datasets grouped into 7 categories collectively cover the key tasks in peptide drug development.
Invoked when the paper states the collection is the most comprehensive and systematically covers key aspects.

pith-pipeline@v0.9.0 · 5534 in / 1180 out tokens · 27170 ms · 2026-05-10T15:47:29.030989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

In 12 datasets, all sequences have lengths below 50

work page
[2]

In 18 datasets, fewer than 10% of the sequences exceed a length of 50; 57 Published as a conference paper at ICLR 2026

work page 2026
[3]

For the benchmark, we believe it is necessary to define a unified length criterion

In 5 datasets, at least 10% of the sequences are longer than 50, with maximum lengths reaching up to 150 (including datasets on antifungal, antimicrobial, antiparasitic, anti- cancer, allergen, neurotoxin, and toxicity). For the benchmark, we believe it is necessary to define a unified length criterion. From the perspective of drug development experts, di...

work page 2019
[4]

Principle 1: The differences between positive and negative samples should reflect gener- alizable biological properties rather than dataset-specific artifacts (e.g., clear distributional differences, inherent contrasts between active and inactive peptides)

work page
[5]

highly related,

Principle 2·: False negatives should be avoided as much as possible. Based on these principles, we introduceBiologically Informed and Distribution-Controlled Neg- ative Sampling (BDNegSamp). The procedure involves three key steps: Step 1: Construction of the Biologically Informed Negative Sample PoolThe initial negative pool is defined as all collected bi...

work page arXiv 2025
[6]

Run the enrichment analysis to identify significantly enriched k-mers (henceforthmotifs)

work page
[7]

Group sequences into clusters such that any two sequences sharing at least one enriched motif are placed in the same cluster

work page
[8]

Assign clusters (not sequences) to train/validation/test according to the target ratio (8:1:1 in our experiments). By construction, this protocol prevents any enriched motif from appearing across partitions, thereby forcing models to generalize beyond dataset-specific shortcuts and yielding a more faithful estimate of cross-motif generalization difficulty...

work page
[9]

Apply the kmer–aware procedure in subsubsection D.4.1 to identify motif clusters and allocate them to splits

work page
[10]

For MMseqs2-based homology splitting, parameter settings also need to be standardized

For sequences that do not contain any enriched motif, applyMMseqs2to form homology clusters and allocate these clusters to the existing splits, maintaining the desired propor- tions. For MMseqs2-based homology splitting, parameter settings also need to be standardized. By ex- amining the changes in the number of isolated sequences under different identity...

work page 2026
[11]

Results on 22 canonical peptide classification datasets under four splitting strategies: hybrid-split(Table 21),MMseqs2-split(Table 23),k-mer–split(Table 22), andrandom- split(Table 24)

work page
[12]

Results on 4 non-canonical peptide regression datasets under four splitting strategies: hybrid-split(Table 25),MMseqs2-split(Table 27),k-mer–split(Table 26), andrandom- split(Table 28)

work page
[13]

Results on 5 non-canonical peptide datasets under five splitting strategies:ECFP-split(Ta- ble 29),hybrid-split(Table 30),MMseqs2-split(Table 32),k-mer–split(Table 31), and random-split(Table 33)

work page
[14]

Results on 3 PepPI datasets without freezing the protein encoder (Table 34)

work page
[15]

For most datasets in PepBenchData-150, the proportion of sequences longer than 50 residues is below 10% (see Table 7 and Table 8)

Results on thePepBenchData-150benchmark (Table 36). For most datasets in PepBenchData-150, the proportion of sequences longer than 50 residues is below 10% (see Table 7 and Table 8). Therefore, we consider these datasets to be largely compara- ble to thePepBenchData-50version. Only five datasets contain more than 50% sequences longer than 50 residues, and...

work page
[16]

The conclusions in the main text hold across all partitioning strategies

work page
[17]

The hybrid partition combines the advantages of MMseqs2-split and kmer-split, making it a more challenging partitioning strategy

The performance of kmer-split is significantly lower than that of random-split, indicating that the issue of kmer leakage is severe. The hybrid partition combines the advantages of MMseqs2-split and kmer-split, making it a more challenging partitioning strategy

work page
[18]

This result is expected, as ESM2-150-F is fine-tuned only on peptide data and thus forgets the protein pre-training knowledge

In the PepBenchData-150 version, the fine-tuned ESM2-150-F performs worse than ESM2- 150M. This result is expected, as ESM2-150-F is fine-tuned only on peptide data and thus forgets the protein pre-training knowledge. As shown in Figure 111, ESM2-150M-F ex- hibits higher perplexity than ESM2-150M on sequences longer than 50

work page
[19]

bbp", official_feature_names=[

For the PepPI task, it remains uncertain whether freezing the protein encoder is necessary. 74 Published as a conference paper at ICLR 2026 Table 21: Performance of models on canonical peptide classification (ROC-AUC↑, %) with hybrid- split. Dataset sizes are shown separately; results are mean±std. Best and second-best scores per row are inboldand gray sh...

work page arXiv 2026

[1] [1]

In 12 datasets, all sequences have lengths below 50

work page

[2] [2]

In 18 datasets, fewer than 10% of the sequences exceed a length of 50; 57 Published as a conference paper at ICLR 2026

work page 2026

[3] [3]

For the benchmark, we believe it is necessary to define a unified length criterion

In 5 datasets, at least 10% of the sequences are longer than 50, with maximum lengths reaching up to 150 (including datasets on antifungal, antimicrobial, antiparasitic, anti- cancer, allergen, neurotoxin, and toxicity). For the benchmark, we believe it is necessary to define a unified length criterion. From the perspective of drug development experts, di...

work page 2019

[4] [4]

Principle 1: The differences between positive and negative samples should reflect gener- alizable biological properties rather than dataset-specific artifacts (e.g., clear distributional differences, inherent contrasts between active and inactive peptides)

work page

[5] [5]

highly related,

Principle 2·: False negatives should be avoided as much as possible. Based on these principles, we introduceBiologically Informed and Distribution-Controlled Neg- ative Sampling (BDNegSamp). The procedure involves three key steps: Step 1: Construction of the Biologically Informed Negative Sample PoolThe initial negative pool is defined as all collected bi...

work page arXiv 2025

[6] [6]

Run the enrichment analysis to identify significantly enriched k-mers (henceforthmotifs)

work page

[7] [7]

Group sequences into clusters such that any two sequences sharing at least one enriched motif are placed in the same cluster

work page

[8] [8]

Assign clusters (not sequences) to train/validation/test according to the target ratio (8:1:1 in our experiments). By construction, this protocol prevents any enriched motif from appearing across partitions, thereby forcing models to generalize beyond dataset-specific shortcuts and yielding a more faithful estimate of cross-motif generalization difficulty...

work page

[9] [9]

Apply the kmer–aware procedure in subsubsection D.4.1 to identify motif clusters and allocate them to splits

work page

[10] [10]

For MMseqs2-based homology splitting, parameter settings also need to be standardized

For sequences that do not contain any enriched motif, applyMMseqs2to form homology clusters and allocate these clusters to the existing splits, maintaining the desired propor- tions. For MMseqs2-based homology splitting, parameter settings also need to be standardized. By ex- amining the changes in the number of isolated sequences under different identity...

work page 2026

[11] [11]

Results on 22 canonical peptide classification datasets under four splitting strategies: hybrid-split(Table 21),MMseqs2-split(Table 23),k-mer–split(Table 22), andrandom- split(Table 24)

work page

[12] [12]

Results on 4 non-canonical peptide regression datasets under four splitting strategies: hybrid-split(Table 25),MMseqs2-split(Table 27),k-mer–split(Table 26), andrandom- split(Table 28)

work page

[13] [13]

Results on 5 non-canonical peptide datasets under five splitting strategies:ECFP-split(Ta- ble 29),hybrid-split(Table 30),MMseqs2-split(Table 32),k-mer–split(Table 31), and random-split(Table 33)

work page

[14] [14]

Results on 3 PepPI datasets without freezing the protein encoder (Table 34)

work page

[15] [15]

For most datasets in PepBenchData-150, the proportion of sequences longer than 50 residues is below 10% (see Table 7 and Table 8)

Results on thePepBenchData-150benchmark (Table 36). For most datasets in PepBenchData-150, the proportion of sequences longer than 50 residues is below 10% (see Table 7 and Table 8). Therefore, we consider these datasets to be largely compara- ble to thePepBenchData-50version. Only five datasets contain more than 50% sequences longer than 50 residues, and...

work page

[16] [16]

The conclusions in the main text hold across all partitioning strategies

work page

[17] [17]

The hybrid partition combines the advantages of MMseqs2-split and kmer-split, making it a more challenging partitioning strategy

The performance of kmer-split is significantly lower than that of random-split, indicating that the issue of kmer leakage is severe. The hybrid partition combines the advantages of MMseqs2-split and kmer-split, making it a more challenging partitioning strategy

work page

[18] [18]

This result is expected, as ESM2-150-F is fine-tuned only on peptide data and thus forgets the protein pre-training knowledge

In the PepBenchData-150 version, the fine-tuned ESM2-150-F performs worse than ESM2- 150M. This result is expected, as ESM2-150-F is fine-tuned only on peptide data and thus forgets the protein pre-training knowledge. As shown in Figure 111, ESM2-150M-F ex- hibits higher perplexity than ESM2-150M on sequences longer than 50

work page

[19] [19]

bbp", official_feature_names=[

For the PepPI task, it remains uncertain whether freezing the protein encoder is necessary. 74 Published as a conference paper at ICLR 2026 Table 21: Performance of models on canonical peptide classification (ROC-AUC↑, %) with hybrid- split. Dataset sizes are shown separately; results are mean±std. Best and second-best scores per row are inboldand gray sh...

work page arXiv 2026