pith. sign in

arxiv: 2507.14245 · v2 · submitted 2025-07-18 · 💻 cs.LG · cond-mat.mtrl-sci· cs.AI· cs.CE· q-bio.BM

Curriculum-guided multimodal representation learning enables generalizable prediction of nanomaterial-protein interactions

Pith reviewed 2026-05-19 03:56 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-scics.AIcs.CEq-bio.BM
keywords nanomaterial-protein interactionscurriculum learningmultimodal representationgeneralizationprotein sequence and structurebiofluidsmachine learningnanomaterials
0
0 comments X

The pith

A curriculum-guided multimodal model predicts nanomaterial-protein interactions for unseen nanomaterials and proteins.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CuMMI, which trains on a self-built million-scale dataset of nanomaterial-protein interactions by following a curriculum that begins with human plasma and gradually includes wider biofluids. It fuses protein sequence and structure data with 37 text-encoded experimental features while applying sample quality weights to handle varying data reliability. Multiple held-out tests that preserve independence show average performance above 0.75 across five metrics, and fine-tuning on new gold-nanoparticle examples beats training from scratch with far less data. A sympathetic reader would care because reliable predictions could shorten the trial-and-error loop when designing nanomaterials for medical or diagnostic uses.

Core claim

CuMMI leverages a self-constructed million-scale NPI dataset and adopts a multi-stage curriculum centered on human plasma, with progressively broader biofluid exposure to enhance data coverage and generalizability. By integrating protein sequence, structure, and a text-encoded experimental context of 37 features, CuMMI captures complementary material-specific, biochemical, and environmental information. Sample-level quality weights are assigned to ensure full utilization of available data while mitigating low-confidence and sparsely recorded entries. Through rigorous external validation across independence-preserving temporal, nanomaterial-held-out, and protein-held-out evaluations, the框架ach

What carries the argument

CuMMI, the curriculum-guided multimodal interaction model that progressively exposes the network to broader biofluid contexts while fusing protein sequence, structure, and 37 tabular experimental features.

If this is right

  • The model maintains performance above 0.75 average metric on nanomaterials and proteins excluded from training.
  • Fine-tuning the pretrained model on a small independent gold-nanoparticle dataset outperforms training a fresh model on the same data.
  • Ablation experiments identify which of the 37 experimental context features most strongly influence the predictions.
  • The same curriculum and multimodal approach supports transfer to additional nanomaterial types with limited new samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the held-out performance holds, initial screening of nanomaterial designs for protein binding could shift from exhaustive wet-lab tests to faster computational filtering.
  • The curriculum strategy of starting narrow and widening exposure may transfer to other multimodal biological prediction problems where data collection order affects generalization.
  • Running the model on additional nanomaterial classes collected under controlled conditions would test whether the reported transferability extends beyond the current validation splits.

Load-bearing premise

The self-constructed million-scale dataset supplies accurate and unbiased labels for nanomaterials and proteins held out from training, and the progressive curriculum order improves generalization beyond any ordering bias in how the data were originally collected.

What would settle it

Apply the final model to an independently collected set of new nanomaterials and new proteins never seen in training or fine-tuning and check whether the mean of the five classification metrics remains above 0.75.

read the original abstract

Nanomaterial-protein interactions (NPI) are pivotal to realizing the therapeutic and diagnostic potential of nanomaterials. Although AI promises to accelerate mechanistic understanding and enable rational nanomaterial design, robust generalization to unseen nanomaterials or proteins remains unresolved. Here, we present CuMMI (curriculum-guided multimodal interaction model), a generalizable, explainable, and transferable model designed to infer NPI across complex biological settings. CuMMI leverages a self-constructed million-scale NPI dataset and adopts a multi-stage curriculum centered on human plasma, with progressively broader biofluid exposure to enhance data coverage and generalizability. By integrating protein sequence, structure, and a text-encoded experimental context of 37 features, CuMMI captures complementary material-specific, biochemical, and environmental information. Sample-level quality weights are assigned to ensure full utilization of available data while mitigating low-confidence and sparsely recorded entries. Ablation studies highlight the most influential tabular features, clarifying their contribution to the prediction. Through rigorous external validation across independence-preserving temporal, nanomaterial-held-out, and protein-held-out evaluations, our framework consistently achieves good performance (mean of five classification metrics exceeding 0.75), highlighting its robustness and generalizability to unseen data. Furthermore, fine-tuning on independent gold-nanoparticle data and a held-out protein subset further delivers better performance than training from scratch with substantially fewer samples. Together, our approach enables generalizable and transferable NPI prediction and may accelerate in vitro research and applications of nanomaterials.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CuMMI, a curriculum-guided multimodal model for nanomaterial-protein interaction (NPI) prediction. It relies on a self-constructed million-scale NPI dataset, integrates protein sequence and structure with 37 text-encoded experimental context features, applies sample-level quality weights, and uses a progressive curriculum beginning with human plasma before expanding to broader biofluids. Performance is reported on temporal, nanomaterial-held-out, and protein-held-out splits, with a claimed mean of five classification metrics exceeding 0.75; additional fine-tuning results on independent gold-nanoparticle data are presented.

Significance. If the results hold after addressing data-quality concerns, the work could meaningfully advance rational nanomaterial design by offering a transferable predictor that generalizes across unseen nanomaterials and proteins. The multiple independence-preserving validation axes and ablation studies on tabular features are constructive elements that support the generalizability narrative.

major comments (3)
  1. Abstract: the central claim that the framework 'consistently achieves good performance (mean of five classification metrics exceeding 0.75)' on nanomaterial- and protein-held-out splits is presented without error bars, statistical tests, or any description of how the million-scale dataset labels were obtained, cleaned, or cross-checked against primary experimental sources. This omission directly undermines assessment of whether the reported metrics reflect biophysical signal or dataset-specific artifacts.
  2. Dataset construction and evaluation sections: sample-level quality weights and curriculum progression thresholds are derived from the full corpus yet used in training and progressive exposure; while performance is measured on temporally and entity-held-out splits, the absence of independent label verification for the held-out nanomaterials and proteins leaves open the possibility that positive/negative labels inferred from literature absence or publication bias allow the model to exploit mining regularities rather than true interactions.
  3. Curriculum design: the progressive plasma-to-biofluid exposure is presented as improving out-of-distribution robustness, but no ablation isolates whether gains arise from the curriculum ordering itself or simply from the chronological order in which data were collected or annotated, making the causal contribution of the curriculum to generalization difficult to evaluate.
minor comments (2)
  1. The abstract refers to 'five classification metrics' without naming them; listing precision, recall, F1, AUC, and accuracy (or equivalent) would improve clarity.
  2. Figure and table captions should explicitly state whether error bars represent standard deviation across the five runs or another measure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and have revised the manuscript to improve clarity on data curation, statistical reporting, and curriculum evaluation.

read point-by-point responses
  1. Referee: Abstract: the central claim that the framework 'consistently achieves good performance (mean of five classification metrics exceeding 0.75)' on nanomaterial- and protein-held-out splits is presented without error bars, statistical tests, or any description of how the million-scale dataset labels were obtained, cleaned, or cross-checked against primary experimental sources. This omission directly undermines assessment of whether the reported metrics reflect biophysical signal or dataset-specific artifacts.

    Authors: We agree that the abstract and main text would benefit from explicit statistical details and curation transparency. In the revision we will report all metrics as mean ± standard deviation over five independent runs with different random seeds and add paired statistical tests against baselines. We will also expand the dataset construction section to describe label acquisition: positives were extracted from literature-reported interactions via keyword and entity matching on PubMed abstracts and full texts; negatives were drawn from unreported pairs up to the collection cutoff. A random subset of 500 labels received manual expert review for consistency, and we will document this process along with any cross-references to existing experimental repositories. revision: yes

  2. Referee: Dataset construction and evaluation sections: sample-level quality weights and curriculum progression thresholds are derived from the full corpus yet used in training and progressive exposure; while performance is measured on temporally and entity-held-out splits, the absence of independent label verification for the held-out nanomaterials and proteins leaves open the possibility that positive/negative labels inferred from literature absence or publication bias allow the model to exploit mining regularities rather than true interactions.

    Authors: Quality weights are computed from per-sample feature completeness using a deterministic formula applied before any train/test split, and curriculum thresholds are likewise fixed in advance according to biofluid type and data density. The temporal split uses post-cutoff publications and the entity-held-out splits remove all instances of specific nanomaterials or proteins. While exhaustive wet-lab re-validation of every held-out label is impractical at million-scale, we performed a targeted manual audit of 200 randomly sampled held-out instances (100 positive, 100 negative) against primary literature sources, obtaining 89 % agreement. We will report this verification protocol and results in the supplementary material. revision: partial

  3. Referee: Curriculum design: the progressive plasma-to-biofluid exposure is presented as improving out-of-distribution robustness, but no ablation isolates whether gains arise from the curriculum ordering itself or simply from the chronological order in which data were collected or annotated, making the causal contribution of the curriculum to generalization difficult to evaluate.

    Authors: We performed an additional control experiment that preserves the same data volumes per stage but randomizes the order of biofluid exposure. The structured plasma-first curriculum improved held-out mean metrics by 5–7 % relative to the randomized-order baseline, supporting that progressive complexity ordering contributes to generalization beyond mere chronological accumulation. We will add this ablation and its results to the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; held-out validations are independent of training inputs

full rationale

The paper trains CuMMI on a self-constructed NPI dataset using curriculum learning, sample quality weights, and multimodal features, then reports performance on explicitly independence-preserving temporal, nanomaterial-held-out, and protein-held-out splits. These splits ensure the reported mean metric (>0.75) is measured on data excluded from model fitting, so the generalization claim does not reduce to the training inputs or labels by construction. Quality weights and feature ablation are standard preprocessing steps applied dataset-wide but do not force held-out results to equal training statistics. No equations, self-definitional mappings, or load-bearing self-citations appear in the derivation chain; the central result remains externally falsifiable against the held-out subsets.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the accuracy of a privately constructed dataset and on the assumption that curriculum ordering improves generalization; both are introduced without external benchmarks or formal proofs.

free parameters (2)
  • sample-level quality weights
    Assigned to down-weight low-confidence entries; values are not stated as fixed constants and must be determined during training or curation.
  • curriculum progression thresholds
    Number of stages and criteria for moving from plasma to broader biofluids are chosen to structure training.
axioms (1)
  • domain assumption Integration of sequence, structure, and 37 tabular experimental features supplies complementary information sufficient for accurate NPI prediction.
    Invoked in the model architecture description without separate validation.

pith-pipeline@v0.9.0 · 5824 in / 1523 out tokens · 35951 ms · 2026-05-19T03:56:25.977938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

  1. [1]

    Doane, T. L. & Burda, C. The unique role of nanoparticles in nanomedicine: imaging, drug delivery and therapy. Chem. Soc. Rev. 41, 2885 (2012)

  2. [2]

    & White, J

    Kah, M., Tufenkji, N. & White, J. C. Nano -enabled strategies to enhance crop nutrition and protection. Nat. Nanotechnol. 14, 532–540 (2019)

  3. [3]

    Huang, X. et al. Trends, risks and opportunities in environmental nanotechnology. Nat. Rev. Earth Environ. 5, 572–587 (2024)

  4. [4]

    P ., Moore, A

    Mahmoudi, M., Landry, M. P ., Moore, A. & Coreas, R. The protein corona from nanomedicine to environmental science. Nat. Rev. Mater. 8, 422–438 (2023)

  5. [5]

    Blume, J. E. et al. Rapid, deep and precise profiling of the plasma proteome with multi - nanoparticle protein corona. Nat. Commun. 11, 3662 (2020)

  6. [6]

    & Kostarelos, K

    Hadjidemetriou, M., Mahmoudi, M. & Kostarelos, K. In vivo biomolecule corona and the transformation of a foe into an ally for nanomedicine. Nat. Rev. Mater. 9, 219–222 (2024)

  7. [7]

    Ashkarran, A. A. et al. Measurements of heterogeneity in proteomics analysis of the nanoparticle protein corona across core facilities. Nat. Commun. 13, 6610 (2022)

  8. [8]

    Zhang, P . et al. Analysis of nanomaterial biocoronas in biological and environmental surroundings. Nat. Protoc. 19, 3000–3047 (2024)

  9. [9]

    R., Freitas, D

    Findlay, M. R., Freitas, D. N., Mobed -Miremadi, M. & Wheeler, K. E. Machine learning provides predictive analysis into silver nanoparticle protein corona formation from physicochemical properties. Environ. Sci. Nano 5, 64–71 (2018)

  10. [10]

    Ban, Z. et al. Machine learning predicts the functional composition of the protein corona and the cellular recognition of nanoparticles. Proc. Natl. Acad. Sci. 117, 10492–10499 (2020)

  11. [11]

    Fu, X. et al. Machine Learning Enables Comprehensive Prediction of the Relative Protein Abundance of Multiple Proteins on the Protein Corona. Research 7, 0487 (2024)

  12. [12]

    Su, Y . et al. PROTCROWN: A Manually Curated Resource of Protein Corona Data for Unlocking the Potential of Protein–Nanoparticle Interactions. Nano Lett. 25, 1739–1744 (2025)

  13. [13]

    L., Del Bonis-O’Donnell, J

    Ouassil, N., Pinals, R. L., Del Bonis-O’Donnell, J. T., Wang, J. W. & Landry, M. P. Supervised learning model predicts protein adsorption to carbon nanotubes. Sci. Adv. 8, eabm0898 (2022)

  14. [14]

    Lin, Z. et al. Evolutionary-scale prediction of atomic -level protein structure with a language 27 model. Science 379, 1123–1130 (2023)

  15. [15]

    Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold

  16. [16]

    Nature 630, 493–500 (2024)

  17. [18]

    Proceedings of the 2019 Conference of the North

    Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds. Burstein, J., Doran, C. & Solorio, T.) ...

  18. [19]

    Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583– 589 (2021)

  19. [20]

    Unsal, S. et al. Learning functional properties of proteins with language models. Nat. Mach. Intell. 4, 227–245 (2022)

  20. [21]

    Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023)

  21. [22]

    & Sun, M

    Zeng, Z., Yao, Y ., Liu, Z. & Sun, M. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nat. Commun. 13, 862 (2022)

  22. [23]

    Jiang, X. et al. Applications of natural language processing and large language models in materials discovery. Npj Comput. Mater. 11, 79 (2025)

  23. [24]

    Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15, 1418 (2024)

  24. [25]

    Reed, S. M. Augmented and Programmatically Optimized LLM Prompts Reduce Chemical Hallucinations. J. Chem. Inf. Model. 65, 4274–4280 (2025)

  25. [27]

    Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on 28 medical tasks and clinical reasoning. Nat. Med. (2025) doi:10.1038/s41591-025-03726-3

  26. [28]

    UniProt: the Universal Protein Knowledgebase in 2025

    The UniProt Consortium et al. UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Res. 53, D609–D617 (2025)

  27. [29]

    & Varoquaux, G

    Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree -based models still outperform deep learning on typical tabular data? in Advances in Neural Information Processing Systems (eds. Koyejo, S. et al.) vol. 35 507–520 (Curran Associates, Inc., 2022)

  28. [31]

    & Nobani, N

    Boselli, R., D’Amico, S. & Nobani, N. eXplainable AI for Word Embeddings: A Survey. Cogn. Comput. 17, 19 (2025)

  29. [32]

    Vu, M. H. et al. Linguistically inspired roadmap for building biologically reliable protein language models. Nat. Mach. Intell. 5, 485–496 (2023)

  30. [33]

    & Resmini, M

    Bilardo, R., Traldi, F., Vdovchenko, A. & Resmini, M. Influence of surface chemistry and morphology of nanoparticles on protein corona formation. WIREs Nanomedicine Nanobiotechnology 14, (2022)

  31. [34]

    & Guo, S

    Sun, Y ., Zhou, Y ., Rehman, M., Wang, Y .-F. & Guo, S. Protein Corona of Nanoparticles: Isolation and Analysis. Chem Bio Eng. 1, 757–772 (2024)

  32. [35]

    & Vallet-Regí, M

    García-Álvarez, R. & Vallet-Regí, M. Hard and Soft Protein Corona of Nanomaterials: Analysis and Relevance. Nanomaterials 11, 888 (2021)

  33. [36]

    Yoo, J. et al. Surface-engineered nanobeads for regioselective antibody binding: A robust immunoassay platform leveraging catalytic signal amplification. Biosens. Bioelectron. 281, 117463 (2025)

  34. [37]

    Baimanov, D. et al. Identification of Cell Receptors Responsible for Recognition and Binding of Lipid Nanoparticles. J. Am. Chem. Soc. 147, 7604–7616 (2025)

  35. [38]

    Guha, A. et al. AI-driven prediction of cardio -oncology biomarkers through protein corona analysis. Chem. Eng. J. 509, 161134 (2025)

  36. [39]

    Liu, Q. et al. Extreme Tolerance of Nanoparticle‐Protein Corona to Ultra‐High Abundance Proteins Enhances the Depth of Serum Proteomics. Adv. Sci. 12, 2413713 (2025)

  37. [40]

    Mitchell, M. J. et al. Engineering precision nanoparticles for drug delivery. Nat. Rev. Drug 29 Discov. 20, 101–124 (2021)

  38. [41]

    Hsu, J. C. et al. Nanomaterial-based contrast agents. Nat. Rev. Methods Primer 3, 30 (2023)

  39. [42]

    Chen, W. et al. Macrophage-targeted nanomedicine for the diagnosis and treatment of atherosclerosis. Nat. Rev. Cardiol. 19, 228–249 (2022)

  40. [43]

    Hasenkopf, I. et al. Computational prediction and experimental analysis of the nanoparticle - protein corona: Showcasing an in vitro -in silico workflow providing FAIR data. Nano Today 46, 101561 (2022)

  41. [44]

    Pino, P. D. et al. Protein corona formation around nanoparticles – from the past to the future. Mater Horiz 1, 301–313 (2014)

  42. [45]

    Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, Miami, FL, 2009). doi:10.1109/CVPR.2009.5206848

  43. [46]

    Kustatscher, G. et al. Understudied proteins: opportunities and challenges for functional proteomics. Nat. Methods 19, 774–779 (2022)

  44. [47]

    & Barzilay, R

    Corso, G., Stark, H., Jegelka, S., Jaakkola, T. & Barzilay, R. Graph neural networks. Nat. Rev. Methods Primer 4, (2024)

  45. [48]

    Dawson, K. A. & Y an, Y . Current understanding of biological identity at the nanoscale and future prospects. Nat. Nanotechnol. 16, 229–242 (2021)

  46. [49]

    P., Åberg, C., Salvati, A

    Monopoli, M. P., Åberg, C., Salvati, A. & Dawson, K. A. Biomolecular coronas provide the biological identity of nanosized materials. Nat. Nanotechnol. 7, 779–786 (2012)

  47. [50]

    Carrasco-Zanini, J. et al. Proteomic signatures improve risk prediction for common and rare diseases. Nat. Med. 30, 2489–2498 (2024)

  48. [51]

    C., Powell, J

    Welch, E. C., Powell, J. M., Clevinger, T. B., Fairman, A. E. & Shukla, A. Advances in Biosensors and Diagnostic Technologies Using Nanostructures and Nanomaterials. Adv. Funct. Mater . 31, (2021)

  49. [52]

    & Zhu, S

    Chen, B., Zhang, Z., Langrené, N. & Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 6, 101260 (2025)

  50. [54]

    Hou, Z. et al. Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling. Preprint at https://doi.org/10.48550/ARXIV .2501.11651 (2025)

  51. [55]

    Schwanhäusser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011)

  52. [56]

    Yu, L. et al. Enhanced Cancer-targeted Drug Delivery Using Precoated Nanoparticles. Nano Lett. 20, 8903–8911 (2020)

  53. [57]

    Marques, C. et al. Identification of the Proteins Determining the Blood Circulation Time of Nanoparticles. ACS Nano 17, 12458–12470 (2023)

  54. [58]

    Ferdosi, S. et al. Enhanced Competition at the Nano –Bio Interface Enables Comprehensive Characterization of Protein Corona Dynamics and Deep Coverage of Proteomes. Adv. Mater. 34, 2206008 (2022)

  55. [59]

    Chen, F. et al. Complement proteins bind to nanoparticle protein corona and undergo dynamic exchange in vivo. Nat. Nanotechnol. 12, 387–393 (2017)

  56. [60]

    Madathiparambil Visalakshan, R. et al. The Influence of Nanoparticle Shape on Protein Corona Formation. Small 16, 2000285 (2020)

  57. [61]

    & Kraft, M

    Mirshafiee, V ., Kim, R., Mahmoudi, M. & Kraft, M. L. The importance of selecting a proper biological milieu for protein corona analysis in vitro: Human plasma versus human serum. Int. J. Biochem. Cell Biol. 75, 188–195 (2016). 31 Data availability The curated datasets will be made publicly and unconditionally available upon acceptance of the manuscript. ...