pith. sign in

arxiv: 2606.04566 · v1 · pith:TTIVCRU5new · submitted 2026-06-03 · 🧬 q-bio.BM

AF_Cache: Efficient Pipeline for Running AlphaFold for High-Throughput Protein-Protein Interaction Prediction

Pith reviewed 2026-06-28 03:19 UTC · model grok-4.3

classification 🧬 q-bio.BM
keywords AF_CacheAlphaFoldprotein-protein interactionmultiple sequence alignmentfeature cachingsequence bucketinghigh-throughput predictionNextflow pipeline
0
0 comments X

The pith

AF_Cache pipeline reduces AlphaFold2 runtime for protein pairs by half and MSA generation by up to 13 times via caching and bucketing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AF_Cache as a Nextflow-based pipeline that makes large-scale protein-protein interaction predictions feasible with AlphaFold2 and AlphaFold3. It tackles the main bottlenecks of repeated CPU-based multiple sequence alignment and model recompilations by adding GPU-accelerated MSA generation with MMseqs2, feature caching to skip redundant work, and length-based bucketing of sequences. On a test set of 5,050 human mitochondrial protein pairs the method delivers the reported speedups while the authors state that predictive performance remains unchanged. The result is a reusable workflow that supports high-throughput screening without repeating the full computational load for every pair.

Core claim

AF_Cache is a high-throughput pipeline that integrates GPU-accelerated MSA generation with MMseqs2, feature caching to eliminate redundant alignment computations, and sequence length bucketing to minimise repeated JAX compilations, yielding a ~2-fold reduction in AlphaFold2 inference time and up to a 13-fold speedup of MSA generation on 5,050 human mitochondrial protein pairs.

What carries the argument

Feature caching combined with sequence length bucketing inside a Nextflow workflow that calls MMseqs2 for MSA.

If this is right

  • Large-scale screening of thousands of protein pairs becomes practical on modest hardware.
  • The same caching and bucketing steps apply to both AlphaFold2 and AlphaFold3 workflows.
  • Researchers gain a reusable Nextflow template for deploying structure-based PPI prediction at scale.
  • Redundant MSA generation across overlapping pairs is avoided without altering the final models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same caching pattern could reduce cost in other alignment-heavy bioinformatics pipelines.
  • If accuracy holds, the approach would support proteome-wide interaction maps that are currently too expensive.
  • Speed gains may allow tighter integration with downstream tools such as docking or dynamics simulations.

Load-bearing premise

Caching alignments and bucketing sequences by length leave the AlphaFold interaction scores and structures unchanged.

What would settle it

Direct comparison of interaction scores or predicted structures between standard AlphaFold runs and AF_Cache runs on the same 5,050 mitochondrial pairs that reveals any consistent difference.

Figures

Figures reproduced from arXiv: 2606.04566 by Arne Elofsson Claudio Mirabello, Sarah Narrowe.

Figure 1
Figure 1. Figure 1: Overview of the AF Cache and AlphaFold pipelines. The input to the pipeline is a directory containing the input FASTA files. It is possible to either run all-against-all (the default) or use an input file. The next step of the pipeline is to generate MSAs for all proteins simultaneously using MMseqs2-GPU. Then the pairs are divided into size buckets and padded to a common size. Next, each pair is executed … view at source ↗
Figure 2
Figure 2. Figure 2: AF Cache speeds up AF2 and AF3 and achieves similar structural scores on pairs with shared PDB entries. a) Runtime breakdown, here the time spent on MSAs, other prediction tasks and prediction (inference) time is separated for each pipeline. 8, 16, or 32 cores were used to generate the MSAs for the default pipeline, but here we have taken the total core time and divided it by 128, representing 128 cores, s… view at source ↗
read the original abstract

Motivation: Accurate prediction of protein-protein interactions is essential for understanding biological processes, and recent advances such as AlphaFold2 and AlphaFold3 have enabled structure-based interaction prediction at unprecedented accuracy. However, the high computational cost of these methods, driven primarily by CPU-based repeated multiple sequence alignment (MSA) generation and, for AlphaFold2, repeated model recompilations, limits their applicability in large-scale, high-throughput settings. This creates a need for efficient pipelines that retain predictive performance while substantially reducing runtime. Results: We present AF_Cache, a high-throughput Nextflow pipeline for accelerating protein-protein interaction prediction using AlphaFold2 and AlphaFold3. AF_Cache combines GPU-accelerated MSA generation with MMseqs2, feature caching to eliminate redundant alignment computations, and sequence length bucketing to minimise repeated JAX compilations. Benchmarking on a dataset of 5,050 human mitochondrial protein pairs demonstrates a $\sim$2-fold reduction in inference time for AlphaFold2 and up to a 13-fold speedup of the MSA generation. AF\_Cache enables efficient large-scale interaction screening and provides a practical framework for deploying AlphaFold-based methods in high-throughput applications. Availability and implementation: The code and Nextflow pipeline are available on GitHub here: https://github.com/clami66/AF_cache. The code for reproducing the results of the paper, the MSAs, and the predicted models can be found at Zenodo: https://zenodo.org/records/20478892

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents AF_Cache, a Nextflow pipeline for high-throughput protein-protein interaction prediction with AlphaFold2 and AlphaFold3. It combines GPU-accelerated MSA generation using MMseqs2, feature caching to eliminate redundant computations, and sequence length bucketing to reduce JAX recompilations. Benchmarking on 5,050 human mitochondrial protein pairs reports a ~2-fold reduction in AlphaFold2 inference time and up to 13-fold speedup in MSA generation, with the claim that predictive performance is retained. The pipeline, reproduction code, MSAs, and models are publicly available on GitHub and Zenodo.

Significance. If the optimizations preserve original AlphaFold outputs, the work supplies a practical, reproducible framework for large-scale PPI screening that directly targets the dominant CPU and compilation bottlenecks. The explicit release of code, data, and models is a clear strength that enables independent verification and adoption.

major comments (1)
  1. [Abstract] Abstract (Results paragraph): the claim that AF_Cache 'retain[s] predictive performance' is unsupported by any reported metrics. No ipTM, pLDDT, RMSD, DockQ, or other equivalence statistics are provided comparing the baseline AlphaFold pipeline against the feature-cached and length-bucketed versions on the 5,050-pair set. Because feature caching assumes bit-identical MMseqs2 outputs and bucketing can alter padding/compilation paths, this verification is load-bearing for the central claim that the measured speedups are usable without accuracy loss.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the need for explicit verification of retained predictive performance. We agree this is a substantive point and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (Results paragraph): the claim that AF_Cache 'retain[s] predictive performance' is unsupported by any reported metrics. No ipTM, pLDDT, RMSD, DockQ, or other equivalence statistics are provided comparing the baseline AlphaFold pipeline against the feature-cached and length-bucketed versions on the 5,050-pair set. Because feature caching assumes bit-identical MMseqs2 outputs and bucketing can alter padding/compilation paths, this verification is load-bearing for the central claim that the measured speedups are usable without accuracy loss.

    Authors: We agree that the claim requires supporting metrics and that the assumptions around bit-identical caching and bucketing do not substitute for empirical verification. In the revised manuscript we will add a dedicated results subsection (and associated supplementary table) that reports ipTM, pLDDT, and, where relevant, DockQ or RMSD values on a representative random subset of the 5,050 pairs, directly comparing the baseline AlphaFold2/3 pipeline against the cached and bucketed versions. The abstract will be updated to reference these equivalence results. We will also make the comparison scripts available in the reproduction repository. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical runtime claims rest on direct measurements

full rationale

The paper is an engineering description of a Nextflow pipeline implementing feature caching and length bucketing for AlphaFold. All reported results are direct wall-clock timings on a fixed 5,050-pair mitochondrial dataset; no equations, fitted parameters, predictions derived from first principles, or uniqueness theorems appear. The preservation-of-accuracy premise is an unverified assumption (correctness issue) but is not part of any derivation chain that reduces to the inputs by construction. No self-citations are load-bearing for any mathematical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering and benchmarking paper; it introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5814 in / 1124 out tokens · 52865 ms · 2026-06-28T03:19:41.741920+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references

  1. [1]

    Abagyan, R. A. and Batalov, S. , title =. J. Mol. Biol. , year = 1997, volume = 273, pages =

  2. [2]

    and Taylor, W.R

    Jones, D.T. and Taylor, W.R. and Thornton, J.M. A new appoach to protein fold recognition. Nature

  3. [3]

    and Schneider, R

    Sander, C. and Schneider, R. Database of homology derived protein structures and the structural meaning of sequence alignment. Proteins: Struct. Funct. Genet

  4. [4]

    and Karplus, K

    Park, J. and Karplus, K. and Barrett, C. and Hughey, R. and Haussler, D. and Hubbard, T. and Chothia C. , title =. J. Mol. Biol. , year = 1998, volume = 284, pages =

  5. [5]

    and Eddy, S.R

    Sonnhammer, E.L. and Eddy, S.R. and Durbin, R. Pfam: a Comprehensive database of protein domain families based on seed alignments. Proteins, Structure function and genetics. 1997

  6. [6]

    and Teichmann, S

    Park, J. and Teichmann, S. A. and Hubbard, T. and Chothia, C. , title =. J. Mol. Biol. , year = 1997, volume = 273, pages =

  7. [7]

    and Apweiler, R

    Bairoch, A. and Apweiler, R. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Res

  8. [8]

    and Brenner, S.E

    Murzin, A.G. and Brenner, S.E. and Hubbard, T. and Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol

  9. [9]

    and Chothia, C

    Lesk, A.M. and Chothia, C. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol

  10. [10]

    and Lesk, A.M

    Chothia, C. and Lesk, A.M. The relationship between the divergence of sequence and structure in proteins. EMBO J

  11. [11]

    and Madden, T.L

    Altschul, S.F. and Madden, T.L. and Schaffer, A.A. and Zhang, J. and Zhang, Z. and Miller, W. and Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997

  12. [12]

    Brenner, S. E. and Chothia, C. and Hubbard, T. Assessing sequence comparison methods with reliable structurally identified evolutionary relationships. Proc. Natl. Acad. Sci. USA

  13. [13]

    Pearson, W. R. and Lipman, D. J. Improved Tools for Biological Sequence Analysis. Proc. Natl. Acad. Sci. U.S.A

  14. [14]

    Pearson, W. R. Comparison of methods for searching protein sequence databases. Protein Sci

  15. [15]

    and Sander, C

    Holm, L. and Sander, C. Touring protein fold space with Dali/FSSP. Nucl. Acid. Res

  16. [16]

    and Waterman, M.S

    Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol

  17. [17]

    and Wunsch, C.D

    Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol

  18. [18]

    and Henikoff, J.G

    Henikoff, S. and Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA

  19. [19]

    and Cohen, M.A

    Gonnet, G.H. and Cohen, M.A. and Benner, S.A. Exhaustive matching of the entire protein sequence database. Science

  20. [20]

    and Schwartz, R.M

    Dayhoff, M. and Schwartz, R.M. and Orcutt, B.C. Atlas of protein sequence and structure

  21. [21]

    and Blaisdell, B

    Karlin, S. and Blaisdell, B. E. and Mocarski, E. S. and Brendel, V. A method to identify distinctive charge configurations in protein sequences, with application to human herpesvirus polypeptides. J. Mol. Biol

  22. [22]

    Thompson, J. D. and Higgins, D.G and Gibson, T.J. , title =. Nucleic Acids Research , year = 1994, volume = 22, pages =

  23. [23]

    and Sander, C

    Rost, B. and Sander, C. Prediction of protein secondary structure structure at better than 70\. J. Mol. Biol

  24. [24]

    and Eisenberg, D

    Rice, D. and Eisenberg, D. A 3 D -1 D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. J. Mol. Biol

  25. [25]

    and Eisenberg, D

    Fischer, D. and Eisenberg, D. Protein fold recognition using sequence-derived predictions. Protein Sci

  26. [26]

    and Schneider, R

    Rost, B. and Schneider, R. and Sander, C. Protein fold recognition by prediction-based threading. J. Mol. Biol. 1997

  27. [27]

    and Fischer, D

    Rice, D. and Fischer, D. and Weiss, R. and Eisenberg, D. Fold assignments for amino acid sequences of the C A S P 2 experiment. Proteins: Struct. Funct. Genet., Suppl

  28. [28]

    and Geetha, V

    Di Francesco, V. and Geetha, V. and Garnier, J. and Munson, P. J. Fold recognition using predicted secondary structure sequences and hidden M arkov models of proteins folds. Proteins: Struct. Funct. Genet., Suppl

  29. [29]

    and Domingues, F

    Fl \"o ckner, H. and Domingues, F. and Sippl, M. J. Proteins folds from pair interactions: A blind test in fold recogition. Proteins: Struct. Funct. Genet., Suppl

  30. [30]

    and Sj \"o lander, K

    Karplus, K. and Sj \"o lander, K. and Barrett, C. and Cline, M. and Haussler, D. and Hughey, R. and Holm, L. and Sander, C. Predicting structures using hidden M arkov models. Proteins: Struct. Funct. Genet., Suppl

  31. [31]

    and Lüthy, R

    Bowie, J.U. and Lüthy, R. and Eisenberg, D. A method to identify protein sequence that fold into a known three-dimensional structure. Science

  32. [32]

    HMMER--Hidden M arkov model software URL: http://genome.wustl.edu/eddy/hmmer.html

    Eddy, S.R. HMMER--Hidden M arkov model software URL: http://genome.wustl.edu/eddy/hmmer.html

  33. [33]

    and Eddy, S

    Durbin, R. and Eddy, S. and Krogh, A. and Mitchison, G. Biological sequence analysis

  34. [34]

    and Brown, M

    Krogh, A. and Brown, M. and Mian, I. S. and Sj \"o lander, K. and Haussler, D. Hidden M arkov models in computational biology: applications to protein modeling. J. Mol. Biol

  35. [35]

    and Hubbard, T

    Moult, J. and Hubbard, T. and Bryant, S. H. and Fidelis, K. and Pedersen, J. T. Critical assesment of methods of proteins structure predictions ( C A S P ): Round I I. Proteins: Struct. Funct. Genet., Suppl

  36. [36]

    and Argos, P

    Frishman, D. and Argos, P. Seventy-five percent accuracy in protein secondary structure prediction. Proteins: Struct. Funct. Genet

  37. [37]

    and Krogh, A

    Haussler, D. and Krogh, A. and Mian, I. S. and Sj \"o lander, K. Protein modeling using hidden M arkov models: Analysis of globins. `` Hawaii International Conference on Systems Science ``

  38. [38]

    o lander, K. Protein modeling using hidden Markov models: Analysis of globins. \

    Haussler, D. and Krogh, A. and Mian, I. S. and Sj \"o lander, K. Protein modeling using hidden Markov models: Analysis of globins. \" Hawaii International Conference on Systems Science

  39. [39]

    Rabiner, L. R. and Juang, B. H. An introduction to hidden M arkov models

  40. [40]

    Dayhoff, M. O. and Barker, W. C. and Hunt, L. T. Establishing homologies in protein sequences. Meth. Enzymol

  41. [41]

    Hubbard, T. J. and Park, J. , title =. Proteins: Struct. Funct. Genet. , year = 1995, volume = 23, pages =

  42. [42]

    Fischer , title =

    D. Fischer , title =. Pacific Symposium on Biocomputing , pages =

  43. [43]

    , title =

    Krogh, A. , title =. Proc. of Fifth Int. Conf. on Intelligent Systems for Molecular Biology , pages =

  44. [44]

    and von Heijne, G

    Sonnhammer, E. and von Heijne, G. and Krogh, A. , title =. Proc. of Sixth Int. Conf. on Intelligent Systems for Molecular Biology. , pages =

  45. [45]

    and Fischer, D

    Elofsson, A. and Fischer, D. and Rice, D. W. and Le Grand, S. M. and Eisenberg D. , title =. Fold Des , year = 1996, volume = 1, pages =

  46. [46]

    and McLachlan, A

    Gribskov, M. and McLachlan, A. D. and Eisenberg, D. , title =. Proc Natl Acad Sci U S A , year = 1987, volume = 84, pages =

  47. [47]

    and Levitt, M

    Gerstein, M. and Levitt, M. , title =. Protein Sci , year = 1998, volume = 7, pages =

  48. [48]

    and Sander, C

    Holm, L. and Sander, C. , title =. Proteins: Struct. Funct. Genet. , year = 1997, volume = 28, pages =

  49. [49]

    and Barrett, C

    Karplus, K. and Barrett, C. and Hughey, R. , title =. Bioinformatics , year = 1998, volume = 14, pages =

  50. [50]

    and Michi, A.D

    Orengo, C.A. and Michi, A.D. and Jones, S. and Jones, D.T. and Swindels, M. B. and Thornton, J.M. , title =. Structure , year = 1997, volume = 5, pages =

  51. [51]

    and Suwa, M

    Salamov, A.A. and Suwa, M. and Orengo, C. A. and Swindells, M. B. , title =. Protein Sci , year = 1999, volume = 8, pages =

  52. [52]

    Sheridan, R. P. and Dixon, J. S. and Venkataraghavan, R. , title =. Int. J. Pept. Protein Res. , year = 1985, volume = 25, pages =

  53. [53]

    Salamov, A. A. and Suwa, M. and Orengo, C. A. and Swindells M. B. , title =. Protein Eng , year = 1999, volume = 12, pages =

  54. [54]

    Lindahl, E and Elofsson, A , title =. J. Mol. Biol. , year = 2000, volume = 295, pages =

  55. [55]

    Proc Natl Acad Sci U S A , year = 1998, volume = 95, number = 11, pages =

    Levitt, M and Gerstein, M , title =. Proc Natl Acad Sci U S A , year = 1998, volume = 95, number = 11, pages =

  56. [56]

    Proc Natl Acad Sci U S A , year = 1984, volume = 81, number = 4, pages =

    Kabsch, W and Sander, C , title =. Proc Natl Acad Sci U S A , year = 1984, volume = 81, number = 4, pages =

  57. [57]

    Fischer and A

    D. Fischer and A. Elofsson and L. Rychlewski and F. Pazos and A. Valencia and B. Rost and Ortiz, A.R. and Dunbrack, R.L. , title =. Proteins , year =

  58. [58]

    1999 , title =

    CASP , key =. 1999 , title =

  59. [59]

    and Barret, C

    Fischer, D. and Barret, C. and Bryson, K. and Elofsson, A. and Godzik, A. and Jones, D. and Karplus, K.J. and Kelley, L.A. and MacCallum, R.M. and Pawowski, K. and Rost, B. and Rychlewski, L. and Sternberg, M. , title =. Proteins , year =

  60. [60]

    Siew and A

    N. Siew and A. Elofsson and L. Rychlewski and D. Fischer , title =. Bionformatics , year =

  61. [61]

    Jones and G.J

    T.A. Jones and G.J. Kleywegt , title =. Proteins , year = 1999, volume =

  62. [62]

    Orengo and J.E

    C.A. Orengo and J.E. Bray and T. Hubbard and L. LoConte and I. Sillitoe , title =. Proteins , year = 1999, volume =

  63. [63]

    Zemla and C

    A. Zemla and C. Veclovas and J. Moult and K. Fidelis , title =. Proteins , year = 1999, volume =

  64. [64]

    Proteins , year = 1999, volume =

    T.J.P Hubbard , title =. Proteins , year = 1999, volume =

  65. [65]

    Murzin , title =

    A.G. Murzin , title =. Proteins , year = 1999, volume =

  66. [66]

    Moult and T

    J. Moult and T. Hubbard and K. Fidelis and J.T. Pedersen , title =. Proteins , year = 1999, volume =

  67. [67]

    Feng and M.J

    Z.-K. Feng and M.J. Sippl , title =. Fold. Des. , year = 1996, volume = 1, pages =

  68. [68]

    Abagyan and M.M

    R.A. Abagyan and M.M. Totrov , title =. J. Mol. Biol , year = 1997, volume = 268, number = 3, pages =

  69. [69]

    and Persson, B

    Kallberg, Y. and Persson, B. , title =. Bioinformatics , year = 1999, volume = 15, number = 3, pages =

  70. [70]

    Rost , title =

    B. Rost , title =. Proteins , year = 1997, volume =

  71. [71]

    , title =

    Sali, A. , title =. Nat Struct Biol , year = 1998, volume = 5, number = 12, pages =

  72. [72]

    , title =

    Sanchez, R and Sali, A. , title =

  73. [73]

    Peitsch and M.R

    M.C. Peitsch and M.R. Wilkins and L. Tonella and J-C Sanchez and D.F Hochstrasser , title =. Electrophoresis , year = 1997, volume = 18, pages =

  74. [74]

    Jones , title =

    D.T. Jones , title =

  75. [75]

    , title =

    Jones, D.T. , title =

  76. [76]

    Alexandrov and R

    N. Alexandrov and R. Lüthy , title =. Protein Sci. , year = 1998, volume = 7, pages =

  77. [77]

    and Bowie, J.U

    Lüthy, R. and Bowie, J.U. and Eisenberg, D. , title =. Nature , year = 1992, volume = 356, number = 6364, pages =

  78. [78]

    Domingues and P

    F.S. Domingues and P. Lackner and A. Andreeva and M.J. Sippl , title =. J. Mol. Biol , year = 2000, volume = 297, number = 4, pages =

  79. [79]

    Rost , title =

    B. Rost , title =. Protein Eng. , year = 1999, volume = 12, number = 2, pages =

  80. [80]

    Hadley and D

    C. Hadley and D. T. Jones , title =. Structure , year = 1999, volume = 7, number = 8, pages =

Showing first 80 references.