pith. sign in

arxiv: 2605.06644 · v2 · submitted 2026-05-07 · 💻 cs.LG

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

Pith reviewed 2026-05-12 04:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords fluorescent proteinsquantum yield3D mechanism graphschromophoresignal propagationstructure-based predictionmachine learningprotein engineering
0
0 comments X

The pith

A chromophore-centred 3D graph model captures local signal propagation to predict fluorescent protein quantum yield better than sequence models, with largest gains for remote homologs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fluorescent protein quantum yield is controlled by the mature chromophore and its immediate three-dimensional microenvironment rather than sequence identity. The paper introduces a method that converts PDB structures into typed 3D residue graphs, registers them to the mature chromophore state, partitions the chromophore into phenolate, bridge and imidazolinone regions, and propagates signals along contact channels. The resulting 52 non-identity features feed band-specific regression and outperform protein language models on a 531-protein set. The advantage is clearest for proteins sharing less than 50 percent sequence similarity, and the same features recover band-specific mechanisms such as aromatic packing or charge balance. This structural approach supplies intrinsic interpretability because each feature directly encodes a channel, seed signal and target region.

Core claim

The paper claims that edge-specific signal propagation on mature chromophore-region 3D mechanism graphs, obtained by converting PDB structures to typed residue graphs, registering to mature-CRO state, partitioning into chromophore regions, and transforming by channel-signal-region propagation, supplies 52 non-identity features that enable band-specific ExtraTrees regression to reach R = 0.772 and MAE = 0.131 on random cross-validation, exceed sequence baselines, rank first in bright screening, and maintain superiority in the remote-homology bucket while recovering band-specific mechanisms through stable selected features.

What carries the argument

The chromophore-centred mechanism graph that encodes typed 3D residue contacts as channels carrying seed signals to partitioned target regions in the mature chromophore.

If this is right

  • The largest performance margin appears in the remote-homology bucket below 50 percent sequence similarity.
  • The model achieves the highest Bright P@5 of 0.704 among tested methods.
  • Stable features recover distinct mechanisms for GFP-like, Red and Far-red bands without post-hoc analysis.
  • Removing identity shortcuts leaves 52 features that still support the reported accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph construction could be tested on other photophysical properties controlled by local microenvironments, such as photostability or emission wavelength shifts.
  • If the channel-based propagation model is accurate, targeted mutation of high-importance contact residues should produce predictable changes in measured quantum yield.
  • Adding explicit dynamics or solvent terms to the 3D graphs might further improve accuracy for flexible chromophore environments.
  • The method offers a route to structure-guided design that reduces dependence on evolutionary sequence patterns.

Load-bearing premise

Converting PDB structures into typed 3D residue graphs registered to a mature chromophore state and partitioned into chromophore regions produces features that reflect genuine physical signal propagation without introducing artifacts or selection bias.

What would settle it

Randomizing the chromophore-region partition labels while leaving all contact channels and node features intact and then observing that regression performance collapses to the level of sequence-only baselines would show that the region-specific propagation step is not responsible for the reported gains.

Figures

Figures reproduced from arXiv: 2605.06644 by Steven Aw Yoong Kit, Swee Keong Yeap, Yuchen Xiong.

Figure 1
Figure 1. Figure 1: Clean overview of the proposed algorithm. A PDB structure is converted into a typed 3D residue graph, registered to a mature chromophore state, partitioned into functional CRO regions, transformed by edge-specific signal propagation, filtered to remove identity shortcuts and routed to a band-specific predictor. Mature-state chromophore registration. The immature pre￾cursor is represented as 𝑐 (0) 𝑖 = triad… view at source ↗
Figure 2
Figure 2. Figure 2: Prediction strength across random and homology-controlled regimes. The mechanism graph model is best under random CV and strongest in the most remote-homology bucket. labelled neighbours. In the < 50% similarity bucket, the proposed method produced the best bright precision at 𝐾 = 10, 15, 20 and 25, and the best dark precision at all tested 𝐾 values from 5 to 25 ( view at source ↗
Figure 2
Figure 2. Figure 2: ). Bucket membership was determined by each test protein’s maximum 5-mer Jaccard similarity to the fixed training set, not by random fold assignment. In the 0.70–0.85 bucket, the method reached 𝑅 = 0.756, outperforming Band mean (0.643), ESM-C (0.672) and SaProt (0.701). In the 0.50–0.70 bucket, it reached 𝑅 = 0.824, es￾sentially matching Band mean (0.830) while clearly exceeding ESM-C (0.626) and SaProt (… view at source ↗
Figure 3
Figure 3. Figure 3: Top-K retrieval frontiers for bright and dark proteins. Left: random CV. Right: remote-homology bucket (< 50% similarity). The mechanism graph model provides the strongest overall screening behaviour in the remote-homology setting. matches crystallographic and photophysical work on far-red proteins showing that chromophore isomerization, planarity and steric restriction of torsional relaxation are central … view at source ↗
Figure 3
Figure 3. Figure 3: Top-K retrieval frontiers for bright and dark proteins. Left: random CV. Right: the most remote fixed-split bucket, where each held-out protein has maximum 5-mer Jaccard similarity 𝑚𝑘 < 0.50 to the fixed training set. The embedded panel label “Remote homology (< 50%)” is a compact label for this 𝐽5 < 0.50 bucket and does not denote pairwise sequence identity. together with clamp asymmetry. This agrees with… view at source ↗
Figure 4
Figure 4. Figure 4: Stable selected features across seeds and folds. Bubble area is proportional to recurrence among top-10 selected features across five seeds and five folds. Colours denote feature families, not propagation channels. The activated propagation channels are steric and hydrophobic; feature families describe the physicochemical seed signal or clamp descriptor carried by those channels. The label clamp asymmetry … view at source ↗
Figure 4
Figure 4. Figure 4: Stable selected mechanism descriptors across seeds and folds. The three columns correspond to the GFP-like, Red and Far-red band-specific models. Each bubble denotes a channel–signal–region descriptor or a local clamp descriptor that repeatedly appeared among the top-10 selected features across five seeds and five folds; bubble area is proportional to recurrence. Colours denote feature families, not propag… view at source ↗
read the original abstract

Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (<50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins. Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a chromophore-centred mechanism graph algorithm for predicting fluorescent protein quantum yields from PDB structures. Each structure is converted to a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate/bridge/imidazolinone regions, and transformed via channel-signal-region propagation to produce 121 enrichment features; after removing identity shortcuts this yields 52 non-identity features for band-specific ExtraTrees regression. On a 531-protein benchmark the method reports the highest random-CV performance (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002) among model-based baselines, outperforming Band mean, ESM-C and SaProt, with the largest advantage in the sequence-remote bucket (<50% similarity, R = 0.697) and strongest bright/dark Top-K screening; selected features recover band-specific mechanisms.

Significance. If the reported gains are shown to arise from genuine physical signal propagation rather than structural-homology leakage, the work would establish that explicit 3D contact-channel graphs can outperform sequence language models for QY prediction, especially for remote homologs. The intrinsic interpretability of each feature (contact channel + seed signal + target CRO region) and the provision of code, feature tables and evaluation scripts upon request are clear strengths that support reproducibility and mechanistic insight for fluorescent-protein engineering.

major comments (2)
  1. [Homology control results] Homology-control section: the remote bucket is defined exclusively by sequence identity <50%. Because the model inputs are 3D residue graphs whose edges encode contacts and channels, structural similarity metrics (TM-score, Dali Z-score or equivalent) must be reported for proteins in this bucket. Sequence <50% frequently permits conserved folds and similar chromophore microenvironments, which could allow the 52 features to match training examples and inflate the reported R = 0.697 advantage over baselines.
  2. [Methods / Feature derivation] Feature-selection paragraph: the reduction from 121 to 52 non-identity features by 'removing identity shortcuts' is described without stating whether the selection criteria and threshold were fixed a priori, performed inside each CV fold, or applied to the full labelled set. If the latter, this data-driven step risks circularity that undermines the cross-validation claims.
minor comments (2)
  1. [Abstract] Abstract and results: clarify whether 'Band mean' is treated as a model-based baseline or a simple statistical baseline when claiming 'best ... among model-based baselines'.
  2. [Data and code availability] Reproducibility statement: while code and scripts are offered upon request, public deposition (e.g., GitHub/Zenodo) would remove the barrier to independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive and detailed review. We appreciate the focus on strengthening the homology controls and clarifying the feature derivation process. We address each major comment below and will incorporate the suggested revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Homology control results] Homology-control section: the remote bucket is defined exclusively by sequence identity <50%. Because the model inputs are 3D residue graphs whose edges encode contacts and channels, structural similarity metrics (TM-score, Dali Z-score or equivalent) must be reported for proteins in this bucket. Sequence <50% frequently permits conserved folds and similar chromophore microenvironments, which could allow the 52 features to match training examples and inflate the reported R = 0.697 advantage over baselines.

    Authors: We agree that structural similarity metrics are necessary to rule out fold-level leakage in the remote-homology bucket. In the revised manuscript we will compute TM-scores (via TM-align) between every remote-bucket test protein and its nearest training-set neighbor (by sequence identity). We will report the distribution of these TM-scores together with performance stratified by TM-score thresholds (e.g., TM-score < 0.5). This addition will allow readers to assess whether the reported R = 0.697 advantage persists for structurally divergent proteins and will directly address the concern that conserved chromophore microenvironments may be driving the results. revision: yes

  2. Referee: [Methods / Feature derivation] Feature-selection paragraph: the reduction from 121 to 52 non-identity features by 'removing identity shortcuts' is described without stating whether the selection criteria and threshold were fixed a priori, performed inside each CV fold, or applied to the full labelled set. If the latter, this data-driven step risks circularity that undermines the cross-validation claims.

    Authors: The identity-shortcut removal follows a fixed, a priori rule defined during mechanism-graph construction: any feature whose propagation path connects a residue to itself without involving a chromophore-region channel is excluded. This deterministic criterion was applied independently inside each cross-validation training fold using only training data. We will revise the Methods section to state this procedure explicitly, including the precise definition of identity shortcuts and confirmation that selection never used test-set information. This clarification will remove any ambiguity regarding circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; feature engineering and CV evaluation remain independent of target labels

full rationale

The paper constructs typed 3D residue graphs from PDB structures, registers them to a mature chromophore state, partitions regions, and computes 121 enrichment features (later reduced to 52 non-identity features) that encode contact channels, seed signals, and target regions. These fixed structural descriptors are then fed to ExtraTrees regression. Performance is measured via random CV and sequence-homology buckets on a 531-protein benchmark, with no evidence that the regression outputs or selected features are algebraically equivalent to the QY labels by definition. No load-bearing self-citations, uniqueness theorems, or fitted-input-as-prediction steps appear in the derivation; the central claim rests on empirical generalization from pre-specified graph-derived features rather than tautological reduction to the training targets.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The method rests on standard structural biology inputs and introduces new algorithmic representations for signal propagation; no major new physical entities are postulated beyond the graph construction itself.

free parameters (2)
  • Selection threshold for 52 non-identity features
    Chosen after removing identity shortcuts; exact selection criterion and whether it was tuned on the benchmark are not stated in the abstract.
  • ExtraTrees hyperparameters
    Model parameters for regression not specified; typical for tree ensembles but still data-influenced.
axioms (2)
  • domain assumption PDB structures provide accurate 3D coordinates for residue contacts around the chromophore.
    Invoked when converting structures into typed 3D residue graphs.
  • domain assumption Partitioning the chromophore into phenolate, bridge, and imidazolinone regions is physically meaningful for signal propagation.
    Central to the mechanism-graph construction and feature encoding.
invented entities (2)
  • Mature chromophore-region 3D mechanism graph no independent evidence
    purpose: To represent local physical signals acting on specific chromophore sub-regions.
    New representation introduced by the algorithm.
  • Channel-signal-region propagation no independent evidence
    purpose: To generate enrichment features that encode contact channels, seed signals, and target CRO regions.
    Core step for creating the 121 (then 52) features.

pith-pipeline@v0.9.0 · 5655 in / 1889 out tokens · 84976 ms · 2026-05-12T04:34:09.614840+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    (1998) The green fluorescent protein.Annu

    Tsien,R.Y. (1998) The green fluorescent protein.Annu. Rev. Biochem., 67, 509–544

  2. [2]

    and Tsien,R.Y

    Shaner,N.C., Steinbach,P.A. and Tsien,R.Y. (2005) A guide to choosing fluorescent proteins.Nat. Methods,2, 905–909

  3. [3]

    and Verkhusha,V.V

    Piatkevich,K.D. and Verkhusha,V.V. (2010) Advances in engineering of fluorescent proteins and photoactivatable proteins with red emission. Curr. Opin. Chem. Biol.,14, 23–29

  4. [4]

    Natl Acad

    Rives,A.et al.(2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.Proc. Natl Acad. Sci. USA,118, e2016239118

  5. [5]

    and Wehenkel,L

    Geurts,P., Ernst,D. and Wehenkel,L. (2006) Extremely randomized trees. Mach. Learn.,63, 3–42

  6. [6]

    Jumper,J.et al.(2021) Highly accurate protein structure prediction with AlphaFold.Nature,596, 583–589

  7. [7]

    and Prasher,D.C

    Chalfie,M., Tu,Y., Euskirchen,G., Ward,W.W. and Prasher,D.C. (1994) Green fluorescent protein as a marker for gene expression.Science,263, 802–805

  8. [8]

    and Remington,S.J

    Ormö,M., Cubitt,A.B., Kallio,K., Gross,L.A., Tsien,R.Y. and Remington,S.J. (1996) Crystal structure of the Aequorea victoria green fluorescent protein.Science,273, 1392–1395

  9. [9]

    and Ranganathan,R

    Wall,M.A., Socolich,M. and Ranganathan,R. (2000) The structural basis for red fluorescence in the tetrameric GFP homolog DsRed.Nat. Struct. Biol.,7, 1133–1138

  10. [10]

    and da Silva,J.C.G.E

    Ferreira,J.R.M., Rodrigues,J.V., Silva,A.M.S. and da Silva,J.C.G.E. (2022) Locking the GFP fluorophore to enhance its emission intensity. Molecules,28, 234

  11. [11]

    and Pigault,C

    Follenius-Wund,A., Bourotte,M., Schmitt,M., Iyice,F., Lami,H., Bourguignon,J.-J., Haiech,J. and Pigault,C. (2003) Fluorescent derivatives of the GFP chromophore give a new insight into the GFP fluorescence process.Biophys. J.,85, 1839–1850

  12. [12]

    and Rhee,Y.M

    Park,J.W. and Rhee,Y.M. (2016) Electric field keeps chromophore planar and produces high yield fluorescence in green fluorescent protein.J. Am. Chem. Soc.,138, 13619–13629

  13. [13]

    and Hughes,T.E

    Drobizhev,M., Molina,R.S., Callis,P.R., Scott,J.N., Lambert,G.G., Salih,A., Shaner,N.C. and Hughes,T.E. (2021) Local electric field controls fluorescence quantum yield of red and far-red fluorescent proteins.Front. Mol. Biosci.,8, 633217

  14. [14]

    and Gadella,T.W.J

    Bindels,D.S., Haarbosch,L., van Weeren,L., Postma,M., Wiese,K.E., Mastop,M., Aumonier,S., Gotthard,G., Royant,A., Hink,M.A. and Gadella,T.W.J. (2017) mScarlet: a bright monomeric red fluorescent protein for cellular imaging.Nat. Methods,14, 53–56

  15. [15]

    and Chica,R.A

    Legault,S., Fraser-Halberg,D.P., McAnelly,R.L., Eason,M.G., Thompson,M.C. and Chica,R.A. (2022) Generation of bright monomeric red fluorescent proteins via computational design of enhanced chromophore packing.Chem. Sci.,13, 1408–1418

  16. [16]

    and Pletnev,V

    Pletnev,S., Shcherbo,D., Chudakov,D.M., Pletneva,N., Merzlyak,E.M., Wlodawer,A., Dauter,Z. and Pletnev,V. (2008) A crystallographic study of bright far-red fluorescent protein mKate reveals pH-induced cis–trans isomerization of the chromophore.J. Biol. Chem.,283, 28980–28987

  17. [17]

    and Rossjohn,J

    Petersen,J., Wilmann,P.G., Beddoe,T., Oakley,A.J., Devenish,R.J., Prescott,M. and Rossjohn,J. (2003) The 2.0-Å crystal structure of eqFP611, a far red fluorescent protein from the sea anemoneEntacmaea quadricolor.J. Biol. Chem.,278, 44626–44631

  18. [18]

    (2019) FPbase: a community-editable fluorescent protein database.Nat

    Lambert,T.J. (2019) FPbase: a community-editable fluorescent protein database.Nat. Methods,16, 277–278

  19. [19]

    and Bourne,P.E

    Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank.Nucleic Acids Res.,28, 235–242

  20. [20]

    (2025) OpenFold3-preview: a fully open-source biomolecular structure prediction model based on AlphaFold3.Zenodo

    The OpenFold3 Team. (2025) OpenFold3-preview: a fully open-source biomolecular structure prediction model based on AlphaFold3.Zenodo. https://doi.org/10.5281/zenodo.19001000

  21. [21]

    (2024) ESM Cambrian: revealing the mysteries of proteins with unsupervised learning

    EvolutionaryScale Team. (2024) ESM Cambrian: revealing the mysteries of proteins with unsupervised learning

  22. [22]

    and Yuan,F

    Su,J., Han,C., Zhou,Y., Shan,J., Zhou,X. and Yuan,F. (2024) SaProt: protein language modeling with structure-aware vocabulary. International Conference on Learning Representations

  23. [23]

    Pedregosa,F.et al.(2011) Scikit-learn: machine learning in Python.J. Mach. Learn. Res.,12, 2825–2830

  24. [24]

    and Fan,Z

    Xie,Z., Zhang,P., Lin,Q., Zhang,Q. and Fan,Z. (2025) EM-PLA: environment-aware heterogeneous graph-based multimodal protein–ligand binding affinity prediction.Bioinformatics,41, btaf298

  25. [25]

    and Zhang,Q

    Yang,J., Li,Z., Fan,X., Cheng,Y., Chu,Q. and Zhang,Q. (2022) Deep learning identifies explainable reasoning paths of mechanism of action for drug repurposing from multilayer biological network.Brief. Bioinform.,23, bbac469

  26. [26]

    and Zhang,Q

    Yang,J., Xu,Z., Wu,W.K.K., Chu,Q. and Zhang,Q. (2021) GraphSynergy: a network-inspired deep learning model for anticancer drug combination prediction.J. Am. Med. Inform. Assoc.,28, 2336–2345. 10 A. Implementation details of channel–signal–region features A.1 Activated channels and reserved channels The implementation separates candidate physical annotatio...