Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction
Pith reviewed 2026-05-12 04:34 UTC · model grok-4.3
The pith
A chromophore-centred 3D graph model captures local signal propagation to predict fluorescent protein quantum yield better than sequence models, with largest gains for remote homologs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that edge-specific signal propagation on mature chromophore-region 3D mechanism graphs, obtained by converting PDB structures to typed residue graphs, registering to mature-CRO state, partitioning into chromophore regions, and transforming by channel-signal-region propagation, supplies 52 non-identity features that enable band-specific ExtraTrees regression to reach R = 0.772 and MAE = 0.131 on random cross-validation, exceed sequence baselines, rank first in bright screening, and maintain superiority in the remote-homology bucket while recovering band-specific mechanisms through stable selected features.
What carries the argument
The chromophore-centred mechanism graph that encodes typed 3D residue contacts as channels carrying seed signals to partitioned target regions in the mature chromophore.
If this is right
- The largest performance margin appears in the remote-homology bucket below 50 percent sequence similarity.
- The model achieves the highest Bright P@5 of 0.704 among tested methods.
- Stable features recover distinct mechanisms for GFP-like, Red and Far-red bands without post-hoc analysis.
- Removing identity shortcuts leaves 52 features that still support the reported accuracy.
Where Pith is reading between the lines
- The same graph construction could be tested on other photophysical properties controlled by local microenvironments, such as photostability or emission wavelength shifts.
- If the channel-based propagation model is accurate, targeted mutation of high-importance contact residues should produce predictable changes in measured quantum yield.
- Adding explicit dynamics or solvent terms to the 3D graphs might further improve accuracy for flexible chromophore environments.
- The method offers a route to structure-guided design that reduces dependence on evolutionary sequence patterns.
Load-bearing premise
Converting PDB structures into typed 3D residue graphs registered to a mature chromophore state and partitioned into chromophore regions produces features that reflect genuine physical signal propagation without introducing artifacts or selection bias.
What would settle it
Randomizing the chromophore-region partition labels while leaving all contact channels and node features intact and then observing that regression performance collapses to the level of sequence-only baselines would show that the region-specific propagation step is not responsible for the reported gains.
Figures
read the original abstract
Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (<50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins. Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a chromophore-centred mechanism graph algorithm for predicting fluorescent protein quantum yields from PDB structures. Each structure is converted to a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate/bridge/imidazolinone regions, and transformed via channel-signal-region propagation to produce 121 enrichment features; after removing identity shortcuts this yields 52 non-identity features for band-specific ExtraTrees regression. On a 531-protein benchmark the method reports the highest random-CV performance (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002) among model-based baselines, outperforming Band mean, ESM-C and SaProt, with the largest advantage in the sequence-remote bucket (<50% similarity, R = 0.697) and strongest bright/dark Top-K screening; selected features recover band-specific mechanisms.
Significance. If the reported gains are shown to arise from genuine physical signal propagation rather than structural-homology leakage, the work would establish that explicit 3D contact-channel graphs can outperform sequence language models for QY prediction, especially for remote homologs. The intrinsic interpretability of each feature (contact channel + seed signal + target CRO region) and the provision of code, feature tables and evaluation scripts upon request are clear strengths that support reproducibility and mechanistic insight for fluorescent-protein engineering.
major comments (2)
- [Homology control results] Homology-control section: the remote bucket is defined exclusively by sequence identity <50%. Because the model inputs are 3D residue graphs whose edges encode contacts and channels, structural similarity metrics (TM-score, Dali Z-score or equivalent) must be reported for proteins in this bucket. Sequence <50% frequently permits conserved folds and similar chromophore microenvironments, which could allow the 52 features to match training examples and inflate the reported R = 0.697 advantage over baselines.
- [Methods / Feature derivation] Feature-selection paragraph: the reduction from 121 to 52 non-identity features by 'removing identity shortcuts' is described without stating whether the selection criteria and threshold were fixed a priori, performed inside each CV fold, or applied to the full labelled set. If the latter, this data-driven step risks circularity that undermines the cross-validation claims.
minor comments (2)
- [Abstract] Abstract and results: clarify whether 'Band mean' is treated as a model-based baseline or a simple statistical baseline when claiming 'best ... among model-based baselines'.
- [Data and code availability] Reproducibility statement: while code and scripts are offered upon request, public deposition (e.g., GitHub/Zenodo) would remove the barrier to independent verification.
Simulated Author's Rebuttal
Thank you for the constructive and detailed review. We appreciate the focus on strengthening the homology controls and clarifying the feature derivation process. We address each major comment below and will incorporate the suggested revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Homology control results] Homology-control section: the remote bucket is defined exclusively by sequence identity <50%. Because the model inputs are 3D residue graphs whose edges encode contacts and channels, structural similarity metrics (TM-score, Dali Z-score or equivalent) must be reported for proteins in this bucket. Sequence <50% frequently permits conserved folds and similar chromophore microenvironments, which could allow the 52 features to match training examples and inflate the reported R = 0.697 advantage over baselines.
Authors: We agree that structural similarity metrics are necessary to rule out fold-level leakage in the remote-homology bucket. In the revised manuscript we will compute TM-scores (via TM-align) between every remote-bucket test protein and its nearest training-set neighbor (by sequence identity). We will report the distribution of these TM-scores together with performance stratified by TM-score thresholds (e.g., TM-score < 0.5). This addition will allow readers to assess whether the reported R = 0.697 advantage persists for structurally divergent proteins and will directly address the concern that conserved chromophore microenvironments may be driving the results. revision: yes
-
Referee: [Methods / Feature derivation] Feature-selection paragraph: the reduction from 121 to 52 non-identity features by 'removing identity shortcuts' is described without stating whether the selection criteria and threshold were fixed a priori, performed inside each CV fold, or applied to the full labelled set. If the latter, this data-driven step risks circularity that undermines the cross-validation claims.
Authors: The identity-shortcut removal follows a fixed, a priori rule defined during mechanism-graph construction: any feature whose propagation path connects a residue to itself without involving a chromophore-region channel is excluded. This deterministic criterion was applied independently inside each cross-validation training fold using only training data. We will revise the Methods section to state this procedure explicitly, including the precise definition of identity shortcuts and confirmation that selection never used test-set information. This clarification will remove any ambiguity regarding circularity. revision: yes
Circularity Check
No significant circularity; feature engineering and CV evaluation remain independent of target labels
full rationale
The paper constructs typed 3D residue graphs from PDB structures, registers them to a mature chromophore state, partitions regions, and computes 121 enrichment features (later reduced to 52 non-identity features) that encode contact channels, seed signals, and target regions. These fixed structural descriptors are then fed to ExtraTrees regression. Performance is measured via random CV and sequence-homology buckets on a 531-protein benchmark, with no evidence that the regression outputs or selected features are algebraically equivalent to the QY labels by definition. No load-bearing self-citations, uniqueness theorems, or fitted-input-as-prediction steps appear in the derivation; the central claim rests on empirical generalization from pre-specified graph-derived features rather than tautological reduction to the training targets.
Axiom & Free-Parameter Ledger
free parameters (2)
- Selection threshold for 52 non-identity features
- ExtraTrees hyperparameters
axioms (2)
- domain assumption PDB structures provide accurate 3D coordinates for residue contacts around the chromophore.
- domain assumption Partitioning the chromophore into phenolate, bridge, and imidazolinone regions is physically meaningful for signal propagation.
invented entities (2)
-
Mature chromophore-region 3D mechanism graph
no independent evidence
-
Channel-signal-region propagation
no independent evidence
Reference graph
Works this paper leans on
-
[1]
(1998) The green fluorescent protein.Annu
Tsien,R.Y. (1998) The green fluorescent protein.Annu. Rev. Biochem., 67, 509–544
work page 1998
-
[2]
Shaner,N.C., Steinbach,P.A. and Tsien,R.Y. (2005) A guide to choosing fluorescent proteins.Nat. Methods,2, 905–909
work page 2005
-
[3]
Piatkevich,K.D. and Verkhusha,V.V. (2010) Advances in engineering of fluorescent proteins and photoactivatable proteins with red emission. Curr. Opin. Chem. Biol.,14, 23–29
work page 2010
- [4]
-
[5]
Geurts,P., Ernst,D. and Wehenkel,L. (2006) Extremely randomized trees. Mach. Learn.,63, 3–42
work page 2006
-
[6]
Jumper,J.et al.(2021) Highly accurate protein structure prediction with AlphaFold.Nature,596, 583–589
work page 2021
-
[7]
Chalfie,M., Tu,Y., Euskirchen,G., Ward,W.W. and Prasher,D.C. (1994) Green fluorescent protein as a marker for gene expression.Science,263, 802–805
work page 1994
-
[8]
Ormö,M., Cubitt,A.B., Kallio,K., Gross,L.A., Tsien,R.Y. and Remington,S.J. (1996) Crystal structure of the Aequorea victoria green fluorescent protein.Science,273, 1392–1395
work page 1996
-
[9]
Wall,M.A., Socolich,M. and Ranganathan,R. (2000) The structural basis for red fluorescence in the tetrameric GFP homolog DsRed.Nat. Struct. Biol.,7, 1133–1138
work page 2000
-
[10]
Ferreira,J.R.M., Rodrigues,J.V., Silva,A.M.S. and da Silva,J.C.G.E. (2022) Locking the GFP fluorophore to enhance its emission intensity. Molecules,28, 234
work page 2022
-
[11]
Follenius-Wund,A., Bourotte,M., Schmitt,M., Iyice,F., Lami,H., Bourguignon,J.-J., Haiech,J. and Pigault,C. (2003) Fluorescent derivatives of the GFP chromophore give a new insight into the GFP fluorescence process.Biophys. J.,85, 1839–1850
work page 2003
-
[12]
Park,J.W. and Rhee,Y.M. (2016) Electric field keeps chromophore planar and produces high yield fluorescence in green fluorescent protein.J. Am. Chem. Soc.,138, 13619–13629
work page 2016
-
[13]
Drobizhev,M., Molina,R.S., Callis,P.R., Scott,J.N., Lambert,G.G., Salih,A., Shaner,N.C. and Hughes,T.E. (2021) Local electric field controls fluorescence quantum yield of red and far-red fluorescent proteins.Front. Mol. Biosci.,8, 633217
work page 2021
-
[14]
Bindels,D.S., Haarbosch,L., van Weeren,L., Postma,M., Wiese,K.E., Mastop,M., Aumonier,S., Gotthard,G., Royant,A., Hink,M.A. and Gadella,T.W.J. (2017) mScarlet: a bright monomeric red fluorescent protein for cellular imaging.Nat. Methods,14, 53–56
work page 2017
-
[15]
Legault,S., Fraser-Halberg,D.P., McAnelly,R.L., Eason,M.G., Thompson,M.C. and Chica,R.A. (2022) Generation of bright monomeric red fluorescent proteins via computational design of enhanced chromophore packing.Chem. Sci.,13, 1408–1418
work page 2022
-
[16]
Pletnev,S., Shcherbo,D., Chudakov,D.M., Pletneva,N., Merzlyak,E.M., Wlodawer,A., Dauter,Z. and Pletnev,V. (2008) A crystallographic study of bright far-red fluorescent protein mKate reveals pH-induced cis–trans isomerization of the chromophore.J. Biol. Chem.,283, 28980–28987
work page 2008
-
[17]
Petersen,J., Wilmann,P.G., Beddoe,T., Oakley,A.J., Devenish,R.J., Prescott,M. and Rossjohn,J. (2003) The 2.0-Å crystal structure of eqFP611, a far red fluorescent protein from the sea anemoneEntacmaea quadricolor.J. Biol. Chem.,278, 44626–44631
work page 2003
-
[18]
(2019) FPbase: a community-editable fluorescent protein database.Nat
Lambert,T.J. (2019) FPbase: a community-editable fluorescent protein database.Nat. Methods,16, 277–278
work page 2019
-
[19]
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank.Nucleic Acids Res.,28, 235–242
work page 2000
-
[20]
The OpenFold3 Team. (2025) OpenFold3-preview: a fully open-source biomolecular structure prediction model based on AlphaFold3.Zenodo. https://doi.org/10.5281/zenodo.19001000
-
[21]
(2024) ESM Cambrian: revealing the mysteries of proteins with unsupervised learning
EvolutionaryScale Team. (2024) ESM Cambrian: revealing the mysteries of proteins with unsupervised learning
work page 2024
-
[22]
Su,J., Han,C., Zhou,Y., Shan,J., Zhou,X. and Yuan,F. (2024) SaProt: protein language modeling with structure-aware vocabulary. International Conference on Learning Representations
work page 2024
-
[23]
Pedregosa,F.et al.(2011) Scikit-learn: machine learning in Python.J. Mach. Learn. Res.,12, 2825–2830
work page 2011
- [24]
-
[25]
Yang,J., Li,Z., Fan,X., Cheng,Y., Chu,Q. and Zhang,Q. (2022) Deep learning identifies explainable reasoning paths of mechanism of action for drug repurposing from multilayer biological network.Brief. Bioinform.,23, bbac469
work page 2022
-
[26]
Yang,J., Xu,Z., Wu,W.K.K., Chu,Q. and Zhang,Q. (2021) GraphSynergy: a network-inspired deep learning model for anticancer drug combination prediction.J. Am. Med. Inform. Assoc.,28, 2336–2345. 10 A. Implementation details of channel–signal–region features A.1 Activated channels and reserved channels The implementation separates candidate physical annotatio...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.