Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

Dhruv Agarwal; Riya Bisht

arxiv: 2606.12651 · v1 · pith:64JDHNSZnew · submitted 2026-06-10 · 💻 cs.LG · q-bio.QM

Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

Riya Bisht , Dhruv Agarwal This is my paper

Pith reviewed 2026-06-27 10:15 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords graph neural networksout-of-distribution generalizationsynthesizability predictionauxiliary lossesmolecular property predictionphysics-informed machine learningdrug discovery

0 comments

The pith

Adding auxiliary losses for molecular complexity and strain energy improves a GNN's out-of-distribution accuracy on synthesizability prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether two cheap physical priors, used as auxiliary supervision, can help a graph neural network maintain performance when its training molecules differ from the test molecules. A GINE backbone is trained to classify molecules as synthesizable or not using SAScore labels. On an out-of-distribution split that trains on drug-like compounds and tests on natural products, the versions that add a Bertz-index regression loss, an MMFF94 strain penalty, or both all produce small but statistically significant AUC gains over the baseline. The improvements vanish in-distribution, and the authors note that single-seed runs can produce misleading patterns that disappear under repeated evaluation.

Core claim

On a 65,177-molecule corpus labeled by SAScore thresholds, all three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation.

What carries the argument

Two auxiliary losses added to a GINE backbone: topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. These supply closed-form physical priors as extra supervision signals during training.

If this is right

The combination of both auxiliary losses produces the largest OOD gain.
The auxiliary losses leave in-distribution performance unchanged.
Single-seed experiments can yield non-monotonic patterns that do not survive multi-seed bootstrap evaluation.
The effect is modest in absolute size yet detectable with paired confidence intervals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same auxiliary-loss pattern could be tested on other molecular property tasks that have inexpensive physical or topological priors.
In a generative drug-discovery loop the small OOD lift might reduce the fraction of proposed molecules that later fail synthesis checks.
Larger gains might appear if the auxiliary signals were integrated more deeply than as simple regression or penalty terms.

Load-bearing premise

The chosen split that trains on drug-like molecules and tests on natural products is a representative test of out-of-distribution generalization for synthesizability filters.

What would settle it

Repeating the 4-way ablation on a new OOD split that trains on natural products and tests on drug-like molecules and finding that none of the physics-aware variants produce statistically significant AUC gains.

Figures

Figures reproduced from arXiv: 2606.12651 by Dhruv Agarwal, Riya Bisht.

**Figure 2.** Figure 2: Phase 1 baseline training loss (left) and validation ROC-AUC (right). Validation AUC [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: In-distribution training loss (left; the strain/combined variants carry a constant aux offset) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Left: mean OOD ROC-AUC per variant (bars = mean ± std over 5 seeds; dots = individual seeds); all three aux variants sit above the baseline. Right: paired ∆ vs. baseline with 95% bootstrap CIs; all three intervals lie entirely right of zero. reference point—unsurprising, since they are trained on the SAScore labels, so this contextualizes rather than validates the absolute numbers. A second independent GNN… view at source ↗

read the original abstract

Machine-learning drug-discovery pipelines increasingly rely on generative models that propose molecules far from the data used to train downstream synthesizability filters. Existing filters (SAScore, SCScore, RAscore, DeepSA) are purely statistical and degrade in exactly this out-of-distribution (OOD) regime. We ask whether cheap, closed-form physical priors, used as auxiliary supervision on a graph neural network (GNN), improve OOD generalization. We add two auxiliary losses to a GINE backbone: a topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. On a 65,177-molecule corpus (HIV, Tox21, COCONUT) labeled by SAScore thresholds we reproduce a strong in-distribution baseline, then evaluate a 4-way ablation (baseline / +complexity / +strain / +both) on a single-source OOD split (train on drug-like HIV+Tox21, test on COCONUT natural products), repeated over 5 seeds with paired bootstrap confidence intervals. All three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation. We are explicit that the effects are modest, and we report a cautionary methodological finding: a single-seed version of this experiment produced a qualitatively different (non-monotone) story that did not survive multi-seed evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small, statistically backed OOD AUC gains from Bertz and MMFF94 auxiliaries on one split, with good multi-seed reporting but narrow scope.

read the letter

The paper shows that two auxiliary losses based on Bertz index and MMFF94 energies give small but detectable OOD AUC lifts (0.003–0.006) over a GINE baseline on the HIV+Tox21 to COCONUT split, with all three variants beating the baseline under 5-seed bootstrap intervals that exclude zero.

The work does a straightforward 4-way ablation and correctly flags that single-seed runs produced unstable, non-monotone results while the multi-seed version stabilizes. That caution is useful and strengthens the claim. The in-distribution numbers stay flat, so the auxiliaries only show value under the reported shift.

The gains are modest and the OOD test rests on a single source split. Nothing in the abstract tests whether the same pattern appears on other distribution shifts that matter in generative pipelines. Implementation details on loss weighting and exact training are absent from the provided text, which limits how far one can trust the numbers without the full methods.

This is for people already running GNN filters who want a cheap way to add physical supervision and are willing to run multi-seed checks. The empirical care is solid enough that the paper should go to peer review rather than desk rejection.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that adding two auxiliary losses to a GINE GNN for binary synthesizability classification—one regressing on the Bertz topological complexity index and one softly penalizing MMFF94 strain energy—yields small but statistically significant OOD AUC gains on a single-source split (train HIV+Tox21 drug-like molecules, test COCONUT natural products). A 4-way ablation (baseline vs. +complexity vs. +strain vs. both) repeated over 5 seeds with paired bootstrap CIs shows deltas of +0.0060, +0.0032, and +0.0066 respectively (all CIs exclude zero), while in-distribution performance remains equivalent; the authors explicitly caution that single-seed runs can produce misleading non-monotonic patterns.

Significance. If the result holds under broader testing, the work demonstrates that inexpensive closed-form physical priors can provide a detectable robustness boost to statistical synthesizability filters precisely in the OOD regime relevant to generative drug-discovery models. The multi-seed design, explicit single-seed caution, and use of external physical targets (Bertz, MMFF94) rather than self-referential supervision are methodological strengths. The modest effect sizes and restriction to one distribution shift, however, constrain immediate practical impact.

major comments (2)

[Experimental Setup / OOD Evaluation] The OOD claim rests on a single-source split (HIV+Tox21 → COCONUT). While the authors scope the result narrowly and report statistically significant deltas on this split, the representativeness of this particular shift for the broader class of OOD scenarios encountered by synthesizability filters is not tested; additional splits or multi-source OOD protocols would be needed to support the generalization narrative.
[Methods] Full implementation details (exact GINE architecture, hyper-parameters, loss weighting, SAScore threshold used for labeling, and data preprocessing) are absent from the provided text. This absence makes it impossible to verify the reproduced in-distribution baseline or to assess whether the auxiliary losses interact with any hidden modeling choices.

minor comments (2)

[Results] The abstract and results section correctly flag the modest effect sizes; adding the corresponding in-distribution AUC values (even if statistically indistinguishable) would help readers quantify how OOD-specific the benefit is.
[Introduction] Consider citing prior work on auxiliary physical losses in molecular GNNs (e.g., force-field or graph-topology regularizers) to situate the contribution more explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: [Experimental Setup / OOD Evaluation] The OOD claim rests on a single-source split (HIV+Tox21 → COCONUT). While the authors scope the result narrowly and report statistically significant deltas on this split, the representativeness of this particular shift for the broader class of OOD scenarios encountered by synthesizability filters is not tested; additional splits or multi-source OOD protocols would be needed to support the generalization narrative.

Authors: We agree that the evaluation uses only one OOD split and that this limits claims about broader representativeness. The manuscript is explicitly scoped to this single-source shift (drug-like molecules to natural products), which is directly relevant to the generative drug-discovery setting highlighted in the introduction. We already caution readers about the narrow scope and the risk of non-monotonic single-seed results. While additional splits would strengthen a broader generalization narrative, they fall outside the stated scope of the work; the contribution is the demonstration of a detectable, statistically significant auxiliary-loss effect under this specific, practically relevant OOD condition. We therefore do not plan to add further splits in the revision. revision: no
Referee: [Methods] Full implementation details (exact GINE architecture, hyper-parameters, loss weighting, SAScore threshold used for labeling, and data preprocessing) are absent from the provided text. This absence makes it impossible to verify the reproduced in-distribution baseline or to assess whether the auxiliary losses interact with any hidden modeling choices.

Authors: We agree that the absence of these details hinders reproducibility. In the revised manuscript we will add a new Methods subsection (or appendix) that specifies the exact GINE architecture, all training hyper-parameters, the loss-weighting coefficients for the auxiliary terms, the SAScore threshold used to create binary labels, and the full data-preprocessing pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation on external targets

full rationale

The paper reports a standard ML ablation (GINE backbone + optional auxiliary regression/penalty losses) evaluated via repeated multi-seed training and paired bootstrap CIs on a fixed single-source OOD split (HIV+Tox21 train, COCONUT test). Supervision targets (Bertz index, MMFF94 energies) are computed by independent external algorithms on the molecules themselves; they are not derived from the model or fitted to the target metric. No equations, uniqueness theorems, ansatzes, or self-citations appear as load-bearing steps in any derivation chain. The result is a set of statistical deltas on held-out data, fully falsifiable outside the paper's fitted weights.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard supervised GNN training assumptions plus the domain assumption that the chosen physical descriptors are relevant proxies; no free parameters, new entities, or ad-hoc axioms beyond those are introduced.

axioms (1)

domain assumption Bertz index and MMFF94 force-field energies are appropriate physical priors for auxiliary supervision in synthesizability prediction
Invoked directly as regression and penalty targets without further empirical validation in the abstract.

pith-pipeline@v0.9.1-grok · 5908 in / 1445 out tokens · 41317 ms · 2026-06-27T10:15:55.439911+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references

[1]

Batatia, I. et al. (2022). MACE: Higher order equivariant message passing neural networks for fast and accurate force fields.NeurIPS35

2022
[2]

Bengio, E. et al. (2021). Flow network based generative models for non-iterative diverse candidate generation.NeurIPS34

2021
[3]

Bertz, S.H. (1981). The first general index of molecular complexity.JACS103(12), 3599–3601

1981
[4]

Coley, C.W. et al. (2018). SCScore: synthetic complexity learned from a reaction corpus.JCIM58(2), 252–261

2018
[5]

Wang, S. et al. (2023). DeepSA: a deep-learning driven predictor of compound synthesis accessibility.J. Cheminform.15, 103

2023
[6]

& Ning, X

Dey, V. & Ning, X. (2024). Enhancing molecular property prediction with auxiliary learning and task-specific adaptation (RCGrad).J. Cheminform.16, 87. arXiv:2401.16299

arXiv 2024
[7]

Ji, Y. et al. (2023). DrugOOD: Out-of-distribution dataset curator and benchmark for AI-aided drug discovery.AAAI37(7), 8023–8031. arXiv:2201.09637

arXiv 2023
[8]

& Schuffenhauer, A

Ertl, P. & Schuffenhauer, A. (2009). Estimation of synthetic accessibility score.J. Cheminform.1(1), 8. 10

2009
[9]

Fabian, B. et al. (2020). Molecular representation learning with language models and domain-relevant auxiliary tasks (MolBERT).Machine Learning for Molecules Workshop, NeurIPS. arXiv:2011.13230

arXiv 2020
[10]

& Wang, B

Fan, S., Wang, X., Shi, C., Cui, P. & Wang, B. (2024). Generalizing graph neural networks on out-of-distribution graphs (StableGNN).IEEE Trans. Pattern Anal. Mach. Intell.46(1), 322–337. arXiv:2111.10657

arXiv 2024
[11]

& Coley, C.W

Gao, W. & Coley, C.W. (2020). The synthesizability of molecules proposed by generative models.JCIM 60(12), 5714–5723

2020
[12]

Yu, J. et al. (2022). Organic compound synthetic accessibility prediction based on the graph attention mechanism (GASA).JCIM62(12), 2973–2986

2022
[13]

Gasteiger, J. et al. (2020). Directional message passing for molecular graphs (DimeNet).ICLR

2020
[14]

Genheden, S. et al. (2020). AiZynthFinder.J. Cheminform.12(1), 70

2020
[15]

Gómez-Bombarelli, R. et al. (2018). Automatic chemical design using a data-driven continuous represen- tation of molecules.ACS Cent. Sci.4(2), 268–276

2018
[16]

Gui, S., Li, X., Wang, L. & Ji, S. (2022). GOOD: A graph out-of-distribution benchmark.NeurIPS Datasets & Benchmarks Track35. arXiv:2206.08452

arXiv 2022
[17]

Halgren, T.A. (1996). Merck molecular force field (MMFF94).J. Comput. Chem.17(5-6), 490–519

1996
[18]

Hu, W. et al. (2020). Strategies for pre-training graph neural networks.ICLR

2020
[19]

Karpatne, A. et al. (2017). Theory-guided data science: A new paradigm for scientific discovery from data.IEEE Trans. Knowl. Data Eng.29(10), 2318–2331

2017
[20]

& Zhu, W

Li, H., Wang, X., Zhang, Z. & Zhu, W. (2023). OOD-GNN: Out-of-distribution generalized graph neural network.IEEE Trans. Knowl. Data Eng.35(7), 7328–7340

2023
[21]

Raissi, M. et al. (2019). Physics-informed neural networks.J. Comput. Phys.378, 686–707

2019
[22]

Schütt, K.T. et al. (2017). SchNet: A continuous-filter convolutional neural network for modeling quantum interactions.NeurIPS30

2017
[23]

Sorokina, M. et al. (2021). COCONUT online: Collection of Open Natural Products.J. Cheminform. 13(1), 2

2021
[24]

Stokes, J.M. et al. (2020). A deep learning approach to antibiotic discovery.Cell180(4), 688–702

2020
[25]

Voršilák, M. et al. (2020). SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J. Cheminform.12, 35

2020
[26]

& Niepert, M

Takamoto, M., Zaverkin, V. & Niepert, M. (2025). Physics-informed weakly supervised learning for interatomic potentials.ICML. arXiv:2408.05215

arXiv 2025
[27]

Thakkar, A. et al. (2021). Retrosynthetic accessibility score (RAscore).Chem. Sci.12(9), 3339–3349

2021
[28]

Vignac, C. et al. (2023). DiGress: Discrete denoising diffusion for graph generation.ICLR

2023
[29]

Yang, K. et al. (2019). Analyzing learned molecular representations.JCIM59(8), 3370–3388. 11 A Per-seed OOD results Table 5 gives the per-seed OOD ROC-AUC behind Table 3. The effect is consistent on average but not unanimous per seed: +complexity is marginally negative on seed 2 (−0.0002) and +strain on seed 4 (−0.0009); +both is positive on all five seed...

2019

[1] [1]

Batatia, I. et al. (2022). MACE: Higher order equivariant message passing neural networks for fast and accurate force fields.NeurIPS35

2022

[2] [2]

Bengio, E. et al. (2021). Flow network based generative models for non-iterative diverse candidate generation.NeurIPS34

2021

[3] [3]

Bertz, S.H. (1981). The first general index of molecular complexity.JACS103(12), 3599–3601

1981

[4] [4]

Coley, C.W. et al. (2018). SCScore: synthetic complexity learned from a reaction corpus.JCIM58(2), 252–261

2018

[5] [5]

Wang, S. et al. (2023). DeepSA: a deep-learning driven predictor of compound synthesis accessibility.J. Cheminform.15, 103

2023

[6] [6]

& Ning, X

Dey, V. & Ning, X. (2024). Enhancing molecular property prediction with auxiliary learning and task-specific adaptation (RCGrad).J. Cheminform.16, 87. arXiv:2401.16299

arXiv 2024

[7] [7]

Ji, Y. et al. (2023). DrugOOD: Out-of-distribution dataset curator and benchmark for AI-aided drug discovery.AAAI37(7), 8023–8031. arXiv:2201.09637

arXiv 2023

[8] [8]

& Schuffenhauer, A

Ertl, P. & Schuffenhauer, A. (2009). Estimation of synthetic accessibility score.J. Cheminform.1(1), 8. 10

2009

[9] [9]

Fabian, B. et al. (2020). Molecular representation learning with language models and domain-relevant auxiliary tasks (MolBERT).Machine Learning for Molecules Workshop, NeurIPS. arXiv:2011.13230

arXiv 2020

[10] [10]

& Wang, B

Fan, S., Wang, X., Shi, C., Cui, P. & Wang, B. (2024). Generalizing graph neural networks on out-of-distribution graphs (StableGNN).IEEE Trans. Pattern Anal. Mach. Intell.46(1), 322–337. arXiv:2111.10657

arXiv 2024

[11] [11]

& Coley, C.W

Gao, W. & Coley, C.W. (2020). The synthesizability of molecules proposed by generative models.JCIM 60(12), 5714–5723

2020

[12] [12]

Yu, J. et al. (2022). Organic compound synthetic accessibility prediction based on the graph attention mechanism (GASA).JCIM62(12), 2973–2986

2022

[13] [13]

Gasteiger, J. et al. (2020). Directional message passing for molecular graphs (DimeNet).ICLR

2020

[14] [14]

Genheden, S. et al. (2020). AiZynthFinder.J. Cheminform.12(1), 70

2020

[15] [15]

Gómez-Bombarelli, R. et al. (2018). Automatic chemical design using a data-driven continuous represen- tation of molecules.ACS Cent. Sci.4(2), 268–276

2018

[16] [16]

Gui, S., Li, X., Wang, L. & Ji, S. (2022). GOOD: A graph out-of-distribution benchmark.NeurIPS Datasets & Benchmarks Track35. arXiv:2206.08452

arXiv 2022

[17] [17]

Halgren, T.A. (1996). Merck molecular force field (MMFF94).J. Comput. Chem.17(5-6), 490–519

1996

[18] [18]

Hu, W. et al. (2020). Strategies for pre-training graph neural networks.ICLR

2020

[19] [19]

Karpatne, A. et al. (2017). Theory-guided data science: A new paradigm for scientific discovery from data.IEEE Trans. Knowl. Data Eng.29(10), 2318–2331

2017

[20] [20]

& Zhu, W

Li, H., Wang, X., Zhang, Z. & Zhu, W. (2023). OOD-GNN: Out-of-distribution generalized graph neural network.IEEE Trans. Knowl. Data Eng.35(7), 7328–7340

2023

[21] [21]

Raissi, M. et al. (2019). Physics-informed neural networks.J. Comput. Phys.378, 686–707

2019

[22] [22]

Schütt, K.T. et al. (2017). SchNet: A continuous-filter convolutional neural network for modeling quantum interactions.NeurIPS30

2017

[23] [23]

Sorokina, M. et al. (2021). COCONUT online: Collection of Open Natural Products.J. Cheminform. 13(1), 2

2021

[24] [24]

Stokes, J.M. et al. (2020). A deep learning approach to antibiotic discovery.Cell180(4), 688–702

2020

[25] [25]

Voršilák, M. et al. (2020). SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J. Cheminform.12, 35

2020

[26] [26]

& Niepert, M

Takamoto, M., Zaverkin, V. & Niepert, M. (2025). Physics-informed weakly supervised learning for interatomic potentials.ICML. arXiv:2408.05215

arXiv 2025

[27] [27]

Thakkar, A. et al. (2021). Retrosynthetic accessibility score (RAscore).Chem. Sci.12(9), 3339–3349

2021

[28] [28]

Vignac, C. et al. (2023). DiGress: Discrete denoising diffusion for graph generation.ICLR

2023

[29] [29]

Yang, K. et al. (2019). Analyzing learned molecular representations.JCIM59(8), 3370–3388. 11 A Per-seed OOD results Table 5 gives the per-seed OOD ROC-AUC behind Table 3. The effect is consistent on average but not unanimous per seed: +complexity is marginally negative on seed 2 (−0.0002) and +strain on seed 4 (−0.0009); +both is positive on all five seed...

2019