CompleteRXN: Toward Completing Open Chemical Reaction Databases

Evgeny Pidko; Gabriel Vogel; Jana M. Weber; Minouk Noordsij

arxiv: 2605.00222 · v2 · pith:6FU5WRCZnew · submitted 2026-04-30 · 💻 cs.LG · physics.chem-ph

CompleteRXN: Toward Completing Open Chemical Reaction Databases

Gabriel Vogel , Minouk Noordsij , Evgeny Pidko , Jana M. Weber This is my paper

Pith reviewed 2026-05-09 20:22 UTC · model grok-4.3

classification 💻 cs.LG physics.chem-ph

keywords chemical reactionsreaction completionbenchmark datasetmachine learningUSPTOconstrained decodingatom balanceincomplete data

0 comments

The pith

A new benchmark and constrained model complete missing parts of chemical reactions from open databases at up to 99% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds CompleteRXN, a supervised benchmark that pairs incomplete reactions drawn from USPTO records with their atom-balanced counterparts from mechanistic sources. It introduces the Constrained Reaction Balancer, an encoder-decoder model that uses constrained decoding to generate valid completions of missing byproducts, co-reactants, and coefficients. The model reaches 99.20% equivalence accuracy on random splits and 91.12% on extreme out-of-distribution splits, while other methods such as SynRBL produce plausible but less accurate results. Performance falls as incompleteness grows and drops sharply when tested on the full uncurated USPTO collection, revealing a gap between controlled benchmark conditions and practical robustness.

Core claim

We construct CompleteRXN by mapping USPTO incomplete reactions to curated mechanistic reactions that supply the missing species and stoichiometric coefficients. The Constrained Reaction Balancer achieves 99.20% equivalence accuracy on random test splits and 91.12% on extreme out-of-distribution splits. SynRBL yields many balanced and plausible completions yet lower accuracy on the benchmark splits. All methods degrade with greater incompleteness, and performance declines substantially on the full uncurated USPTO set.

What carries the argument

The Constrained Reaction Balancer (CRB): an encoder-decoder neural network with constrained decoding that forces generated reaction completions to be atom-balanced.

If this is right

Reaction completion becomes a learnable supervised task when aligned incomplete and balanced reaction pairs are available.
Accuracy declines steadily as the degree of incompleteness or distributional shift in test reactions increases.
Benchmark success does not guarantee high performance on raw, uncurated reaction collections, motivating more robust methods.
Constrained decoding provides a practical way to enforce chemical validity during generation of completions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Completed reaction data from this approach could serve as cleaner training material for downstream models that predict new reactions or plan syntheses.
The alignment technique used to build the benchmark could be adapted to fill gaps in other large but incomplete scientific datasets.
Models trained on CompleteRXN might be tested for robustness by deliberately adding controlled noise or missing entries to existing reaction collections.

Load-bearing premise

Aligning USPTO records with curated mechanistic reactions produces incomplete inputs that accurately reflect the kinds of missing data found in real chemical databases.

What would settle it

Manual chemical validation of many completions generated on the full uncurated USPTO set that shows frequent production of invalid or unbalanced reactions would demonstrate that benchmark results do not generalize to practical conditions.

Figures

Figures reproduced from arXiv: 2605.00222 by Evgeny Pidko, Gabriel Vogel, Jana M. Weber, Minouk Noordsij.

**Figure 2.** Figure 2: Distribution shift induced by the proposed data splits. We plot cumulative distributions of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Equivalence accuracy over reaction incompleteness across random, group-based, and ex [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Prediction probability distributions of accurate, balanced but inaccurate and unbalanced [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Chemical reaction datasets such as USPTO suffer from substantial incompleteness, frequently missing byproducts, co-reactants, and stoichiometric coefficients. This limits their applicability and reliability in downstream applications. Here, we introduce CompleteRXN, a large-scale supervised benchmark for reaction completion under realistic missing-data conditions. We construct a dataset of aligned incomplete and atom-balanced reactions by mapping USPTO records to curated mechanistic reactions. We evaluate representative baselines, including a novel encoder-decoder reaction completion model with constrained decoding, the Constrained Reaction Balancer (CRB), and a recent algorithmic method, SynRBL. On our CompleteRXN benchmark, the CRB achieves high performance across splits of increasing difficulty, reaching 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split. SynRBL produces many balanced and chemically plausible completions, but with lower accuracy on the benchmark test splits. Across all methods, performance degrades with increasing incompleteness. We observe a substantial drop when evaluating on reactions outside the benchmark (full uncurated USPTO), highlighting the gap between benchmark performance and practical robustness and motivating future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CompleteRXN gives a new benchmark and CRB model for reaction completion, but the dataset construction leaves open whether the incompleteness patterns match real USPTO gaps.

read the letter

The paper's main contribution is the CompleteRXN benchmark, created by mapping USPTO records to curated mechanistic reactions to produce aligned incomplete and atom-balanced pairs, plus the Constrained Reaction Balancer (CRB) encoder-decoder with constrained decoding. It reports 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split, with clear degradation as incompleteness grows and a direct comparison showing CRB ahead of SynRBL on the benchmark splits. The authors also test on the full uncurated USPTO and note the substantial performance drop, which is a straightforward acknowledgment of the remaining gap to practical use.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces CompleteRXN, a supervised benchmark for chemical reaction completion under missing-data conditions. It is constructed by aligning incomplete USPTO records with curated mechanistic reactions to produce atom-balanced pairs. The authors propose the Constrained Reaction Balancer (CRB), an encoder-decoder model with constrained decoding, and compare it to SynRBL. On the benchmark, CRB achieves 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split, with performance degrading as incompleteness increases. A notable drop is observed when testing on the full uncurated USPTO dataset.

Significance. If the benchmark accurately models realistic incompleteness patterns in open chemical reaction databases, the CRB approach could significantly enhance the completeness and reliability of such datasets for downstream machine learning and chemical applications. The empirical results on increasing difficulty splits demonstrate the model's robustness within the benchmark, and the comparison with SynRBL provides useful baselines. However, the performance gap on uncurated data indicates that further validation is needed for practical impact.

major comments (3)

Abstract: The construction of the CompleteRXN dataset via mapping USPTO records to curated mechanistic reactions is central to the claim of 'realistic missing-data conditions,' yet no quantitative validation is provided, such as overlap statistics between mapped missing fragments and actual USPTO omissions or inter-annotator agreement on alignments. This is load-bearing for the benchmark's validity.
Abstract: The reported equivalence accuracies (99.20% random split, 91.12% extreme OOD) lack accompanying details on the definition of the equivalence metric, statistical significance tests, error bars, or exact rules for data exclusion and split construction, which are necessary to evaluate the reliability of these figures.
Abstract: The substantial performance drop on the full uncurated USPTO dataset compared to the benchmark splits suggests that the constructed benchmark may not fully capture the distribution of real-world incompleteness, potentially limiting the generalizability of the CRB model's reported strengths.

minor comments (1)

Abstract: The abstract mentions 'representative baselines' but provides limited details on the CRB architecture and training procedure; expanding this would improve clarity for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing the CompleteRXN benchmark and CRB model. The comments highlight important areas for improving clarity and validation, and we have revised the manuscript accordingly to address them point by point.

read point-by-point responses

Referee: Abstract: The construction of the CompleteRXN dataset via mapping USPTO records to curated mechanistic reactions is central to the claim of 'realistic missing-data conditions,' yet no quantitative validation is provided, such as overlap statistics between mapped missing fragments and actual USPTO omissions or inter-annotator agreement on alignments. This is load-bearing for the benchmark's validity.

Authors: We agree that additional quantitative validation strengthens the benchmark. In the revised manuscript, we have expanded Section 3 to include overlap statistics comparing the types and frequencies of missing fragments in our aligned pairs against documented USPTO incompleteness patterns (e.g., missing byproducts and co-reactants). The alignment procedure is deterministic and based on atom mapping plus template matching from the curated mechanistic set; we have added a description of this algorithm along with results from a manual audit of a random sample of alignments to verify quality. We have also clarified that inter-annotator agreement does not apply directly as the process is automated rather than manual annotation. revision: yes
Referee: Abstract: The reported equivalence accuracies (99.20% random split, 91.12% extreme OOD) lack accompanying details on the definition of the equivalence metric, statistical significance tests, error bars, or exact rules for data exclusion and split construction, which are necessary to evaluate the reliability of these figures.

Authors: We apologize for the insufficient detail in the abstract. The equivalence metric is defined as the fraction of predictions where the completed reaction matches the ground-truth balanced reaction under canonical SMILES equivalence after atom balancing. In the revision, we have updated the abstract to briefly define the metric and added a new subsection in Methods detailing: (i) exact split construction rules (random 80/10/10, OOD by reaction class, extreme OOD by unseen templates), (ii) data exclusion criteria (invalid mappings, duplicate reactions, reactions with >10 missing atoms), and (iii) statistical reporting (error bars from 5 runs with different seeds and significance tests for method comparisons). revision: yes
Referee: Abstract: The substantial performance drop on the full uncurated USPTO dataset compared to the benchmark splits suggests that the constructed benchmark may not fully capture the distribution of real-world incompleteness, potentially limiting the generalizability of the CRB model's reported strengths.

Authors: We agree this performance gap is an important observation and already highlighted it in the original manuscript as motivation for future work. The drop arises because the benchmark uses controlled alignments to curated complete reactions, while uncurated USPTO includes additional noise (erroneous entries, non-standard notations, and omission patterns outside our alignment distribution). In the revised discussion, we have added analysis of error modes on the uncurated set and positioned CompleteRXN explicitly as a controlled testbed for progressive difficulty rather than a complete proxy for all real-world cases. revision: partial

Circularity Check

0 steps flagged

No circularity: performance claims rest on held-out empirical evaluation of a constructed benchmark

full rationale

The paper constructs CompleteRXN by mapping USPTO records to curated mechanistic reactions, then reports model accuracies on random and OOD splits of that dataset. These metrics are measured on test portions never used for construction or fitting, with no equations, self-citations, or renamings that reduce the reported numbers to the inputs by definition. The derivation chain consists of dataset creation followed by standard supervised evaluation; no load-bearing step collapses into a tautology or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that USPTO-to-mechanistic mapping produces faithful incomplete-complete pairs without introducing systematic bias; no explicit free parameters or invented entities are described beyond standard model training.

axioms (1)

domain assumption USPTO records can be reliably mapped to curated mechanistic reactions to create aligned incomplete and atom-balanced pairs.
This mapping is the basis for constructing the CompleteRXN dataset as stated in the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1253 out tokens · 67824 ms · 2026-05-09T20:22:01.714639+00:00 · methodology

CompleteRXN: Toward Completing Open Chemical Reaction Databases

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)