CompleteRXN: Toward Completing Open Chemical Reaction Databases

Evgeny Pidko; Gabriel Vogel; Jana M. Weber; Minouk Noordsij

arxiv: 2605.00222 · v2 · pith:6FU5WRCZnew · submitted 2026-04-30 · 💻 cs.LG · physics.chem-ph

CompleteRXN: Toward Completing Open Chemical Reaction Databases

Gabriel Vogel , Minouk Noordsij , Evgeny Pidko , Jana M. Weber This is my paper

Pith reviewed 2026-07-01 08:00 UTC · model grok-4.3

classification 💻 cs.LG physics.chem-ph

keywords chemical reaction completionreaction balancingUSPTO datasetmachine learning benchmarkconstrained decodingencoder-decoder model

0 comments

The pith

A constrained decoder model completes missing chemical reaction components with 99% accuracy on standard splits and 91% on extreme out-of-distribution cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CompleteRXN, a benchmark dataset for completing incomplete chemical reactions from open databases like USPTO. It maps USPTO records to curated mechanistic reactions to create aligned incomplete and balanced pairs. A new model called the Constrained Reaction Balancer uses encoder-decoder architecture with constrained decoding to predict missing parts. On the benchmark, it reaches high equivalence accuracy across difficulty levels, though performance drops on fully uncurated data. This highlights the challenge of realistic missing data conditions in reaction databases.

Core claim

The central claim is that the Constrained Reaction Balancer achieves high performance on the CompleteRXN benchmark, with 99.20% equivalence accuracy on random splits and 91.12% on extreme out-of-distribution splits, outperforming other methods like SynRBL in accuracy while producing balanced completions.

What carries the argument

The Constrained Reaction Balancer (CRB), an encoder-decoder model with constrained decoding that enforces atom balance and chemical plausibility in reaction completions.

If this is right

Performance degrades with increasing incompleteness in reactions.
Substantial drop occurs when evaluating on full uncurated USPTO data.
SynRBL produces many balanced and plausible completions but with lower accuracy.
Future work is motivated to improve practical robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark holds, similar completion models could improve reliability of reaction databases for downstream ML tasks.
The drop on uncurated data suggests need for better generalization techniques.
Mapping to mechanistic reactions may not capture all real-world incompleteness patterns.

Load-bearing premise

That mapping USPTO records to curated mechanistic reactions creates a dataset representative of realistic missing-data conditions in open chemical reaction databases.

What would settle it

If the CRB model fails to achieve above 80% equivalence accuracy on a new set of reactions extracted directly from uncurated USPTO without mapping, the claim of high performance under realistic conditions would be falsified.

Figures

Figures reproduced from arXiv: 2605.00222 by Evgeny Pidko, Gabriel Vogel, Jana M. Weber, Minouk Noordsij.

**Figure 2.** Figure 2: Distribution shift induced by the proposed data splits. We plot cumulative distributions of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Equivalence accuracy over reaction incompleteness across random, group-based, and ex [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Prediction probability distributions of accurate, balanced but inaccurate and unbalanced [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Chemical reaction datasets such as USPTO suffer from substantial incompleteness, frequently missing byproducts, co-reactants, and stoichiometric coefficients. This limits their applicability and reliability in downstream applications. Here, we introduce CompleteRXN, a large-scale supervised benchmark for reaction completion under realistic missing-data conditions. We construct a dataset of aligned incomplete and atom-balanced reactions by mapping USPTO records to curated mechanistic reactions. We evaluate representative baselines, including a novel encoder-decoder reaction completion model with constrained decoding, the Constrained Reaction Balancer (CRB), and a recent algorithmic method, SynRBL. On our CompleteRXN benchmark, the CRB achieves high performance across splits of increasing difficulty, reaching 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split. SynRBL produces many balanced and chemically plausible completions, but with lower accuracy on the benchmark test splits. Across all methods, performance degrades with increasing incompleteness. We observe a substantial drop when evaluating on reactions outside the benchmark (full uncurated USPTO), highlighting the gap between benchmark performance and practical robustness and motivating future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces CompleteRXN by mapping USPTO to mechanistic reactions for a reaction completion benchmark, with CRB hitting high numbers on the splits but a clear drop on uncurated data that questions how realistic the setup is.

read the letter

The punchline is that this paper builds a new benchmark for completing incomplete chemical reactions by aligning USPTO data with mechanistic records, and their CRB model performs strongly on the controlled splits but shows a clear drop on the full uncurated set.

What is new is the CompleteRXN dataset itself, which provides paired incomplete and complete reactions under varying levels of difficulty. The encoder-decoder model with constrained decoding is a solid baseline that enforces atom balance during generation. They also compare to SynRBL and show how all methods struggle more as incompleteness increases.

The paper does a reasonable job highlighting the gap between benchmark performance and real-world application by reporting the lower results on raw USPTO. This is honest and useful.

The main soft spot is whether the mapping process creates missing data patterns that are representative of actual open databases. The performance difference suggests it might not, so the high accuracies on CompleteRXN may overstate the practical value for improving downstream ML tasks. Without the full details on how the alignments were done and how the splits were defined, it's difficult to assess if there are any biases in the evaluation.

This paper is for researchers in computational chemistry and machine learning who work with reaction databases and need ways to handle incompleteness. It would be of interest to anyone building or evaluating models for reaction prediction or balancing.

It deserves peer review because the benchmark construction is a concrete step forward on a known problem, and the results point to important limitations that referees could help refine.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CompleteRXN, a supervised benchmark for reaction completion under missing-data conditions. It constructs paired incomplete/atom-balanced reactions by mapping USPTO records onto curated mechanistic reactions, then evaluates baselines including a new encoder-decoder model (Constrained Reaction Balancer, CRB) with constrained decoding and the algorithmic method SynRBL. The central empirical claims are that CRB reaches 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split, that performance degrades with increasing incompleteness, and that a substantial drop occurs when the same models are tested on full uncurated USPTO.

Significance. If the USPTO-to-mechanistic mapping produces incompleteness patterns that match those actually present in open databases, CompleteRXN and the CRB model would provide a useful supervised signal for improving reaction database completeness. The explicit reporting of the performance drop on uncurated USPTO is a strength that correctly flags the generalization gap.

major comments (2)

[Dataset construction] Dataset construction (abstract and §3): the claim that the mapping 'produces a dataset that accurately represents realistic missing-data conditions' is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative comparison of the induced distribution of missing byproducts, co-reactants, or stoichiometric omissions against statistics measured directly on uncurated USPTO; the reported substantial drop on full USPTO is consistent with a mismatch.
[Evaluation protocol] Evaluation protocol (abstract): the extreme out-of-distribution split is described only at a high level; without explicit definitions of how the splits are constructed (e.g., reaction-class or template hold-out criteria) it is impossible to judge whether the 91.12% figure reflects genuine extrapolation or residual leakage from the USPTO-to-mechanistic mapping step.

minor comments (1)

The abstract states performance numbers to two decimal places but does not report the number of test examples per split or confidence intervals; adding these would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify areas where additional clarity and analysis would strengthen the manuscript. We respond to each below and indicate the revisions we will make.

read point-by-point responses

Referee: Dataset construction (abstract and §3): the claim that the mapping 'produces a dataset that accurately represents realistic missing-data conditions' is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative comparison of the induced distribution of missing byproducts, co-reactants, or stoichiometric omissions against statistics measured directly on uncurated USPTO; the reported substantial drop on full USPTO is consistent with a mismatch.

Authors: We agree that a quantitative comparison of incompleteness patterns would strengthen the claim. In the revised manuscript we will add a new subsection (or appendix) that reports the empirical distributions of missing byproducts, co-reactants, and stoichiometric omissions in CompleteRXN and directly compares them to the same statistics computed on the full uncurated USPTO. This analysis will also discuss the observed performance drop on uncurated data in light of any distributional differences. revision: yes
Referee: Evaluation protocol (abstract): the extreme out-of-distribution split is described only at a high level; without explicit definitions of how the splits are constructed (e.g., reaction-class or template hold-out criteria) it is impossible to judge whether the 91.12% figure reflects genuine extrapolation or residual leakage from the USPTO-to-mechanistic mapping step.

Authors: The construction details for all splits, including the extreme out-of-distribution split (reaction-class and template hold-out criteria), appear in §4 of the full manuscript. To address the concern that the abstract is insufficiently explicit, we will expand the abstract to include a concise but precise description of the split criteria and will add a short dedicated paragraph (with pseudocode) in §4 that makes the hold-out rules fully explicit, thereby allowing readers to assess potential leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark construction with no self-referential derivations.

full rationale

The paper presents an empirical benchmark (CompleteRXN) built by mapping USPTO records to curated mechanistic reactions, then evaluates models including a novel CRB on random and OOD splits. No equations, parameter fits, or derivations are described that reduce to author-defined inputs by construction. The central claims are performance numbers on the constructed benchmark and a noted drop on uncurated USPTO; these are falsifiable measurements rather than tautological predictions. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. This is a standard empirical evaluation whose validity rests on external data realism, not internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the USPTO-to-mechanistic mapping process and the assumption that the resulting splits test realistic incompleteness.

axioms (1)

domain assumption USPTO records can be reliably mapped to curated mechanistic reactions to produce aligned incomplete and atom-balanced pairs that represent real missing-data conditions.
This mapping is the foundation for the entire benchmark construction described in the abstract.

pith-pipeline@v0.9.1-grok · 5735 in / 1126 out tokens · 38409 ms · 2026-07-01T08:00:22.438094+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 1 canonical work pages

[1]

Blanco, N

C. Blanco, N. Pauliks, F. Donati, N. Engberg, and J. Weber. Machine learning to support prospective life cycle assessment of emerging chemical technologies.Current Opinion in Green and Sustainable Chemistry, 50:100979, 2024

2024
[2]

S. Chen, T. Langhorst, J. Nöhl, C. Oberschelp, M. Pillich, J. Schilling, and A. Bardow. Life cycle assessment for all organic chemicals.arXiv preprint arXiv:2603.15686, 2026

work page arXiv 2026
[3]

C. W. Coley, R. Barzilay, T. S. Jaakkola, W. H. Green, and K. F. Jensen. Prediction of organic reaction outcomes using machine learning.ACS central science, 3(5):434–443, 2017

2017
[4]

C. W. Coley, L. Rogers, W. H. Green, and K. F. Jensen. Computer-assisted retrosynthesis based on molecular similarity.ACS central science, 3(12):1237–1245, 2017

2017
[5]

Y . Du, A. R. Jamasb, J. Guo, T. Fu, C. Harris, Y . Wang, C. Duan, P. Liò, P. Schwaller, and T. L. Blundell. Machine learning-aided generative molecular design.Nature Machine Intelligence, 6(6):589–604, 2024

2024
[6]

Ioannou, S

I. Ioannou, S. C. D’Angelo, Á. Galán-Martín, C. Pozo, J. Pérez-Ramírez, and G. Guillén- Gosálbez. Process modelling and life cycle assessment coupled with experimental work to shape the future sustainable production of chemicals and fuels.Reaction Chemistry & Engineering, 6(7):1179–1194, 2021

2021
[7]

W. Jin, C. Coley, R. Barzilay, and T. Jaakkola. Predicting organic reaction outcomes with weisfeiler-lehman network.Advances in neural information processing systems, 30, 2017

2017
[8]

J. F. Joung, M. H. Fong, N. Casetti, J. P. Liles, N. S. Dassanayake, and C. W. Coley. Electron flow matching for generative reaction mechanism prediction.Nature, 645(8079):115–123, 2025

2025
[9]

D. M. Lowe.Extraction of chemical structures and reactions from the literature. PhD thesis, 2012

2012
[10]

NameRxn, 2026

NextMove Software. NameRxn, 2026. Accessed: 2026-04-05

2026
[11]

Phan, N.-N

T.-L. Phan, N.-N. Nguyen Song, and P. F. Stadler. Synrxn: An open benchmark and curated dataset for computational reaction modeling.Scientific Data, 13(1):625, 2026

2026
[12]

T.-L. Phan, K. Weinbauer, T. Gärtner, D. Merkle, J. L. Andersen, R. Fagerberg, and P. F. Stadler. Reaction rebalancing: a novel approach to curating reaction databases.Journal of Cheminformatics, 16(1):82, 2024. 10

2024
[13]

Probst, P

D. Probst, P. Schwaller, and J.-L. Reymond. Reaction classification and yield prediction using the differential reaction fingerprint drfp.Digital discovery, 1(2):91–97, 2022

2022
[14]

Schneider, D

N. Schneider, D. M. Lowe, R. A. Sayle, and G. A. Landrum. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. Journal of chemical information and modeling, 55(1):39–53, 2015

2015
[15]

Schneider, N

N. Schneider, N. Stiefl, and G. A. Landrum. What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016

2016
[16]

found in translation

P. Schwaller, T. Gaudin, D. Lanyi, C. Bekas, and T. Laino. “found in translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chemical science, 9(28):6091–6098, 2018

2018
[17]

Schwaller, T

P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas, and A. A. Lee. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction.ACS central science, 5(9):1572–1583, 2019

2019
[18]

Schwaller, D

P. Schwaller, D. Probst, A. C. Vaucher, V . H. Nair, D. Kreutter, T. Laino, and J.-L. Reymond. Mapping the space of chemical reactions using attention-based neural networks.Nature Machine Intelligence, 3(2):144–152, 2021

2021
[19]

M. H. Segler and M. P. Waller. Neural-symbolic machine learning for retrosynthesis and reaction prediction.Chemistry–A European Journal, 23(25):5966–5971, 2017

2017
[20]

van Wijngaarden, G

M. van Wijngaarden, G. V ogel, and J. M. Weber. Completing partial reaction equations with rule and language model-based methods. InComputer Aided Chemical Engineering, volume 53, pages 3139–3144. Elsevier, 2024

2024
[21]

J. M. Weber, Z. Guo, C. Zhang, A. M. Schweidtmann, and A. A. Lapkin. Chemical data intelligence for sustainable chemistry.Chemical Society Reviews, 50(21):12013–12036, 2021

2021
[22]

Zhang, A

C. Zhang, A. Arun, and A. A. Lapkin. Completing and balancing database excerpted chemical reactions with a hybrid mechanistic-machine learning approach.ACS omega, 9(16):18385– 18399, 2024

2024
[23]

Zipoli, Z

F. Zipoli, Z. Ayadi, P. Schwaller, T. Laino, and A. C. Vaucher. Completion of partial chemical equations.Machine Learning: Science and Technology, 5(2):025071, 2024. 11 A Details of Data Analysis of USPTO The USPTO reaction dataset is highly incomplete: across commonly used subsets, fewer than 10% of reactions are atom- and charge-balanced. Missing inform...

2024

[1] [1]

Blanco, N

C. Blanco, N. Pauliks, F. Donati, N. Engberg, and J. Weber. Machine learning to support prospective life cycle assessment of emerging chemical technologies.Current Opinion in Green and Sustainable Chemistry, 50:100979, 2024

2024

[2] [2]

S. Chen, T. Langhorst, J. Nöhl, C. Oberschelp, M. Pillich, J. Schilling, and A. Bardow. Life cycle assessment for all organic chemicals.arXiv preprint arXiv:2603.15686, 2026

work page arXiv 2026

[3] [3]

C. W. Coley, R. Barzilay, T. S. Jaakkola, W. H. Green, and K. F. Jensen. Prediction of organic reaction outcomes using machine learning.ACS central science, 3(5):434–443, 2017

2017

[4] [4]

C. W. Coley, L. Rogers, W. H. Green, and K. F. Jensen. Computer-assisted retrosynthesis based on molecular similarity.ACS central science, 3(12):1237–1245, 2017

2017

[5] [5]

Y . Du, A. R. Jamasb, J. Guo, T. Fu, C. Harris, Y . Wang, C. Duan, P. Liò, P. Schwaller, and T. L. Blundell. Machine learning-aided generative molecular design.Nature Machine Intelligence, 6(6):589–604, 2024

2024

[6] [6]

Ioannou, S

I. Ioannou, S. C. D’Angelo, Á. Galán-Martín, C. Pozo, J. Pérez-Ramírez, and G. Guillén- Gosálbez. Process modelling and life cycle assessment coupled with experimental work to shape the future sustainable production of chemicals and fuels.Reaction Chemistry & Engineering, 6(7):1179–1194, 2021

2021

[7] [7]

W. Jin, C. Coley, R. Barzilay, and T. Jaakkola. Predicting organic reaction outcomes with weisfeiler-lehman network.Advances in neural information processing systems, 30, 2017

2017

[8] [8]

J. F. Joung, M. H. Fong, N. Casetti, J. P. Liles, N. S. Dassanayake, and C. W. Coley. Electron flow matching for generative reaction mechanism prediction.Nature, 645(8079):115–123, 2025

2025

[9] [9]

D. M. Lowe.Extraction of chemical structures and reactions from the literature. PhD thesis, 2012

2012

[10] [10]

NameRxn, 2026

NextMove Software. NameRxn, 2026. Accessed: 2026-04-05

2026

[11] [11]

Phan, N.-N

T.-L. Phan, N.-N. Nguyen Song, and P. F. Stadler. Synrxn: An open benchmark and curated dataset for computational reaction modeling.Scientific Data, 13(1):625, 2026

2026

[12] [12]

T.-L. Phan, K. Weinbauer, T. Gärtner, D. Merkle, J. L. Andersen, R. Fagerberg, and P. F. Stadler. Reaction rebalancing: a novel approach to curating reaction databases.Journal of Cheminformatics, 16(1):82, 2024. 10

2024

[13] [13]

Probst, P

D. Probst, P. Schwaller, and J.-L. Reymond. Reaction classification and yield prediction using the differential reaction fingerprint drfp.Digital discovery, 1(2):91–97, 2022

2022

[14] [14]

Schneider, D

N. Schneider, D. M. Lowe, R. A. Sayle, and G. A. Landrum. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. Journal of chemical information and modeling, 55(1):39–53, 2015

2015

[15] [15]

Schneider, N

N. Schneider, N. Stiefl, and G. A. Landrum. What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016

2016

[16] [16]

found in translation

P. Schwaller, T. Gaudin, D. Lanyi, C. Bekas, and T. Laino. “found in translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chemical science, 9(28):6091–6098, 2018

2018

[17] [17]

Schwaller, T

P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, C. Bekas, and A. A. Lee. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction.ACS central science, 5(9):1572–1583, 2019

2019

[18] [18]

Schwaller, D

P. Schwaller, D. Probst, A. C. Vaucher, V . H. Nair, D. Kreutter, T. Laino, and J.-L. Reymond. Mapping the space of chemical reactions using attention-based neural networks.Nature Machine Intelligence, 3(2):144–152, 2021

2021

[19] [19]

M. H. Segler and M. P. Waller. Neural-symbolic machine learning for retrosynthesis and reaction prediction.Chemistry–A European Journal, 23(25):5966–5971, 2017

2017

[20] [20]

van Wijngaarden, G

M. van Wijngaarden, G. V ogel, and J. M. Weber. Completing partial reaction equations with rule and language model-based methods. InComputer Aided Chemical Engineering, volume 53, pages 3139–3144. Elsevier, 2024

2024

[21] [21]

J. M. Weber, Z. Guo, C. Zhang, A. M. Schweidtmann, and A. A. Lapkin. Chemical data intelligence for sustainable chemistry.Chemical Society Reviews, 50(21):12013–12036, 2021

2021

[22] [22]

Zhang, A

C. Zhang, A. Arun, and A. A. Lapkin. Completing and balancing database excerpted chemical reactions with a hybrid mechanistic-machine learning approach.ACS omega, 9(16):18385– 18399, 2024

2024

[23] [23]

Zipoli, Z

F. Zipoli, Z. Ayadi, P. Schwaller, T. Laino, and A. C. Vaucher. Completion of partial chemical equations.Machine Learning: Science and Technology, 5(2):025071, 2024. 11 A Details of Data Analysis of USPTO The USPTO reaction dataset is highly incomplete: across commonly used subsets, fewer than 10% of reactions are atom- and charge-balanced. Missing inform...

2024