ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

Botao Yu; Daniel Adu-Ampratwum; Xia Ning; Xinyi Ling; Ye Liu

arxiv: 2605.07103 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.MA

ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

Ye Liu , Botao Yu , Xinyi Ling , Daniel Adu-Ampratwum , Xia Ning This is my paper

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords reaction feasibility predictionmulti-tool reasoningagentic frameworktool utility modelingconflict resolutioncomputational chemistry

0 comments

The pith

ARMOR uses an agentic setup to learn each tool's strengths and resolve conflicts for more accurate reaction feasibility predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARMOR as a way to combine multiple AI tools for deciding if a chemical reaction is feasible. Individual tools perform unevenly across different reactions, so simple averaging or picking one tool often falls short. ARMOR instead builds a hierarchy that favors stronger tools for each case, tracks their specific patterns, and uses memory-augmented reasoning to settle disagreements. Experiments on a public dataset show consistent gains over single-tool and aggregation baselines, with the largest improvements occurring precisely when the tools disagree. This approach matters because reliable feasibility checks can speed up exploration in drug design and materials discovery without needing to run every reaction in the lab.

Core claim

ARMOR organizes tools into a hierarchy that prioritizes top-performing tools for each reaction, characterizes their strengths through tool-specific patterns, and resolves conflicts via memory-augmented reasoning to produce the final feasibility prediction.

What carries the argument

An agentic framework that models tool-specific utilities, adaptively prioritizes tools in a hierarchy, and resolves conflicts through memory-augmented reasoning.

If this is right

Single-tool methods and simple aggregation approaches are outperformed on the evaluation dataset.
The accuracy gains are largest on reactions where the underlying tools produce conflicting feasibility signals.
Explicit utility modeling and memory-based conflict resolution allow complementary strengths across tools to be used systematically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive hierarchy and conflict-resolution steps could be applied to other multi-tool decision tasks such as property prediction or synthesis planning.
If tool utilities prove stable across chemical domains, the framework might reduce the need to retrain or select tools for each new subfield.
Memory-augmented reasoning for conflicts may generalize to settings where tools provide probabilistic rather than binary outputs.

Load-bearing premise

The public dataset reflects the distribution of real-world reactions and the learned tool utilities and conflict patterns will hold for new reactions outside the test set.

What would settle it

Run ARMOR on a fresh collection of reactions drawn from a different source where tool predictions frequently conflict and check whether the accuracy advantage over baselines disappears.

Figures

Figures reproduced from arXiv: 2605.07103 by Botao Yu, Daniel Adu-Ampratwum, Xia Ning, Xinyi Ling, Ye Liu.

**Figure 2.** Figure 2: Ablation study of ARMOR. In this subsection, we conduct ablation experiments to assess the effectiveness of different components in the ARMOR framework. We progressively remove the tool conflict resolution module (Section 3.3), utility-aware tool prioritization module (Section 3.2) and tool hierarchy construction module (Section 3.1), resulting in three variants: -w/o Conflict, -w/o Utility, and -w/o Hier… view at source ↗

**Figure 3.** Figure 3: Case study. To further illustrate the reasoning process of ARMOR, we present a representative case study in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Reaction feasibility prediction, as a fundamental problem in computational chemistry, has benefited from diverse tools enabled by recent advances in artificial intelligence, particularly large language models. However, the performance of individual tools varies substantially across reactions, making it difficult for any single tool to consistently perform well across all cases. This raises a critical challenge: how to effectively leverage multiple tools to obtain more accurate feasibility predictions. To address this, we propose ARMOR, an agentic framework that explicitly models tool-specific utilities, adaptively prioritizes tools, and further resolves the potential tool conflicts to produce the final prediction for each reaction. Unlike existing approaches that rely on simple aggregation or heuristic assignment over various tools, ARMOR organizes tools into a hierarchy that prioritizes top-performing tools and defers others when needed, characterizes their strengths through tool-specific patterns, and resolves conflicts via memoryaugmented reasoning. Extensive experiments on a public dataset demonstrate that ARMOR consistently outperforms strong baselines, including single-tool methods as well as various tool aggregation and tool selection approaches. Further analysis shows that the improvements are particularly significant on reactions with conflicting tool predictions, highlighting the effectiveness of ARMOR in leveraging the complementary strengths of multiple tools. The code is available via https://anonymous.4open.science/r/ARMOR-E13F.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARMOR's hierarchy plus memory-augmented conflict handling improves reaction feasibility predictions over simple aggregation, especially on disagreeing tools, with code released.

read the letter

ARMOR models tool utilities explicitly, puts stronger tools higher in a priority list, and uses memory to sort out conflicts when the tools disagree on a reaction. The result is better accuracy than single tools or basic aggregation methods on the public dataset they tested, with the biggest edge showing up exactly on the conflicting cases. That combination of utility-aware prioritization and memory resolution is the concrete new piece; it is not a first-principles derivation but a practical assembly of existing multi-tool ideas tailored to this chemistry task. The code release lets anyone check the implementation and the exact baselines, which is helpful. The main soft spots are the usual ones for this kind of applied work: whether the public dataset captures the distribution of real-world reactions that matter for drug or materials screening, and how well the learned utilities transfer to new reactions outside the evaluation split. The abstract claims consistent gains and highlights the conflict cases, but without the full tables and error bars it is still possible the baselines were not the strongest possible or that some post-selection occurred. No internal contradictions or circular definitions appear in the framing. This paper is for computational chemists who already run several LLM-based or rule-based tools and want a structured way to combine them rather than for people looking for a new theoretical foundation. A reader who needs a working multi-tool system for feasibility screening would find the architecture and the conflict-resolution results useful. It deserves a serious referee because the experiments are on a public set, the code is out, and the central claim is falsifiable.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ARMOR, an agentic framework for reaction feasibility prediction in computational chemistry. It explicitly models tool-specific utilities, organizes tools into a performance-based hierarchy for adaptive prioritization, and resolves conflicts among tools via memory-augmented reasoning. Experiments on a public dataset show consistent outperformance versus single-tool baselines and various aggregation/selection methods, with larger gains on reactions exhibiting conflicting tool predictions; code is released.

Significance. If the experimental results hold under scrutiny, the work could meaningfully advance multi-tool reasoning for scientific prediction tasks by replacing heuristic aggregation with utility-aware, conflict-resolving mechanisms. The focus on complementary tool strengths and the public code release are clear strengths that would support reproducibility and extension.

major comments (1)

[Experiments] Experiments section: the central claim of consistent outperformance (especially on conflicting cases) requires explicit reporting of baseline implementations, hyperparameter choices, dataset statistics (size, class balance, reaction types), and statistical significance tests with error bars; without these, it is impossible to verify that gains are not due to implementation differences or post-hoc selection.

minor comments (2)

[Abstract] Abstract: 'memoryaugmented' is missing a hyphen and should read 'memory-augmented'.
[Abstract] The anonymous code link is appropriate for review but should be updated to a permanent repository upon acceptance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address the major comment below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim of consistent outperformance (especially on conflicting cases) requires explicit reporting of baseline implementations, hyperparameter choices, dataset statistics (size, class balance, reaction types), and statistical significance tests with error bars; without these, it is impossible to verify that gains are not due to implementation differences or post-hoc selection.

Authors: We agree that these details are necessary to fully substantiate the claims and enable independent verification. In the revised manuscript, we will expand the Experiments section with a new subsection that explicitly describes: (i) the precise implementations of all baselines (including any adaptations from original sources), (ii) all hyperparameter values and selection procedures, (iii) full dataset statistics (total size, class balance, and distribution across reaction types), and (iv) statistical significance tests (e.g., paired t-tests or McNemar tests) with error bars on all reported metrics. These additions will be placed before the main results tables to address concerns about implementation differences or post-hoc selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirical agentic framework (ARMOR) that models tool utilities, prioritizes tools hierarchically, and resolves conflicts via memory-augmented reasoning for reaction feasibility prediction. Performance claims rest on experiments against baselines on a public dataset with released code; no equations, derivations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce any result to its inputs by construction. The argument is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that tool utilities can be characterized from patterns and that memory examples are sufficient for conflict resolution, but none are enumerated.

pith-pipeline@v0.9.0 · 5542 in / 1191 out tokens · 61870 ms · 2026-05-11T00:50:37.178031+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pattern Extraction... Align(Ptj), Cov(Ptj), Conf(Ptj)... top-L tools... K=8 demonstrations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Llama 3 Model Card. (2024). https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md SM Aithal and D Upadhyay

work page 2024
[2]

Anthropic

Feasibility study of the potential use of chemistry based emission predictions for real-time control of modern diesel engines.Applied Energy91, 1 (2012), 475–482. Anthropic

work page 2012
[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Dynamic selection of classifiers—a comprehensive review.Pattern recognition47, 11 (2014), 3665–3680. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

work page 2014
[4]

Yash Chainani, Zhuofu Ni, Kevin M Shebek, Linda J Broadbelt, and Keith EJ Tyo

Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901. Yash Chainani, Zhuofu Ni, Kevin M Shebek, Linda J Broadbelt, and Keith EJ Tyo

work page 2020
[5]

Davide Chicco and Giuseppe Jurman

DORA- XGB: an improved enzymatic reaction feasibility classifier trained using a novel synthetic data approach.Molecular Systems Design & Engineering10, 2 (2025), 129–142. Davide Chicco and Giuseppe Jurman

work page 2025
[6]

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.BMC genomics21, 1 (2020),

work page 2020
[7]

DeepSeek-AI

DESlib: A Dynamic ensemble selection library in Python.Journal of Machine Learning Research21, 8 (2020), 1–5. DeepSeek-AI

work page 2020
[8]

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf. Accessed: 2026-04-29. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

work page 2026
[9]

InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186. Thomas G Dietterich

work page 2019
[10]

Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng

Deep learning for chemical reaction prediction.Molecular Systems Design & Engineering3, 3 (2018), 442–452. Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng

work page 2018
[11]

William L Jorgensen, Ellen R Laird, Alan J Gushurst, Jan M Fleischer, Scott A Gothe, Harold E Helson, Genevieve D Paderes, and Shenna Sinclair

Chemformer: a pre-trained transformer for computational chemistry.Machine Learning: Science and Technology 3, 1 (2022), 015022. William L Jorgensen, Ellen R Laird, Alan J Gushurst, Jan M Fleischer, Scott A Gothe, Harold E Helson, Genevieve D Paderes, and Shenna Sinclair

work page 2022
[12]

JungZoona

CAMEO: a program for the logical prediction of the products of organic reactions.Pure and Applied Chemistry62, 10 (1990), 1921–1932. JungZoona

work page 1990
[13]

https://huggingface.co/JungZoona/ T3Q-qwen2.5-14b-v1.0-e3

T3Q-Qwen2.5-14B-v1.0-e3. https://huggingface.co/JungZoona/ T3Q-qwen2.5-14b-v1.0-e3. Accessed: 2026-04-15. Albert HR Ko, Robert Sabourin, and Alceu Souza Britto Jr

work page 2026
[14]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa

From dynamic classifier selection to dynamic ensemble selection.Pattern recognition41, 5 (2008), 1718–1731. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa

work page 2008
[15]

Bhavyahshree Navaneetha Krishnan, Adel Heydarabadipour, and Herbert Sauro

Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213. Bhavyahshree Navaneetha Krishnan, Adel Heydarabadipour, and Herbert Sauro

work page 2022
[16]

Kusuri Murakumo, Naruki Yoshikawa, Kentaro Rikimaru, Shogo Nakamura, Kairi Furui, Takamasa Suzuki, Hiroyuki Yamasaki, Yuki Nishigaya, Yuzo Takagi, and Masahito Ohue

BioModel- sRAG: A Biological Modeling Assistant Using RAG (Retrieval Augmented Generation).arXiv preprint arXiv:2601.22684(2026). Kusuri Murakumo, Naruki Yoshikawa, Kentaro Rikimaru, Shogo Nakamura, Kairi Furui, Takamasa Suzuki, Hiroyuki Yamasaki, Yuki Nishigaya, Yuzo Takagi, and Masahito Ohue

work page arXiv 2026
[17]

InAI for Accelerated Materials Design-NeurIPS 2023 Workshop

LLM drug discovery challenge: A contest as a feasibility study on the utilization of large language models in medicinal chemistry. InAI for Accelerated Materials Design-NeurIPS 2023 Workshop. Sanggil Park, Herim Han, Hyungjun Kim, and Sunghwan Choi

work page 2023
[18]

Daniel Probst, Philippe Schwaller, and Jean-Louis Reymond

Machine learning applications for chemical reactions.Chemistry–An Asian Journal17, 14 (2022), e202200203. Daniel Probst, Philippe Schwaller, and Jean-Louis Reymond

work page 2022
[19]

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Xuanhe Zhou, Yufei Huang, Chaojun Xiao, et al

Reaction classification and yield prediction using the differential reaction fingerprint DRFP.Digital discovery1, 2 (2022), 91–97. Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Xuanhe Zhou, Yufei Huang, Chaojun Xiao, et al. 2024a. Tool learning with foundation models.Comput. Surveys57, 4 (2024), 1–40. Yujia Qin, Shihao...

work page 2022
[20]

Lior Rokach

Tool learning with large language models: A survey.Frontiers of Computer Science19, 8 (2025), 198343. Lior Rokach

work page 2025
[21]

Ohad Rubin, Jonathan Herzig, and Jonathan Berant

Ensemble-based classifiers.Artificial intelligence review33, 1 (2010), 1–39. Ohad Rubin, Jonathan Herzig, and Jonathan Berant

work page 2010
[22]

InProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies

Learning to retrieve prompts for in-context learning. InProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. 2655–2671. Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A Hunter, Costas Bekas, and Alpha A Lee

work page 2022
[23]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean

Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction.ACS central science5, 9 (2019), 1572–1583. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean

work page 2019
[24]

OpenAI GPT-5 System Card

Openai gpt-5 system card. arXiv preprint arXiv:2601.03267(2025). 11 Rodrigo GF Soares, Alixandre Santana, Anne MP Canuto, and Marcílio Carlos Pereira de Souto

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

InThe 2006 IEEE international joint conference on neural network proceedings

Using accuracy and diversity to select classifiers to build ensembles. InThe 2006 IEEE international joint conference on neural network proceedings. IEEE, 1310–1316. Wendy A Warr

work page 2006
[26]

Tomasz Woloszynski, Marek Kurzynski, Pawel Podsiadlo, and Gwidon W Stachowiak

A short review of chemical reaction database systems, computer-aided synthesis design, reaction prediction and synthetic feasibility.Molecular informatics33, 6-7 (2014), 469–476. Tomasz Woloszynski, Marek Kurzynski, Pawel Podsiadlo, and Gwidon W Stachowiak

work page 2014
[27]

Feng Yang, Juan Liu, Qiang Zhang, Zhihui Yang, Jianghang Liu, and Guangsheng Wu

A mea- sure of competence based on random classification for dynamic ensemble selection.Information Fusion13, 3 (2012), 207–213. Feng Yang, Juan Liu, Qiang Zhang, Zhihui Yang, Jianghang Liu, and Guangsheng Wu

work page 2012
[28]

Towards global reaction feasibility and robustness prediction with high throughput data and bayesian deep learning.Nature Communications16, 1 (2025),

work page 2025
[29]

12 A Tool Set Construction and Training Details Table A1: Dataset for constructing the tool set. Split #Reactions #Feasible #Infeasible Train 200,000 40,000 160,000 Validation 25,000 5,000 20,000 Training DataThe individual tools are trained using the FREAdataset (Yu et al ., 2026), which is derived from the U.S. Patent & Trademark Office (USPTO1). The de...

work page 2026
[30]

(1) Classification-based Feasibility Predictors • BERT(Devlin et al ., 2019)

andT3Q-Qwen-14B (JungZoona, 2025), respectively, as the base LLMs. (1) Classification-based Feasibility Predictors • BERT(Devlin et al ., 2019). We fine-tune bert-base-uncased as a sequence classification baseline for reaction feasibility prediction. Each reaction is represented as a plain text string by concatenating the reactant and product SMILES with ...

work page 2025
[31]

These six retrieved reactions are provided to the LLM alongside the query, and the model is asked to make a feasibility judgment informed by the retrieved evidence

to retrieve the top-3 most similar feasible reactions and top-3 most similar infeasible reactions from the training set via FAISS binary index with Hamming distance. These six retrieved reactions are provided to the LLM alongside the query, and the model is asked to make a feasibility judgment informed by the retrieved evidence. • ChemformerLlama / Chemfo...

work page 2022
[32]

DES-KNN estimates these criteria within the k-nearest neighborhood of each test reaction, whereas DES-Clustering evaluates them within cluster-defined regions

selects tools by jointly considering tool accuracy and diversity. DES-KNN estimates these criteria within the k-nearest neighborhood of each test reaction, whereas DES-Clustering evaluates them within cluster-defined regions. In addition, we include HarderMoE (Huang et al., 2024), which performs dynamic expert routing based on input difficulty. • LLM-base...

work page 2024
[33]

We also include closed-source LLMs, including GPT-5.4-mini (Singh et al., 2025), DeepSeek-v4-flash (DeepSeek-AI,

employs LLM-based scoring to assess tool utility and select tools for each instance, where the same LLM backbone is adopted as inARMOR for a fair comparison. We also include closed-source LLMs, including GPT-5.4-mini (Singh et al., 2025), DeepSeek-v4-flash (DeepSeek-AI,

work page 2025
[34]

C Impact of Demonstrations Table A2: Impact of the number of demon- strations (K) on accuracy (ACC (%))

and Claude-Sonnet-4.6 (Anthropic, 2026), which are prompted to evaluate tool utilities on the validation set and perform tool selection during testing. C Impact of Demonstrations Table A2: Impact of the number of demon- strations (K) on accuracy (ACC (%)). # Demonstration Overall Feasible Infeasible K= 090.80 92.53 89.07 K= 291.1292.5789.67 K= 491.30 92.5...

work page arXiv 2026

[1] [1]

Llama 3 Model Card. (2024). https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md SM Aithal and D Upadhyay

work page 2024

[2] [2]

Anthropic

Feasibility study of the potential use of chemistry based emission predictions for real-time control of modern diesel engines.Applied Energy91, 1 (2012), 475–482. Anthropic

work page 2012

[3] [3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Dynamic selection of classifiers—a comprehensive review.Pattern recognition47, 11 (2014), 3665–3680. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

work page 2014

[4] [4]

Yash Chainani, Zhuofu Ni, Kevin M Shebek, Linda J Broadbelt, and Keith EJ Tyo

Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901. Yash Chainani, Zhuofu Ni, Kevin M Shebek, Linda J Broadbelt, and Keith EJ Tyo

work page 2020

[5] [5]

Davide Chicco and Giuseppe Jurman

DORA- XGB: an improved enzymatic reaction feasibility classifier trained using a novel synthetic data approach.Molecular Systems Design & Engineering10, 2 (2025), 129–142. Davide Chicco and Giuseppe Jurman

work page 2025

[6] [6]

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.BMC genomics21, 1 (2020),

work page 2020

[7] [7]

DeepSeek-AI

DESlib: A Dynamic ensemble selection library in Python.Journal of Machine Learning Research21, 8 (2020), 1–5. DeepSeek-AI

work page 2020

[8] [8]

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf. Accessed: 2026-04-29. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

work page 2026

[9] [9]

InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186. Thomas G Dietterich

work page 2019

[10] [10]

Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng

Deep learning for chemical reaction prediction.Molecular Systems Design & Engineering3, 3 (2018), 442–452. Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng

work page 2018

[11] [11]

William L Jorgensen, Ellen R Laird, Alan J Gushurst, Jan M Fleischer, Scott A Gothe, Harold E Helson, Genevieve D Paderes, and Shenna Sinclair

Chemformer: a pre-trained transformer for computational chemistry.Machine Learning: Science and Technology 3, 1 (2022), 015022. William L Jorgensen, Ellen R Laird, Alan J Gushurst, Jan M Fleischer, Scott A Gothe, Harold E Helson, Genevieve D Paderes, and Shenna Sinclair

work page 2022

[12] [12]

JungZoona

CAMEO: a program for the logical prediction of the products of organic reactions.Pure and Applied Chemistry62, 10 (1990), 1921–1932. JungZoona

work page 1990

[13] [13]

https://huggingface.co/JungZoona/ T3Q-qwen2.5-14b-v1.0-e3

T3Q-Qwen2.5-14B-v1.0-e3. https://huggingface.co/JungZoona/ T3Q-qwen2.5-14b-v1.0-e3. Accessed: 2026-04-15. Albert HR Ko, Robert Sabourin, and Alceu Souza Britto Jr

work page 2026

[14] [14]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa

From dynamic classifier selection to dynamic ensemble selection.Pattern recognition41, 5 (2008), 1718–1731. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa

work page 2008

[15] [15]

Bhavyahshree Navaneetha Krishnan, Adel Heydarabadipour, and Herbert Sauro

Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213. Bhavyahshree Navaneetha Krishnan, Adel Heydarabadipour, and Herbert Sauro

work page 2022

[16] [16]

Kusuri Murakumo, Naruki Yoshikawa, Kentaro Rikimaru, Shogo Nakamura, Kairi Furui, Takamasa Suzuki, Hiroyuki Yamasaki, Yuki Nishigaya, Yuzo Takagi, and Masahito Ohue

BioModel- sRAG: A Biological Modeling Assistant Using RAG (Retrieval Augmented Generation).arXiv preprint arXiv:2601.22684(2026). Kusuri Murakumo, Naruki Yoshikawa, Kentaro Rikimaru, Shogo Nakamura, Kairi Furui, Takamasa Suzuki, Hiroyuki Yamasaki, Yuki Nishigaya, Yuzo Takagi, and Masahito Ohue

work page arXiv 2026

[17] [17]

InAI for Accelerated Materials Design-NeurIPS 2023 Workshop

LLM drug discovery challenge: A contest as a feasibility study on the utilization of large language models in medicinal chemistry. InAI for Accelerated Materials Design-NeurIPS 2023 Workshop. Sanggil Park, Herim Han, Hyungjun Kim, and Sunghwan Choi

work page 2023

[18] [18]

Daniel Probst, Philippe Schwaller, and Jean-Louis Reymond

Machine learning applications for chemical reactions.Chemistry–An Asian Journal17, 14 (2022), e202200203. Daniel Probst, Philippe Schwaller, and Jean-Louis Reymond

work page 2022

[19] [19]

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Xuanhe Zhou, Yufei Huang, Chaojun Xiao, et al

Reaction classification and yield prediction using the differential reaction fingerprint DRFP.Digital discovery1, 2 (2022), 91–97. Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Xuanhe Zhou, Yufei Huang, Chaojun Xiao, et al. 2024a. Tool learning with foundation models.Comput. Surveys57, 4 (2024), 1–40. Yujia Qin, Shihao...

work page 2022

[20] [20]

Lior Rokach

Tool learning with large language models: A survey.Frontiers of Computer Science19, 8 (2025), 198343. Lior Rokach

work page 2025

[21] [21]

Ohad Rubin, Jonathan Herzig, and Jonathan Berant

Ensemble-based classifiers.Artificial intelligence review33, 1 (2010), 1–39. Ohad Rubin, Jonathan Herzig, and Jonathan Berant

work page 2010

[22] [22]

InProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies

Learning to retrieve prompts for in-context learning. InProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. 2655–2671. Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A Hunter, Costas Bekas, and Alpha A Lee

work page 2022

[23] [23]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean

Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction.ACS central science5, 9 (2019), 1572–1583. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean

work page 2019

[24] [24]

OpenAI GPT-5 System Card

Openai gpt-5 system card. arXiv preprint arXiv:2601.03267(2025). 11 Rodrigo GF Soares, Alixandre Santana, Anne MP Canuto, and Marcílio Carlos Pereira de Souto

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

InThe 2006 IEEE international joint conference on neural network proceedings

Using accuracy and diversity to select classifiers to build ensembles. InThe 2006 IEEE international joint conference on neural network proceedings. IEEE, 1310–1316. Wendy A Warr

work page 2006

[26] [26]

Tomasz Woloszynski, Marek Kurzynski, Pawel Podsiadlo, and Gwidon W Stachowiak

A short review of chemical reaction database systems, computer-aided synthesis design, reaction prediction and synthetic feasibility.Molecular informatics33, 6-7 (2014), 469–476. Tomasz Woloszynski, Marek Kurzynski, Pawel Podsiadlo, and Gwidon W Stachowiak

work page 2014

[27] [27]

Feng Yang, Juan Liu, Qiang Zhang, Zhihui Yang, Jianghang Liu, and Guangsheng Wu

A mea- sure of competence based on random classification for dynamic ensemble selection.Information Fusion13, 3 (2012), 207–213. Feng Yang, Juan Liu, Qiang Zhang, Zhihui Yang, Jianghang Liu, and Guangsheng Wu

work page 2012

[28] [28]

Towards global reaction feasibility and robustness prediction with high throughput data and bayesian deep learning.Nature Communications16, 1 (2025),

work page 2025

[29] [29]

12 A Tool Set Construction and Training Details Table A1: Dataset for constructing the tool set. Split #Reactions #Feasible #Infeasible Train 200,000 40,000 160,000 Validation 25,000 5,000 20,000 Training DataThe individual tools are trained using the FREAdataset (Yu et al ., 2026), which is derived from the U.S. Patent & Trademark Office (USPTO1). The de...

work page 2026

[30] [30]

(1) Classification-based Feasibility Predictors • BERT(Devlin et al ., 2019)

andT3Q-Qwen-14B (JungZoona, 2025), respectively, as the base LLMs. (1) Classification-based Feasibility Predictors • BERT(Devlin et al ., 2019). We fine-tune bert-base-uncased as a sequence classification baseline for reaction feasibility prediction. Each reaction is represented as a plain text string by concatenating the reactant and product SMILES with ...

work page 2025

[31] [31]

These six retrieved reactions are provided to the LLM alongside the query, and the model is asked to make a feasibility judgment informed by the retrieved evidence

to retrieve the top-3 most similar feasible reactions and top-3 most similar infeasible reactions from the training set via FAISS binary index with Hamming distance. These six retrieved reactions are provided to the LLM alongside the query, and the model is asked to make a feasibility judgment informed by the retrieved evidence. • ChemformerLlama / Chemfo...

work page 2022

[32] [32]

DES-KNN estimates these criteria within the k-nearest neighborhood of each test reaction, whereas DES-Clustering evaluates them within cluster-defined regions

selects tools by jointly considering tool accuracy and diversity. DES-KNN estimates these criteria within the k-nearest neighborhood of each test reaction, whereas DES-Clustering evaluates them within cluster-defined regions. In addition, we include HarderMoE (Huang et al., 2024), which performs dynamic expert routing based on input difficulty. • LLM-base...

work page 2024

[33] [33]

We also include closed-source LLMs, including GPT-5.4-mini (Singh et al., 2025), DeepSeek-v4-flash (DeepSeek-AI,

employs LLM-based scoring to assess tool utility and select tools for each instance, where the same LLM backbone is adopted as inARMOR for a fair comparison. We also include closed-source LLMs, including GPT-5.4-mini (Singh et al., 2025), DeepSeek-v4-flash (DeepSeek-AI,

work page 2025

[34] [34]

C Impact of Demonstrations Table A2: Impact of the number of demon- strations (K) on accuracy (ACC (%))

and Claude-Sonnet-4.6 (Anthropic, 2026), which are prompted to evaluate tool utilities on the validation set and perform tool selection during testing. C Impact of Demonstrations Table A2: Impact of the number of demon- strations (K) on accuracy (ACC (%)). # Demonstration Overall Feasible Infeasible K= 090.80 92.53 89.07 K= 291.1292.5789.67 K= 491.30 92.5...

work page arXiv 2026