SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

Kelvin Kiu Wai Tam; Newt Nguyen Kim Hue Nam; Rui Wang; Tianqing Fang; Tianshi Zheng; Wei Fan; Xiyun Li; Yangqiu Song

arxiv: 2605.01489 · v2 · pith:474OF7OJnew · submitted 2026-05-02 · 💻 cs.AI · cs.CL

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

Tianshi Zheng , Rui Wang , Xiyun Li , Kelvin Kiu Wai Tam , Newt Nguyen Kim Hue Nam , Wei Fan , Yangqiu Song , Tianqing Fang This is my paper

Pith reviewed 2026-07-01 00:15 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords SciResearcherdeep research agentsfrontier scientific reasoningagent foundation modeltask synthesissupervised fine-tuningagentic reinforcement learningscientific benchmarks

0 comments

The pith

An automated framework synthesizes academic-grounded tasks to train an 8B agent that sets new benchmarks on frontier biology and chemistry reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SciResearcher as a method to automatically build training data for AI agents that handle frontier scientific problems. It creates conceptual and computational tasks drawn from academic papers to develop skills in gathering information, using tools, and reasoning over long sequences. These data are used for supervised fine-tuning followed by agentic reinforcement learning on an 8B model. The resulting SciResearcher-8B reaches 19.46 percent on the HLE-Bio/Chem-Gold benchmark and posts 13-15 point gains on SuperGPQA-Hard-Biology and TRQA-Literature. A reader would care because the work shows a route to capable scientific agents that does not require hand-curated data or models larger than 8B parameters.

Core claim

We introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literatu

What carries the argument

SciResearcher, an automated agentic framework that synthesizes conceptual and computational tasks from academic sources to produce training data for information-seeking and long-horizon reasoning.

If this is right

Supervised fine-tuning plus agentic reinforcement learning on the synthesized tasks produces measurable gains on hard biology and literature benchmarks.
An 8B-scale model can exceed the performance of several larger proprietary agents on the reported science evaluations.
The framework provides a scalable alternative to knowledge-graph or web-browsing data pipelines for frontier scientific domains.
The same data-construction loop can be repeated to generate additional training examples without manual curation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis loop could be applied to physics or mathematics papers to test whether similar gains appear on those benchmarks.
Open models trained this way may narrow the gap with closed systems that rely on proprietary web-scale data.
If the tasks successfully train long-horizon tool use, the method could extend to multi-step experimental design agents.

Load-bearing premise

Tasks created by synthesizing academic evidence will produce capabilities that transfer to the held-out science benchmarks.

What would settle it

Retraining the 8B model on the same base data but without the synthesized academic tasks and observing no improvement or a drop below 19.46% on HLE-Bio/Chem-Gold would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.01489 by Kelvin Kiu Wai Tam, Newt Nguyen Kim Hue Nam, Rui Wang, Tianqing Fang, Tianshi Zheng, Wei Fan, Xiyun Li, Yangqiu Song.

**Figure 1.** Figure 1: Performance comparison on HLE-Bio/Chem-Gold ( view at source ↗

**Figure 2.** Figure 2: Comparison of ontology and web presence be view at source ↗

**Figure 2.** Figure 2: Overview of our SciResearcher data construction framework. the eval_urls tool, which applies four metrics— model exclusiveness, search identifiability, computational complexity, and LLM unfamiliarity—to support comprehensive assessment. Third, subagents are deployed to conduct a deep dive into the final selected URLs, extracting the complete model specification together with the scenarios and constraints… view at source ↗

**Figure 3.** Figure 3: Overview of our SciResearcher data construction framework. specific and concrete to support further evidence-grounded expansion. After selecting the best anchor, we invoke a new web agent instance to gather additional academic evidence about that anchor and generate a new question whose answer is exactly the anchor entity. This newly generated question is then fused back into the previous question by repla… view at source ↗

**Figure 3.** Figure 3: A running example of a question evolution pipeline for conceptual task curation. Question fusion and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: A running example of a question evolution pipeline for conceptual task curation. Question view at source ↗

**Figure 4.** Figure 4: (a) Word clouds of the curated questions from the two pipelines. (b) Distribution and performance of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Word clouds of the curated questions from the two pipelines. (b) Distribution and view at source ↗

**Figure 5.** Figure 5: (a) Distribution of trajectory lengths (in macro steps) for SFT and RL checkpoints. (b) Distribution of [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Distribution of trajectory lengths (in macro steps) for SFT and RL checkpoints. (b) view at source ↗

**Figure 6.** Figure 6: Dataset overlap analysis. (a) t-SNE projection of question embeddings, using 30 sampled questions per [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract outlines a data synthesis loop for scientific agents but supplies zero ablations or validation that the generated tasks actually produce the claimed benchmark gains.

read the letter

The core idea is an automated agentic pipeline that pulls academic papers and turns them into conceptual plus computational tasks meant to train information-seeking and tool-using behavior in bio/chem domains. That target is sensible; existing KG and web-browsing routes do struggle with sparse, computation-heavy frontier material.

What stands out is the explicit framing around long-horizon reasoning and tool integration rather than pure recall. The reported numbers—an 8B model at 19.46 % on HLE-Bio/Chem-Gold and 13-15 point lifts on SuperGPQA-Hard-Biology and TRQA-Literature—are presented as new SOTA at that scale.

The weak link is exactly the one the stress-test flags: nothing shows that the synthesis step, rather than generic fine-tuning or data volume, drives the scores. No task-quality checks, no error rates on the generated problems, no ablation removing the agentic loop, and no comparison against simpler curation baselines. The abstract states the outcome without the supporting measurements.

A reader already building agent training pipelines for science might skim the high-level loop for ideas. Anyone needing reproducible methods or evidence that the tasks transfer will find the current version thin.

I would not send this to referees in its present form; the central performance claim rests on an untested premise.

Referee Report

2 major / 0 minor

Summary. The paper introduces SciResearcher, a fully automated agentic framework that synthesizes diverse conceptual and computational tasks grounded in academic evidence to train deep research agents. These tasks are designed to elicit information acquisition, tool-integrated reasoning, and long-horizon capabilities. The resulting SciResearcher-8B model, trained via supervised fine-tuning and agentic reinforcement learning on the curated data, achieves 19.46% on the HLE-Bio/Chem-Gold benchmark (new SOTA at its scale, surpassing some larger proprietary agents) and 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature.

Significance. If the performance improvements are shown to stem specifically from the automated synthesis of grounded tasks rather than generic post-training, the work could offer a scalable paradigm for constructing training data in sparse, heterogeneous scientific domains where traditional knowledge-graph or web-browsing approaches fall short. This would strengthen the case for agent foundation models in automated scientific discovery.

major comments (2)

[Abstract and §4 (Results)] Abstract and §4 (Results): The central performance claims (19.46% on HLE-Bio/Chem-Gold and 13-15% gains on the other two benchmarks) are presented as resulting from the SciResearcher synthesis method, yet no ablation studies isolate the contribution of the synthesized conceptual/computational tasks, no validation of task grounding accuracy is reported, and no comparison to generic fine-tuning baselines is provided. This leaves the transfer from synthesized tasks to benchmark gains unsupported.
[§3 (Method)] §3 (Method): The description of the automated framework for task synthesis lacks any quantitative analysis or error analysis showing that the generated tasks correctly elicit information acquisition, tool use, and long-horizon reasoning without introducing factual or computational errors from the academic sources.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional empirical support would strengthen our claims regarding the SciResearcher framework. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract and §4 (Results)] Abstract and §4 (Results): The central performance claims (19.46% on HLE-Bio/Chem-Gold and 13-15% gains on the other two benchmarks) are presented as resulting from the SciResearcher synthesis method, yet no ablation studies isolate the contribution of the synthesized conceptual/computational tasks, no validation of task grounding accuracy is reported, and no comparison to generic fine-tuning baselines is provided. This leaves the transfer from synthesized tasks to benchmark gains unsupported.

Authors: We agree that the current manuscript would benefit from explicit ablations to isolate the contribution of the agentic task synthesis. In the revised version we will add: (i) a generic fine-tuning baseline using standard SFT on raw academic passages without the conceptual/computational task synthesis step; (ii) human validation results on a random sample of 200 synthesized tasks measuring factual and computational grounding accuracy; and (iii) an analysis correlating specific task features (e.g., number of tool calls, horizon length) with downstream benchmark gains. These additions will provide direct evidence for the transfer from synthesized tasks to the reported improvements. revision: yes
Referee: [§3 (Method)] §3 (Method): The description of the automated framework for task synthesis lacks any quantitative analysis or error analysis showing that the generated tasks correctly elicit information acquisition, tool use, and long-horizon reasoning without introducing factual or computational errors from the academic sources.

Authors: We acknowledge the absence of quantitative validation in the current §3. The revised manuscript will include: (i) aggregate error statistics from the synthesis pipeline (factual error rate via automated checks plus human review of 300 tasks, computational error rate on code-generation tasks); (ii) distributional statistics on elicited behaviors (e.g., average number of information-acquisition steps, tool invocations, and reasoning horizon length across the dataset); and (iii) a small-scale human study confirming that the generated tasks require the intended capabilities. These metrics will be reported alongside the existing framework description. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no self-referential derivations or fitted predictions

full rationale

The paper describes an automated data synthesis framework (SciResearcher) for training an 8B agent model via supervised fine-tuning and agentic RL, then reports benchmark scores. No equations, parameter-fitting procedures, uniqueness theorems, or self-citations appear in the abstract or described content. The performance claims rest on external benchmark evaluations rather than any reduction of outputs to inputs by construction. The derivation chain is therefore self-contained as an empirical pipeline without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5793 in / 1145 out tokens · 31420 ms · 2026-07-01T00:15:42.009571+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SciLens: Multi-modal Scientific Claim Verification with Agentic Entailment and Grounding
cs.CL 2026-06 unverdicted novelty 5.0

SciLens introduces an evidence-conditioned atomic entailment framework that grounds claims to modality-specific witnesses in tables and figures, achieving 79.2% macro-F1 on SciClaimEval.

Reference graph

Works this paper leans on

49 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Scibench: Evaluating college-level scientific problem-solving abilities of large language models. Preprint, arXiv:2307.10635. Jason Wei, Zhiqing Sun, Spencer Papay, Scott McK- inney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. Browsecomp: A simple yet chal- lenging benchmark for browsing agents.P...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Evidence Entailment

From automation to autonomy: A survey on large language models in scientific discovery. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17733–17750, Suzhou, China. Association for Com- putational Linguistics. Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng,...

work page arXiv 2025
[3]

Include the seed entity or be directly grounded in it
[4]

Be concise but scientifically meaningful
[5]

Be answerable from a single authoritative academic source at this stage
[6]

Prefer multiple-choice format with plausible confounders, while allowing short-answer format when more appropriate
[7]

Avoid shortcuts that can be solved by trivia, superficial keyword matching, or generic web search without reading the academic evidence
[8]

Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientific venues

Be suitable as the semantic backbone for later anchor-based augmentation.> ## Pre-Action Protocol: Plan Before Searching <Metric Definition> <Before browsing, understand the seed entity and its scientific context. Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientif...
[9]

Meticulousness and persistence in finding high-quality academic evidence
[10]

Task decomposition: search -> evidence extraction -> question generation -> verification
[11]

Adaptive error handling and reuse of progress state when searches fail or evidence is insufficient
[12]

Multi-query scout search and URL selection based on relevance, venue quality, source diversity, and scientific specificity
[13]

Use of the url2evidence sub-agent to access selected academic sources, extract key supporting evidence, and distinguish stand-alone scientific facts from study-specific artifacts
[14]

Evidence quality checks, including source authority, evidence-answer entailment, and avoidance of unsupported assumptions
[15]

Question formulation with plausible, unbiased, and challenging confounders for MCQs; clear expected answer for short-answer questions; and final quality checks
[16]

question

Multi-tool coordination following the typical workflow: scout search -> source selection -> url2evidence -> question generation -> verification. ## Output Format The final output MUST be a JSON object with the following structure: '''json { "question": "The question text containing or directly grounded in the seed entity", "answer": "The correct answer co...
[17]

**Domain-specific**: It is a concrete scientific entity, such as a gene, protein, pathway, compound, species, technique, disease, mutation, phenotype, material, model, or other scientific concept
[18]

**Question-body only**: It appears in the question stem but does NOT appear in the correct answer or any confounder
[19]

**Decisive**: The question becomes substantially harder or unanswerable if this entity is masked or removed
[20]

## Your Task Given the question, correct answer(s), and confounders below, you must:

**Specific and concrete**: It is sufficiently specific to support further evidence-grounded browsing and question generation. ## Your Task Given the question, correct answer(s), and confounders below, you must:
[21]

Identify candidate anchor entities in the question body
[22]

Verify that each candidate does NOT appear in the correct answer or any confounder
[23]

Evaluate whether each candidate is decisive for deriving the final answer
[24]

Select the most decisive, specific, and concrete entity
[25]

## Selection Criteria (in priority order)

If no valid anchor exists, return an empty string. ## Selection Criteria (in priority order)
[26]

AXL" over

Prefer the MOST SPECIFIC entity, e.g., "AXL" over "receptor tyrosine kinase"
[27]

Prefer entities that constrain the answer, such that removing them makes multiple answers plausible
[28]

Prefer named entities, such as gene, protein, compound, disease, pathway, or model names, over generic scientific terms
[29]

Prefer entities that are decoupled from the surface form of the answer options
[30]

candidates

If multiple candidates exist, choose the one most central to the scientific claim. ## Output Format Return ONLY valid JSON: { "candidates": [ { "entity": "...", "in_question": true, "in_options": false, "is_decisive": true } ], "anchor_entity": "<the single valid anchor entity string, or empty string if none>", "entity_type": "<type: gene|protein|pathway|...
[31]

Search identifiability
[32]

Computational complexity
[33]

### Level 3: Detailed Model Extraction with url2evidence Use the url2evidence sub-agent to conduct a deep dive into the final selected source or sources

LLM unfamiliarity Also consider URL validity and whether the source clearly contains a usable computational or numerical model. ### Level 3: Detailed Model Extraction with url2evidence Use the url2evidence sub-agent to conduct a deep dive into the final selected source or sources. Extract the complete model specification, including:
[34]

Model name and scientific purpose
[35]

Variable definitions
[36]

Parameter definitions and units
[37]

Applicable scenario and constraints
[38]

## Model Selection Criteria Select a model that satisfies as many of the following criteria as possible:

Any assumptions required for correct model use. ## Model Selection Criteria Select a model that satisfies as many of the following criteria as possible:
[39]

The model supports calculable numerical outputs
[40]

The model is described in a real, citable academic source
[41]

The equations are nontrivial and not merely standard textbook formulas
[42]

The computation requires meaningful model instantiation or numerical solving
[43]

The model can support a realistic scenario-based scientific question
[44]

The source is relatively recent, niche, or unlikely to be memorized by LLMs
[45]

seed_entity

The model is clearly associated with the seed entity or its scientific domain. ## What Counts as a Frontier Numerical Model? <A model with explicit mathematical structure, such as governing equations, ODE/PDE systems, kinetic models, dose-response models, mechanistic simulations, quantitative biological or chemical models, or other computational formulati...
[46]

Search for and identify the relevant model
[47]

Extract the model equations and constraints from the paper
[48]

Instantiate the model in a concrete scientific scenario
[49]

is_valid_url

Write and execute a Python solver to compute a numerical answer. First, perform preliminary validity checks. Then evaluate the article according to the four core metrics used for computational task curation. ## Preliminary Check 1: URL Validity <Metric Definition> <Determine whether the URL corresponds to a real and accessible academic source, such as a p...

[1] [1]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Scibench: Evaluating college-level scientific problem-solving abilities of large language models. Preprint, arXiv:2307.10635. Jason Wei, Zhiqing Sun, Spencer Papay, Scott McK- inney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. Browsecomp: A simple yet chal- lenging benchmark for browsing agents.P...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Evidence Entailment

From automation to autonomy: A survey on large language models in scientific discovery. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17733–17750, Suzhou, China. Association for Com- putational Linguistics. Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng,...

work page arXiv 2025

[3] [3]

Include the seed entity or be directly grounded in it

[4] [4]

Be concise but scientifically meaningful

[5] [5]

Be answerable from a single authoritative academic source at this stage

[6] [6]

Prefer multiple-choice format with plausible confounders, while allowing short-answer format when more appropriate

[7] [7]

Avoid shortcuts that can be solved by trivia, superficial keyword matching, or generic web search without reading the academic evidence

[8] [8]

Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientific venues

Be suitable as the semantic backbone for later anchor-based augmentation.> ## Pre-Action Protocol: Plan Before Searching <Metric Definition> <Before browsing, understand the seed entity and its scientific context. Plan 3--5 diverse search queries that target academic sources such as peer-reviewed papers, domain databases, preprints, and reputable scientif...

[9] [9]

Meticulousness and persistence in finding high-quality academic evidence

[10] [10]

Task decomposition: search -> evidence extraction -> question generation -> verification

[11] [11]

Adaptive error handling and reuse of progress state when searches fail or evidence is insufficient

[12] [12]

Multi-query scout search and URL selection based on relevance, venue quality, source diversity, and scientific specificity

[13] [13]

Use of the url2evidence sub-agent to access selected academic sources, extract key supporting evidence, and distinguish stand-alone scientific facts from study-specific artifacts

[14] [14]

Evidence quality checks, including source authority, evidence-answer entailment, and avoidance of unsupported assumptions

[15] [15]

Question formulation with plausible, unbiased, and challenging confounders for MCQs; clear expected answer for short-answer questions; and final quality checks

[16] [16]

question

Multi-tool coordination following the typical workflow: scout search -> source selection -> url2evidence -> question generation -> verification. ## Output Format The final output MUST be a JSON object with the following structure: '''json { "question": "The question text containing or directly grounded in the seed entity", "answer": "The correct answer co...

[17] [17]

**Domain-specific**: It is a concrete scientific entity, such as a gene, protein, pathway, compound, species, technique, disease, mutation, phenotype, material, model, or other scientific concept

[18] [18]

**Question-body only**: It appears in the question stem but does NOT appear in the correct answer or any confounder

[19] [19]

**Decisive**: The question becomes substantially harder or unanswerable if this entity is masked or removed

[20] [20]

## Your Task Given the question, correct answer(s), and confounders below, you must:

**Specific and concrete**: It is sufficiently specific to support further evidence-grounded browsing and question generation. ## Your Task Given the question, correct answer(s), and confounders below, you must:

[21] [21]

Identify candidate anchor entities in the question body

[22] [22]

Verify that each candidate does NOT appear in the correct answer or any confounder

[23] [23]

Evaluate whether each candidate is decisive for deriving the final answer

[24] [24]

Select the most decisive, specific, and concrete entity

[25] [25]

## Selection Criteria (in priority order)

If no valid anchor exists, return an empty string. ## Selection Criteria (in priority order)

[26] [26]

AXL" over

Prefer the MOST SPECIFIC entity, e.g., "AXL" over "receptor tyrosine kinase"

[27] [27]

Prefer entities that constrain the answer, such that removing them makes multiple answers plausible

[28] [28]

Prefer named entities, such as gene, protein, compound, disease, pathway, or model names, over generic scientific terms

[29] [29]

Prefer entities that are decoupled from the surface form of the answer options

[30] [30]

candidates

If multiple candidates exist, choose the one most central to the scientific claim. ## Output Format Return ONLY valid JSON: { "candidates": [ { "entity": "...", "in_question": true, "in_options": false, "is_decisive": true } ], "anchor_entity": "<the single valid anchor entity string, or empty string if none>", "entity_type": "<type: gene|protein|pathway|...

[31] [31]

Search identifiability

[32] [32]

Computational complexity

[33] [33]

### Level 3: Detailed Model Extraction with url2evidence Use the url2evidence sub-agent to conduct a deep dive into the final selected source or sources

LLM unfamiliarity Also consider URL validity and whether the source clearly contains a usable computational or numerical model. ### Level 3: Detailed Model Extraction with url2evidence Use the url2evidence sub-agent to conduct a deep dive into the final selected source or sources. Extract the complete model specification, including:

[34] [34]

Model name and scientific purpose

[35] [35]

Variable definitions

[36] [36]

Parameter definitions and units

[37] [37]

Applicable scenario and constraints

[38] [38]

## Model Selection Criteria Select a model that satisfies as many of the following criteria as possible:

Any assumptions required for correct model use. ## Model Selection Criteria Select a model that satisfies as many of the following criteria as possible:

[39] [39]

The model supports calculable numerical outputs

[40] [40]

The model is described in a real, citable academic source

[41] [41]

The equations are nontrivial and not merely standard textbook formulas

[42] [42]

The computation requires meaningful model instantiation or numerical solving

[43] [43]

The model can support a realistic scenario-based scientific question

[44] [44]

The source is relatively recent, niche, or unlikely to be memorized by LLMs

[45] [45]

seed_entity

The model is clearly associated with the seed entity or its scientific domain. ## What Counts as a Frontier Numerical Model? <A model with explicit mathematical structure, such as governing equations, ODE/PDE systems, kinetic models, dose-response models, mechanistic simulations, quantitative biological or chemical models, or other computational formulati...

[46] [46]

Search for and identify the relevant model

[47] [47]

Extract the model equations and constraints from the paper

[48] [48]

Instantiate the model in a concrete scientific scenario

[49] [49]

is_valid_url

Write and execute a Python solver to compute a numerical answer. First, perform preliminary validity checks. Then evaluate the article according to the four core metrics used for computational task curation. ## Preliminary Check 1: URL Validity <Metric Definition> <Determine whether the URL corresponds to a real and accessible academic source, such as a p...