EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Baiyu Huang; Boyu Zhu; Chao Chen; Fei Mi; Heyuan Deng; Lifeng Shang; Mengyi Deng; Minrui Xu; Xiao Zhu; Xingshan Zeng

arxiv: 2605.18703 · v1 · pith:P3UMIPJ2new · submitted 2026-05-18 · 💻 cs.CL · cs.LG

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Minrui Xu , Zilin Wang , Mengyi DENG , Zhiwei Li , Zhicheng Yang , Xiao Zhu , Yinhong Liu , Boyu Zhu

show 7 more authors

Baiyu Huang Chao Chen Heyuan Deng Fei Mi Lifeng Shang Xingshan Zeng Zhijiang Guo

This is my paper

Pith reviewed 2026-05-20 10:14 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords envfactoryenvironmentsagenticrobusttrainingtrajectorieschallengesdepend

0 comments

The pith

EnvFactory scales tool-use agents by synthesizing verified environments and natural trajectories for efficient Agentic RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EnvFactory as a way to overcome the lack of scalable environments and realistic data for training LLMs to use tools via reinforcement learning. It autonomously builds and checks stateful executable environments from real sources and creates multi-turn trajectories that mimic natural human queries with hidden intents. Using just 85 such environments in seven areas, it produces thousands of training examples that boost model scores on several benchmarks. This matters because it offers a practical path to larger-scale agent training without relying on costly or unreliable external systems.

Core claim

EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents, and using only 85 verified environments across 7 domains generates 2,575 SFT and RL trajectories that improve performance on tool-use and conversational benchmarks.

What carries the argument

The EnvFactory framework for autonomous environment verification and topology-aware trajectory synthesis in Agentic RL.

Load-bearing premise

The generated environments remain robust under RL training and the trajectories capture implicit human reasoning without artifacts that hurt generalization.

What would settle it

A drop in performance on held-out benchmarks when using the generated trajectories instead of real-world data would show that the synthetic environments and trajectories do not provide sufficient training signal.

read the original abstract

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $\tau^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EnvFactory automates synthesis of stateful tool environments and implicit-intent trajectories from real sources, delivering reported benchmark gains with a small set of environments but thin details on verification robustness.

read the letter

The main thing to know is that this paper automates the creation of executable, stateful tool environments from authentic resources and then generates multi-turn trajectories that aim for natural implicit human intents rather than explicit instructions. They verify 85 environments across 7 domains, produce 2575 SFT and RL trajectories, and claim improvements up to 15% on BFCLv3, 8.6% on MCP-Atlas, and 6% on conversational benchmarks for Qwen3 models. The efficiency angle stands out because prior work often uses far more environments or relies on costly APIs or LLM simulators.

Referee Report

2 major / 2 minor

Summary. The paper introduces EnvFactory, a fully automated framework that autonomously explores and verifies stateful executable tool environments from authentic resources and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement. Using only 85 verified environments across 7 domains, it generates 2,575 SFT and RL trajectories and reports performance improvements of up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks for Qwen3-series models, claiming better efficiency than prior work using more environments.

Significance. If the robustness and generalization claims hold, this work could meaningfully advance scalable Agentic RL for tool-use by automating environment construction and trajectory synthesis, reducing dependence on real-world APIs and manual data curation while capturing implicit human reasoning more effectively than over-specified synthetic data.

major comments (2)

[§3] §3 (Environment Synthesis and Verification): The autonomous verification process confirms initial executability and basic state transitions, but the manuscript provides no explicit mechanism for dynamic re-verification or failure recovery when RL policies reach novel state combinations or edge cases during training. This directly undermines the central claim that the 85 environments remain robust and stateful throughout Agentic RL, potentially invalidating the 2,575 trajectories and downstream gains on BFCLv3 and MCP-Atlas.
[§5] §5 (Experiments and Results): The reported benchmark improvements lack details on baseline implementations, statistical significance testing, variance across runs, or controls for data leakage and selection effects. Without these, the claim of superior training efficiency with fewer environments cannot be fully evaluated and risks being driven by unaccounted factors.

minor comments (2)

[Abstract] The notation τ²-Bench in the abstract and results should include a brief definition or citation to clarify whether it is a standard benchmark or one introduced in this work.
[Figure 2] Ensure all figures showing trajectory examples or environment topologies have clear captions explaining how they demonstrate implicit intents versus over-specified instructions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us clarify key aspects of robustness and experimental rigor. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [§3] §3 (Environment Synthesis and Verification): The autonomous verification process confirms initial executability and basic state transitions, but the manuscript provides no explicit mechanism for dynamic re-verification or failure recovery when RL policies reach novel state combinations or edge cases during training. This directly undermines the central claim that the 85 environments remain robust and stateful throughout Agentic RL, potentially invalidating the 2,575 trajectories and downstream gains on BFCLv3 and MCP-Atlas.

Authors: We appreciate the referee highlighting the importance of ongoing robustness during RL training. The verification procedure in §3 is intentionally designed around authentic resources to confirm both executability and core state transitions, and the topology-aware sampling strategy for trajectory generation systematically explores diverse state combinations within those verified environments. We acknowledge, however, that the original manuscript did not explicitly describe handling for entirely novel states that might arise under policy updates. In the revised manuscript, we have added a dedicated paragraph in §3.3 outlining a lightweight dynamic monitoring and recovery protocol: state transitions are logged during RL, and any unverified edge case triggers an automated re-exploration and verification step using the same autonomous pipeline. This addition directly supports the robustness claim for the 85 environments and the generated trajectories without changing the reported performance numbers. revision: yes
Referee: [§5] §5 (Experiments and Results): The reported benchmark improvements lack details on baseline implementations, statistical significance testing, variance across runs, or controls for data leakage and selection effects. Without these, the claim of superior training efficiency with fewer environments cannot be fully evaluated and risks being driven by unaccounted factors.

Authors: We agree that these experimental details are necessary for full reproducibility and to substantiate the efficiency claims. The revised §5 now includes: (i) explicit descriptions of how each baseline was re-implemented, including hyperparameter choices and any adaptations made for fair comparison; (ii) mean performance with standard deviation across five independent runs; (iii) statistical significance results using the Wilcoxon signed-rank test with reported p-values; and (iv) a new subsection on data-leakage controls that verifies zero environment overlap between our training set and the BFCLv3 / MCP-Atlas test splits, together with ablation studies isolating selection effects. These additions allow readers to evaluate the claim that EnvFactory achieves superior gains with substantially fewer environments than prior work. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework evaluated on external benchmarks

full rationale

The paper introduces EnvFactory as an automated system for synthesizing stateful tool environments and multi-turn trajectories from authentic resources, then reports performance gains on external benchmarks (BFCLv3, MCP-Atlas, τ²-Bench, VitaBench) after training Qwen3 models. No equations, derivations, fitted parameters, or predictions appear in the text. Claims rest on direct empirical comparisons rather than any quantity defined by construction from the paper's own inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present; the central contribution is a practical pipeline whose validity is assessed outside the paper via held-out benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review means specific hyperparameters for sampling and refinement are unknown; the central claims rest on the domain assumption that verified executable environments plus natural trajectories produce effective RL signals.

axioms (1)

domain assumption Autonomously explored and verified stateful environments from authentic resources are sufficiently robust and representative for Agentic RL training
Invoked when claiming superior efficiency and performance despite using far fewer environments than prior work.

pith-pipeline@v0.9.0 · 5832 in / 1262 out tokens · 35592 ms · 2026-05-20T10:14:42.554963+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

12 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL P

URLhttps://arxiv.org/abs/2601.01498. 12 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL P. He, Z. Dai, B. He, H. Liu, X. Tang, H. Lu, J. Li, J. Ding, S. Mukherjee, S. Wang, Y. Xing, J. Tang, and B. Dumoulin. Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025a. URLhttps://arxiv.org/abs/2...

work page arXiv 2025
[2]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URLhttps://proceedings.mlr.press/v267/patil25a.html. 13 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on lea...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11704-024-40678-2 2023
[3]

URLhttps://arxiv.org/abs/2506.11045. R. Wang, G. Todd, Z. Xiao, X. Yuan, M.-A. Côté, P. Clark, and P. Jansen. Can language models serve as text-based world simulators?, 2024. URLhttps://arxiv.org/abs/2406.06485. Z. Wang, X. Zeng, W. Liu, L. Li, Y. Wang, L. Shang, X. Jiang, Q. Liu, and K.-F. Wong. Toolflow: Boosting llm tool-calling through natural and coh...

work page arXiv 2024
[4]

URLhttps://arxiv.org/abs/2503.07826. Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al. Sglang: Ef...

work page arXiv 2025
[5]

Output ONLY the final Python code wrapped in<tool_code>tags

work page
[6]

NO explanations, NO markdown formatting outside the tags

work page
[7]

Implementation Architecture

The code must be production-ready, strictly following the 4-Section structure. Implementation Architecture

work page
[8]

•Section 2: Class: Main logic class

File Structure (Mandatory) • Section 1: Schema: Pydantic models (Entity models + 1 Scenario model).Scenario_Schema defines the internal state structure of the Class. •Section 2: Class: Main logic class. •Section 3: MCP Tools: FastMCP registration + Wrappers. 20 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL •Sectio...

work page
[9]

current time

Core Requirements 2.1. Pydantic Models • Use Pydantic v2 API throughout. Do NOT use deprecated v1 patterns, such as.dict(). Use model_dump()instead. •Define all data structures using PydanticBaseModelclasses. • Import: from pydantic import BaseModel, Field and from typing import Dict, List, Optional, Union, Any. • Each model must inherit fromBaseModel, us...

work page
[10]

For example, taxRatesMap: Dict[str, float] = Field(default={{...}}, description="...")

Scenario Model: Add reference data fields directly to the Scenario model as ordinary Pydantic fields. For example, taxRatesMap: Dict[str, float] = Field(default={{...}}, description="..."). Use Field(default={{...}}) or Field(default_factory=lambda: {{...}})to set default values containing 10–20 entries. These fields are just like other Scenario fields, w...

work page
[11]

For example, self.taxRatesMap: Dict[str, float] = {{}}

Class Init: Initialize corresponding class attributes. For example, self.taxRatesMap: Dict[str, float] = {{}}

work page
[12]

If scenario provides the field, use the provided value; otherwise, use the default value fromField(default=...)

Load Scenario: Pydantic automatically handles default values. If scenario provides the field, use the provided value; otherwise, use the default value fromField(default=...)

work page
[13]

Random seed for reproducible results

Tool Methods: Access reference data directly through class attributes, such asself.taxRatesMap. NEVER return hardcoded values. 5.Save Scenario: Return all fields in the dictionary, including reference data fields. 2.6. Random Number Generation (Reproducibility) • Avoid random when possible: Prefer deterministic logic based on state variables or input para...

work page
[14]

item not found

MCP Tools 3.1. Error Handling & Empty Output • Class methods: MUST NOT contain try-except blocks or error detection logic. Directly perform operations. Return normal results orNone, for empty output. Let exceptions propagate naturally. •MCP wrapper functions: MUST use try-except blocks for all error detection and handling. –Simplified validation: MCP wrap...

work page
[15]

key": {"field1

Analyze Tool Code Structure • Examine the provided tool_code to identify the main Pydantic scenario model, such as GoogleCalendarScenario,TwitterScenario, orInventoryScenario. •Understand all fields, their types, default values, and relationships. •Identify reference data fields, such as lookup tables liketaxRatesMapandshippingZonesMap. •Understand the to...

work page
[16]

Read ALL Pydantic class definitions in Section 1 (Schema) of thetool_code

work page
[17]

Map each field in the main Scenario model to its actual type

work page
[18]

For complex types, such asDictorListwithBaseModel, identify the nested structure

work page
[19]

Generate data that exactly matches the nested structure

work page
[20]

Match them precisely

DO NOT guess or simplify complex types. Match them precisely

work page
[21]

pass": Normal success expected, such as empty collections or extreme but valid values. •

Generate Diverse Test Scenarios You must generate{n_scenarios}test scenarios with varying complexity levels. Complexity Levels 1.Simple (1–2 scenarios): Minimal data. •1–2 main entities, such as 1 calendar with 1 event, or 2 items in inventory. •Basic fields populated. •Use default reference data if applicable. •Purpose: Test basic tool functionality. 2.M...

work page
[22]

validation_error

Ensure Scenario Quality Each scenario must: •Be a complete, valid dictionary matching the scenario model structure. •Include ALL required fields from the Pydantic model. •Use realistic, coherent data, such as consistent date ranges and related IDs. •Have unique identifiers, such as different event IDs or item IDs. •Include reference data fields with their...

work page
[23]

scenario_id

Output Format Your response must strictly follow this structure: <scenarios> [ { "scenario_id": "scenario_001", "complexity_level": "simple", "description": "Brief description of what this scenario tests", "expected_behavior": "pass", "scenario_data": { // Complete scenario dictionary matching the Pydantic model } }, { "scenario_id": "scenario_002", "comp...

work page
[24]

Scenario Preparation You will receive: •mcp_server_name: Name of the MCP server •tool_code : MCP Tools section (Section 3) containing FastMCP registration and tool wrapper functions •tools_metadata: List of all available tools with their schemas •scenario_id: Unique identifier for this scenario •scenario_data: The test scenario data •request_id: For const...

work page
[25]

{mcp_server_name}-{request_id}_{scenario_id}

Client ID Construction You must use this exact pattern: •client_id = "{mcp_server_name}-{request_id}_{scenario_id}" •Example:"GoogleMaps-abc123_scenario_001" •Use the SAMEclient_idfor all operations in this scenario

work page
[26]

pass"(default): Normal execution, tools should succeed •

Understanding Expected Behavior The scenario may include anexpected_behaviorfield: •"pass"(default): Normal execution, tools should succeed •"validation_error": Scenario contains invalid data, tools should reject it with validation error When evaluating results: 27 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL 1.P...

work page
[27]

validation_error

Expected Failure: Tool correctly rejected invalid input with validation error when expected_behavior="validation_error". •THIS COUNTS AS PASSED. The tool is working correctly by rejecting bad data. 3.Unexpected Failure: • Tool raised error when success was expected, meaningexpected_behavior="pass" but got error. • OR: Tool succeeded when validation error ...

work page
[28]

{mcp_server_name}-load_scenario

Layered Validation Procedure Layer 1: Scenario Loading (Critical and Blocking)Callexecute_mcp_toolwith: •tool_name:"{mcp_server_name}-load_scenario" •tool_args:{"scenario": scenario_data} •client_id: as constructed above Record the result. Evaluate based onexpected_behavior: • If expected_behavior="validation_error" andload_scenario fails with validation ...

work page
[29]

Tool execution error

Error Diagnosis For any failures, provide: • Error type: For example,"Tool execution error", "State inconsistency", or"Schema mismatch" 28 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL •Error location: Which tool/method failed •Error details: Actual error message, stack trace if available •Expected vs Actual: What...

work page
[30]

scenario_id

Output Format Your response must strictly follow: <validation_result> { "scenario_id": "...", "passed": true/false, "load_scenario_result": { "success": true/false, "error": "..." // provide the error message if failed }, "tool_execution_results": [ { "tool_name": "...", "passed": true/false, "error": "..." // provide the error message if failed } ], "sav...

work page
[31]

pass" but marked as

Error Categorization and Scenario Problem Detection Categorize errors into: •Pydantic Model Issues: Schema definition problems, field type mismatches •Load/Save Scenario Issues: State management problems, missing fields in save •Tool Logic Errors: Incorrect implementation, wrong return values, missing error handling •State Management Issues: Tools not rea...

work page
[32]

Prioritized Fix Strategy The errors are categorized by severity. Fix issues in this order: 1.CRITICAL (Must Fix First): •load_scenariofailures, since these block all testing •Pydantic model schema mismatches, including validation errors and type mismatches •These affect all scenarios and must be fixed before anything else 2.HIGH (Fix Next): •Tools that fa...

work page
[33]

Prompts for ToolGraph Logical Refinement Prompt for ToolGraph Role You are an expert tool relationship analyst specializing in dependency inference

Fix Implementation Guidelines •Fix the root cause, not symptoms •Ensure fixes do not break currently passing scenarios •Maintain all original functionality and structure •Follow all MCP tool generation requirements •Test edge cases in your mental model before suggesting fixes 30 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and...

work page
[34]

Can you login to the website with your account?

Current Adjacency Map: A dicttool_name→[list of successor tool names] , repre- sentingexistingdependencies (Tool A → Tool B means Tool B may depend on or follow Tool A). Guidelines For everycandidateordered pair (Tool A→Tool B)not already present, assess: • SemanticComplementarity: Dothetoolssolvepartsofasharedtaskorpipeline? (e.g.,preprocessing →analysis...

work page
[35]

a user wanted information

Scenario Design Design a cohesive narrative thatnaturally motivatesthe observed tool sequence. Your scenario must: •Define a realistic user persona (name, age range, occupation, location, relevant traits) • Establish a concrete situation with time/place/context that explainswhythe user would perform these actions •Flow logically from initial need→actions ...

work page

[1] [1]

12 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL P

URLhttps://arxiv.org/abs/2601.01498. 12 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL P. He, Z. Dai, B. He, H. Liu, X. Tang, H. Lu, J. Li, J. Ding, S. Mukherjee, S. Wang, Y. Xing, J. Tang, and B. Dumoulin. Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025a. URLhttps://arxiv.org/abs/2...

work page arXiv 2025

[2] [2]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URLhttps://proceedings.mlr.press/v267/patil25a.html. 13 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on lea...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11704-024-40678-2 2023

[3] [3]

URLhttps://arxiv.org/abs/2506.11045. R. Wang, G. Todd, Z. Xiao, X. Yuan, M.-A. Côté, P. Clark, and P. Jansen. Can language models serve as text-based world simulators?, 2024. URLhttps://arxiv.org/abs/2406.06485. Z. Wang, X. Zeng, W. Liu, L. Li, Y. Wang, L. Shang, X. Jiang, Q. Liu, and K.-F. Wong. Toolflow: Boosting llm tool-calling through natural and coh...

work page arXiv 2024

[4] [4]

URLhttps://arxiv.org/abs/2503.07826. Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al. Sglang: Ef...

work page arXiv 2025

[5] [5]

Output ONLY the final Python code wrapped in<tool_code>tags

work page

[6] [6]

NO explanations, NO markdown formatting outside the tags

work page

[7] [7]

Implementation Architecture

The code must be production-ready, strictly following the 4-Section structure. Implementation Architecture

work page

[8] [8]

•Section 2: Class: Main logic class

File Structure (Mandatory) • Section 1: Schema: Pydantic models (Entity models + 1 Scenario model).Scenario_Schema defines the internal state structure of the Class. •Section 2: Class: Main logic class. •Section 3: MCP Tools: FastMCP registration + Wrappers. 20 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL •Sectio...

work page

[9] [9]

current time

Core Requirements 2.1. Pydantic Models • Use Pydantic v2 API throughout. Do NOT use deprecated v1 patterns, such as.dict(). Use model_dump()instead. •Define all data structures using PydanticBaseModelclasses. • Import: from pydantic import BaseModel, Field and from typing import Dict, List, Optional, Union, Any. • Each model must inherit fromBaseModel, us...

work page

[10] [10]

For example, taxRatesMap: Dict[str, float] = Field(default={{...}}, description="...")

Scenario Model: Add reference data fields directly to the Scenario model as ordinary Pydantic fields. For example, taxRatesMap: Dict[str, float] = Field(default={{...}}, description="..."). Use Field(default={{...}}) or Field(default_factory=lambda: {{...}})to set default values containing 10–20 entries. These fields are just like other Scenario fields, w...

work page

[11] [11]

For example, self.taxRatesMap: Dict[str, float] = {{}}

Class Init: Initialize corresponding class attributes. For example, self.taxRatesMap: Dict[str, float] = {{}}

work page

[12] [12]

If scenario provides the field, use the provided value; otherwise, use the default value fromField(default=...)

Load Scenario: Pydantic automatically handles default values. If scenario provides the field, use the provided value; otherwise, use the default value fromField(default=...)

work page

[13] [13]

Random seed for reproducible results

Tool Methods: Access reference data directly through class attributes, such asself.taxRatesMap. NEVER return hardcoded values. 5.Save Scenario: Return all fields in the dictionary, including reference data fields. 2.6. Random Number Generation (Reproducibility) • Avoid random when possible: Prefer deterministic logic based on state variables or input para...

work page

[14] [14]

item not found

MCP Tools 3.1. Error Handling & Empty Output • Class methods: MUST NOT contain try-except blocks or error detection logic. Directly perform operations. Return normal results orNone, for empty output. Let exceptions propagate naturally. •MCP wrapper functions: MUST use try-except blocks for all error detection and handling. –Simplified validation: MCP wrap...

work page

[15] [15]

key": {"field1

Analyze Tool Code Structure • Examine the provided tool_code to identify the main Pydantic scenario model, such as GoogleCalendarScenario,TwitterScenario, orInventoryScenario. •Understand all fields, their types, default values, and relationships. •Identify reference data fields, such as lookup tables liketaxRatesMapandshippingZonesMap. •Understand the to...

work page

[16] [16]

Read ALL Pydantic class definitions in Section 1 (Schema) of thetool_code

work page

[17] [17]

Map each field in the main Scenario model to its actual type

work page

[18] [18]

For complex types, such asDictorListwithBaseModel, identify the nested structure

work page

[19] [19]

Generate data that exactly matches the nested structure

work page

[20] [20]

Match them precisely

DO NOT guess or simplify complex types. Match them precisely

work page

[21] [21]

pass": Normal success expected, such as empty collections or extreme but valid values. •

Generate Diverse Test Scenarios You must generate{n_scenarios}test scenarios with varying complexity levels. Complexity Levels 1.Simple (1–2 scenarios): Minimal data. •1–2 main entities, such as 1 calendar with 1 event, or 2 items in inventory. •Basic fields populated. •Use default reference data if applicable. •Purpose: Test basic tool functionality. 2.M...

work page

[22] [22]

validation_error

Ensure Scenario Quality Each scenario must: •Be a complete, valid dictionary matching the scenario model structure. •Include ALL required fields from the Pydantic model. •Use realistic, coherent data, such as consistent date ranges and related IDs. •Have unique identifiers, such as different event IDs or item IDs. •Include reference data fields with their...

work page

[23] [23]

scenario_id

Output Format Your response must strictly follow this structure: <scenarios> [ { "scenario_id": "scenario_001", "complexity_level": "simple", "description": "Brief description of what this scenario tests", "expected_behavior": "pass", "scenario_data": { // Complete scenario dictionary matching the Pydantic model } }, { "scenario_id": "scenario_002", "comp...

work page

[24] [24]

Scenario Preparation You will receive: •mcp_server_name: Name of the MCP server •tool_code : MCP Tools section (Section 3) containing FastMCP registration and tool wrapper functions •tools_metadata: List of all available tools with their schemas •scenario_id: Unique identifier for this scenario •scenario_data: The test scenario data •request_id: For const...

work page

[25] [25]

{mcp_server_name}-{request_id}_{scenario_id}

Client ID Construction You must use this exact pattern: •client_id = "{mcp_server_name}-{request_id}_{scenario_id}" •Example:"GoogleMaps-abc123_scenario_001" •Use the SAMEclient_idfor all operations in this scenario

work page

[26] [26]

pass"(default): Normal execution, tools should succeed •

Understanding Expected Behavior The scenario may include anexpected_behaviorfield: •"pass"(default): Normal execution, tools should succeed •"validation_error": Scenario contains invalid data, tools should reject it with validation error When evaluating results: 27 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL 1.P...

work page

[27] [27]

validation_error

Expected Failure: Tool correctly rejected invalid input with validation error when expected_behavior="validation_error". •THIS COUNTS AS PASSED. The tool is working correctly by rejecting bad data. 3.Unexpected Failure: • Tool raised error when success was expected, meaningexpected_behavior="pass" but got error. • OR: Tool succeeded when validation error ...

work page

[28] [28]

{mcp_server_name}-load_scenario

Layered Validation Procedure Layer 1: Scenario Loading (Critical and Blocking)Callexecute_mcp_toolwith: •tool_name:"{mcp_server_name}-load_scenario" •tool_args:{"scenario": scenario_data} •client_id: as constructed above Record the result. Evaluate based onexpected_behavior: • If expected_behavior="validation_error" andload_scenario fails with validation ...

work page

[29] [29]

Tool execution error

Error Diagnosis For any failures, provide: • Error type: For example,"Tool execution error", "State inconsistency", or"Schema mismatch" 28 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL •Error location: Which tool/method failed •Error details: Actual error message, stack trace if available •Expected vs Actual: What...

work page

[30] [30]

scenario_id

Output Format Your response must strictly follow: <validation_result> { "scenario_id": "...", "passed": true/false, "load_scenario_result": { "success": true/false, "error": "..." // provide the error message if failed }, "tool_execution_results": [ { "tool_name": "...", "passed": true/false, "error": "..." // provide the error message if failed } ], "sav...

work page

[31] [31]

pass" but marked as

Error Categorization and Scenario Problem Detection Categorize errors into: •Pydantic Model Issues: Schema definition problems, field type mismatches •Load/Save Scenario Issues: State management problems, missing fields in save •Tool Logic Errors: Incorrect implementation, wrong return values, missing error handling •State Management Issues: Tools not rea...

work page

[32] [32]

Prioritized Fix Strategy The errors are categorized by severity. Fix issues in this order: 1.CRITICAL (Must Fix First): •load_scenariofailures, since these block all testing •Pydantic model schema mismatches, including validation errors and type mismatches •These affect all scenarios and must be fixed before anything else 2.HIGH (Fix Next): •Tools that fa...

work page

[33] [33]

Prompts for ToolGraph Logical Refinement Prompt for ToolGraph Role You are an expert tool relationship analyst specializing in dependency inference

Fix Implementation Guidelines •Fix the root cause, not symptoms •Ensure fixes do not break currently passing scenarios •Maintain all original functionality and structure •Follow all MCP tool generation requirements •Test edge cases in your mental model before suggesting fixes 30 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and...

work page

[34] [34]

Can you login to the website with your account?

Current Adjacency Map: A dicttool_name→[list of successor tool names] , repre- sentingexistingdependencies (Tool A → Tool B means Tool B may depend on or follow Tool A). Guidelines For everycandidateordered pair (Tool A→Tool B)not already present, assess: • SemanticComplementarity: Dothetoolssolvepartsofasharedtaskorpipeline? (e.g.,preprocessing →analysis...

work page

[35] [35]

a user wanted information

Scenario Design Design a cohesive narrative thatnaturally motivatesthe observed tool sequence. Your scenario must: •Define a realistic user persona (name, age range, occupation, location, relevant traits) • Establish a concrete situation with time/place/context that explainswhythe user would perform these actions •Flow logically from initial need→actions ...

work page