EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
Pith reviewed 2026-05-20 10:14 UTC · model grok-4.3
The pith
EnvFactory scales tool-use agents by synthesizing verified environments and natural trajectories for efficient Agentic RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents, and using only 85 verified environments across 7 domains generates 2,575 SFT and RL trajectories that improve performance on tool-use and conversational benchmarks.
What carries the argument
The EnvFactory framework for autonomous environment verification and topology-aware trajectory synthesis in Agentic RL.
Load-bearing premise
The generated environments remain robust under RL training and the trajectories capture implicit human reasoning without artifacts that hurt generalization.
What would settle it
A drop in performance on held-out benchmarks when using the generated trajectories instead of real-world data would show that the synthetic environments and trajectories do not provide sufficient training signal.
read the original abstract
Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $\tau^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EnvFactory, a fully automated framework that autonomously explores and verifies stateful executable tool environments from authentic resources and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement. Using only 85 verified environments across 7 domains, it generates 2,575 SFT and RL trajectories and reports performance improvements of up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks for Qwen3-series models, claiming better efficiency than prior work using more environments.
Significance. If the robustness and generalization claims hold, this work could meaningfully advance scalable Agentic RL for tool-use by automating environment construction and trajectory synthesis, reducing dependence on real-world APIs and manual data curation while capturing implicit human reasoning more effectively than over-specified synthetic data.
major comments (2)
- [§3] §3 (Environment Synthesis and Verification): The autonomous verification process confirms initial executability and basic state transitions, but the manuscript provides no explicit mechanism for dynamic re-verification or failure recovery when RL policies reach novel state combinations or edge cases during training. This directly undermines the central claim that the 85 environments remain robust and stateful throughout Agentic RL, potentially invalidating the 2,575 trajectories and downstream gains on BFCLv3 and MCP-Atlas.
- [§5] §5 (Experiments and Results): The reported benchmark improvements lack details on baseline implementations, statistical significance testing, variance across runs, or controls for data leakage and selection effects. Without these, the claim of superior training efficiency with fewer environments cannot be fully evaluated and risks being driven by unaccounted factors.
minor comments (2)
- [Abstract] The notation τ²-Bench in the abstract and results should include a brief definition or citation to clarify whether it is a standard benchmark or one introduced in this work.
- [Figure 2] Ensure all figures showing trajectory examples or environment topologies have clear captions explaining how they demonstrate implicit intents versus over-specified instructions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which has helped us clarify key aspects of robustness and experimental rigor. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§3] §3 (Environment Synthesis and Verification): The autonomous verification process confirms initial executability and basic state transitions, but the manuscript provides no explicit mechanism for dynamic re-verification or failure recovery when RL policies reach novel state combinations or edge cases during training. This directly undermines the central claim that the 85 environments remain robust and stateful throughout Agentic RL, potentially invalidating the 2,575 trajectories and downstream gains on BFCLv3 and MCP-Atlas.
Authors: We appreciate the referee highlighting the importance of ongoing robustness during RL training. The verification procedure in §3 is intentionally designed around authentic resources to confirm both executability and core state transitions, and the topology-aware sampling strategy for trajectory generation systematically explores diverse state combinations within those verified environments. We acknowledge, however, that the original manuscript did not explicitly describe handling for entirely novel states that might arise under policy updates. In the revised manuscript, we have added a dedicated paragraph in §3.3 outlining a lightweight dynamic monitoring and recovery protocol: state transitions are logged during RL, and any unverified edge case triggers an automated re-exploration and verification step using the same autonomous pipeline. This addition directly supports the robustness claim for the 85 environments and the generated trajectories without changing the reported performance numbers. revision: yes
-
Referee: [§5] §5 (Experiments and Results): The reported benchmark improvements lack details on baseline implementations, statistical significance testing, variance across runs, or controls for data leakage and selection effects. Without these, the claim of superior training efficiency with fewer environments cannot be fully evaluated and risks being driven by unaccounted factors.
Authors: We agree that these experimental details are necessary for full reproducibility and to substantiate the efficiency claims. The revised §5 now includes: (i) explicit descriptions of how each baseline was re-implemented, including hyperparameter choices and any adaptations made for fair comparison; (ii) mean performance with standard deviation across five independent runs; (iii) statistical significance results using the Wilcoxon signed-rank test with reported p-values; and (iv) a new subsection on data-leakage controls that verifies zero environment overlap between our training set and the BFCLv3 / MCP-Atlas test splits, together with ablation studies isolating selection effects. These additions allow readers to evaluate the claim that EnvFactory achieves superior gains with substantially fewer environments than prior work. revision: yes
Circularity Check
No circularity; empirical framework evaluated on external benchmarks
full rationale
The paper introduces EnvFactory as an automated system for synthesizing stateful tool environments and multi-turn trajectories from authentic resources, then reports performance gains on external benchmarks (BFCLv3, MCP-Atlas, τ²-Bench, VitaBench) after training Qwen3 models. No equations, derivations, fitted parameters, or predictions appear in the text. Claims rest on direct empirical comparisons rather than any quantity defined by construction from the paper's own inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present; the central contribution is a practical pipeline whose validity is assessed outside the paper via held-out benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Autonomously explored and verified stateful environments from authentic resources are sufficiently robust and representative for Agentic RL training
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
12 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL P
URLhttps://arxiv.org/abs/2601.01498. 12 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL P. He, Z. Dai, B. He, H. Liu, X. Tang, H. Lu, J. Li, J. Ding, S. Mukherjee, S. Wang, Y. Xing, J. Tang, and B. Dumoulin. Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025a. URLhttps://arxiv.org/abs/2...
-
[2]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URLhttps://proceedings.mlr.press/v267/patil25a.html. 13 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on lea...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11704-024-40678-2 2023
-
[3]
URLhttps://arxiv.org/abs/2506.11045. R. Wang, G. Todd, Z. Xiao, X. Yuan, M.-A. Côté, P. Clark, and P. Jansen. Can language models serve as text-based world simulators?, 2024. URLhttps://arxiv.org/abs/2406.06485. Z. Wang, X. Zeng, W. Liu, L. Li, Y. Wang, L. Shang, X. Jiang, Q. Liu, and K.-F. Wong. Toolflow: Boosting llm tool-calling through natural and coh...
-
[4]
URLhttps://arxiv.org/abs/2503.07826. Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al. Sglang: Ef...
-
[5]
Output ONLY the final Python code wrapped in<tool_code>tags
-
[6]
NO explanations, NO markdown formatting outside the tags
-
[7]
The code must be production-ready, strictly following the 4-Section structure. Implementation Architecture
-
[8]
•Section 2: Class: Main logic class
File Structure (Mandatory) • Section 1: Schema: Pydantic models (Entity models + 1 Scenario model).Scenario_Schema defines the internal state structure of the Class. •Section 2: Class: Main logic class. •Section 3: MCP Tools: FastMCP registration + Wrappers. 20 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL •Sectio...
-
[9]
Core Requirements 2.1. Pydantic Models • Use Pydantic v2 API throughout. Do NOT use deprecated v1 patterns, such as.dict(). Use model_dump()instead. •Define all data structures using PydanticBaseModelclasses. • Import: from pydantic import BaseModel, Field and from typing import Dict, List, Optional, Union, Any. • Each model must inherit fromBaseModel, us...
-
[10]
For example, taxRatesMap: Dict[str, float] = Field(default={{...}}, description="...")
Scenario Model: Add reference data fields directly to the Scenario model as ordinary Pydantic fields. For example, taxRatesMap: Dict[str, float] = Field(default={{...}}, description="..."). Use Field(default={{...}}) or Field(default_factory=lambda: {{...}})to set default values containing 10–20 entries. These fields are just like other Scenario fields, w...
-
[11]
For example, self.taxRatesMap: Dict[str, float] = {{}}
Class Init: Initialize corresponding class attributes. For example, self.taxRatesMap: Dict[str, float] = {{}}
-
[12]
Load Scenario: Pydantic automatically handles default values. If scenario provides the field, use the provided value; otherwise, use the default value fromField(default=...)
-
[13]
Random seed for reproducible results
Tool Methods: Access reference data directly through class attributes, such asself.taxRatesMap. NEVER return hardcoded values. 5.Save Scenario: Return all fields in the dictionary, including reference data fields. 2.6. Random Number Generation (Reproducibility) • Avoid random when possible: Prefer deterministic logic based on state variables or input para...
-
[14]
MCP Tools 3.1. Error Handling & Empty Output • Class methods: MUST NOT contain try-except blocks or error detection logic. Directly perform operations. Return normal results orNone, for empty output. Let exceptions propagate naturally. •MCP wrapper functions: MUST use try-except blocks for all error detection and handling. –Simplified validation: MCP wrap...
-
[15]
Analyze Tool Code Structure • Examine the provided tool_code to identify the main Pydantic scenario model, such as GoogleCalendarScenario,TwitterScenario, orInventoryScenario. •Understand all fields, their types, default values, and relationships. •Identify reference data fields, such as lookup tables liketaxRatesMapandshippingZonesMap. •Understand the to...
-
[16]
Read ALL Pydantic class definitions in Section 1 (Schema) of thetool_code
-
[17]
Map each field in the main Scenario model to its actual type
-
[18]
For complex types, such asDictorListwithBaseModel, identify the nested structure
-
[19]
Generate data that exactly matches the nested structure
- [20]
-
[21]
pass": Normal success expected, such as empty collections or extreme but valid values. •
Generate Diverse Test Scenarios You must generate{n_scenarios}test scenarios with varying complexity levels. Complexity Levels 1.Simple (1–2 scenarios): Minimal data. •1–2 main entities, such as 1 calendar with 1 event, or 2 items in inventory. •Basic fields populated. •Use default reference data if applicable. •Purpose: Test basic tool functionality. 2.M...
-
[22]
Ensure Scenario Quality Each scenario must: •Be a complete, valid dictionary matching the scenario model structure. •Include ALL required fields from the Pydantic model. •Use realistic, coherent data, such as consistent date ranges and related IDs. •Have unique identifiers, such as different event IDs or item IDs. •Include reference data fields with their...
-
[23]
Output Format Your response must strictly follow this structure: <scenarios> [ { "scenario_id": "scenario_001", "complexity_level": "simple", "description": "Brief description of what this scenario tests", "expected_behavior": "pass", "scenario_data": { // Complete scenario dictionary matching the Pydantic model } }, { "scenario_id": "scenario_002", "comp...
-
[24]
Scenario Preparation You will receive: •mcp_server_name: Name of the MCP server •tool_code : MCP Tools section (Section 3) containing FastMCP registration and tool wrapper functions •tools_metadata: List of all available tools with their schemas •scenario_id: Unique identifier for this scenario •scenario_data: The test scenario data •request_id: For const...
-
[25]
{mcp_server_name}-{request_id}_{scenario_id}
Client ID Construction You must use this exact pattern: •client_id = "{mcp_server_name}-{request_id}_{scenario_id}" •Example:"GoogleMaps-abc123_scenario_001" •Use the SAMEclient_idfor all operations in this scenario
-
[26]
pass"(default): Normal execution, tools should succeed •
Understanding Expected Behavior The scenario may include anexpected_behaviorfield: •"pass"(default): Normal execution, tools should succeed •"validation_error": Scenario contains invalid data, tools should reject it with validation error When evaluating results: 27 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL 1.P...
-
[27]
Expected Failure: Tool correctly rejected invalid input with validation error when expected_behavior="validation_error". •THIS COUNTS AS PASSED. The tool is working correctly by rejecting bad data. 3.Unexpected Failure: • Tool raised error when success was expected, meaningexpected_behavior="pass" but got error. • OR: Tool succeeded when validation error ...
-
[28]
{mcp_server_name}-load_scenario
Layered Validation Procedure Layer 1: Scenario Loading (Critical and Blocking)Callexecute_mcp_toolwith: •tool_name:"{mcp_server_name}-load_scenario" •tool_args:{"scenario": scenario_data} •client_id: as constructed above Record the result. Evaluate based onexpected_behavior: • If expected_behavior="validation_error" andload_scenario fails with validation ...
-
[29]
Error Diagnosis For any failures, provide: • Error type: For example,"Tool execution error", "State inconsistency", or"Schema mismatch" 28 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL •Error location: Which tool/method failed •Error details: Actual error message, stack trace if available •Expected vs Actual: What...
-
[30]
Output Format Your response must strictly follow: <validation_result> { "scenario_id": "...", "passed": true/false, "load_scenario_result": { "success": true/false, "error": "..." // provide the error message if failed }, "tool_execution_results": [ { "tool_name": "...", "passed": true/false, "error": "..." // provide the error message if failed } ], "sav...
-
[31]
Error Categorization and Scenario Problem Detection Categorize errors into: •Pydantic Model Issues: Schema definition problems, field type mismatches •Load/Save Scenario Issues: State management problems, missing fields in save •Tool Logic Errors: Incorrect implementation, wrong return values, missing error handling •State Management Issues: Tools not rea...
-
[32]
Prioritized Fix Strategy The errors are categorized by severity. Fix issues in this order: 1.CRITICAL (Must Fix First): •load_scenariofailures, since these block all testing •Pydantic model schema mismatches, including validation errors and type mismatches •These affect all scenarios and must be fixed before anything else 2.HIGH (Fix Next): •Tools that fa...
-
[33]
Fix Implementation Guidelines •Fix the root cause, not symptoms •Ensure fixes do not break currently passing scenarios •Maintain all original functionality and structure •Follow all MCP tool generation requirements •Test edge cases in your mental model before suggesting fixes 30 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and...
-
[34]
Can you login to the website with your account?
Current Adjacency Map: A dicttool_name→[list of successor tool names] , repre- sentingexistingdependencies (Tool A → Tool B means Tool B may depend on or follow Tool A). Guidelines For everycandidateordered pair (Tool A→Tool B)not already present, assess: • SemanticComplementarity: Dothetoolssolvepartsofasharedtaskorpipeline? (e.g.,preprocessing →analysis...
-
[35]
Scenario Design Design a cohesive narrative thatnaturally motivatesthe observed tool sequence. Your scenario must: •Define a realistic user persona (name, age range, occupation, location, relevant traits) • Establish a concrete situation with time/place/context that explainswhythe user would perform these actions •Flow logically from initial need→actions ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.