pith. sign in

arxiv: 2605.18703 · v1 · pith:P3UMIPJ2new · submitted 2026-05-18 · 💻 cs.CL · cs.LG

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Pith reviewed 2026-05-20 10:14 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords envfactoryenvironmentsagenticrobusttrainingtrajectorieschallengesdepend
0
0 comments X

The pith

EnvFactory scales tool-use agents by synthesizing verified environments and natural trajectories for efficient Agentic RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EnvFactory as a way to overcome the lack of scalable environments and realistic data for training LLMs to use tools via reinforcement learning. It autonomously builds and checks stateful executable environments from real sources and creates multi-turn trajectories that mimic natural human queries with hidden intents. Using just 85 such environments in seven areas, it produces thousands of training examples that boost model scores on several benchmarks. This matters because it offers a practical path to larger-scale agent training without relying on costly or unreliable external systems.

Core claim

EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents, and using only 85 verified environments across 7 domains generates 2,575 SFT and RL trajectories that improve performance on tool-use and conversational benchmarks.

What carries the argument

The EnvFactory framework for autonomous environment verification and topology-aware trajectory synthesis in Agentic RL.

Load-bearing premise

The generated environments remain robust under RL training and the trajectories capture implicit human reasoning without artifacts that hurt generalization.

What would settle it

A drop in performance on held-out benchmarks when using the generated trajectories instead of real-world data would show that the synthetic environments and trajectories do not provide sufficient training signal.

read the original abstract

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $\tau^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EnvFactory, a fully automated framework that autonomously explores and verifies stateful executable tool environments from authentic resources and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement. Using only 85 verified environments across 7 domains, it generates 2,575 SFT and RL trajectories and reports performance improvements of up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks for Qwen3-series models, claiming better efficiency than prior work using more environments.

Significance. If the robustness and generalization claims hold, this work could meaningfully advance scalable Agentic RL for tool-use by automating environment construction and trajectory synthesis, reducing dependence on real-world APIs and manual data curation while capturing implicit human reasoning more effectively than over-specified synthetic data.

major comments (2)
  1. [§3] §3 (Environment Synthesis and Verification): The autonomous verification process confirms initial executability and basic state transitions, but the manuscript provides no explicit mechanism for dynamic re-verification or failure recovery when RL policies reach novel state combinations or edge cases during training. This directly undermines the central claim that the 85 environments remain robust and stateful throughout Agentic RL, potentially invalidating the 2,575 trajectories and downstream gains on BFCLv3 and MCP-Atlas.
  2. [§5] §5 (Experiments and Results): The reported benchmark improvements lack details on baseline implementations, statistical significance testing, variance across runs, or controls for data leakage and selection effects. Without these, the claim of superior training efficiency with fewer environments cannot be fully evaluated and risks being driven by unaccounted factors.
minor comments (2)
  1. [Abstract] The notation τ²-Bench in the abstract and results should include a brief definition or citation to clarify whether it is a standard benchmark or one introduced in this work.
  2. [Figure 2] Ensure all figures showing trajectory examples or environment topologies have clear captions explaining how they demonstrate implicit intents versus over-specified instructions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us clarify key aspects of robustness and experimental rigor. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§3] §3 (Environment Synthesis and Verification): The autonomous verification process confirms initial executability and basic state transitions, but the manuscript provides no explicit mechanism for dynamic re-verification or failure recovery when RL policies reach novel state combinations or edge cases during training. This directly undermines the central claim that the 85 environments remain robust and stateful throughout Agentic RL, potentially invalidating the 2,575 trajectories and downstream gains on BFCLv3 and MCP-Atlas.

    Authors: We appreciate the referee highlighting the importance of ongoing robustness during RL training. The verification procedure in §3 is intentionally designed around authentic resources to confirm both executability and core state transitions, and the topology-aware sampling strategy for trajectory generation systematically explores diverse state combinations within those verified environments. We acknowledge, however, that the original manuscript did not explicitly describe handling for entirely novel states that might arise under policy updates. In the revised manuscript, we have added a dedicated paragraph in §3.3 outlining a lightweight dynamic monitoring and recovery protocol: state transitions are logged during RL, and any unverified edge case triggers an automated re-exploration and verification step using the same autonomous pipeline. This addition directly supports the robustness claim for the 85 environments and the generated trajectories without changing the reported performance numbers. revision: yes

  2. Referee: [§5] §5 (Experiments and Results): The reported benchmark improvements lack details on baseline implementations, statistical significance testing, variance across runs, or controls for data leakage and selection effects. Without these, the claim of superior training efficiency with fewer environments cannot be fully evaluated and risks being driven by unaccounted factors.

    Authors: We agree that these experimental details are necessary for full reproducibility and to substantiate the efficiency claims. The revised §5 now includes: (i) explicit descriptions of how each baseline was re-implemented, including hyperparameter choices and any adaptations made for fair comparison; (ii) mean performance with standard deviation across five independent runs; (iii) statistical significance results using the Wilcoxon signed-rank test with reported p-values; and (iv) a new subsection on data-leakage controls that verifies zero environment overlap between our training set and the BFCLv3 / MCP-Atlas test splits, together with ablation studies isolating selection effects. These additions allow readers to evaluate the claim that EnvFactory achieves superior gains with substantially fewer environments than prior work. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework evaluated on external benchmarks

full rationale

The paper introduces EnvFactory as an automated system for synthesizing stateful tool environments and multi-turn trajectories from authentic resources, then reports performance gains on external benchmarks (BFCLv3, MCP-Atlas, τ²-Bench, VitaBench) after training Qwen3 models. No equations, derivations, fitted parameters, or predictions appear in the text. Claims rest on direct empirical comparisons rather than any quantity defined by construction from the paper's own inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present; the central contribution is a practical pipeline whose validity is assessed outside the paper via held-out benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review means specific hyperparameters for sampling and refinement are unknown; the central claims rest on the domain assumption that verified executable environments plus natural trajectories produce effective RL signals.

axioms (1)
  • domain assumption Autonomously explored and verified stateful environments from authentic resources are sufficiently robust and representative for Agentic RL training
    Invoked when claiming superior efficiency and performance despite using far fewer environments than prior work.

pith-pipeline@v0.9.0 · 5832 in / 1262 out tokens · 35592 ms · 2026-05-20T10:14:42.554963+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    12 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL P

    URLhttps://arxiv.org/abs/2601.01498. 12 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL P. He, Z. Dai, B. He, H. Liu, X. Tang, H. Lu, J. Li, J. Ding, S. Mukherjee, S. Wang, Y. Xing, J. Tang, and B. Dumoulin. Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025a. URLhttps://arxiv.org/abs/2...

  2. [2]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URLhttps://proceedings.mlr.press/v267/patil25a.html. 13 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on lea...

  3. [3]

    URLhttps://arxiv.org/abs/2506.11045. R. Wang, G. Todd, Z. Xiao, X. Yuan, M.-A. Côté, P. Clark, and P. Jansen. Can language models serve as text-based world simulators?, 2024. URLhttps://arxiv.org/abs/2406.06485. Z. Wang, X. Zeng, W. Liu, L. Li, Y. Wang, L. Shang, X. Jiang, Q. Liu, and K.-F. Wong. Toolflow: Boosting llm tool-calling through natural and coh...

  4. [4]

    URLhttps://arxiv.org/abs/2503.07826. Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al. Sglang: Ef...

  5. [5]

    Output ONLY the final Python code wrapped in<tool_code>tags

  6. [6]

    NO explanations, NO markdown formatting outside the tags

  7. [7]

    Implementation Architecture

    The code must be production-ready, strictly following the 4-Section structure. Implementation Architecture

  8. [8]

    •Section 2: Class: Main logic class

    File Structure (Mandatory) • Section 1: Schema: Pydantic models (Entity models + 1 Scenario model).Scenario_Schema defines the internal state structure of the Class. •Section 2: Class: Main logic class. •Section 3: MCP Tools: FastMCP registration + Wrappers. 20 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL •Sectio...

  9. [9]

    current time

    Core Requirements 2.1. Pydantic Models • Use Pydantic v2 API throughout. Do NOT use deprecated v1 patterns, such as.dict(). Use model_dump()instead. •Define all data structures using PydanticBaseModelclasses. • Import: from pydantic import BaseModel, Field and from typing import Dict, List, Optional, Union, Any. • Each model must inherit fromBaseModel, us...

  10. [10]

    For example, taxRatesMap: Dict[str, float] = Field(default={{...}}, description="...")

    Scenario Model: Add reference data fields directly to the Scenario model as ordinary Pydantic fields. For example, taxRatesMap: Dict[str, float] = Field(default={{...}}, description="..."). Use Field(default={{...}}) or Field(default_factory=lambda: {{...}})to set default values containing 10–20 entries. These fields are just like other Scenario fields, w...

  11. [11]

    For example, self.taxRatesMap: Dict[str, float] = {{}}

    Class Init: Initialize corresponding class attributes. For example, self.taxRatesMap: Dict[str, float] = {{}}

  12. [12]

    If scenario provides the field, use the provided value; otherwise, use the default value fromField(default=...)

    Load Scenario: Pydantic automatically handles default values. If scenario provides the field, use the provided value; otherwise, use the default value fromField(default=...)

  13. [13]

    Random seed for reproducible results

    Tool Methods: Access reference data directly through class attributes, such asself.taxRatesMap. NEVER return hardcoded values. 5.Save Scenario: Return all fields in the dictionary, including reference data fields. 2.6. Random Number Generation (Reproducibility) • Avoid random when possible: Prefer deterministic logic based on state variables or input para...

  14. [14]

    item not found

    MCP Tools 3.1. Error Handling & Empty Output • Class methods: MUST NOT contain try-except blocks or error detection logic. Directly perform operations. Return normal results orNone, for empty output. Let exceptions propagate naturally. •MCP wrapper functions: MUST use try-except blocks for all error detection and handling. –Simplified validation: MCP wrap...

  15. [15]

    key": {"field1

    Analyze Tool Code Structure • Examine the provided tool_code to identify the main Pydantic scenario model, such as GoogleCalendarScenario,TwitterScenario, orInventoryScenario. •Understand all fields, their types, default values, and relationships. •Identify reference data fields, such as lookup tables liketaxRatesMapandshippingZonesMap. •Understand the to...

  16. [16]

    Read ALL Pydantic class definitions in Section 1 (Schema) of thetool_code

  17. [17]

    Map each field in the main Scenario model to its actual type

  18. [18]

    For complex types, such asDictorListwithBaseModel, identify the nested structure

  19. [19]

    Generate data that exactly matches the nested structure

  20. [20]

    Match them precisely

    DO NOT guess or simplify complex types. Match them precisely

  21. [21]

    pass": Normal success expected, such as empty collections or extreme but valid values. •

    Generate Diverse Test Scenarios You must generate{n_scenarios}test scenarios with varying complexity levels. Complexity Levels 1.Simple (1–2 scenarios): Minimal data. •1–2 main entities, such as 1 calendar with 1 event, or 2 items in inventory. •Basic fields populated. •Use default reference data if applicable. •Purpose: Test basic tool functionality. 2.M...

  22. [22]

    validation_error

    Ensure Scenario Quality Each scenario must: •Be a complete, valid dictionary matching the scenario model structure. •Include ALL required fields from the Pydantic model. •Use realistic, coherent data, such as consistent date ranges and related IDs. •Have unique identifiers, such as different event IDs or item IDs. •Include reference data fields with their...

  23. [23]

    scenario_id

    Output Format Your response must strictly follow this structure: <scenarios> [ { "scenario_id": "scenario_001", "complexity_level": "simple", "description": "Brief description of what this scenario tests", "expected_behavior": "pass", "scenario_data": { // Complete scenario dictionary matching the Pydantic model } }, { "scenario_id": "scenario_002", "comp...

  24. [24]

    Scenario Preparation You will receive: •mcp_server_name: Name of the MCP server •tool_code : MCP Tools section (Section 3) containing FastMCP registration and tool wrapper functions •tools_metadata: List of all available tools with their schemas •scenario_id: Unique identifier for this scenario •scenario_data: The test scenario data •request_id: For const...

  25. [25]

    {mcp_server_name}-{request_id}_{scenario_id}

    Client ID Construction You must use this exact pattern: •client_id = "{mcp_server_name}-{request_id}_{scenario_id}" •Example:"GoogleMaps-abc123_scenario_001" •Use the SAMEclient_idfor all operations in this scenario

  26. [26]

    pass"(default): Normal execution, tools should succeed •

    Understanding Expected Behavior The scenario may include anexpected_behaviorfield: •"pass"(default): Normal execution, tools should succeed •"validation_error": Scenario contains invalid data, tools should reject it with validation error When evaluating results: 27 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL 1.P...

  27. [27]

    validation_error

    Expected Failure: Tool correctly rejected invalid input with validation error when expected_behavior="validation_error". •THIS COUNTS AS PASSED. The tool is working correctly by rejecting bad data. 3.Unexpected Failure: • Tool raised error when success was expected, meaningexpected_behavior="pass" but got error. • OR: Tool succeeded when validation error ...

  28. [28]

    {mcp_server_name}-load_scenario

    Layered Validation Procedure Layer 1: Scenario Loading (Critical and Blocking)Callexecute_mcp_toolwith: •tool_name:"{mcp_server_name}-load_scenario" •tool_args:{"scenario": scenario_data} •client_id: as constructed above Record the result. Evaluate based onexpected_behavior: • If expected_behavior="validation_error" andload_scenario fails with validation ...

  29. [29]

    Tool execution error

    Error Diagnosis For any failures, provide: • Error type: For example,"Tool execution error", "State inconsistency", or"Schema mismatch" 28 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL •Error location: Which tool/method failed •Error details: Actual error message, stack trace if available •Expected vs Actual: What...

  30. [30]

    scenario_id

    Output Format Your response must strictly follow: <validation_result> { "scenario_id": "...", "passed": true/false, "load_scenario_result": { "success": true/false, "error": "..." // provide the error message if failed }, "tool_execution_results": [ { "tool_name": "...", "passed": true/false, "error": "..." // provide the error message if failed } ], "sav...

  31. [31]

    pass" but marked as

    Error Categorization and Scenario Problem Detection Categorize errors into: •Pydantic Model Issues: Schema definition problems, field type mismatches •Load/Save Scenario Issues: State management problems, missing fields in save •Tool Logic Errors: Incorrect implementation, wrong return values, missing error handling •State Management Issues: Tools not rea...

  32. [32]

    Prioritized Fix Strategy The errors are categorized by severity. Fix issues in this order: 1.CRITICAL (Must Fix First): •load_scenariofailures, since these block all testing •Pydantic model schema mismatches, including validation errors and type mismatches •These affect all scenarios and must be fixed before anything else 2.HIGH (Fix Next): •Tools that fa...

  33. [33]

    Prompts for ToolGraph Logical Refinement Prompt for ToolGraph Role You are an expert tool relationship analyst specializing in dependency inference

    Fix Implementation Guidelines •Fix the root cause, not symptoms •Ensure fixes do not break currently passing scenarios •Maintain all original functionality and structure •Follow all MCP tool generation requirements •Test edge cases in your mental model before suggesting fixes 30 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and...

  34. [34]

    Can you login to the website with your account?

    Current Adjacency Map: A dicttool_name→[list of successor tool names] , repre- sentingexistingdependencies (Tool A → Tool B means Tool B may depend on or follow Tool A). Guidelines For everycandidateordered pair (Tool A→Tool B)not already present, assess: • SemanticComplementarity: Dothetoolssolvepartsofasharedtaskorpipeline? (e.g.,preprocessing →analysis...

  35. [35]

    a user wanted information

    Scenario Design Design a cohesive narrative thatnaturally motivatesthe observed tool sequence. Your scenario must: •Define a realistic user persona (name, age range, occupation, location, relevant traits) • Establish a concrete situation with time/place/context that explainswhythe user would perform these actions •Flow logically from initial need→actions ...