EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Guanting Dong; Haofei Chang; Ji-Rong Wen; Xiaoshuai Song; Yutao Zhu; Zhicheng Dou

arxiv: 2601.05808 · v2 · submitted 2026-01-09 · 💻 cs.CL · cs.AI· cs.LG

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Xiaoshuai Song , Haofei Chang , Guanting Dong , Yutao Zhu , Ji-Rong Wen , Zhicheng Dou This is my paper

Pith reviewed 2026-05-16 16:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM agentstool interactionenvironment synthesisprogrammatic generationsupervised fine-tuningreinforcement learningmulti-turn interactionstrajectory validation

0 comments

The pith

EnvScaler uses programmatic synthesis to create hundreds of tool-interaction environments and thousands of validated scenarios for training LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EnvScaler as an automated way to build rich training sandboxes for LLM agents that must use multiple tools across several turns. It addresses the shortage of scalable environments by first mining topics and modeling logic to create environment skeletons, then generating task scenarios along with rule-based functions that validate correct trajectories. The resulting collection of 191 environments and roughly 7,000 scenarios is used for both supervised fine-tuning and reinforcement learning on the Qwen3 model family. Experiments on three benchmarks show clear gains in the models' ability to solve tasks that require coordinated, multi-step tool use. A sympathetic reader would care because current agent training is bottlenecked by access to realistic, consistent interaction data, and this method offers a path to expand that data without manual construction or simulation artifacts.

Core claim

EnvScaler constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation, then generates multiple task scenarios and rule-based trajectory validation functions for each skeleton. This process produces 191 environments and about 7,000 scenarios that are applied to supervised fine-tuning and reinforcement learning of Qwen3 series models. The resulting agents demonstrate significantly stronger performance on three benchmarks that measure success in complex environments requiring multi-turn, multi-tool interactions.

What carries the argument

SkelBuilder for constructing environment skeletons and ScenGenerator for producing task scenarios together with rule-based trajectory validators.

If this is right

Agents trained this way can complete more multi-step tasks that require coordinated calls to several tools.
The rule-based validators provide reliable reward signals during reinforcement learning without external simulation errors.
The same synthesis pipeline can be rerun to produce additional environments whenever new tool APIs become available.
Training data volume scales linearly with compute rather than human effort, enabling larger agent models to be fine-tuned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Validation functions created during synthesis could be reused at inference time to detect when an agent deviates from valid trajectories.
The approach might extend to other agent domains such as web navigation or code execution if similar logic models can be defined.
Because the environments are fully specified by code, they allow systematic variation of difficulty parameters to study scaling laws for agent performance.

Load-bearing premise

The programmatically generated environments and validation rules accurately reflect real-world tool behaviors and supply consistent training signals without hidden inconsistencies.

What would settle it

Training an LLM agent on the synthesized set and then measuring no improvement (or a decline) in success rate on held-out real-world multi-turn tool-use tasks compared with a baseline trained without them.

Figures

Figures reproduced from arXiv: 2601.05808 by Guanting Dong, Haofei Chang, Ji-Rong Wen, Xiaoshuai Song, Yutao Zhu, Zhicheng Dou.

**Figure 2.** Figure 2: The overview of EnvScaler. with the pass rate indicating environment quality. To further synthesize multiple task scenarios for each environment, we propose ScenGenerator. To ensure task relevance and solvability within a given environment and scenario, ScenGenerator first synthesize the environment’s initial database/state, and then derives challenging tasks from the current state. To achieve rule-based … view at source ↗

**Figure 3.** Figure 3: The overall framework of SkelBuilder. 2025). In this paper, we focus on general tool use across various domain-specific environments (Patil et al., 2025; Yao et al., 2025; Chen et al., 2025), rather than tool-integrated reasoning and web information access centered on Python or search tools (Dong et al., 2025; Li et al., 2025a). Some work have explored the training data and RL strategies from different p… view at source ↗

**Figure 4.** Figure 4: The overall framework of ScenGenerator. Item Avg. Med. # Constraint Rules Per Env 4.58 5 # State Category Per Env Level 1 (e.g., user, message, item) 3.74 4 Level 2 per Level 1 (e.g., u_id, u_phone) 5.72 5 Total 21.38 21 # Tools Per Env Env Information Query (e.g., list_users) 10.44 10 Env State Change (e.g., send_message) 8.14 8 Total 18.58 18 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Pairwise comparison of different LLMs on a [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The RL training and validation curve of Qwen3 in synthetic environments after SFT. gains across all datasets, whereas Qwen3-1.7B shows notable improvements on BFCL-MT and ACEBench-Agent but a slight drop on Tau-Bench. The main reason is that large-scale models possess stronger exploration capabilities during RL, enabling them to extract effective strategies. In contrast, small-scale models, with weaker fo… view at source ↗

**Figure 7.** Figure 7: The change of Qwen3-4B’s performance with [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Diversity and statistical distributions of 191 synthesized environments. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: The Direct RL training and validation curve [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: The prompt for filtering tasks situated within a domain-specific, stateful environment. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: The prompt for inferring environment description from existing tasks. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: The prompt for planning state and rules of the environment. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: The prompt for planning tool operations of the environment. [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: The prompt for programmatically converting states into the class definition. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: The prompt for programmatically converting operation to class-method. [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: The prompt for initializing the testing agent during environments assessment. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: The prompt for initializing the checking agent during environments assessment. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: The prompt for generating environment’s initial state data. [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: The prompt for generating a task under the specific environment and state. [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 20.** Figure 20: The prompt for generating verification checklist for a task. [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

**Figure 21.** Figure 21: The prompt for programmatically converting each checkpoint to a verification function. [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗

**Figure 22.** Figure 22: The system prompt for prompting LLM agents under the [PITH_FULL_IMAGE:figures/full_fig_p019_22.png] view at source ↗

**Figure 23.** Figure 23: The system prompt for prompting LLM agents under the [PITH_FULL_IMAGE:figures/full_fig_p019_23.png] view at source ↗

**Figure 24.** Figure 24: The system prompt for prompting the LLM to act as a user under the [PITH_FULL_IMAGE:figures/full_fig_p019_24.png] view at source ↗

**Figure 25.** Figure 25: An example of initial state data configuration for the environment. [PITH_FULL_IMAGE:figures/full_fig_p027_25.png] view at source ↗

**Figure 26.** Figure 26: An example task under the above state configuration. [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗

read the original abstract

Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EnvScaler gives a concrete pipeline to generate 191 tool environments and 7k scenarios for agent training, with reported gains on three benchmarks, but the lack of external fidelity checks is the main open question.

read the letter

The core takeaway is that this paper delivers a working two-stage synthesis system—SkelBuilder for environment skeletons via topic mining and logic modeling, then ScenGenerator for scenarios plus rule-based validators—and they actually produced 191 environments and roughly 7k scenarios to train Qwen3 models with both SFT and RL. That scale is useful because real tool sandboxes are hard to get in volume. They release the code and data, which is the part that will matter most to people who want to try it themselves. The reported improvements on the three multi-turn, multi-tool benchmarks are the main empirical claim, and the abstract frames it as addressing the usual bottlenecks of restricted access, hallucinations in simulated envs, and manual scaling limits. On the positive side, the method is fully programmatic and independent of downstream training loops, so there is no obvious circularity. The release of artifacts also lets others inspect or extend the generated data directly. The soft spot is exactly the one the stress-test flags: there is no quantitative evidence in the available description that the synthetic environments match real API behavior or avoid systematic simplifications. The rule-based validators catch some inconsistencies, but without distribution comparisons to live tool logs, expert realism scores, or mismatch rates on actual executions, it is hard to know whether the benchmark gains reflect better agent capabilities or just better alignment to the generated distribution. If the paper includes those checks in the full text, they need to be front and center; if not, that is the section that would need the most work in revision. This is the kind of paper that belongs in a reading group focused on LLM agents and tool use. Researchers who are already building training pipelines for multi-turn tool tasks will find the released environments and the synthesis recipe immediately usable, even if they end up adding their own validation layer. It is not a foundational theoretical advance, but it tackles a practical data bottleneck with a replicable method. I would send it to peer review. The problem is real, the implementation is concrete, and the released resources give referees something tangible to evaluate. A referee could usefully press on the fidelity metrics and ask for clearer baseline numbers, but the work is worth the time.

Referee Report

2 major / 1 minor

Summary. The paper introduces EnvScaler, a framework for programmatically synthesizing scalable tool-interactive environments for training LLM agents. It features SkelBuilder, which uses topic mining, logic modeling, and quality evaluation to create environment skeletons, and ScenGenerator, which produces task scenarios and rule-based trajectory validators. The authors generate 191 environments and approximately 7,000 scenarios, apply them to SFT and RL training of Qwen3 models, and claim significant improvements on three benchmarks involving multi-turn, multi-tool interactions.

Significance. If the synthesized environments accurately capture real-world tool interactions without systematic artifacts, EnvScaler could substantially advance scalable training of LLM agents by reducing reliance on manual construction or restricted real-system access. The public release of code and data is a clear strength for reproducibility.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central claim of 'significant improvements' on three benchmarks is stated without any quantitative metrics, baseline comparisons, error bars, or statistical details. This is load-bearing because the reported gains cannot be assessed for magnitude, significance, or whether they exceed what would be expected from training on any large synthetic dataset.
[§3.1 and §3.2] §3.1 (SkelBuilder) and §3.2 (ScenGenerator): No quantitative fidelity metrics are reported (e.g., distribution distance to real tool-use logs, expert realism ratings, or execution mismatch rates against live APIs). The rule-based validators alone do not address the risk that benchmark gains reflect overfitting to synthetic simplifications rather than transferable multi-turn tool-use capability.

minor comments (1)

[Abstract] The abstract uses the imprecise phrase 'about 7K scenarios'; the exact count and breakdown by environment should be stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the quantitative presentation of results and add fidelity analyses for the synthesized environments.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'significant improvements' on three benchmarks is stated without any quantitative metrics, baseline comparisons, error bars, or statistical details. This is load-bearing because the reported gains cannot be assessed for magnitude, significance, or whether they exceed what would be expected from training on any large synthetic dataset.

Authors: We agree that the abstract summarizes results qualitatively and that §4 would benefit from additional statistical details. The experiments section already contains tables comparing EnvScaler-trained models against baselines including the base Qwen3, standard SFT, and prior agent methods, with concrete accuracy and success-rate improvements on the three benchmarks. In the revision we will (1) insert key quantitative figures into the abstract, (2) add error bars from multiple random seeds and p-values for statistical significance in §4, and (3) include a new control experiment training on an equivalently sized unstructured synthetic dataset to isolate the contribution of EnvScaler’s structured synthesis. revision: yes
Referee: [§3.1 and §3.2] §3.1 (SkelBuilder) and §3.2 (ScenGenerator): No quantitative fidelity metrics are reported (e.g., distribution distance to real tool-use logs, expert realism ratings, or execution mismatch rates against live APIs). The rule-based validators alone do not address the risk that benchmark gains reflect overfitting to synthetic simplifications rather than transferable multi-turn tool-use capability.

Authors: We acknowledge that explicit fidelity metrics are absent from the submitted version. The rule-based validators guarantee executability, and SkelBuilder’s quality evaluation enforces logical coherence, yet these do not quantify distributional similarity to real tool-use data. In the revision we will add: (i) embedding-based distribution distances between generated scenarios and held-out real tool-use traces, (ii) execution mismatch rates measured against live APIs on a sampled subset, and (iii) an analysis showing that benchmark gains persist on tasks whose tool sequences differ from the synthetic training distribution. Full expert human ratings would require a new study; we will therefore report the above automated metrics and note the limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in synthesis-to-training pipeline

full rationale

The paper's core chain is: (1) SkelBuilder performs topic mining + logic modeling + quality evaluation to produce environment skeletons; (2) ScenGenerator adds rule-based trajectory validators to produce scenarios; (3) the resulting 191 environments / ~7K scenarios are used as training data for SFT and RL on Qwen3 models; (4) performance is measured on three external benchmarks. None of these steps is defined in terms of its own outputs, no parameters are fitted on a subset and then relabeled as predictions, and no load-bearing premise rests on a self-citation whose content is itself unverified. The synthesis pipeline is presented as an independent, programmatic process whose fidelity is asserted rather than derived from the downstream benchmark numbers. Consequently the reported improvements are not forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text. The synthesis process likely relies on implicit choices in topic mining and logic modeling, but these are not specified.

pith-pipeline@v0.9.0 · 5519 in / 1165 out tokens · 63000 ms · 2026-05-16T16:01:07.104146+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model
cs.LG 2026-04 unverdicted novelty 6.0

TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.
Code as Agent Harness
cs.CL 2026-05 accept novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
Scalable Environments Drive Generalizable Agents
cs.AI 2026-05 unverdicted novelty 5.0

Generalizable agents require environment scaling via diverse executable rule-sets, distinguished from trajectory and task scaling in a new taxonomy.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 5 Pith papers · 2 internal anchors

[1]

2509.17158 , archivePrefix=

Are: Scaling up agent environments and evalu- ations.arXiv preprint arXiv:2509.17158. Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang, Yong Jiang, Liangcai Su, Liwen Zhang, Wenbiao Yin, Zhen Zhang, Fuli Feng, Pengjun Xie, and Xiaobin Wang. 2025. Autoforge: Automated en- vironment synthesis for agentic reinforcement learn- ing.Preprint, arXiv:2...

work page arXiv 2025
[2]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization. Preprint, arXiv:2501.03262. Yuchen Huang, Sijia Li, Zhiyuan Fan, Minghao LIU, Wei Liu, and Yi R. Fung. 2025. Scaling environ- ments for LLM agents: Fundamentals, approaches, and future directions. InWorkshop on Scaling Envi- ronments for Agents. Minghao Li, Yingx...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025a

Userbench: An interactive gym environment for user-centric agents. InWorkshop on Scaling En- vironments for Agents. Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. 2025. Tool learning with large language mod- els: A survey.Frontiers of Computer Science, 19(8):198343. Michael Sullivan, Mareike Hartmann,...

work page arXiv 2025
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R Narasimhan. 2025. {$\tau$}-bench: A benchmark for \underline{T}ool-\underline{A}gent- \underline{U}ser interaction in real-world domains. InThe Thirteenth International Conference on Learn- ing Representations. Junjie Ye, Changhao Jiang, Zhengyin D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Persistent Environment - The query is about a domain where: - There is a live, ongoing state that can be read or changed - The environment supports both: a) Information queries about current state (read operations) b) Explicit state-changing actions (create, update, delete, move, cancel, etc.)

work page
[6]

State Dependency - The task cannot be answered correctly without: - Inspecting the actual current data or configuration in the environment, and/or - Executing an operation that modifies that data

work page
[7]

Domain Specificity - The environment is not general-purpose knowledge; it is a structured system such as: File management system with stored files/folders, Calendar/scheduling system, Other specialized platforms with records that persist over time

work page
[8]

Is invoice #1024 paid?

Actionability in Context - The query must correspond to an actionable operation or status check within the actual environment ( not hypothetical). ### Eligible Task Types - State queries: "Is invoice #1024 paid?" / "What meetings are scheduled for Wednesday?" - State modification operations: "Upload the proposal.pdf to the project folder" / "Cancel order ...

work page
[9]

- Note any relevant entities, constraints, relationships, or dynamics implied by the task

# Analysis - Explain the reasoning process used to connect the task to the chosen environment. - Note any relevant entities, constraints, relationships, or dynamics implied by the task

work page
[10]

# Environment Summary - Provide a concise label for the environment type

work page
[11]

- Focus on its inherent structure, the nature of the state it maintains, typical operations it supports , and its general scope in real-world usage

# Environment Introduction - Introduce the environment itself, without referring to the current task. - Focus on its inherent structure, the nature of the state it maintains, typical operations it supports , and its general scope in real-world usage. - Limit to approximately three sentences

work page
[12]

all possible data in an e-commerce system

# Metrics - Usefulness: 1-10 Reflects how broadly applicable and valuable this environment is in real-world scenarios. Higher scores indicate environments relevant to many contexts and industries. - Modelability: 1-10 Indicates how straightforward it would be to represent this environment using a single Python class, with attributes holding state and meth...

work page
[13]

- Parameters needed

In # Analysis, reason about: - What entities/attributes are involved. - Parameters needed. - Expected outputs (queries return structured results, state modifications return success messages). - Error/edge cases (e.g., invalid input, permission denied). - Does it involve environmental constraints or rules

work page
[14]

success": False,

In # Code, implement the Python method: - Method name:`def <operation_name >(self, ...)`. Note: Cannot be an independent function, but rather a method function within an already implemented environment class. - Add clear type hints. - Add docstring describing inputs, outputs, constraints. - Error handling: do not raise exceptions - return a dict like`{ "s...

work page
[15]

Each checklist item must be independent and not rely on the results of other items

work page
[16]

Every checklist item must start with the exact phrase: "Has ..." followed by a clear description of the action or field to verify

work page
[17]

Use precise fields and exact values; avoid vague wording

work page
[18]

If the task requires checking multiple fields, split them into separate checklist items

work page
[19]

List the items in logical order, ensuring each is self-contained

work page
[20]

- Each item must start with "Has ..." and be verifiable with a single boolean expression

Output format: - Use Markdown list syntax (`-`) for each checklist item. - Each item must start with "Has ..." and be verifiable with a single boolean expression. Figure 20: The prompt for generating verification checklist for a task. ScenGenerator: Prompt for Converting Checkpoint to Verification Function You are a Python verification function generation...

work page
[21]

This ensures the code matches the actual system schema

Always reference the data structure, field names, and value formats from`initial_state`when writing your verification logic. This ensures the code matches the actual system schema

work page
[22]

Instead, check that the field exists in`final_state`and has the correct type/format (e.g., string)

If the check item involves randomly generated or time-dependent fields (e.g.,`user_id`,`create_time`, `update_time`, UUIDs), do not validate against a fixed concrete value. Instead, check that the field exists in`final_state`and has the correct type/format (e.g., string)

work page
[23]

add a remark

If the check item describes a non-fixed target value (e.g., "add a remark"), only verify that the field exists and meets basic conditions (e.g., non-empty string, correct data type)

work page
[24]

If the check item specifies an explicit target value, you must strictly match it (`==`)

work page
[25]

Has been added

Use`initial_state`as a reference only when necessary to determine changes - for example, "Has been added" means the entity didn't exist in`initial_state`but exists in`final_state`

work page
[26]

The function must implement only the given single check item and return`True`if passed,`False`if failed

work page
[27]

The function must not modify any input data and must perform no actions other than verification

work page
[28]

Figure 21: The prompt for programmatically converting each checkpoint to a verification function

Ensure the function signature is`def check_func(final_state)`, and that it only returns`True`or` False`. Figure 21: The prompt for programmatically converting each checkpoint to a verification function. System Prompt for Non-Conversation Agent You are a helpful assistant. When given a specific task, your goal is to complete it in an interactive environmen...

work page
[29]

If the task contains multiple sub-tasks, do not reveal all of them at once; provide relevant sub-tasks one by one as the Agent asks

work page
[30]

If completing the task requires multiple pieces of information, do not disclose them all at once; provide partial information in response to the Agent's questions

work page
[31]

All requests must remain strictly within the scope of the task-do not add extra requirements, intentions, or invent information that was not part of the original task

work page
[32]

""Look up a user by their phone number. Constraints: - Only registered users are present in the system. - Phone number must match a user entry; otherwise, return error

Always keep the conversation focused on progressing toward the task, ensuring every sub-task or goal is covered and none are skipped. Fidelity and Consistency Requirements: - Always remain faithful to the original task wording throughout the conversation. Pay special attention to preserving exact keywords, names, and proper nouns}@*)-do not rephrase or al...

work page 2024
[36]

Update: re-sent Gabby’s invite and delivery confirmed

Send a short status update from Brandon to Alice Chan (USR1): “Update: re-sent Gabby’s invite and delivery confirmed.” Then mark Alice’s previously unread message from Brandon as “read.” Figure 26: An example task under the above state configuration. No. Checkpoint Check Function 1 Has Gabby Fields been added as a contact for user USR2 (Brandon Lee)? def ...

work page
[37]

Validate the number before sending

As Brandon Lee (USR2), add Gabby Fields as a contact using her mobile number +17165558888. Validate the number before sending

work page
[38]

Hi Gabby, the meeting is now at 4:30pm ET. Please confirm

Re-send a corrected invite to Gabby from Brandon: “Hi Gabby, the meeting is now at 4:30pm ET. Please confirm.” Update the new message’s delivery status to “delivered.”

work page
[39]

After successfully sending the new message, delete the old failed message Brandon previously sent to Gabby

Link the new message to the existing Brandon–Gabby conversation and then archive that conversation. After successfully sending the new message, delete the old failed message Brandon previously sent to Gabby

work page
[40]

Update: re-sent Gabby’s invite and delivery confirmed

Send a short status update from Brandon to Alice Chan (USR1): “Update: re-sent Gabby’s invite and delivery confirmed.” Then mark Alice’s previously unread message from Brandon as “read.” Action:Function(name="validate_phone_number", arguments={"phone_number": "+17165558888"}) 2Observation:[Tool Result]{’success’: True, ’data’: {’valid’: True, ’reason’: ’P...

work page arXiv 2024

[1] [1]

2509.17158 , archivePrefix=

Are: Scaling up agent environments and evalu- ations.arXiv preprint arXiv:2509.17158. Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang, Yong Jiang, Liangcai Su, Liwen Zhang, Wenbiao Yin, Zhen Zhang, Fuli Feng, Pengjun Xie, and Xiaobin Wang. 2025. Autoforge: Automated en- vironment synthesis for agentic reinforcement learn- ing.Preprint, arXiv:2...

work page arXiv 2025

[2] [2]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization. Preprint, arXiv:2501.03262. Yuchen Huang, Sijia Li, Zhiyuan Fan, Minghao LIU, Wei Liu, and Yi R. Fung. 2025. Scaling environ- ments for LLM agents: Fundamentals, approaches, and future directions. InWorkshop on Scaling Envi- ronments for Agents. Minghao Li, Yingx...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025a

Userbench: An interactive gym environment for user-centric agents. InWorkshop on Scaling En- vironments for Agents. Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. 2025. Tool learning with large language mod- els: A survey.Frontiers of Computer Science, 19(8):198343. Michael Sullivan, Mareike Hartmann,...

work page arXiv 2025

[4] [4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R Narasimhan. 2025. {$\tau$}-bench: A benchmark for \underline{T}ool-\underline{A}gent- \underline{U}ser interaction in real-world domains. InThe Thirteenth International Conference on Learn- ing Representations. Junjie Ye, Changhao Jiang, Zhengyin D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Persistent Environment - The query is about a domain where: - There is a live, ongoing state that can be read or changed - The environment supports both: a) Information queries about current state (read operations) b) Explicit state-changing actions (create, update, delete, move, cancel, etc.)

work page

[6] [6]

State Dependency - The task cannot be answered correctly without: - Inspecting the actual current data or configuration in the environment, and/or - Executing an operation that modifies that data

work page

[7] [7]

Domain Specificity - The environment is not general-purpose knowledge; it is a structured system such as: File management system with stored files/folders, Calendar/scheduling system, Other specialized platforms with records that persist over time

work page

[8] [8]

Is invoice #1024 paid?

Actionability in Context - The query must correspond to an actionable operation or status check within the actual environment ( not hypothetical). ### Eligible Task Types - State queries: "Is invoice #1024 paid?" / "What meetings are scheduled for Wednesday?" - State modification operations: "Upload the proposal.pdf to the project folder" / "Cancel order ...

work page

[9] [9]

- Note any relevant entities, constraints, relationships, or dynamics implied by the task

# Analysis - Explain the reasoning process used to connect the task to the chosen environment. - Note any relevant entities, constraints, relationships, or dynamics implied by the task

work page

[10] [10]

# Environment Summary - Provide a concise label for the environment type

work page

[11] [11]

- Focus on its inherent structure, the nature of the state it maintains, typical operations it supports , and its general scope in real-world usage

# Environment Introduction - Introduce the environment itself, without referring to the current task. - Focus on its inherent structure, the nature of the state it maintains, typical operations it supports , and its general scope in real-world usage. - Limit to approximately three sentences

work page

[12] [12]

all possible data in an e-commerce system

# Metrics - Usefulness: 1-10 Reflects how broadly applicable and valuable this environment is in real-world scenarios. Higher scores indicate environments relevant to many contexts and industries. - Modelability: 1-10 Indicates how straightforward it would be to represent this environment using a single Python class, with attributes holding state and meth...

work page

[13] [13]

- Parameters needed

In # Analysis, reason about: - What entities/attributes are involved. - Parameters needed. - Expected outputs (queries return structured results, state modifications return success messages). - Error/edge cases (e.g., invalid input, permission denied). - Does it involve environmental constraints or rules

work page

[14] [14]

success": False,

In # Code, implement the Python method: - Method name:`def <operation_name >(self, ...)`. Note: Cannot be an independent function, but rather a method function within an already implemented environment class. - Add clear type hints. - Add docstring describing inputs, outputs, constraints. - Error handling: do not raise exceptions - return a dict like`{ "s...

work page

[15] [15]

Each checklist item must be independent and not rely on the results of other items

work page

[16] [16]

Every checklist item must start with the exact phrase: "Has ..." followed by a clear description of the action or field to verify

work page

[17] [17]

Use precise fields and exact values; avoid vague wording

work page

[18] [18]

If the task requires checking multiple fields, split them into separate checklist items

work page

[19] [19]

List the items in logical order, ensuring each is self-contained

work page

[20] [20]

- Each item must start with "Has ..." and be verifiable with a single boolean expression

Output format: - Use Markdown list syntax (`-`) for each checklist item. - Each item must start with "Has ..." and be verifiable with a single boolean expression. Figure 20: The prompt for generating verification checklist for a task. ScenGenerator: Prompt for Converting Checkpoint to Verification Function You are a Python verification function generation...

work page

[21] [21]

This ensures the code matches the actual system schema

Always reference the data structure, field names, and value formats from`initial_state`when writing your verification logic. This ensures the code matches the actual system schema

work page

[22] [22]

Instead, check that the field exists in`final_state`and has the correct type/format (e.g., string)

If the check item involves randomly generated or time-dependent fields (e.g.,`user_id`,`create_time`, `update_time`, UUIDs), do not validate against a fixed concrete value. Instead, check that the field exists in`final_state`and has the correct type/format (e.g., string)

work page

[23] [23]

add a remark

If the check item describes a non-fixed target value (e.g., "add a remark"), only verify that the field exists and meets basic conditions (e.g., non-empty string, correct data type)

work page

[24] [24]

If the check item specifies an explicit target value, you must strictly match it (`==`)

work page

[25] [25]

Has been added

Use`initial_state`as a reference only when necessary to determine changes - for example, "Has been added" means the entity didn't exist in`initial_state`but exists in`final_state`

work page

[26] [26]

The function must implement only the given single check item and return`True`if passed,`False`if failed

work page

[27] [27]

The function must not modify any input data and must perform no actions other than verification

work page

[28] [28]

Figure 21: The prompt for programmatically converting each checkpoint to a verification function

Ensure the function signature is`def check_func(final_state)`, and that it only returns`True`or` False`. Figure 21: The prompt for programmatically converting each checkpoint to a verification function. System Prompt for Non-Conversation Agent You are a helpful assistant. When given a specific task, your goal is to complete it in an interactive environmen...

work page

[29] [29]

If the task contains multiple sub-tasks, do not reveal all of them at once; provide relevant sub-tasks one by one as the Agent asks

work page

[30] [30]

If completing the task requires multiple pieces of information, do not disclose them all at once; provide partial information in response to the Agent's questions

work page

[31] [31]

All requests must remain strictly within the scope of the task-do not add extra requirements, intentions, or invent information that was not part of the original task

work page

[32] [32]

""Look up a user by their phone number. Constraints: - Only registered users are present in the system. - Phone number must match a user entry; otherwise, return error

Always keep the conversation focused on progressing toward the task, ensuring every sub-task or goal is covered and none are skipped. Fidelity and Consistency Requirements: - Always remain faithful to the original task wording throughout the conversation. Pay special attention to preserving exact keywords, names, and proper nouns}@*)-do not rephrase or al...

work page 2024

[33] [36]

Update: re-sent Gabby’s invite and delivery confirmed

Send a short status update from Brandon to Alice Chan (USR1): “Update: re-sent Gabby’s invite and delivery confirmed.” Then mark Alice’s previously unread message from Brandon as “read.” Figure 26: An example task under the above state configuration. No. Checkpoint Check Function 1 Has Gabby Fields been added as a contact for user USR2 (Brandon Lee)? def ...

work page

[34] [37]

Validate the number before sending

As Brandon Lee (USR2), add Gabby Fields as a contact using her mobile number +17165558888. Validate the number before sending

work page

[35] [38]

Hi Gabby, the meeting is now at 4:30pm ET. Please confirm

Re-send a corrected invite to Gabby from Brandon: “Hi Gabby, the meeting is now at 4:30pm ET. Please confirm.” Update the new message’s delivery status to “delivered.”

work page

[36] [39]

After successfully sending the new message, delete the old failed message Brandon previously sent to Gabby

Link the new message to the existing Brandon–Gabby conversation and then archive that conversation. After successfully sending the new message, delete the old failed message Brandon previously sent to Gabby

work page

[37] [40]

Update: re-sent Gabby’s invite and delivery confirmed

Send a short status update from Brandon to Alice Chan (USR1): “Update: re-sent Gabby’s invite and delivery confirmed.” Then mark Alice’s previously unread message from Brandon as “read.” Action:Function(name="validate_phone_number", arguments={"phone_number": "+17165558888"}) 2Observation:[Tool Result]{’success’: True, ’data’: {’valid’: True, ’reason’: ’P...

work page arXiv 2024