The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Bingrou Zhou; Binit Jha; Chen Luo; Dawn Song; Guangrui Li; Jason Choi; Michael J. Morais; Monica Xiao Cheng; Shaunak Mishra; Tianqi Zheng

arxiv: 2603.05910 · v2 · pith:5K53UDX6new · submitted 2026-03-06 · 💻 cs.AI

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Guangrui Li , Yaochen Xie , Yi Liu , Ziwei Dong , Xingyuan Pan , Tianqi Zheng , Jason Choi , Michael J. Morais

show 6 more authors

Binit Jha Shaunak Mishra Bingrou Zhou Chen Luo Monica Xiao Cheng Dawn Song

This is my paper

Pith reviewed 2026-05-21 12:42 UTC · model grok-4.3

classification 💻 cs.AI

keywords environment evolutiontool-calling agentsgraph transformationsbenchmarking LLM agentsprogrammable benchmarks

0 comments

The pith

ProEvolve makes environment evolution programmable for agent benchmarks by using graph transformations on a typed relational graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ProEvolve as a framework to address static benchmarks for LLM tool-calling agents. It represents the environment with a typed relational graph that unifies data, tools, and schema. Evolution is then expressed through explicit graph transformations that update all related components consistently. This enables automatic generation of evolved environments and task sandboxes for testing how agents adapt to changes like added or deprecated capabilities. Validation in e-commerce and airline domains shows practical application for diagnosing agent behavior under evolution.

Core claim

At its core, ProEvolve uses a typed relational graph to provide a unified explicit representation of the environment, allowing adding, removing, or modifying capabilities to be expressed as graph transformations that coherently propagate updates across tools, schemas, and data access.

What carries the argument

Typed relational graph as unified representation with graph transformations for coherent updates and subgraph sampling for task sandboxes.

If this is right

Agents can be tested in sequences of evolved environments to measure adaptation.
New benchmarks can be generated automatically without manual redesign for each version.
Failure modes in agent behavior become identifiable when environments change systematically.
Supports diagnostic studies on representative agents in structured evolution scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such programmable evolution could extend to other domains like robotics or web navigation where environments change dynamically.
Integrating this with agent training might allow agents to learn from evolving settings during development.
Future work could explore automated detection of inconsistencies in transformations if the graph assumption weakens.

Load-bearing premise

That representing the environment as a typed relational graph allows transformations to update tools, schemas, and data without introducing inconsistencies that need manual fixes.

What would settle it

Running the framework on a new domain and checking if all generated environments execute without errors and maintain data consistency after multiple transformations.

Figures

Figures reproduced from arXiv: 2603.05910 by Bingrou Zhou, Binit Jha, Chen Luo, Dawn Song, Guangrui Li, Jason Choi, Michael J. Morais, Monica Xiao Cheng, Shaunak Mishra, Tianqi Zheng, Xingyuan Pan, Yaochen Xie, Yi Liu, Ziwei Dong.

**Figure 1.** Figure 1: End-to-end workflow of programmable environment evolution and graph-grounded task instantiation. Environment graphs are evolved via programmable graph edits and translated into executable code (left). Tasks are then generated by sampling subgraphs and materializing state-wise user intents and data into runnable sandbox instances (right), enabling controlled evaluation under evolving environments. ment evol… view at source ↗

**Figure 2.** Figure 2: Programmable environment evolution via graph transformations. Starting from a seed environment graph G 0 , we generate a curriculum of environments G 1 , G 2 , G 3 by applying explicit edit operators (arrows; e.g., component onboarding, schema/tool updates, and dependency rewiring), which add/remove nodes and edges in a coherent manner. This yields controlled environment dynamics while preserving a unifi… view at source ↗

**Figure 3.** Figure 3: Context as subgraph expansion in a tool-mediated conversation. At each turn, the environment exposes a reachable context subgraph (gray nodes; dashed arrows for reachable tool transitions), while the agent activates a subset of nodes (green) by executing tools/actions (solid arrows) conditioned on the dialogue. As the conversation progresses, executed transitions expand the active subgraph, enabling retrie… view at source ↗

**Figure 4.** Figure 4: Performance–efficiency trade-off of replay strategies. Each point corresponds to a model, plotted by average tool calls (x-axis) and success rate / mean completeness (y-axis) over the evolving episode. Faint markers denote the Baseline strategy, while bold markers denote the replay strategy; arrows connect the same model before and after replay. Top: Baseline → History Replay. Bottom: Baseline → Reflection… view at source ↗

**Figure 5.** Figure 5: Efficiency breakdown by task difficulty. We report average tool calls, estimated cost, conversation turns, and reward for each model on easy vs. hard tasks. Harder tasks generally require longer trajectories and more tool usage tool usage, i.e., 6.9 → 9.0/12.1 . This suggests that DeepSeek can effectively leverage prior interaction traces to improve execution reliability under environmental changes, albeit… view at source ↗

read the original abstract

LLM-powered tool-calling agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks evaluate these systems under static environment interfaces, with fixed schemas and toolsets, making it difficult to assess how agents behave as environments evolves -- when capabilities are added, reorganized, or deprecated across successive environment versions. In this paper, we study structured environment evolution as a benchmark-construction problem for tool-calling agents. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment - data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve supports (1) automatic generation of evolved executable environments through explicit graph transformations, and (2) graph-grounded construction of task sandboxes via subgraph sampling and instantiation. We validate ProEvolve in two tool-calling domains, e-commerce and airline booking, in terms of quality, implementation validity, and failure modes. Finally, we use the generated benchmark as a downstream diagnostic to study how representative agents behave under structured environment evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProEvolve gives a graph-based way to make environment evolution explicit in agent benchmarks, but the executability guarantees look thin on the evidence shown.

read the letter

The main point is that this paper builds a framework called ProEvolve to turn environment changes into programmable graph transformations instead of leaving benchmarks static. They represent tools, schemas, and data as one typed relational graph, then apply rewrite rules that update everything at once when capabilities get added, removed, or altered. From there they generate new executable environments and pull task sandboxes out of subgraphs. They run this in e-commerce and airline booking to check quality and some failure modes. That setup directly tackles the problem that most current agent tests freeze the interface while real deployments keep shifting. The formalism looks workable for creating controlled variation without rebuilding from scratch each time. The validation effort in two concrete domains is a plus, even if it stays at the level of describing what they checked rather than showing big performance deltas. The soft spot is the executability claim. The transformations are supposed to keep generated environments runnable, but the abstract gives no quantitative results on how often that holds or how they verified cross-layer invariants like tool preconditions and data constraints. If a dependency sits outside the relational model, the output could be well-formed on paper yet broken in practice. That part feels preliminary. This is for people who build or run tool-calling agent evaluations and want more dynamic test suites. A reader focused on robustness would get concrete ideas from the graph construction even if they adapt only pieces of it. I would send it to peer review so the authors can add the missing numbers and checks on whether the generated cases actually expose new agent weaknesses.

Circularity Check

0 steps flagged

No significant circularity; constructive framework without self-referential reductions

full rationale

The paper proposes ProEvolve as a methodological framework for programmable environment evolution in tool-calling agent benchmarks. It defines a typed relational graph as a unified explicit representation of data, tools, and schema, then expresses environment changes as graph transformations that propagate updates, enabling automatic generation of evolved executable environments and graph-grounded task sandboxes via subgraph sampling. No equations, fitted parameters, predictions of derived quantities, or load-bearing self-citations appear in the provided text. The central claims are presented as a construction procedure rather than a derivation that reduces to prior inputs by definition or statistical forcing. Validation is described in terms of quality and implementation validity in e-commerce and airline domains without circular reductions to the framework's own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central proposal rests on the assumption that environments can be faithfully captured by a single typed relational graph and that transformations on this graph remain coherent; no free parameters or invented physical entities are described.

axioms (1)

domain assumption A typed relational graph provides a unified explicit representation of data, tools, and schema.
Stated as the core formalism enabling coherent propagation of updates.

invented entities (1)

ProEvolve graph-transformation framework no independent evidence
purpose: To make environment evolution programmable and automatically generate evolved benchmarks
Newly introduced construction method with no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5801 in / 1207 out tokens · 40471 ms · 2026-05-21T12:42:56.697232+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a typed relational graph provides a unified, explicit representation of the environment - data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

environment evolution is formulated as graph transformations... G(0) → G(1) → ... via Δ(k)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

No mileage balance tracking:the User entity has no field for accumulated miles (e.g., miles_balance); User.membershiponly tracks tier

work page
[2]

No mileage transaction history:there is no entity recording how miles are earned, redeemed, transferred, or expired

work page
[3]

No mileage earning rules:the system lacks configuration for earning miles based on distance, cabin class, fare type, or tier multipliers

work page
[4]

No redemption rate configuration:there is no data defining miles required for cabin upgrades across routes or distances

work page
[5]

6.No miles redemption execution:no tool exists to deduct miles and apply an upgrade to an existing reservation

No upgrade availability check with miles:existing APIs (e.g., update_reservation_flights) only support cash-based changes. 6.No miles redemption execution:no tool exists to deduct miles and apply an upgrade to an existing reservation. 7.No miles expiration tracking:the system cannot track earning dates or expiration policies for loyalty miles. Required Ca...

work page
[6]

get_order_refund_summary Type:READ Description:Retrieves refund information for all returns and exchanges associated with a specific order, including request IDs and refund amounts. Inputs(1 node(s)):Order.order_id(Order) [PK] Outputs(2 node(s)): ExchangeReturnRequest.request_id (ExchangeReturnRequest) [PK]; ExchangeReturnRequest.refund_amount(ExchangeRet...

work page
[7]

get_subscription_order_swap_history Type:READ Description:Retrieves the complete product swap history for a subscription order, showing all product changes made to the subscription that generated this order. Inputs(1 node(s)): Order.parent_subscription_id (Order) [FK]: Parent subscription ID if this is a subscrip- tion order Outputs(2 node(s)): Subscripti...

work page
[8]

Inputs(1 node(s)): Reservation.reservation_id (Reservation) [PK]: Unique reservation identifier (e.g., ‘ZFA04Y’)

get_reservation_flight_statuses Type:READ Description:Retrieves the current status of all flights in a reservation, allowing passengers to quickly check if any flights are delayed, cancelled, or on-time. Inputs(1 node(s)): Reservation.reservation_id (Reservation) [PK]: Unique reservation identifier (e.g., ‘ZFA04Y’). Outputs(2 node(s)): ReservationFlight.f...

work page
[9]

Exchange the delivered grey Burkini for the correct purple variant

get_segment_flight_pricing Type:READ Description:Retrieves the price breakdown for all flights within a specific itinerary segment, essential for multi-city booking pricing and fare adjustments. Inputs(1 node(s)): ItinerarySegment.segment_id (ItinerarySegment) [PK]: Unique identifier for the itinerary segment (e.g., ‘SEG001’). Outputs(2 node(s)): SegmentF...

work page 2024
[10]

**SELECT** the {num_tools} MOST V ALUABLE paths that would benefit from having shortcut tools

work page
[11]

‘ #### Output Format Select exactly {num_tools} paths and provide tool proposals in this exact JSON format (wrapped in<tool_proposals>tags): <tool_proposals> [ {{

**DESIGN** a tool for each selected path, choosing which nodes serve as INPUTS and which as OUTPUTS ### Multi-Input/Multi-Output Tools You can design tools with flexible input/output mappings: - **Single Input→Single Output**: Simple lookup (e.g., get product by user_id) - **Multiple Inputs→Single Output**: Filtered lookup (e.g., get order by user_id AND ...

work page
[12]

**Analyze Graph Differences**: Compare the base graph with the new graph to identify: - Added nodes → Create new Pydantic models or add fields to existing models - Removed nodes → Remove or deprecate corresponding models/fields - Added edges → Implement new tools with @is_tool decorators - Modified edges → Update tool signatures, parameters, or behavior -...

work page
[13]

**Maintain Code Consistency**: Use the base_data.py and base_tool.py as templates to ensure: - Consistent coding patterns and style - Proper Pydantic model inheritance and validation - Correct @is_tool(ToolType.READ/WRITE) decorators - Comprehensive docstrings with Args, Returns, Raises sections - Appropriate error handling with ValueError for invalid inputs

work page
[14]

"" prompt += f

**Handle Evolution Strategies Appropriately**: - Completion: Focus on adding new functionality while preserving existing code - Modification: Update existing tools/models without breaking backward compatibility - Saturation: Add tools that leverage existing data relationships - Deprecation: Safely remove code while maintaining system integrity ## {domain....

work page
[15]

Identify new edges that represent new tools to be implemented 3

Analyze the differences between the original ‘attribute_graph‘ and ‘updated_attribute_graph‘ 2. Identify new edges that represent new tools to be implemented 3. For each new tool mentioned in the edges’ "tools" field: - Create a new method with the @is_tool decorator - Determine if it’s a READ or WRITE operation based on the relationship type - Implement ...

work page
[16]

**Generate Test Specifications**: Analyze the graph changes to determine what needs testing: - New data models → Test instantiation, field validation, serialization - New tools→ Test input/output contracts, error handling, edge cases - New relationships→Test data integrity across entities

work page
[17]

**Write Comprehensive Tests**: For each component, generate tests covering: - **Happy Path**: Valid inputs produce expected outputs - **Edge Cases**: Boundary values, empty inputs, maximum lengths - **Error Handling**: Invalid inputs, missing data, not found scenarios - **Type Validation**: Correct types are enforced

work page
[18]

color: rgb(150, 34, 73); font-weight: bold

**Follow Testing Best Practices**: - Use pytest framework with clear test function names (test_<span><span style="color: rgb(150, 34, 73); font-weight: bold;"><function></span><span style="color: black; font-weight: normal;">_) - Use unittest.mock to mock database operations and external dependencies - Create realistic test fixtures with valid {domain.nam...

work page
[19]

"" prompt += f

**Mock Database Operations**: - Mock the database class (‘{domain.database_class}‘) to avoid actual database calls - Set up mock return values that match expected data structures - Verify that database methods are called with correct arguments ## {domain.name.title()} Domain Context ### Key Entities and ID Patterns {entities_formatted} ### Example Test Da...

work page

[1] [1]

No mileage balance tracking:the User entity has no field for accumulated miles (e.g., miles_balance); User.membershiponly tracks tier

work page

[2] [2]

No mileage transaction history:there is no entity recording how miles are earned, redeemed, transferred, or expired

work page

[3] [3]

No mileage earning rules:the system lacks configuration for earning miles based on distance, cabin class, fare type, or tier multipliers

work page

[4] [4]

No redemption rate configuration:there is no data defining miles required for cabin upgrades across routes or distances

work page

[5] [5]

6.No miles redemption execution:no tool exists to deduct miles and apply an upgrade to an existing reservation

No upgrade availability check with miles:existing APIs (e.g., update_reservation_flights) only support cash-based changes. 6.No miles redemption execution:no tool exists to deduct miles and apply an upgrade to an existing reservation. 7.No miles expiration tracking:the system cannot track earning dates or expiration policies for loyalty miles. Required Ca...

work page

[6] [6]

get_order_refund_summary Type:READ Description:Retrieves refund information for all returns and exchanges associated with a specific order, including request IDs and refund amounts. Inputs(1 node(s)):Order.order_id(Order) [PK] Outputs(2 node(s)): ExchangeReturnRequest.request_id (ExchangeReturnRequest) [PK]; ExchangeReturnRequest.refund_amount(ExchangeRet...

work page

[7] [7]

get_subscription_order_swap_history Type:READ Description:Retrieves the complete product swap history for a subscription order, showing all product changes made to the subscription that generated this order. Inputs(1 node(s)): Order.parent_subscription_id (Order) [FK]: Parent subscription ID if this is a subscrip- tion order Outputs(2 node(s)): Subscripti...

work page

[8] [8]

Inputs(1 node(s)): Reservation.reservation_id (Reservation) [PK]: Unique reservation identifier (e.g., ‘ZFA04Y’)

get_reservation_flight_statuses Type:READ Description:Retrieves the current status of all flights in a reservation, allowing passengers to quickly check if any flights are delayed, cancelled, or on-time. Inputs(1 node(s)): Reservation.reservation_id (Reservation) [PK]: Unique reservation identifier (e.g., ‘ZFA04Y’). Outputs(2 node(s)): ReservationFlight.f...

work page

[9] [9]

Exchange the delivered grey Burkini for the correct purple variant

get_segment_flight_pricing Type:READ Description:Retrieves the price breakdown for all flights within a specific itinerary segment, essential for multi-city booking pricing and fare adjustments. Inputs(1 node(s)): ItinerarySegment.segment_id (ItinerarySegment) [PK]: Unique identifier for the itinerary segment (e.g., ‘SEG001’). Outputs(2 node(s)): SegmentF...

work page 2024

[10] [10]

**SELECT** the {num_tools} MOST V ALUABLE paths that would benefit from having shortcut tools

work page

[11] [11]

‘ #### Output Format Select exactly {num_tools} paths and provide tool proposals in this exact JSON format (wrapped in<tool_proposals>tags): <tool_proposals> [ {{

**DESIGN** a tool for each selected path, choosing which nodes serve as INPUTS and which as OUTPUTS ### Multi-Input/Multi-Output Tools You can design tools with flexible input/output mappings: - **Single Input→Single Output**: Simple lookup (e.g., get product by user_id) - **Multiple Inputs→Single Output**: Filtered lookup (e.g., get order by user_id AND ...

work page

[12] [12]

**Analyze Graph Differences**: Compare the base graph with the new graph to identify: - Added nodes → Create new Pydantic models or add fields to existing models - Removed nodes → Remove or deprecate corresponding models/fields - Added edges → Implement new tools with @is_tool decorators - Modified edges → Update tool signatures, parameters, or behavior -...

work page

[13] [13]

**Maintain Code Consistency**: Use the base_data.py and base_tool.py as templates to ensure: - Consistent coding patterns and style - Proper Pydantic model inheritance and validation - Correct @is_tool(ToolType.READ/WRITE) decorators - Comprehensive docstrings with Args, Returns, Raises sections - Appropriate error handling with ValueError for invalid inputs

work page

[14] [14]

"" prompt += f

**Handle Evolution Strategies Appropriately**: - Completion: Focus on adding new functionality while preserving existing code - Modification: Update existing tools/models without breaking backward compatibility - Saturation: Add tools that leverage existing data relationships - Deprecation: Safely remove code while maintaining system integrity ## {domain....

work page

[15] [15]

Identify new edges that represent new tools to be implemented 3

Analyze the differences between the original ‘attribute_graph‘ and ‘updated_attribute_graph‘ 2. Identify new edges that represent new tools to be implemented 3. For each new tool mentioned in the edges’ "tools" field: - Create a new method with the @is_tool decorator - Determine if it’s a READ or WRITE operation based on the relationship type - Implement ...

work page

[16] [16]

**Generate Test Specifications**: Analyze the graph changes to determine what needs testing: - New data models → Test instantiation, field validation, serialization - New tools→ Test input/output contracts, error handling, edge cases - New relationships→Test data integrity across entities

work page

[17] [17]

**Write Comprehensive Tests**: For each component, generate tests covering: - **Happy Path**: Valid inputs produce expected outputs - **Edge Cases**: Boundary values, empty inputs, maximum lengths - **Error Handling**: Invalid inputs, missing data, not found scenarios - **Type Validation**: Correct types are enforced

work page

[18] [18]

color: rgb(150, 34, 73); font-weight: bold

**Follow Testing Best Practices**: - Use pytest framework with clear test function names (test_<span><span style="color: rgb(150, 34, 73); font-weight: bold;"><function></span><span style="color: black; font-weight: normal;">_) - Use unittest.mock to mock database operations and external dependencies - Create realistic test fixtures with valid {domain.nam...

work page

[19] [19]

"" prompt += f

**Mock Database Operations**: - Mock the database class (‘{domain.database_class}‘) to avoid actual database calls - Set up mock return values that match expected data structures - Verify that database methods are called with correct arguments ## {domain.name.title()} Domain Context ### Key Entities and ID Patterns {entities_formatted} ### Example Test Da...

work page