pith. sign in

arxiv: 2605.18548 · v1 · pith:ZXVUILDWnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

Pith reviewed 2026-05-20 10:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM agentstool usespatio-temporal dynamicsadaptive replanningbenchmarkdynamic environmentserror modes
0
0 comments X

The pith

Frontier LLMs achieve under 40% accuracy on a benchmark of 227 tasks requiring replanning after sudden spatio-temporal disruptions in tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new benchmark called STT-Arena to evaluate how well language models can detect mid-task changes in realistic executable environments and then adapt their plans accordingly. Existing tests focus mainly on noticing temporal shifts, but this work adds spatial dimensions and forces models to revise execution strategies when triggers invalidate prior decisions. Results show that even top proprietary models fall below 40 percent overall success, and analysis of failures reveals three repeated patterns that the authors then target with a training method combining trajectory refinement and online reinforcement learning to create a stronger 4B-parameter agent.

Core claim

STT-Arena provides 227 high-quality interactive tasks grounded in executable environments, covering nine spatio-temporal conflict types and four solvability levels, with injected triggers that abruptly invalidate ongoing plans. Frontier models exhibit three recurring error modes: continuing with stale state information, misidentifying the nature of a dynamic trigger, and failing to verify outcomes after adaptation. Refining training trajectories to remove these patterns and applying online RL yields STT-Agent-4B, which surpasses larger frontier models on the benchmark.

What carries the argument

STT-Arena benchmark of 227 tasks with nine spatio-temporal conflict types and injected triggers that force detection of state shifts followed by construction of revised execution strategies.

If this is right

  • Models require explicit mechanisms to monitor for state changes and to generate new plans once a prior strategy is invalidated.
  • Training data must include examples that correct stale-state execution and omitted post-adaptation checks.
  • Smaller models trained with targeted refinement and online RL can exceed the performance of much larger proprietary systems on dynamic tool-use problems.
  • Real-world agent deployments in changing environments need benchmarks that test both detection and adaptive replanning rather than detection alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Environments such as robotics navigation or real-time inventory systems would likely expose the same failure modes observed here.
  • The identified error patterns suggest that architectural additions for explicit state tracking could complement the proposed data refinement approach.
  • Extending the conflict types to include multi-agent interactions or physical constraints could further test the limits of current replanning capabilities.

Load-bearing premise

The 227 tasks and nine conflict types sufficiently represent the range of mid-task spatio-temporal disruptions that occur in realistic executable environments.

What would settle it

A model that reaches above 70 percent success across all 227 tasks while also handling standard static tool-use benchmarks at high accuracy would indicate either that the reported difficulty is not fundamental or that the task set does not capture the intended challenge.

Figures

Figures reproduced from arXiv: 2605.18548 by Chunxiao Liu, Hao Xu, Hongsheng Xin, Kun Zhan, Ning Miao, Pengyu Zhu, Sen Su, Tingfeng Hui.

Figure 1
Figure 1. Figure 1: Adaptive replanning in spatio-temporal environments: A mid-task price change invalidates the plan, prompting detection of updated prices and reselection of the optimal flight. To systematically characterize the environmen￾tal dynamics that necessitate such replanning, we identify three fundamental axes along which real-world conditions evolve. Temporal evo￾lution refers to state changes that unfold over ti… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the STT-Arena construction pipeline. The pipeline consists of three stages: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall Pass@1 performance of all evaluated models on STT-Arena. Results are grouped [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance gap between dynamic (STT-Arena) and non-dynamic environments. competitive but consistently lower performance, with Deepseek-V3.2 (32.16%) being the strongest open-source contender yet still trailing the closed-source leaders by a non-trivial margin. This gap suggests that frontier closed-source models retain meaningful advantages in instruction following and adaptive decision-making under dynam… view at source ↗
Figure 6
Figure 6. Figure 6: Test-time scaling via Pass@k rate. Deepseek-V3.2 Qwen-3.5-397B GLM-5 MiniMax-M2.5 0 5 10 15 20 25 30 35 Performance (%) 32.2 28.8 24.2 21.9 30.8 27.8 25.6 21.1 31.7 28.2 24.7 22.0 26.4 20.3 21.1 19.8 Qwen-3.5-397B GPT-4.1 Deepseek-V3.2 No User [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of reasoning content on Pass@1 performance. Reasoning helps but design matters [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of STT-Arena instances across difficulty levels and spatio-temporal subtypes. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Solvable and impossible instance distribution across the nine spatio-temporal subtypes. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of SFT trajectories across difficulty levels and spatio-temporal subtypes. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of RL tasks across difficulty levels and spatio-temporal subtypes. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of static environment, we implement the environment through Python class. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of blueprint, we design the blueprint based on the conflict types, difficulty levels, [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example of user profile, our user simulator is configured by the profile and interact with [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example of checklist which is the evaluation mechanism of our tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Example Python function that validates one checklist item. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Example of dynamic environment which injects the spatio-temporal triggers into the static [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A representative Stale-State Execution failure: the agent makes 20+ blind retries on [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: A representative Misdiagnosis of Dynamic Triggers failure: a Schengen transit restriction [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: A representative Missing Post-Adaptation Verification failure: after a warehouse outage, [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: System prompt for the stateful task filter (Stage 1, Step 1). This prompt instructs the LLM [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: System prompt for the spatio-temporal sensitivity filter (Stage 1, Step 1). This prompt [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: System prompt for inferring the latent environment context from a seed query (Stage [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗
Figure 26
Figure 26. Figure 26: System prompt for inferring the tool operation list (Stage 1, Step 2). The LLM generates a [PITH_FULL_IMAGE:figures/full_fig_p030_26.png] view at source ↗
Figure 28
Figure 28. Figure 28: System prompt for implementing individual tool methods as Python code (Stage 1, Step 2). [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: System prompt for generating test configurations for functional validation (Stage 1, Step [PITH_FULL_IMAGE:figures/full_fig_p034_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: System prompt for generating the sequence of validation tool calls (Stage 1, Step 3). A [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: System prompt for conflict type assignment (Stage 2, Step 1). Given a static environment’s [PITH_FULL_IMAGE:figures/full_fig_p036_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: System prompt for conflict blueprint design (Stage 2, Step 1). The LLM generates a [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: System prompt for injecting spatio-temporal triggers into a static environment (Stage 2, [PITH_FULL_IMAGE:figures/full_fig_p038_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: System prompt for generating the user query, initial configuration, and concrete mutations [PITH_FULL_IMAGE:figures/full_fig_p039_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: System prompt for generating the user profile (Stage 2, Step 2). The LLM creates a [PITH_FULL_IMAGE:figures/full_fig_p040_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: System prompt for generating the evaluation checklist and check functions for solvable [PITH_FULL_IMAGE:figures/full_fig_p041_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: System prompt for the plan agent in dual-agent verification (Stage 3, Step 2). The planning [PITH_FULL_IMAGE:figures/full_fig_p042_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: System prompt for the check agent in dual-agent verification (Stage 3, Step 2). The [PITH_FULL_IMAGE:figures/full_fig_p043_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: System prompt for the consistency auditor (Stage 3, Step 3). An LLM-based auditor [PITH_FULL_IMAGE:figures/full_fig_p044_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: System prompt for the evaluated LLM during STT-Arena benchmarking. The model is [PITH_FULL_IMAGE:figures/full_fig_p045_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: System prompt for the passive user simulator during evaluation. The simulator responds [PITH_FULL_IMAGE:figures/full_fig_p045_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: System prompt for the LLM-as-a-judge on impossible tasks. The judge determines [PITH_FULL_IMAGE:figures/full_fig_p045_42.png] view at source ↗
read the original abstract

Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40\% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces STT-Arena, a benchmark of 227 interactive tasks across nine spatio-temporal conflict types and four solvability levels, grounded in an executable environment with injected triggers that invalidate ongoing plans. Frontier LLMs including Claude-4.6-Opus achieve under 40% overall accuracy, which the authors interpret as evidence of fundamental difficulty in spatio-temporal dynamic reasoning. The work identifies three recurring error modes (Stale-State Execution, Misdiagnosis of Dynamic Triggers, Missing Post-Adaptation Verification), then uses iterative trajectory refinement on observed failures plus online RL to train STT-Agent-4B, which outperforms the evaluated frontier models on the same benchmark.

Significance. If the tasks validly probe replanning under abrupt state invalidation, the sub-40% ceiling for current SOTA models identifies a concrete and practically relevant limitation for agentic tool use. The error-mode taxonomy supplies diagnostic value, and the refinement-plus-RL pipeline offers a replicable recipe for targeted improvement. These elements could steer future agent benchmarks and training regimes toward greater robustness in dynamic settings.

major comments (2)
  1. [§3] §3 (Benchmark Construction): the description of how the 227 tasks and nine conflict types are generated, how triggers are injected into the executable environment, and how solvability levels are assigned remains high-level. Without these operational details it is difficult to judge whether the tasks constitute a representative sample of realistic mid-task spatio-temporal disruptions or whether the reported difficulty is partly an artifact of task design.
  2. [§6] §6 (STT-Agent-4B Training): the iterative trajectory refinement step explicitly uses failure trajectories collected on STT-Arena itself to curate training data. This introduces a direct dependence between the evaluation distribution and the training distribution for the proposed model, which undermines the claim that STT-Agent-4B demonstrates generalizable superiority rather than benchmark-specific adaptation.
minor comments (2)
  1. [§4.2] §4.2 and associated tables: accuracies are reported as point estimates without standard errors, number of runs, or statistical tests; adding these would strengthen the comparative claims.
  2. [Figure 3] Figure 3 (error-mode trajectories): axis labels and legend entries could be enlarged for readability when printed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): the description of how the 227 tasks and nine conflict types are generated, how triggers are injected into the executable environment, and how solvability levels are assigned remains high-level. Without these operational details it is difficult to judge whether the tasks constitute a representative sample of realistic mid-task spatio-temporal disruptions or whether the reported difficulty is partly an artifact of task design.

    Authors: We agree that the current description in §3 is high-level and that more operational details are needed for readers to assess the benchmark's construction and realism. In the revised manuscript we will expand this section with concrete examples of task generation, the mechanism for injecting spatio-temporal triggers into the executable environment, and the criteria used to assign solvability levels. These additions will include pseudocode outlines and validation steps to show that the 227 tasks are intended to reflect realistic mid-task disruptions rather than artifacts of the design. revision: yes

  2. Referee: [§6] §6 (STT-Agent-4B Training): the iterative trajectory refinement step explicitly uses failure trajectories collected on STT-Arena itself to curate training data. This introduces a direct dependence between the evaluation distribution and the training distribution for the proposed model, which undermines the claim that STT-Agent-4B demonstrates generalizable superiority rather than benchmark-specific adaptation.

    Authors: We acknowledge the validity of this concern: collecting failure trajectories directly from STT-Arena for iterative refinement does create a dependence between the training data and the evaluation distribution. This approach was chosen to systematically eliminate the three identified error modes, but it does limit strong claims of generalizability beyond the benchmark. In the revision we will add explicit discussion in §6 and a dedicated limitations paragraph clarifying that STT-Agent-4B demonstrates targeted improvement on STT-Arena via this pipeline and that broader generalization claims would require separate evaluation on other dynamic environments. We will adjust the language around 'outperforms frontier LLMs' to be scoped to the present benchmark. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in core derivation

full rationale

The paper's central claim—that frontier models achieve under 40% accuracy on spatio-temporal replanning—rests on direct evaluation of existing LLMs against a newly constructed set of 227 tasks with explicitly enumerated conflict types and solvability levels in an executable environment. This evaluation does not reduce to any self-definition, fitted parameter renamed as prediction, or self-citation chain. The subsequent error-mode analysis and iterative refinement to produce STT-Agent-4B constitute a downstream application that uses observed failures for training-data curation; it is not a prerequisite for the difficulty result and does not alter the independence of the benchmark scores reported for external models. No load-bearing step in the reported chain collapses by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the premise that the constructed tasks and injected triggers form a realistic proxy for real-world disruptions; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption The nine spatio-temporal conflict types and four solvability levels adequately sample the space of plan-invalidating events in executable environments.
    Invoked when designing the 227 tasks and claiming coverage of dynamic reasoning challenges.

pith-pipeline@v0.9.0 · 5787 in / 1312 out tokens · 34377 ms · 2026-05-20T10:55:52.425035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

  1. [1]

    2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv

    doi: 10.48550/ARXIV .2601.16486. URL https://doi.org/10.48550/arXiv.2601. 16486. Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ru...

  2. [2]

    ‘update_goods_inventory_status‘ — Mark the insulin shipment as ’pending_reassignment’

  3. [3]

    ‘list_operational_hours‘ — Verify new clinic hours and identify alternative delivery window

  4. [4]

    ‘create_new_shipment‘ — Create a new shipment with accelerated delivery window

  5. [5]

    ‘assign_shipment_to_vehicle‘ — Assign to a different vehicle (with faster cooling unit)

  6. [6]

    ‘add_or_update_route_plan‘ — Plan direct route without intermediate stops

  7. [7]

    ‘update_vehicle_status‘ — Dispatch new vehicle

  8. [8]

    , "_raw":

    ‘register_goods_receipt‘ — Complete delivery before 10:00.", "_raw": "## User Goal\nShip a batch of insulin ... ## Normal Flow\n1. ‘create_new_shipment‘ ...\n..." } ] Figure 14: Example of blueprint, we design the blueprint based on the conflict types, difficulty levels, and environment information. Example of User Profile [ { "scenario_id": "ColdChainLog...

  9. [9]

    Persistent Environment –- The query is about a domain where: •There is a live, ongoing state that can be read or changed •The environment supports both: –Information queries about current state (read operations) –Explicit state-changing actions (create, update, delete, move, cancel, etc.)

  10. [10]

    State Dependency –- The task cannot be answered correctly without: •Inspecting the actual current data or configuration in the environment, and/or •Executing an operation that modifies that data

  11. [11]

    Domain Specificity –- The environment is not general-purpose knowledge; it is a structured system such as: •File management system with stored files/folders •Order/logistics tracking system •Calendar/scheduling system •CRM, inventory, ticketing, project management tools •Other specialized platforms with records that persist over time

  12. [12]

    Is invoice #1024 paid?

    Actionability in Context –- The query must correspond to an actionable operation or status check within the actual environment (not hypothetical). Eligible Task Types •State queries: “Is invoice #1024 paid?” / “What meetings are scheduled for Wednesday?” •State modification operations: “Upload the proposal.pdf to the project folder” / “Cancel order #4512”...

  13. [13]

    Whether the task has real temporal affinity for T1/T2/T3 conflicts

  14. [14]

    Whether the task has real spatial affinity for S1/S2/S3 conflicts

  15. [15]

    Whether the task has joint spatiotemporal or strict dependent workflow structure supporting ST1/ST2/ST3 conflicts

  16. [16]

    Whether the conflict lies on a normal, competent execution path rather than an unnatural contrived setup

  17. [17]

    Judgment Rule •Answer YES only if the task naturally supports at least one concrete conflict code

    Whether the task has enough multi-step dependency that an injected conflict would be meaningful, observable, and benchmark-worthy. Judgment Rule •Answer YES only if the task naturally supports at least one concrete conflict code. •Answer NO if the task is stateful but does not clearly support any realistic conflict mechanism. Be Strict •Prefer NO for simp...

  18. [18]

    •Note any relevant entities, constraints, relationships, or dynamics implied by the task

    # Analysis •Explain the reasoning process used to connect the task to the chosen environment. •Note any relevant entities, constraints, relationships, or dynamics implied by the task

  19. [19]

    •Examples: Linux filesystem, E-commerce order management system, Airline booking system

    # Environment Summary •Provide a concise label for the environment type. •Examples: Linux filesystem, E-commerce order management system, Airline booking system

  20. [20]

    •Focus on its inherent structure, the nature of the state it maintains, typical operations it supports, and its general real-world scope

    # Environment Introduction •Introduce the environment itself, without referring to the current task. •Focus on its inherent structure, the nature of the state it maintains, typical operations it supports, and its general real-world scope. •Limit to approximately three sentences

  21. [21]

    •Modelability (1–10): how straightforward it would be to represent this environment using a single Python class with stateful attributes and operational methods

    # Metrics •Usefulness (1–10): how broadly applicable and valuable this environment is in real-world scenarios. •Modelability (1–10): how straightforward it would be to represent this environment using a single Python class with stateful attributes and operational methods. Output Format # Analysis <Your analysis> # Environment Summary <Your environment sum...

  22. [22]

    •Identify what entities and attributes need to be tracked

    # Analysis •Explain what states are involved in the environment. •Identify what entities and attributes need to be tracked. •Note relevant constraints, operational rules, and dependencies

  23. [23]

    •For each entity, specify attributes and describe its functional role

    # State Space Definition •Define the major entities maintained by the environment. •For each entity, specify attributes and describe its functional role

  24. [24]

    environment_summary

    # Constraints & Rules •Summarize core consistency rules, domain constraints, permissions, capacities, temporal rules, or structural restrictions. Output Format # Analysis <Your thought process> # State Space Definition - Entity: EntityName1 - Attributes: Attribute1, Attribute2, ... - Description: The role of this entity in the environment - Entity: Entity...

  25. [25]

    # Analysis •Identify involved entities and attributes •Determine required parameters •Define expected outputs (query vs modification) •Identify edge cases (invalid input, missing state, permission issues) •Consider relevant environment constraints

  26. [26]

    success": True,

    # Code •Implement method as def operation_name(self, ...) •Method must be inside an existing environment class (not standalone) •Use type hints •Include docstring (inputs, outputs, constraints) •Do not raise exceptions •Return structured dictionaries: –Success: {"success": True, "data": ...} (query) –Success: {"success": True, "message": ...} (state chang...

  27. [27]

    •Field names, nesting levels, and value types must match the class definition exactly

    Structure and Type Matching •The JSON must strictly follow the attribute structure and data types required by the class. •Field names, nesting levels, and value types must match the class definition exactly

  28. [28]

    •Ensure all generated data complies with these constraints

    Respect Constraints •Read class methods and docstrings to identify constraints (e.g., valid status values, required fields, ID reference rules). •Ensure all generated data complies with these constraints. •All cross-entity references must be valid and consistent

  29. [29]

    •Cover different valid states and value ranges

    Richness of Data •Each major dictionary-like attribute should contain 3–5 diverse entries. •Cover different valid states and value ranges. •Dates should be distributed over a reasonable time span. •Numerical fields should vary realistically

  30. [30]

    •Avoid placeholder-like values (e.g., name1, user001)

    Realistic Simulation of Data •Use natural fictional names (e.g., Alice Chan, Central City District). •Avoid placeholder-like values (e.g., name1, user001). •Dates must be in ISO format (YYYY-MM-DD) or timestamps. 33 •IDs should be unique and may mix short codes and UUIDs. •All data must be fictitious and non-sensitive

  31. [31]

    •Must be directly usable as class initialization config

    Output Format •Output only JSON (no explanations outside required sections). •Must be directly usable as class initialization config. Input Env Class Definition “‘python {env_class_code} “‘ All Containers {all_containers} Output Format # Analysis <Reasoning: containers, fields, constraints, and data construction strategy> # Init Config { ... } Figure 29: ...

  32. [32]

    Normal flow must use ONLY operations from the environment’s operation list

  33. [33]

    Activation operation must be a READ/QUERY operation where the conflict becomes visible during normal execution

  34. [34]

    Mutations must reference REAL state container names from state_space_definition

  35. [35]

    Mutations must describe field-level changes, not vague availability changes. 36

  36. [36]

    The conflict must match its taxonomy semantics (e.g., T2 requires temporal ordering changes)

  37. [37]

    Observable_via must be an operation that reveals the mutated state

  38. [38]

    Trigger must depend on TASK or STATE CONDITIONS, not call order

  39. [39]

    task_solvable

    Trigger type must be: •always_once: fires once when condition is met •conditional_guarded: keeps affecting execution until recovery Output Format # User Goal <1–2 sentences describing the user’s objective> # Normal Flow •<operation_name> –- <purpose> •<operation_name> –- <purpose> •... # Conflict Design •Activation Operation: <exact operation name> •Trigg...

  40. [40]

    Add self._conflict_triggered = False in __init__

  41. [41]

    Trigger must depend on real state or input conditions, not call counts

  42. [42]

    Implement trigger logic inside {activation_op}

  43. [43]

    If always_once: •Fire only once when condition is first satisfied •Apply exact mutations •Set self._conflict_triggered = True

  44. [44]

    If conditional_guarded: •Fire whenever condition holds AND recovery condition is not satisfied •_conflict_triggered is bookkeeping only •Stop firing once recovery condition is satisfied

  45. [45]

    Normal operation must always return valid results (no error injection)

  46. [46]

    Mutated state must remain fully usable by all methods

  47. [47]

    Do not introduce artificial error handling logic

  48. [48]

    {observable_op} requires no modification unless strictly necessary

  49. [49]

    Figure 33: System prompt for injecting spatio-temporal triggers into a static environment (Stage 2, Step 2)

    Output the full modified class only Output Format # Conflict Environment Code “‘python <complete modified Python class> “‘ Output ONLY the required section. Figure 33: System prompt for injecting spatio-temporal triggers into a static environment (Stage 2, Step 2). The LLM augments the Python environment class with a deterministic conflict trigger inside ...

  50. [50]

    User Query (2–5 sentences) •Must contain all necessary entity names, dates, locations, and identifiers •Must match the init configuration exactly •Must NOT reveal the existence of any conflict •Must sound like a natural user request

  51. [51]

    Init Config (JSON) •Must cover all state containers in the environment •Must include sufficient entities for both normal and recovery paths •Must ensure full referential consistency •Must include at least one alternative entity for recovery

  52. [52]

    environment_summary

    Concrete Mutations (JSON array) •Map abstract mutations to exact entity IDs and fields •Each entry includes: state container, entity ID, field, old value, new value •old value must match init config exactly •new value must induce intended conflict Output Format # User Query <2–5 sentence task instruction> # Init Config “‘json { } “‘ # Concrete Mutations “...

  53. [53]

    Guidelines: 51 • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...