Hierarchical Experimentalist Agents
Pith reviewed 2026-06-30 07:27 UTC · model grok-4.3
The pith
Hierarchical Experimentalist Agents let LLMs learn reusable skills through active experimentation, raising success on hard physics tasks from 2 percent to 77 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. On the Interphyre benchmark the method raises the success rate of the same base model from 2 percent to as high as 77 percent on the hardest levels. Skills acquired only from easier levels transfer to achieve 44 percent success on harder levels without any further active experimentation.
What carries the argument
The hierarchical loop that alternates between experiment design in simulation, extraction of composable skills into a growing library, and integration of evidence to solve the original query.
If this is right
- Agents can transfer skills learned on easier simulation levels to solve harder levels without additional experimentation.
- The same base model reaches substantially higher success rates on tasks that require discovering physical interactions rather than recalling them.
- The framework improves both closed-source and open-weight models and outperforms standard agent baselines such as ReAct and Reflexion.
- No offline data, oracles, or external supervision are required for the performance gains.
Where Pith is reading between the lines
- The same loop could be applied to other interactive simulators such as chemistry or robotics environments where hypotheses must be tested through actions.
- Larger skill libraries built over many sessions might enable cumulative improvement across entirely new task families.
- The reusability result suggests that once a modest set of skills is acquired, further gains may come from composition rather than repeated experimentation.
Load-bearing premise
The performance gains are attributable to the hierarchical experimentalist loop and skill library rather than to prompt length, specific tool-calling format, or other unablated factors in the agent scaffolding.
What would settle it
An ablation study that disables the skill library or replaces the experiment-design step with random interventions and measures whether success rates on the hardest Interphyre levels fall back near the 2 percent baseline would settle the claim.
Figures
read the original abstract
Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search. This paradigm breaks down in novel domains and for sophisticated queries that cannot be answered from prior knowledge alone. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long-horizon tasks in a complex physical system. To address this, we introduce Hierarchical Experimentalist Agents (HExA), an in-context self-improvement framework to learn from active experimentation. HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. HExA is training-free, compatible with any black-box model, and does not require external supervision, oracles, or offline data. To evaluate active experimentation, we introduce Interphyre, a tool-calling benchmark built on the PHYRE 2D procedural physics environment, where agents propose interventions and test hypotheses through simulation APIs. Experiments show that current LLM agents struggle in these settings, especially on the hardest levels of Interphyre. Claude Sonnet 4.6 achieves only 2% success, while HExA improves the same model to up to 77% success. HExA also improves open-weight models and outperforms agentic baselines such as ReAct and Reflexion. Moreover, using only skills learned from easier levels and transferred without active experimentation, HExA achieves 44% success, demonstrating the reusability and generalization of its learned skills. Overall, HExA shows that learning through active experimentation can help agents discover useful knowledge, acquire reusable skills, and make efficient progress on novel long-horizon tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Hierarchical Experimentalist Agents (HExA), a training-free in-context framework in which LLMs iteratively design and refine experiments, acquire a reusable library of composable skills from simulation feedback, and integrate evidence to solve novel long-horizon tasks. On the newly introduced Interphyre benchmark (built on PHYRE), the method is reported to raise Claude Sonnet 4.6 success from 2 % to as high as 77 % on the hardest levels and to achieve 44 % success via zero-shot transfer of skills learned only on easier levels; it also outperforms ReAct and Reflexion baselines across several models.
Significance. If the reported gains prove robust and causally attributable to the hierarchical experimentalist loop plus skill library, the work would constitute a meaningful step toward agents that can discover domain-specific knowledge through active experimentation rather than relying solely on parametric memory or retrieval. The Interphyre benchmark itself supplies a concrete, tool-calling testbed for physics-based hypothesis testing that is currently absent from most agent evaluations.
major comments (2)
- [Abstract / Results] Abstract and results sections: the headline performance figures (2 % o 77 % and 44 % transfer) are presented without any report of the number of independent trials, standard errors, or variance across random seeds or level instantiations, so the statistical reliability of the central empirical claim cannot be assessed from the manuscript.
- [Evaluation] Evaluation / experimental setup: no ablation is described that keeps total prompt length, tool-calling schema, and context-management format identical while removing only the hierarchical planning loop and the skill-library acquisition mechanism; therefore the attribution of gains specifically to HExA rather than to changes in scaffolding remains unverified.
minor comments (1)
- [Introduction] A short paragraph clarifying how Interphyre extends or differs from the original PHYRE task distribution would help readers situate the new benchmark.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our empirical results and strengthen the attribution of gains to HExA. We address each point below and commit to revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and results sections: the headline performance figures (2 % to 77 % and 44 % transfer) are presented without any report of the number of independent trials, standard errors, or variance across random seeds or level instantiations, so the statistical reliability of the central empirical claim cannot be assessed from the manuscript.
Authors: We agree that statistical details are necessary to evaluate reliability. All reported figures were obtained from 5 independent random seeds per level across the full Interphyre test set. In the revised manuscript we will report means, standard errors, and per-level variance both in the abstract and in the results tables. revision: yes
-
Referee: [Evaluation] Evaluation / experimental setup: no ablation is described that keeps total prompt length, tool-calling schema, and context-management format identical while removing only the hierarchical planning loop and the skill-library acquisition mechanism; therefore the attribution of gains specifically to HExA rather than to changes in scaffolding remains unverified.
Authors: We acknowledge the absence of a tightly controlled ablation that holds prompt length, tool schema, and context management fixed while ablating only the hierarchical loop and skill library. We will add this ablation in the revision to isolate the contribution of those two components. revision: yes
Circularity Check
No circularity: purely empirical claims on new benchmark
full rationale
The paper introduces the HExA framework and the Interphyre benchmark, then reports empirical success rates (e.g., Claude Sonnet 4.6 improving from 2% to 77% on hardest levels, plus 44% zero-shot transfer). No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. All load-bearing claims are direct experimental outcomes rather than quantities that reduce to their own inputs by construction, self-citation chains, or renamed ansatzes. The evaluation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can design and refine query-relevant experiments in-context using only simulator feedback
Reference graph
Works this paper leans on
-
[1]
61 Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres
URLhttps://arxiv.org/abs/2507.15550. 61 Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models, 2024. URL https://arxiv.org/ abs/2411.08027. 61 Wei Chow, Jiajun Mao, Bowen Li, Daniel Seita, Vitor Guizilini, and Yue Wang. PhysBench: Benchmarking and enhancing vision-la...
-
[2]
Retrieval-Augmented Generation for Large Language Models: A Survey
URLhttps://arxiv.org/abs/2312.10997. 1 Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025. 5 15 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025
-
[3]
24, 58 OpenAI. Gpt-5 technical report. 2025. URL https://cdn.openai.com/gpt-5-system-card. pdf. 1 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christ...
-
[4]
Spies, William Edwards, Michael I
2 Alex F. Spies, William Edwards, Michael I. Ivanitskiy, et al. Transformers use causal world models in maze-solving tasks. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv.org/abs/2412.11867. 61 Adaptive Agent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley-Schm...
-
[5]
URLhttps://arxiv.org/abs/2107.12808. 2, 3 Qwen Team. Qwen3. 5: Towards native multimodal agents.URL: https://qwen. ai/blog, 2026. 1 Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning env...
-
[6]
identify which source skills encode principles that are structurally relevant to the target (based on shared physical or procedural primitives),
-
[7]
re-ground each selected principle in the entities and mechanics of the target scene, and
-
[8]
No contact events recorded
calibrate reward based on how directly the principle transfers (skills corroborated by multiple source banks receive higher reward). The synthesised bank Kℓ∗ is then injected into the actor’s context via the sameRETRIEVER mecha- nism as within-task skills. This enables zero-shot transfer: the actor attempts the target task with the benefit of cross-task s...
-
[9]
Write the reflection now
{prior_reflection_1} ... Write the reflection now. Be specific and concise (<=5 sentences). Trial 2 conditioning.The returned reflection text is appended to the level’s system prompt under a heading## Reflexion memory before trial 2 starts. The actor sees this block exactly once at trial 2’s first turn, in addition to the unchanged level system prompt and...
-
[10]
The returned object table (positions, sizes, dynamic flags, and any level-specific geometry) is captured verbatim
A single scripted call toget_level_state via the same INTERPHYREtoolkit used by REACT. The returned object table (positions, sizes, dynamic flags, and any level-specific geometry) is captured verbatim
-
[11]
x": <X>,
One non-streaming Claude Sonnet 4.6 call. The conversation has a level-specific system prompt and a single user message that embeds the scene table from step 1 and demands the final (x, y, r) placement as JSON. The harness parses the JSON, runs simulate_action once, and records success/failure. The model is given no read tools, no probe tools (simulate_pa...
2024
-
[15]
Arguments: None Usage: Action: get_contact_log
get_contact_log 39 Description: After running a simulation, returns the contact events: which objects touched and when. Arguments: None Usage: Action: get_contact_log
-
[17]
x": 1.2,
trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball’s (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest (capped at ~600 steps). Use this when you only care about WHERE the green bal...
-
[19]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
2000
-
[22]
x": 0.5,
simulate_action Description: Place a red ball at (x, y) with the given radius and run the full physics simulation to completion. Returns whether the goal was achieved, final positions of all objects, and total simulation steps. If the placement is invalid (out of bounds or overlaps), returns a detailed error with how far to move the ball. Arguments: x (fl...
-
[25]
x": 1.2,
trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball’s (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest (capped at ~600 steps). Use this when you only care about WHERE the green bal...
-
[26]
x": 1.2,
predict_first_contact Description: Cheap pre-simulation check (<=90 physics steps, ~1.5s of sim time). Runs just long enough to find the FIRST object the red ball touches after it is released, and reports: placement validity, the other object’s name, the step of impact, approach speed, approximate contact point, and surface normal. Use this to verify that...
-
[27]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[28]
(2) short range -> increase y to 0.9 or shift x to 0.4--0.45
[Strategy: Three-Tier Fallback Sequence for Failed Primary Placement] When primary (x=0.5, y=0.4, r=1.5) fails: (1) ceiling hit -> immediately try x=0.3, y=0.4, r=1.5 (gray-ball-safe first choice); if still ceiling hit, try x=0.3, y=-0.3, r=2.0 but verify no overlap first. (2) short range -> increase y to 0.9 or shift x to 0.4--0.45. (3) persistent failur...
-
[29]
[Strategy: x=0.3 Is the Ceiling-Escape x-Position; y=0.4 Is the Gray-Ball-Safe Default Height] Shifting x from 0.5 to 0.3 flattens the green ball’s launch arc by changing the arm contact point --- this x-shift is the primary ceiling-escape mechanism, not lowering y. When x=0.3, y=-0.3, r=2.0 is invalid due to gray_ball pivot overlap (the pivot at r~=0.70 ...
-
[30]
The contact point is on the right side of the pivot, creating sufficient lever arm for strong rotation while staying in the geometrically stable zone
[Strategy: x~=0.5 Is the Primary Catapult Launch Sweet Spot] Placing the red ball at x=0.5 on the catapult arm produces a consistent rightward launch arc. The contact point is on the right side of the pivot, creating sufficient lever arm for strong rotation while staying in the geometrically stable zone. Only deviate to x=0.3 when a ceiling hit is observe...
-
[31]
Below r~=1.2, the arm rotates too slowly to launch the green ball across the ~7-unit gap to the basket
[Strategy: Use Large Radius (>=1.5) for Sufficient Launch Energy] The catapult arm is a lever --- the red ball’s mass (proportional tor^3) determines angular momentum imparted to the arm. Below r~=1.2, the arm rotates too slowly to launch the green ball across the ~7-unit gap to the basket. r=1.5 is the minimum reliable threshold; r=2.0 adds negligible ad...
-
[32]
When the arm sits higher (y > -1.5), increase to y=0.9
[Strategy: y=0.4 Is Default Drop Height; Scale Up to y=0.9 When Arm Sits Higher] When the arm y <= -1.5, y=0.4 gives adequate fall distance for the red ball to build approach speed. When the arm sits higher (y > -1.5), increase to y=0.9. Critically, y=0.4 also serves as the safe ceiling-escape fallback height --- even when using x=0.3 for arc flattening, ...
-
[33]
predict_first_contact verifies both placement validity and that the red ball contacts gray_platform (not another object), confirming the catapult mechanism activates
[Strategy: predict_first_contact Is Essential for Variable-Arm Seeds] The catapult arm’s y-position varies across seeds, making the overlap boundary unpredictable. predict_first_contact verifies both placement validity and that the red ball contacts gray_platform (not another object), confirming the catapult mechanism activates. It also catches gray_ball ...
-
[34]
When the ceiling blocker is at x > -2.5, primary placement causes a ceiling hit
[Strategy: Ceiling Blocker Lethality Depends on Its x-Position and Arc Angle] The static black ball near y~=4.6 varies in x-position across seeds (x~=-3.85 to x~=-1.14). When the ceiling blocker is at x > -2.5, primary placement causes a ceiling hit. When the blocker is at x < -3.5 (far left), x=0.3 launches allow the green ball to arc up and even bounce ...
-
[35]
The stable launch zone is x~=0.0 to 0.5
[Strategy: x-Position Fine-Tunes Landing Range; Bifurcation Zones at x<=-0.3 and x=0.7--1.5] Small x shifts (0.1--0.2 units) cause large changes in green ball landing position. The stable launch zone is x~=0.0 to 0.5. Positions x<=-0.3 and x=0.7--1.5 are bifurcation zones where trajectory is chaotic and small changes produce unpredictable outcomes. Within...
-
[36]
Increasing mass beyond 1.5 at this x cannot convert to additional angular momentum because the arm hits its geometric rotation limit
[Strategy: Radius Plateau at x=0.5: r Beyond 1.5 Yields No Additional Range] At x~=0.5, the catapult arm’s rotation saturates --- r=1.5 and r=2.0 produce identical green ball landing positions. Increasing mass beyond 1.5 at this x cannot convert to additional angular momentum because the arm hits its geometric rotation limit. Intermediate radii (r=1.1--1....
-
[37]
This occurs because ceiling clearance is determined by launch angle (set by x-position), not energy level
[Strategy: When Ceiling-Range Tradeoff Is Unsolvable by Radius Tuning, Shift x to 0.1--0.3] When r=1.5 hits the ceiling and r=1.0 falls short (Deltax gap > 1.0 unit), interpolating intermediate radii (r=1.1--1.4) does not resolve the conflict --- they also hit the ceiling or remain short. This occurs because ceiling clearance is determined by launch angle...
-
[38]
After 2-3 failures, 54 agents try small perturbations instead of qualitatively different placements
**Agent fixates on a single launch mechanism (catapult arm) and micro-tunes x/y/radius around the same narrow region without escaping the local solution space.** Why it happens: The catapult arm is the most obvious mechanism. After 2-3 failures, 54 agents try small perturbations instead of qualitatively different placements. How to avoid: After 2 failures...
-
[39]
But lever rotation saturates before additional mass converts to green ball velocity at x=0.5
**Agent increases radius at fixed x=0.5 expecting more range, but arm rotation is saturated and result is identical.** Why it happens: Linear energy intuition: more mass = more energy = more range. But lever rotation saturates before additional mass converts to green ball velocity at x=0.5. How to avoid: When r=1.5 fails at x=0.5, never try r=2.0 at the s...
-
[40]
But more energy produces a higher-arcing launch, worsening the ceiling hit
**Agent adjusts y in the wrong direction after ceiling hits: increases y to 2.0+ (higher drop) instead of shifting x to 0.3 or using y=0.4 at x=0.3.** Why it happens: Higher drop = more kinetic energy at impact. But more energy produces a higher-arcing launch, worsening the ceiling hit. How to avoid: After a ceiling hit, shift x left to 0.3 (keeping y=0.4...
-
[41]
How to avoid: If any attempt lands in x<=-0.3 or x=0.7--1.5, immediately reset to the stable zone x=0.2--0.4
**Agent binary-searches in the x=-0.3 to -0.5 or x=0.8--1.5 bifurcation zones where green ball trajectory is chaotic.** Why it happens: Small x adjustments feel like fine-tuning, but in bifurcation zones they produce wildly different outcomes that appear to need more fine-tuning. How to avoid: If any attempt lands in x<=-0.3 or x=0.7--1.5, immediately res...
-
[42]
But basket physics are unpredictable and deflectors rarely produce sustained green-blue contact
**Agent places deflector balls near the basket (x~=3.5--4.5) hoping to push the blue ball toward the green ball’s landing zone.** Why it happens: When the green ball consistently lands short, moving the target (blue ball) seems logical. But basket physics are unpredictable and deflectors rarely produce sustained green-blue contact. How to avoid: Fix the g...
-
[43]
Review the LEARNED PHYSICS SKILLS listed in the system prompt above
-
[44]
Identify which skill applies to the failure you just observed
-
[45]
Explicitly state which skill you are applying and why in your next Thought. E.6 HEXA trajectory (catapult seed 45, 6 iterations, SUCCESS) --- Step 1 --- Thought: Let me start by describing the scene geometry to understand the layout before attempting any placement. Action: describe_scene_geometry({}) Observation: === Scene Geometry === World bounds: x in ...
2000
-
[46]
Paired counterfactual evaluation follows: the same agent, on the same scene, under matched and perturbed physical conditions
An intervention API with snapshot/restore semantics.Any running simulation can be snapshot- ted at an arbitrary point (on a physics event, a contact trigger, or a fixed timestep) and restored into two or more branches that diverge under controlled perturbations (impulses, forces, parame- ter changes, object additions or removals). Paired counterfactual ev...
-
[47]
Extensible level authoring as Python code.Levels are Python classes built from a composable object API, with tunable physics parameters (gravity, friction, restitution, density) and user-defined success conditions. Like Toybox’s reimplementation of Atari games (Foley et al., 2018), Interphyre reimplements the 2D physics puzzle paradigm established by PHYR...
2018
-
[48]
sample, simulate, update
LLM-native interfaces and interpretability-ready data generation.The simulator and inter- vention API are exposed as a tool-call surface that an LLM agent can invoke directly. The same interfaces support standalone generation of paired (factual, counterfactual) trajectory data for downstream interpretability and offline reinforcement learning pipelines. A...
2019
-
[49]
green_ball
uses LLM agents to actively probe interactive physics simulations and formulate hypotheses, the closest existing paradigm to Interphyre, though PhysGym targets discovery of physical equations rather than intuitive task-solving. LLMPhy (Cherian et al., 2024) demonstrated a simulator-in-the- loop framework where LLMs iteratively estimate physical parameters...
2024
-
[53]
x": 0.5,
simulate_partial Description: Place a red ball and run the simulation only up to the specified step. Returns object positions and velocities at that point. Useful for observing mid-simulation dynamics. Arguments: x (float), y (float), radius (float), stop_step (int) Usage: Action: simulate_partial Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6, "stop_st...
-
[54]
Returns the left gap and right gap, and whether the green ball can fit through each gap
compute_gap_analysis Description: Analyze the gaps on each side of the platform. Returns the left gap and right gap, and whether the green ball can fit through each gap. Arguments: None Usage: Action: compute_gap_analysis
-
[55]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[60]
Returns their coordinates, distance, on which side the blue ball is relative to green, and recommended red ball placement direction
compute_relative_positions Description: Analyze the positions of the green and blue balls. Returns their coordinates, distance, on which side the blue ball is relative to green, and recommended red ball placement direction. Arguments: None Usage: Action: compute_relative_positions
-
[61]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[66]
Returns the center of the ramp
get_ramp_center Description: Analyze the pass_the_parcel setup. Returns the center of the ramp. 70 Arguments: None Usage: Action: get_ramp_center
-
[67]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[68]
No prescriptive advice; you interpret the layout to form a strategy
describe_scene_geometry Description: Return strategy-neutral geometry: every ball (position, radius, dynamic flag), every bar (position, angle, length, dynamic flag), every basket (position, dynamic flag), and the key distance (green <-> blue). No prescriptive advice; you interpret the layout to form a strategy. Arguments: None Usage: Action: describe_sce...
-
[73]
x": 1.2,
trace_green_ball Description: Lightweight trajectory probe --- only the green ball is sampled. Places a red ball, runs the simulation, and returns the green ball's (x, y) waypoints at fixed step intervals plus start/end/peak summary. Stops early once the green ball comes to rest ( capped at ~600 steps). Use this when you only care about WHERE the green ba...
-
[74]
x": 1.2,
predict_first_contact Description: Cheap pre-simulation check (<=90 physics steps, ~1.5s of sim time). Runs just long enough to find the FIRST object the red ball touches after it is released, and reports: placement validity, the other object's name, the step of impact, approach speed, approximate contact point, and surface normal. Use this to verify that...
-
[75]
green_ball
simulate_with_trace Description: Place a red ball and run the simulation. Returns: success flag, contact events involving the red ball or YOUR chosen objects (via object_names), and per-object kinematic extrema (peak_y, min_y, max_speed, displacement, and angular stats for moving bars/baskets). You choose which objects to trace---e.g., ["green_ball"] to s...
-
[76]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[81]
compute_intercept_setup Description: Computes intercept geometry for the falling_into_place level. Returns which platform the green ball is on, which direction it must travel to reach the jar, the platform edge it must cross, the gap center, and the estimated time before the jar reaches platform height. Arguments: None Usage: Action: compute_intercept_setup
-
[82]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[87]
Returns the green ball position, basket position and scale, purple ground position, and recommended push direction to deflect the green ball away from the basket
compute_basket_analysis Description: Analyze the basket case setup. Returns the green ball position, basket position and scale, purple ground position, and recommended push direction to deflect the green ball away from the basket. Arguments: None Usage: Action: compute_basket_analysis 75
-
[88]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[89]
Arguments: None Usage: Action: get_level_state
get_level_state 76 Description: Get the current level layout including all object positions, sizes, and properties. Arguments: None Usage: Action: get_level_state
-
[93]
compute_cliffhanger_analysis Description: Analyse the cliffhanger geometry. Returns the green bar's centre, length, and the (x, y) coordinates of its bottom point (resting on the platform) and top point ( opposite end); the platform's left/right extents and top-surface y; the ceiling y and purple-ground y; the bar's distance to each platform edge; which e...
-
[94]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[95]
Arguments: None Usage: Action: get_level_state
get_level_state Description: Get the current level layout including all object positions, sizes, and properties. Arguments: None Usage: Action: get_level_state
-
[96]
x": 0.5,
simulate_action Description: Place a red ball at (x, y) with the given radius and run the full physics simulation to completion. Returns whether the goal was achieved, final positions of all objects, and total simulation steps. If the placement is invalid (out of bounds or overlaps ), returns a detailed error with how far to move the ball. Arguments: x (f...
-
[97]
Arguments: None Usage: Action: get_contact_log
get_contact_log Description: After running a simulation, returns the contact events: which objects touched and when. Arguments: None Usage: Action: get_contact_log
-
[98]
x": 0.5,
simulate_partial Description: Place a red ball and run the simulation only up to the specified step. Returns object positions and velocities at that point. Useful for observing mid-simulation dynamics. Arguments: x (float), y (float), radius (float), stop_step (int) Usage: Action: simulate_partial Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6, "stop_step": 50}
-
[99]
compute_tipping_point_analysis Description: Analyse tipping_point geometry. Returns the green bar's centre, length, angle, and the (x, y) coordinates of its top and bottom endpoints; the basket's centre and floor; the purple wall's x position and its top/bottom y; the purple wall's side relative to the green bar (LEFT or RIGHT); the horizontal distance fr...
-
[100]
x": 0.5,
finish Description: Submit your final answer. Use this when you are confident in your solution. Arguments: x (float), y (float), radius (float) Usage: Action: finish Action Input: {"x": 0.5, "y": 4.0, "radius": 0.6} To solve this puzzle, you will reason step-by-step and use tools to test your ideas. At each step, you MUST follow this exact format: Thought...
-
[103]
good steps within bad trajectories
A concrete actionable fix **Part 2 -- Partial insights**: Even in failed trajectories, some individual steps show CORRECT physics reasoning or useful discoveries (e.g., the agent found a valid placement region but then abandoned it, or correctly identified a mechanism but applied it with wrong parameters). Extract 1-2 skills from these "good steps within ...
-
[105]
Hard constraints: - Maximum {max_skills} total skills for this level - Maximum {max_mistakes} total mistakes for this level
-
[106]
is_new": false - If it's a NEW skill extracted from the new trajectories: set
For each skill you include: - If it's a RETAINED skill from the existing bank: set "is_new": false - If it's a NEW skill extracted from the new trajectories: set "is_new": true - Include "source_seeds" listing seed numbers where this skill was observed (required for confidence calibration) - Include "confidence": a float in [0.1, 1.0] representing your co...
-
[108]
For new skills, estimate confidence based on: - Success rate among source trajectories (high success = high confidence) - Universality (applies to multiple seed conditions = higher confidence) - Clarity and actionability of the principle
-
[109]
If a new trajectory confirms an existing skill, keep the existing one (possibly with slightly higher confidence)
Do not include duplicate skills. If a new trajectory confirms an existing skill, keep the existing one (possibly with slightly higher confidence)
-
[110]
- Contradicted by the new trajectories - Too specific or rarely applicable - Low confidence (< 0.3) and not directly observed in new trajectories
Remove skills that are: - Redundant or subsumed by other skills. - Contradicted by the new trajectories - Too specific or rarely applicable - Low confidence (< 0.3) and not directly observed in new trajectories
-
[111]
skills": [ {{
Do not remove mistakes unless the new trajectories show they're no longer common. OUTPUT JSON OBJECT: {{ "skills": [ {{ "title": "<short name of skill>", "principle": "<2-3 sentence physics insight>", "when_to_apply": "<condition for applicability>", "example": "<optional concrete coordinate example>", "source_seeds": [<seed numbers>], "confidence": <floa...
-
[112]
What exactly the agent did wrong
-
[113]
WHY the agent made this error (what broken causal belief led to it)
-
[114]
good steps within bad trajectories
A concrete actionable fix **Part 2 -- Partial insights**: Even in failed trajectories, some individual steps show CORRECT physics reasoning or useful discoveries (e.g., the agent found a promising placement region but then abandoned it, or correctly identified a mechanism but applied it with wrong parameters). Extract 1-2 skills from these "good steps wit...
-
[115]
Output the COMPLETE FINAL skill bank (not a diff) -- include both retained existing skills and any new ones
-
[116]
- Maximum {max_mistakes} total mistakes for this level
Hard constraints: - Maximum {max_skills} total skills for this level. - Maximum {max_mistakes} total mistakes for this level
-
[117]
is_new": false. - If it's a NEW skill extracted from the new trajectories: set
For each skill you include: - If it's a RETAINED skill from the existing bank: set "is_new": false. - If it's a NEW skill extracted from the new trajectories: set "is_new": true. - Include "source_seeds" listing seed numbers where this skill was observed (required for confidence calibration). - Include "confidence": a float in [0.1, 1.0] representing your...
-
[118]
For retained skills, preserve their existing confidence values (they've been validated)
-
[119]
- Universality (applies across multiple seed conditions = higher confidence)
For new skills, estimate confidence based on: - Success rate among source trajectories (high success = high confidence). - Universality (applies across multiple seed conditions = higher confidence). - Clarity and actionability of the principle
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.