One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Archiki Prasad; Elias Stengel-Eskin; Jaemin Cho; Mohit Bansal; Zaid Khan

arxiv: 2510.12088 · v2 · submitted 2025-10-14 · 💻 cs.AI · cs.CL· cs.LG

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Zaid Khan , Archiki Prasad , Elias Stengel-Eskin , Jaemin Cho , Mohit Bansal This is my paper

Pith reviewed 2026-05-18 07:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords symbolic world modelsstochastic environmentsunguided explorationprobabilistic programmingCrafter environmentprogrammatic lawsdynamic computation graphstate prediction

0 comments

The pith

OneLife learns key environment dynamics from minimal unguided interaction by representing stochastic transitions as conditionally-activated programmatic laws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OneLife as a way to build executable symbolic models of complex stochastic environments when an agent gets only one life of unguided exploration and no human help. It encodes dynamics as programmatic laws, each with a precondition and an effect, that activate only in relevant states inside a probabilistic programming setup. This produces a dynamic computation graph that routes all inference and learning through just the active laws, sidestepping the cost of evaluating every law at every step. The authors test the approach on their Crafter-OO reimplementation, which gives a clean symbolic state and transition function, and introduce two new metrics: state ranking and state fidelity. On these measures OneLife beats a strong baseline in 16 of 23 scenarios and also produces rollouts that support better planning.

Core claim

By representing an environment's transitional dynamics as a collection of conditionally-activated programmatic laws inside a probabilistic programming framework, an agent can infer an executable symbolic world model from sparse, unguided interaction even when the environment is stochastic, hierarchical, and hostile.

What carries the argument

Conditionally-activated programmatic laws, each with a precondition-effect structure, that form a dynamic computation graph routing inference and optimization only through relevant laws.

If this is right

The dynamic computation graph lets the model scale to complex hierarchical states without evaluating every law on every prediction.
Sparse activation supports learning of stochastic dynamics even when most rules are inactive at any given time.
Simulated rollouts from the learned model can identify superior action strategies without further real interaction.
The state-ranking and state-fidelity metrics provide a standardized way to compare future world-model methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same precondition-effect structure could be applied to physical robotics domains where interaction cost is high and stochasticity is common.
Extending the framework to automatically propose new laws when prediction error rises would remove the need for a fixed law library.
Testing whether the learned laws transfer to procedurally varied Crafter levels would reveal how compositional the discovered dynamics are.

Load-bearing premise

An environment's transitional dynamics can be faithfully captured by a sparse collection of programmatic laws whose activation depends on world state.

What would settle it

A test in which OneLife's generated future states receive lower state-ranking scores than a random baseline on a new Crafter-OO scenario with held-out stochastic rules would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.12088 by Archiki Prasad, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal, Zaid Khan.

**Figure 1.** Figure 1: ONELIFE synthesizes world laws from a single unguided (no environment-specific rewards / goals) episode in a hostile, stochastic environment. ONELIFE models the world as mixture of laws written in code with a precondition-effect structure, each governing an aspect of the world, and infers parameters for the mixture that best explain the observed dynamics of the world. The resulting world model (WM) provid… view at source ↗

**Figure 2.** Figure 2: Illustration of the inference process. The active laws for each observable (defined by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Two evaluation metric categories described in Sec. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Per-scenario state ranking performance of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: We show an example of plan execution within ONELIFE’s world model for the “Stone Miner” scenario. The task is to mine stone, and can only be successfully completed if a wooden pickaxe is obtained before attempting to mine stone. We simulate two plans within the world model. The effective plan carries out a multi-step sequence of gathering wood, crafting a wooden pickaxe, and then attempting to mine. The in… view at source ↗

**Figure 6.** Figure 6: The functional cycle for state transition. A declarative state snapshot is reconstructed into [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OneLife gives a workable method for programmatic world models under one-life stochastic constraints but the joint-dependency handling needs closer checks.

read the letter

The paper's core contribution is a probabilistic programming setup that learns precondition-effect laws for environment transitions, with only sparse activation to keep the computation graph manageable. This lets the model handle stochastic dynamics from minimal unguided data in a reimplemented Crafter environment they call Crafter-OO. They report clear gains over a baseline on 16 of 23 test scenarios using two new metrics: how well the model ranks plausible next states and how closely generated states match reality. They also show the learned model supports better planning through simulated rollouts. That combination of one-life limits, stochasticity, and dynamic law activation is not in the prior work they cite, so the framing is fresh. The implementation appears reproducible enough on the surface, with the object-oriented state and pure transition function making the setup concrete. The dynamic graph idea is a practical way to avoid blowing up when many laws could apply to a hierarchical state with objects, inventory, and positions. The main soft spot is whether the independently learned preconditions actually capture correlated effects across multiple objects or status changes at once. With only one life of data, overlapping preconditions could lead to under-estimated variance or biased rollouts if the probabilistic program does not route uncertainty properly across simultaneously active laws. The abstract claims the approach works, but the results section would need to show explicit checks on joint predictions rather than marginal ones. This paper is aimed at researchers building symbolic or neurosymbolic world models for model-based agents. Anyone already working on programmatic representations or planning in partially observable stochastic settings will find the evaluation protocol and Crafter-OO setup useful to build on. It is worth sending to peer review because the empirical comparison is reported on a non-trivial number of scenarios and the one-life constraint is a realistic stress test, even though the joint-dependency question will probably require additional experiments or analysis in revision.

Referee Report

2 major / 3 minor

Summary. The paper introduces OneLife, a framework for inferring symbolic world models of stochastic environments from minimal unguided ('one life') exploration. Dynamics are represented as conditionally-activated programmatic laws inside a probabilistic programming system; sparse precondition-based activation produces a dynamic computation graph that routes inference only through relevant laws. The authors reimplement Crafter as Crafter-OO (object-oriented symbolic state and pure transition function), introduce state-ranking and state-fidelity metrics, and report that OneLife outperforms a strong baseline on 16 of 23 scenarios while also enabling superior planning via simulated rollouts.

Significance. If the central empirical claims hold, the work is significant because it moves symbolic world-model learning into a realistic stochastic regime with severe data limits and no human guidance. The Crafter-OO environment and the two new evaluation protocols are useful contributions that could become community standards. The probabilistic-programming formulation with dynamic routing is a technically attractive way to scale programmatic models to hierarchical object states.

major comments (2)

[§4.3 and Table 2] §4.3 and Table 2: the claim that OneLife 'outperforms a strong baseline on 16 out of 23 scenarios' is load-bearing for the central result, yet the manuscript provides neither per-scenario breakdowns, error bars across random seeds, nor a precise description of the baseline implementation and hyper-parameters. Without these, it is impossible to judge whether the reported margin is robust or sensitive to post-hoc choices in the Crafter-OO reimplementation.
[§3.2 and §3.3] §3.2 (Dynamic Computation Graph) and §3.3 (Learning): the core modeling assumption that a collection of independently inferred precondition-effect laws can faithfully capture jointly correlated stochastic transitions (e.g., tool-use simultaneously affecting inventory counts, agent health, and object positions) is not directly tested. Under the one-life data regime, overlapping preconditions may remain under-constrained, leading to under-estimated predictive variance; the paper should supply a diagnostic that measures calibration of joint marginals on held-out multi-object transitions.

minor comments (3)

[Abstract and §4.1] The abstract and §4.1 refer to '23 scenarios' without enumerating them or stating the selection criteria; a short appendix table would improve reproducibility.
[§3] Notation for the probabilistic program (precondition predicates, effect distributions, and the dynamic graph construction) is introduced piecemeal; a single running example with explicit equations early in §3 would aid readability.
[Figure 3] Figure 3 (rollout examples) would benefit from explicit uncertainty bands or multiple sampled futures to illustrate how the probabilistic laws propagate stochasticity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our empirical evaluation and modeling assumptions. We address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [§4.3 and Table 2] §4.3 and Table 2: the claim that OneLife 'outperforms a strong baseline on 16 out of 23 scenarios' is load-bearing for the central result, yet the manuscript provides neither per-scenario breakdowns, error bars across random seeds, nor a precise description of the baseline implementation and hyper-parameters. Without these, it is impossible to judge whether the reported margin is robust or sensitive to post-hoc choices in the Crafter-OO reimplementation.

Authors: We agree that additional details are required to substantiate the central claim. In the revised manuscript we will expand §4.3 with a per-scenario breakdown (either as an extended Table 2 or a supplementary table), report results with error bars over multiple random seeds for both OneLife and the baseline, and include a precise description of the baseline implementation together with all hyperparameters and Crafter-OO reimplementation choices. These changes will allow readers to evaluate robustness directly. revision: yes
Referee: [§3.2 and §3.3] §3.2 (Dynamic Computation Graph) and §3.3 (Learning): the core modeling assumption that a collection of independently inferred precondition-effect laws can faithfully capture jointly correlated stochastic transitions (e.g., tool-use simultaneously affecting inventory counts, agent health, and object positions) is not directly tested. Under the one-life data regime, overlapping preconditions may remain under-constrained, leading to under-estimated predictive variance; the paper should supply a diagnostic that measures calibration of joint marginals on held-out multi-object transitions.

Authors: Our state-ranking and state-fidelity metrics already evaluate predictive accuracy on complete state transitions and therefore incorporate joint effects across objects and attributes. We nevertheless agree that an explicit calibration diagnostic for joint marginals would strengthen the validation of the modeling assumption. In the revision we will add such a diagnostic, reporting calibration error on held-out multi-object transitions drawn from the one-life trajectories, and present the results in §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical evaluations on a new environment

full rationale

The paper's central claims rest on introducing Crafter-OO and evaluating OneLife via state ranking and state fidelity metrics against a baseline across 23 scenarios, with success defined by outperforming the baseline on 16/23 cases. No equations or derivations are presented that reduce a 'prediction' or 'first-principles result' to fitted parameters or self-citations by construction. The framework description (conditionally-activated programmatic laws in a probabilistic program) is an architectural choice justified by scaling arguments rather than by re-deriving the target dynamics from the same data. Self-citations, if present, are not load-bearing for the reported empirical outcomes. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review is based solely on the abstract; therefore the ledger is populated only with elements explicitly named in the abstract. The framework introduces conditionally-activated programmatic laws as the core modeling device.

invented entities (1)

conditionally-activated programmatic laws no independent evidence
purpose: Represent environment transitional dynamics as executable precondition-effect rules that activate only in relevant states
Abstract states that each law operates through a precondition-effect structure creating a dynamic computation graph

pith-pipeline@v0.9.0 · 5836 in / 1253 out tokens · 33943 ms · 2026-05-18T07:49:32.146989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Input:WorldStateobjects t

work page
[2]

Reconstruct live game engine instance

work page
[3]

Execute single update tick with given action

work page
[4]

Export resulting state ass t+1

work page
[5]

grass") 11 12# Place the target resource in a specific location 13world_utils.set_tile_material(world, (6, 5),

Return newWorldStateobject 22 This ensures every transition is a pure function of the explicit state, making the environment suitable for symbolic reasoning and program synthesis. B.5 UTILITIES FORPROGRAMMATICSTATEINTERACTION A key contribution of Crafter-OO is a rich set of utilities that enable programmatic interaction with the world state. These functi...

work page
[6]

"" 15Discriminative function: Computes the log-probability of a specific 16next_state given the current state and action. 17

-> float: 14""" 15Discriminative function: Computes the log-probability of a specific 16next_state given the current state and action. 17""" 27 Table 5: Complete list of evaluation scenarios used to test world models in Crafter-OO. Category Scenario Name Description Movementrandom movementTests basic player movement in the cardinal directions. Collection ...

work page 2013
[7]

The initial states t and resulting states t+1, serialized to a structured format (JSON)

work page
[8]

The actiona t that caused the transition

work page
[9]

A textual ‘diff‘ that highlights the exact changes betweens t ands t+1. 33

work page
[10]

A human-readable 2D ASCII rendering of the local environment around the player for both states, providing spatial context

work page
[11]

player inventory

The name of the aspect (e.g., “player inventory”) that changed, which instructs the LLM to focus its analysis. This structured presentation of the transition allows the LLM to ground its reasoning in the specific, observed changes. The full prompt template is provided in Fig. F.2. Figure F.2|Synthesis Prompt 1## Role 2You are a **World Law Synthesizer** -...

work page
[12]

tree")[1] 9) 10interact_option = PlayerInteractAdjacentOption( 11lambda s: find_closest_material_of_type(s,

-> WorldState: 6trees_chopped = 0 7pathfind_option = PlayerPathfindOption( 8lambda s: find_closest_material_of_type(s, "tree")[1] 9) 10interact_option = PlayerInteractAdjacentOption( 11lambda s: find_closest_material_of_type(s, "tree")[1] 12) 13 14# Gather wood by iterating between pathfinding and interaction 15while trees_chopped < num_trees: 16try: 17ac...

work page
[13]

-> WorldState: 37for zombie_id in zombie_ids: 38combat_option = CombatFixedEntityOption(entity_id=zombie_id) 39 40for _ in range(max_steps_per_zombie): 41try: 42action = combat_option.action(state) 43state = transition_fn(state, action) 44except TerminationCondition: 45break # Zombie defeated 46 47return state 48 49 50def sword_then_zombies_plan( 51state:...

work page
[14]

"" 56High-level plan: Craft weapon before engaging in combat. 57Composes two sub-plans into a complete strategy. 58

-> WorldState: 55""" 56High-level plan: Craft weapon before engaging in combat. 57Composes two sub-plans into a complete strategy. 58""" 59# Sub-plan 1: Obtain weapon 60state = craft_wooden_sword_plan(state, transition_fn, num_trees=3) 61 62# Sub-plan 2: Defeat enemies 63state = defeat_zombies_plan(state, transition_fn, zombie_ids) 64 65return state 39

work page

[1] [1]

Input:WorldStateobjects t

work page

[2] [2]

Reconstruct live game engine instance

work page

[3] [3]

Execute single update tick with given action

work page

[4] [4]

Export resulting state ass t+1

work page

[5] [5]

grass") 11 12# Place the target resource in a specific location 13world_utils.set_tile_material(world, (6, 5),

Return newWorldStateobject 22 This ensures every transition is a pure function of the explicit state, making the environment suitable for symbolic reasoning and program synthesis. B.5 UTILITIES FORPROGRAMMATICSTATEINTERACTION A key contribution of Crafter-OO is a rich set of utilities that enable programmatic interaction with the world state. These functi...

work page

[6] [6]

"" 15Discriminative function: Computes the log-probability of a specific 16next_state given the current state and action. 17

-> float: 14""" 15Discriminative function: Computes the log-probability of a specific 16next_state given the current state and action. 17""" 27 Table 5: Complete list of evaluation scenarios used to test world models in Crafter-OO. Category Scenario Name Description Movementrandom movementTests basic player movement in the cardinal directions. Collection ...

work page 2013

[7] [7]

The initial states t and resulting states t+1, serialized to a structured format (JSON)

work page

[8] [8]

The actiona t that caused the transition

work page

[9] [9]

A textual ‘diff‘ that highlights the exact changes betweens t ands t+1. 33

work page

[10] [10]

A human-readable 2D ASCII rendering of the local environment around the player for both states, providing spatial context

work page

[11] [11]

player inventory

The name of the aspect (e.g., “player inventory”) that changed, which instructs the LLM to focus its analysis. This structured presentation of the transition allows the LLM to ground its reasoning in the specific, observed changes. The full prompt template is provided in Fig. F.2. Figure F.2|Synthesis Prompt 1## Role 2You are a **World Law Synthesizer** -...

work page

[12] [12]

tree")[1] 9) 10interact_option = PlayerInteractAdjacentOption( 11lambda s: find_closest_material_of_type(s,

-> WorldState: 6trees_chopped = 0 7pathfind_option = PlayerPathfindOption( 8lambda s: find_closest_material_of_type(s, "tree")[1] 9) 10interact_option = PlayerInteractAdjacentOption( 11lambda s: find_closest_material_of_type(s, "tree")[1] 12) 13 14# Gather wood by iterating between pathfinding and interaction 15while trees_chopped < num_trees: 16try: 17ac...

work page

[13] [13]

-> WorldState: 37for zombie_id in zombie_ids: 38combat_option = CombatFixedEntityOption(entity_id=zombie_id) 39 40for _ in range(max_steps_per_zombie): 41try: 42action = combat_option.action(state) 43state = transition_fn(state, action) 44except TerminationCondition: 45break # Zombie defeated 46 47return state 48 49 50def sword_then_zombies_plan( 51state:...

work page

[14] [14]

"" 56High-level plan: Craft weapon before engaging in combat. 57Composes two sub-plans into a complete strategy. 58

-> WorldState: 55""" 56High-level plan: Craft weapon before engaging in combat. 57Composes two sub-plans into a complete strategy. 58""" 59# Sub-plan 1: Obtain weapon 60state = craft_wooden_sword_plan(state, transition_fn, num_trees=3) 61 62# Sub-plan 2: Defeat enemies 63state = defeat_zombies_plan(state, transition_fn, zombie_ids) 64 65return state 39

work page