One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
Pith reviewed 2026-05-18 07:49 UTC · model grok-4.3
The pith
OneLife learns key environment dynamics from minimal unguided interaction by representing stochastic transitions as conditionally-activated programmatic laws.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing an environment's transitional dynamics as a collection of conditionally-activated programmatic laws inside a probabilistic programming framework, an agent can infer an executable symbolic world model from sparse, unguided interaction even when the environment is stochastic, hierarchical, and hostile.
What carries the argument
Conditionally-activated programmatic laws, each with a precondition-effect structure, that form a dynamic computation graph routing inference and optimization only through relevant laws.
If this is right
- The dynamic computation graph lets the model scale to complex hierarchical states without evaluating every law on every prediction.
- Sparse activation supports learning of stochastic dynamics even when most rules are inactive at any given time.
- Simulated rollouts from the learned model can identify superior action strategies without further real interaction.
- The state-ranking and state-fidelity metrics provide a standardized way to compare future world-model methods.
Where Pith is reading between the lines
- The same precondition-effect structure could be applied to physical robotics domains where interaction cost is high and stochasticity is common.
- Extending the framework to automatically propose new laws when prediction error rises would remove the need for a fixed law library.
- Testing whether the learned laws transfer to procedurally varied Crafter levels would reveal how compositional the discovered dynamics are.
Load-bearing premise
An environment's transitional dynamics can be faithfully captured by a sparse collection of programmatic laws whose activation depends on world state.
What would settle it
A test in which OneLife's generated future states receive lower state-ranking scores than a random baseline on a new Crafter-OO scenario with held-out stochastic rules would falsify the central claim.
Figures
read the original abstract
Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OneLife, a framework for inferring symbolic world models of stochastic environments from minimal unguided ('one life') exploration. Dynamics are represented as conditionally-activated programmatic laws inside a probabilistic programming system; sparse precondition-based activation produces a dynamic computation graph that routes inference only through relevant laws. The authors reimplement Crafter as Crafter-OO (object-oriented symbolic state and pure transition function), introduce state-ranking and state-fidelity metrics, and report that OneLife outperforms a strong baseline on 16 of 23 scenarios while also enabling superior planning via simulated rollouts.
Significance. If the central empirical claims hold, the work is significant because it moves symbolic world-model learning into a realistic stochastic regime with severe data limits and no human guidance. The Crafter-OO environment and the two new evaluation protocols are useful contributions that could become community standards. The probabilistic-programming formulation with dynamic routing is a technically attractive way to scale programmatic models to hierarchical object states.
major comments (2)
- [§4.3 and Table 2] §4.3 and Table 2: the claim that OneLife 'outperforms a strong baseline on 16 out of 23 scenarios' is load-bearing for the central result, yet the manuscript provides neither per-scenario breakdowns, error bars across random seeds, nor a precise description of the baseline implementation and hyper-parameters. Without these, it is impossible to judge whether the reported margin is robust or sensitive to post-hoc choices in the Crafter-OO reimplementation.
- [§3.2 and §3.3] §3.2 (Dynamic Computation Graph) and §3.3 (Learning): the core modeling assumption that a collection of independently inferred precondition-effect laws can faithfully capture jointly correlated stochastic transitions (e.g., tool-use simultaneously affecting inventory counts, agent health, and object positions) is not directly tested. Under the one-life data regime, overlapping preconditions may remain under-constrained, leading to under-estimated predictive variance; the paper should supply a diagnostic that measures calibration of joint marginals on held-out multi-object transitions.
minor comments (3)
- [Abstract and §4.1] The abstract and §4.1 refer to '23 scenarios' without enumerating them or stating the selection criteria; a short appendix table would improve reproducibility.
- [§3] Notation for the probabilistic program (precondition predicates, effect distributions, and the dynamic graph construction) is introduced piecemeal; a single running example with explicit equations early in §3 would aid readability.
- [Figure 3] Figure 3 (rollout examples) would benefit from explicit uncertainty bands or multiple sampled futures to illustrate how the probabilistic laws propagate stochasticity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of our empirical evaluation and modeling assumptions. We address each major comment below and describe the revisions we will make.
read point-by-point responses
-
Referee: [§4.3 and Table 2] §4.3 and Table 2: the claim that OneLife 'outperforms a strong baseline on 16 out of 23 scenarios' is load-bearing for the central result, yet the manuscript provides neither per-scenario breakdowns, error bars across random seeds, nor a precise description of the baseline implementation and hyper-parameters. Without these, it is impossible to judge whether the reported margin is robust or sensitive to post-hoc choices in the Crafter-OO reimplementation.
Authors: We agree that additional details are required to substantiate the central claim. In the revised manuscript we will expand §4.3 with a per-scenario breakdown (either as an extended Table 2 or a supplementary table), report results with error bars over multiple random seeds for both OneLife and the baseline, and include a precise description of the baseline implementation together with all hyperparameters and Crafter-OO reimplementation choices. These changes will allow readers to evaluate robustness directly. revision: yes
-
Referee: [§3.2 and §3.3] §3.2 (Dynamic Computation Graph) and §3.3 (Learning): the core modeling assumption that a collection of independently inferred precondition-effect laws can faithfully capture jointly correlated stochastic transitions (e.g., tool-use simultaneously affecting inventory counts, agent health, and object positions) is not directly tested. Under the one-life data regime, overlapping preconditions may remain under-constrained, leading to under-estimated predictive variance; the paper should supply a diagnostic that measures calibration of joint marginals on held-out multi-object transitions.
Authors: Our state-ranking and state-fidelity metrics already evaluate predictive accuracy on complete state transitions and therefore incorporate joint effects across objects and attributes. We nevertheless agree that an explicit calibration diagnostic for joint marginals would strengthen the validation of the modeling assumption. In the revision we will add such a diagnostic, reporting calibration error on held-out multi-object transitions drawn from the one-life trajectories, and present the results in §4. revision: yes
Circularity Check
No significant circularity; results are empirical evaluations on a new environment
full rationale
The paper's central claims rest on introducing Crafter-OO and evaluating OneLife via state ranking and state fidelity metrics against a baseline across 23 scenarios, with success defined by outperforming the baseline on 16/23 cases. No equations or derivations are presented that reduce a 'prediction' or 'first-principles result' to fitted parameters or self-citations by construction. The framework description (conditionally-activated programmatic laws in a probabilistic program) is an architectural choice justified by scaling arguments rather than by re-deriving the target dynamics from the same data. Self-citations, if present, are not load-bearing for the reported empirical outcomes. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
conditionally-activated programmatic laws
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Input:WorldStateobjects t
-
[2]
Reconstruct live game engine instance
-
[3]
Execute single update tick with given action
-
[4]
Export resulting state ass t+1
-
[5]
Return newWorldStateobject 22 This ensures every transition is a pure function of the explicit state, making the environment suitable for symbolic reasoning and program synthesis. B.5 UTILITIES FORPROGRAMMATICSTATEINTERACTION A key contribution of Crafter-OO is a rich set of utilities that enable programmatic interaction with the world state. These functi...
-
[6]
-> float: 14""" 15Discriminative function: Computes the log-probability of a specific 16next_state given the current state and action. 17""" 27 Table 5: Complete list of evaluation scenarios used to test world models in Crafter-OO. Category Scenario Name Description Movementrandom movementTests basic player movement in the cardinal directions. Collection ...
work page 2013
-
[7]
The initial states t and resulting states t+1, serialized to a structured format (JSON)
-
[8]
The actiona t that caused the transition
-
[9]
A textual ‘diff‘ that highlights the exact changes betweens t ands t+1. 33
-
[10]
A human-readable 2D ASCII rendering of the local environment around the player for both states, providing spatial context
-
[11]
The name of the aspect (e.g., “player inventory”) that changed, which instructs the LLM to focus its analysis. This structured presentation of the transition allows the LLM to ground its reasoning in the specific, observed changes. The full prompt template is provided in Fig. F.2. Figure F.2|Synthesis Prompt 1## Role 2You are a **World Law Synthesizer** -...
-
[12]
-> WorldState: 6trees_chopped = 0 7pathfind_option = PlayerPathfindOption( 8lambda s: find_closest_material_of_type(s, "tree")[1] 9) 10interact_option = PlayerInteractAdjacentOption( 11lambda s: find_closest_material_of_type(s, "tree")[1] 12) 13 14# Gather wood by iterating between pathfinding and interaction 15while trees_chopped < num_trees: 16try: 17ac...
-
[13]
-> WorldState: 37for zombie_id in zombie_ids: 38combat_option = CombatFixedEntityOption(entity_id=zombie_id) 39 40for _ in range(max_steps_per_zombie): 41try: 42action = combat_option.action(state) 43state = transition_fn(state, action) 44except TerminationCondition: 45break # Zombie defeated 46 47return state 48 49 50def sword_then_zombies_plan( 51state:...
-
[14]
-> WorldState: 55""" 56High-level plan: Craft weapon before engaging in combat. 57Composes two sub-plans into a complete strategy. 58""" 59# Sub-plan 1: Obtain weapon 60state = craft_wooden_sword_plan(state, transition_fn, num_trees=3) 61 62# Sub-plan 2: Defeat enemies 63state = defeat_zombies_plan(state, transition_fn, zombie_ids) 64 65return state 39
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.