Beyond State Consistency: Behavior Consistency in Text-Based World Models
Pith reviewed 2026-05-10 13:51 UTC · model grok-4.3
The pith
Training world models with behavior consistency rewards improves their long-term alignment with real text environments compared to state matching.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In text-based environments, world models trained to optimize the Behavior Consistency Reward (BehR) achieve improved long-term behavioral alignment with the real environment. BehR measures the difference in a frozen reference agent's likelihood of taking the logged next action when the input state is the model's prediction versus the true state. This training preserves or improves single-step prediction quality in most cases, reduces false positives in offline surrogate evaluations, and yields modest improvements in lookahead planning.
What carries the argument
The Behavior Consistency Reward (BehR), a tractable step-level metric that quantifies the change in next-action likelihood under a frozen Reference Agent between the real state and the world-model-predicted state.
If this is right
- World models support more reliable multi-step rollouts for agent planning and evaluation.
- Offline surrogate evaluations of agents become more accurate with fewer false positives.
- Single-step state prediction accuracy is maintained or enhanced in three of four tested settings.
- Modest improvements appear in inference-time lookahead planning using the trained models.
Where Pith is reading between the lines
- Extending BehR to use multiple reference agents or learnable ones could further strengthen consistency guarantees.
- The method might apply to other domains like robotics or game environments where functional behavior matters more than exact state replication.
- Longer-term, this could shift world model training away from pure reconstruction losses toward behavior-aware objectives.
Load-bearing premise
That making the world model's predicted states produce similar action likelihoods for a fixed reference agent at each step will keep multi-step trajectories behaviorally consistent with the real environment.
What would settle it
If long-horizon rollouts in the world model lead to action distributions or success rates that differ substantially from those observed in the real environment, even when BehR is high during training.
Figures
read the original abstract
World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text-based environments, world models are typically evaluated and trained with single-step metrics such as Exact Match, aiming to improve the similarity between predicted and real-world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior-aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step-level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world-model-predicted state under a frozen Reference Agent. Experiments on WebShop and TextWorld show that BehR-based training improves long-term alignment in several settings, with the clearest gains in WebShop and less movement in near-ceiling regimes, while preserving or improving single-step prediction quality in three of four settings. World models trained with BehR also achieve lower false positives in offline surrogate evaluation and show modest but encouraging gains in inference-time lookahead planning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Behavior Consistency Reward (BehR), a step-level training objective for text-based world models that measures the change in log-likelihood of a frozen reference agent's next action when the real state is replaced by the model's prediction. It claims that BehR-based training yields world models with improved long-term behavioral alignment (most clearly in WebShop), lower false positives in offline surrogate evaluation, and preserved or better single-step prediction quality in three of four settings, while also showing modest gains in inference-time lookahead planning.
Significance. If the central claim holds, the work offers a tractable alternative to state-similarity metrics that directly targets functional consistency with agent behavior. This could meaningfully improve the reliability of world models for planning and offline evaluation in interactive text environments, where exact-match metrics have known limitations.
major comments (3)
- [Experiments and §4 (BehR definition)] The central claim requires that step-level BehR optimization produces multi-step rollout consistency, yet the training objective only penalizes per-step divergence under the reference agent. No theoretical argument or empirical ablation (e.g., horizon-dependent degradation curves or direct comparison of multi-step trajectory distributions) is provided to show that local improvements prevent compounding errors over long horizons. This is load-bearing for the long-term alignment results reported in the abstract and experiments.
- [Method section on BehR and reference agent] The reference agent is described as frozen but its training data, objective, and independence from the world-model training distribution are not specified. If the reference agent was trained on overlapping trajectories, the BehR signal risks partial circularity rather than providing an external behavioral anchor. This directly affects the interpretation of the consistency gains.
- [Results tables and experimental setup] Quantitative claims of improvement (long-term alignment, false-positive reduction) are presented without error bars, number of runs, statistical tests, or ablations on reference-agent choice. In near-ceiling regimes this makes it impossible to distinguish signal from noise or to assess robustness of the WebShop gains.
minor comments (3)
- [§3] Notation for BehR (change in log-likelihood) should be formalized with an explicit equation, including the precise conditioning on state vs. predicted state.
- [Experiments] The distinction between 'near-ceiling' and 'WebShop' regimes is described post-hoc; pre-specifying the regimes and metrics for each would improve clarity.
- [Related work] Missing citations to prior work on behavior-aware evaluation of world models or surrogate metrics in RL.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have carefully reviewed each major comment and provide point-by-point responses below. Where the comments identify opportunities to strengthen the manuscript, we have incorporated revisions or additional experiments in the updated version.
read point-by-point responses
-
Referee: [Experiments and §4 (BehR definition)] The central claim requires that step-level BehR optimization produces multi-step rollout consistency, yet the training objective only penalizes per-step divergence under the reference agent. No theoretical argument or empirical ablation (e.g., horizon-dependent degradation curves or direct comparison of multi-step trajectory distributions) is provided to show that local improvements prevent compounding errors over long horizons. This is load-bearing for the long-term alignment results reported in the abstract and experiments.
Authors: We acknowledge that the manuscript lacks an explicit theoretical argument connecting per-step BehR optimization to multi-step consistency, as well as targeted ablations such as horizon-dependent degradation curves or direct multi-step trajectory distribution comparisons. The empirical results in Section 5 do demonstrate improved long-term behavioral alignment through multi-step rollout evaluations in WebShop and TextWorld. To strengthen the evidence, the revised manuscript will include new ablations: performance curves as a function of rollout horizon and comparisons of full trajectory distributions between BehR-trained models and baselines. These additions will provide clearer empirical support that step-level improvements mitigate compounding errors. revision: yes
-
Referee: [Method section on BehR and reference agent] The reference agent is described as frozen but its training data, objective, and independence from the world-model training distribution are not specified. If the reference agent was trained on overlapping trajectories, the BehR signal risks partial circularity rather than providing an external behavioral anchor. This directly affects the interpretation of the consistency gains.
Authors: We will expand the method section in the revised manuscript to fully specify the reference agent. It is a frozen policy trained via behavior cloning on a held-out set of expert trajectories that are completely disjoint from the data used to train the world models. Its objective is standard next-action prediction, and it shares no parameters or training overlap with the world model. This clarification will confirm that BehR provides an external behavioral anchor and eliminate any concern of circularity. revision: yes
-
Referee: [Results tables and experimental setup] Quantitative claims of improvement (long-term alignment, false-positive reduction) are presented without error bars, number of runs, statistical tests, or ablations on reference-agent choice. In near-ceiling regimes this makes it impossible to distinguish signal from noise or to assess robustness of the WebShop gains.
Authors: We agree that the results presentation would be improved by including measures of variability and statistical support. The revised manuscript will report all quantitative results with error bars (standard deviation across 5 independent random seeds), explicitly state the number of runs, include statistical tests (e.g., paired t-tests) for key comparisons, and add an ablation varying the reference agent (different training seeds and architectures). These changes will allow better assessment of signal versus noise and robustness, particularly for the WebShop results. revision: yes
Circularity Check
No significant circularity; empirical training pipeline is self-contained
full rationale
The paper introduces BehR as an auxiliary training objective computed from a frozen reference agent, optimizes the world model to maximize this per-step quantity, and then reports separate empirical measurements of long-horizon alignment on WebShop and TextWorld. No equation or claim reduces the reported long-term gains to the definition of BehR by construction; the multi-step results are presented as measured outcomes rather than algebraic consequences of the local objective. The reference agent is treated as an external benchmark, and no self-citation chain or ansatz smuggling is required to support the central training paradigm. This is a standard supervised fine-tuning setup whose validity rests on the experimental results, not on internal definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A frozen reference agent provides a stable and representative measure of action likelihood under real versus predicted states.
Reference graph
Works this paper leans on
-
[1]
Textworld: A learning environment for text- based games. InWorkshop on Computer Games, pages 41–75. Springer, Springer. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.2178...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian- hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 oth- ers. 2024. Qwen2.5 technical report.arXiv preprint ar...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854. A Environment and Dataset Details We evaluate on two representative text-based in- teractive environments. Task-level evaluation uses 200 held-out initial tasks per domain, drawn from the AGENTEVALbenchmark suite of Xi et al. (2024), which provides standa...
work page Pith review arXiv 2024
-
[4]
Reference-Agent prompt construction: Build an agent-perspective prompt from the dialogue history ht and candidate state ˆst+1, ending with the logged expert action a∗ t+1 (see §D.3)
-
[5]
All Qwen3 agents span both roles where applicable
Log-probability computation: Query the frozen Reference Agent (Qwen3-8B) via the 12 Model Role(s) Source World-Model Backbones (open-weight) Qwen2.5-7B Primary WM (SFT + BehR)Qwen/Qwen2.5-7B LLaMA3.1-8B Cross-architecture WM (SFT + BehR)meta-llama/Llama-3.1-8B Evaluation Agents (open-weight, Qwen3 series) Qwen3-0.6B Surrogate eval agentQwen/Qwen3-0.6B Qwe...
work page 2024
-
[6]
Thought: <your reasoning> Action: <the single action to take>
Reward mapping: Compute the Reference- Agent likelihood difference ∆ = ¯ℓpred − ¯ℓreal and apply the exponential form: Rbeh = exp(−coef× |∆|),coef= 1.0 Real-state log-probability caching.A key ef- ficiency optimization: under GRPO with n=5 rollouts per prompt, all 5 candidates share the same real state st+1 and thus the same ¯ℓreal. We cache ¯ℓreal per pr...
work page 2023
-
[7]
Candidate proposal.The planner LLM re- ceives the current observation and the full list of admissible actions, and outputs the top- K actions ranked by estimated promise (Stage 1 prompt below). This costs one LLM call
-
[8]
WM rollout.For each of the K candidate ac- tions, the world model generates a predicted next state ˆs(k) t+1, k= 1, . . . , K . This costs K WM calls (batched)
-
[9]
Best-action selection.The planner LLM re- ceives all K (action, predicted state) pairs and selects the action whose predicted outcome best advances the task goal (Stage 2 prompt below). This costs one LLM call
-
[10]
Total inference cost per step: K+2 LLM calls (K WM + 2 planner)
Execution.The selected action is sent to the real environment (or WM); the returned observation becomes the context for stept+1. Total inference cost per step: K+2 LLM calls (K WM + 2 planner). With K=5, this is 7× the cost of standard ReAct, which motivates keeping K small. Stage 1: Candidate proposal. Lookahead Candidate Proposal You are currently in th...
-
[11]
You open the antique trunk, revealing an old key
action_here . . . Only output the numbered list, nothing else. Stage 2: Best-action selection. Lookahead Best-Action Selection [System] You are a decision-making assistant. You will be given a current state and multiple action options with their predicted outcomes. Select the BEST option. [User]CURRENT STATE: {observation} AVAILABLE OPTIONS (with predicte...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.