PaW co-trains policy and world modeling on standard RL rollouts using action-entropy data selection, noise-tolerant loss, and reward-adaptive balancing, yielding consistent gains on three agent benchmarks.
arXiv preprint arXiv:2601.08955 , year =
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
COMAP co-evolves textual world models and agent policies for LLMs through on-policy self-distillation, yielding up to 16.75% relative gains on embodied planning, web navigation, and tool-use tasks.
WorldEvolver uses episodic memory, semantic memory, and selective foresight to self-evolve world models at test time, achieving top prediction accuracy and agent success on ALFWorld and ScienceWorld benchmarks.
citing papers explorer
-
Policy and World Modeling Co-Training for Language Agents
PaW co-trains policy and world modeling on standard RL rollouts using action-entropy data selection, noise-tolerant loss, and reward-adaptive balancing, yielding consistent gains on three agent benchmarks.
-
COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
COMAP co-evolves textual world models and agent policies for LLMs through on-policy self-distillation, yielding up to 16.75% relative gains on embodied planning, web navigation, and tool-use tasks.
-
Self-Evolving World Models for LLM Agent Planning
WorldEvolver uses episodic memory, semantic memory, and selective foresight to self-evolve world models at test time, achieving top prediction accuracy and agent success on ALFWorld and ScienceWorld benchmarks.