Recognition: unknown
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
Pith reviewed 2026-05-10 00:09 UTC · model grok-4.3
The pith
A co-evolving skill bank and decision agent framework enables LLMs to better handle long-horizon tasks in games.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COSPLAY is a co-evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent-managed skill pipeline discovers reusable skills from the agent's unlabeled rollouts to form and update the skill bank. This setup improves the decision agent's skill retrieval and action generation while the skill bank continually extracts, refines, and updates skills with their contracts, leading to better performance in long-horizon game environments.
What carries the argument
The co-evolution between the LLM decision agent and the skill bank agent, where skills are retrieved for decision making and extracted from rollouts for bank updates.
If this is right
- The decision agent learns better skill retrieval and action generation through interaction with the skill bank.
- The skill bank agent extracts, refines, and contracts skills from unlabeled rollouts, enabling reuse across episodes.
- Experiments across six game environments demonstrate over 25.1 percent average reward improvement with an 8B base model against frontier LLM baselines on single-player benchmarks.
- Competitive performance is maintained on multi-player social reasoning games.
Where Pith is reading between the lines
- This mutual bootstrapping could reduce the need for extensive human-labeled data or supervision in agent training.
- The approach might extend to other long-horizon domains such as robotics or planning tasks if the skill extraction generalizes.
- Smaller models enhanced this way could become more efficient alternatives to scaling up model size for interactive tasks.
Load-bearing premise
That the skill pipeline can reliably extract, refine, and contract genuinely reusable skills from unlabeled rollouts without supervision, and that this produces transferable improvements rather than environment-specific overfitting.
What would settle it
If the skills extracted by the pipeline fail to improve the decision agent's performance when used in new episodes or environments, or if removing the co-evolution loop eliminates the observed reward gains.
Figures
read the original abstract
Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COSPLAY, a co-evolution framework for LLM agents in long-horizon interactive environments such as games. A decision agent retrieves skills from a learnable skill bank to guide multi-step actions under partial observability and delayed rewards, while a separate skill pipeline agent extracts, refines, and contracts reusable skills from the decision agent's unlabeled rollouts to populate and update the bank. Experiments across six game environments claim that COSPLAY instantiated with an 8B base model yields over 25.1% average reward improvement versus four frontier LLM baselines on single-player benchmarks while remaining competitive on multi-player social-reasoning games.
Significance. If the empirical gains prove robust, the framework would represent a meaningful advance for LLM agents by enabling unsupervised, iterative skill discovery and reuse without human supervision or hand-crafted skill libraries. The co-evolution loop between decision and skill agents directly targets the long-horizon consistency problem that current prompting and retrieval methods struggle with. The fact that an 8B model reportedly outperforms larger frontier baselines on single-player tasks would be noteworthy if supported by proper controls and ablations.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The headline claim of a 25.1% average reward improvement is presented without any description of the six environments, the four frontier baselines, number of evaluation episodes, variance across runs, or statistical tests. This information is load-bearing for the central empirical claim and its absence prevents verification that the gains arise from the skill bank rather than prompting artifacts or environment-specific overfitting.
- [§3] §3 (Method, Skill Pipeline): The unsupervised extraction, refinement, and contraction of skills from unlabeled rollouts is described at a high level with no concrete criteria, similarity metric, or validation step for determining reusability or transferability. Without such mechanisms or accompanying ablations that isolate the skill bank's contribution from the base 8B model's retrieval, it is impossible to rule out that reported gains reflect environment-specific correlations rather than genuinely reusable skills.
- [§4] §4 (Experiments): No cross-environment transfer tests or ablation studies (e.g., skill bank disabled, random skills, or fixed bank) are reported. Such controls are necessary to substantiate that the co-evolution produces transferable improvements rather than per-environment overfitting, which directly bears on the weakest assumption identified in the manuscript.
minor comments (2)
- [Abstract] The abstract and introduction use both 'co evolution' and 'co-evolution' inconsistently; standardize the hyphenated form.
- [Figures and Tables] Figure captions and table headers should explicitly state the base model size (8B) and the exact reward metric used for the 25.1% figure to improve readability.
Simulated Author's Rebuttal
We thank the referee for their detailed feedback on our manuscript. We believe the suggested clarifications and additional analyses will strengthen the presentation of our co-evolution framework. Below we respond to each major comment and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claim of a 25.1% average reward improvement is presented without any description of the six environments, the four frontier baselines, number of evaluation episodes, variance across runs, or statistical tests. This information is load-bearing for the central empirical claim and its absence prevents verification that the gains arise from the skill bank rather than prompting artifacts or environment-specific overfitting.
Authors: We agree that the abstract and experimental summary should be more self-contained to allow readers to assess the claims immediately. In the revised manuscript, we will expand the abstract to briefly name the six game environments and the four frontier LLM baselines. In §4, we will add explicit details on the number of evaluation episodes per environment, report standard deviations or variances across multiple runs, and include results from statistical significance tests comparing COSPLAY to baselines. These additions will help confirm that the reported gains are attributable to the co-evolution mechanism rather than other factors. revision: yes
-
Referee: [§3] §3 (Method, Skill Pipeline): The unsupervised extraction, refinement, and contraction of skills from unlabeled rollouts is described at a high level with no concrete criteria, similarity metric, or validation step for determining reusability or transferability. Without such mechanisms or accompanying ablations that isolate the skill bank's contribution from the base 8B model's retrieval, it is impossible to rule out that reported gains reflect environment-specific correlations rather than genuinely reusable skills.
Authors: The description in §3 was intentionally high-level to focus on the overall co-evolution loop, but we recognize the need for concreteness. We will revise §3 to provide the concrete criteria for skill extraction, the similarity metric used for refinement and contraction, and the validation steps for determining reusability and transferability. We will also incorporate ablations that isolate the skill bank's contribution from the base 8B model's retrieval to rule out environment-specific correlations. revision: yes
-
Referee: [§4] §4 (Experiments): No cross-environment transfer tests or ablation studies (e.g., skill bank disabled, random skills, or fixed bank) are reported. Such controls are necessary to substantiate that the co-evolution produces transferable improvements rather than per-environment overfitting, which directly bears on the weakest assumption identified in the manuscript.
Authors: We acknowledge that the current experiments primarily demonstrate in-environment performance improvements. To address concerns about overfitting versus transferable skills, we will include in the revised §4 additional ablation experiments such as COSPLAY with the skill bank disabled, using randomly generated skills, and a fixed skill bank without updates. We will also report cross-environment transfer tests, where skills discovered in one game environment are applied to another, to demonstrate reusability across tasks. These will be added as new tables or figures. revision: yes
Circularity Check
No significant circularity: empirical claims rest on external benchmarks
full rationale
The paper describes an empirical co-evolution framework (COSPLAY) for LLM agents and skill banks, with central claims consisting of reported reward improvements (e.g., 25.1% average on single-player benchmarks) across six game environments. No derivation chain, equations, fitted parameters, or self-referential definitions exist; the method is presented as a proposed architecture whose performance is evaluated via external benchmarks rather than reducing to its own inputs by construction. Self-citations, if any, are not load-bearing for the core results, which are falsifiable against frontier LLM baselines. This is a standard experimental paper with no circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can improve decision making by retrieving and applying structured skills from an external bank
- domain assumption Reusable skills with usage contracts can be discovered from unlabeled agent rollouts
invented entities (2)
-
Skill bank
no independent evidence
-
Skill pipeline agent
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
Reference graph
Works this paper leans on
-
[1]
Dota 2 with Large Scale Deep Reinforcement Learning
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680,
work page internal anchor Pith review arXiv 1912
-
[2]
10 Preprint. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540,
work page internal anchor Pith review arXiv
-
[3]
Can vlms play action role-playing games? take black myth wukong as a study case, 2024
Peng Chen, Pi Bu, Jun Song, Yuan Gao, and Bo Zheng. Can vlms play action role-playing games? take black myth wukong as a study case.arXiv preprint arXiv:2409.12889,
-
[4]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978,
work page internal anchor Pith review arXiv
-
[5]
Visplay: Self-evolving vision-language models from images,
Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661,
-
[6]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review arXiv
-
[7]
lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146,
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146,
-
[8]
A survey on large language model-based game agents,
Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, and Ling Liu. A survey on large language model-based game agents.arXiv preprint arXiv:2404.02039,
-
[9]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025a. Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. Cascade: Cumulative agentic skill creation through autonomou...
work page internal anchor Pith review arXiv
-
[10]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026a. Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus- 1: Hybrid multimodal m...
work page internal anchor Pith review arXiv
-
[11]
Self-Rewarding Vision-Language Model via Reasoning Decomposition
11 Preprint. Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Mm-zero: Self-evolving multi-model vision language models from zero data,
Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, et al. Mm-zero: Self-evolving multi-model vision language models from zero data.arXiv preprint arXiv:2603.09206, 2026b. Yi Liao, Yu Gu, Yuan Sui, Zining Zhu, Yifan Lu, Guohua Tang, Zhongqian Sun, and Wei Yang. Think in games: Learni...
-
[13]
From text to tactic: Evaluating llms playing the game of avalon
Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. Avalonbench: Evaluating llms playing the game of avalon.arXiv preprint arXiv:2310.05036,
-
[14]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,
work page internal anchor Pith review arXiv
-
[15]
Agentic reinforcement learning with implicit step rewards, 2025
Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199,
-
[16]
AVA: Attentive VLM Agent for Mastering StarCraft II
Weiyu Ma, Yuqian Fu, Zecheng Zhang, Guohao Li, and Bernard Ghanem. Vlms play starcraft ii: A benchmark and multimodal decision method.arXiv preprint arXiv:2503.05383,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents
Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non- parametric ppo for llm agents.arXiv preprint arXiv:2602.01869,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,
work page internal anchor Pith review arXiv
-
[19]
arXiv preprint arXiv:2411.13543 , year=
OpenAI. Introducing gpt-5.4, 2026a. OpenAI. GPT-5 mini Model (gpt-5-mini), 2026b. Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´ nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543,
-
[20]
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, et al. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games.arXiv preprint arXiv:2506.03610,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Team et al.Scaling Instructable Agents Across Many Simulated Worlds
Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, Sarah Chakera, et al. Scaling in- structable agents across many simulated worlds.arXiv preprint arXiv:2404.10179,
-
[22]
Bayesian Social Deduction with Graph-Informed Language Models
Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson, Simon Stepputtis, and Joseph Campbell. Bayesian social deduction with graph-informed language models.arXiv preprint arXiv:2506.17788,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
12 Preprint. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815,
-
[25]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,
work page internal anchor Pith review arXiv
-
[26]
Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self- improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025a. Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision-...
-
[27]
Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M Chan, et al. Visgym: Diverse, customizable, scalable environments for multimodal agents.arXiv preprint arXiv:2601.16973,
-
[28]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234,
work page internal anchor Pith review arXiv
-
[29]
Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents
Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, and Hongsheng Li. Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents.arXiv preprint arXiv:2602.05832,
-
[30]
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,
work page internal anchor Pith review arXiv
-
[31]
Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi-agent environments
Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, and Yu Wang. Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi-agent environments. 2025a. Zhongwen Xu, Xianliang Wang, Siyi Li, Tao Yu, Liang Wang, Qiang Fu, and Wei Yang. Agents play thousands of 3d video games.arXiv preprint arXiv:2503.13356, 2025b...
-
[32]
Shuo Yu, Mingyue Cheng, Daoyu Wang, Qi Liu, Zirui Liu, Ze Guo, and Xiaoyu Tao. Memweaver: A hierarchical memory from textual interactive behaviors for personalized generation.arXiv preprint arXiv:2510.07713, 2025a. Simon Yu, Gang Li, Weiyan Shi, and Peng Qi. Polyskill: Learning generalizable skills through polymorphic abstraction.arXiv preprint arXiv:2510...
-
[33]
Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025
Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395,
-
[34]
Alex L Zhang, Thomas L Griffiths, Karthik R Narasimhan, and Ofir Press. Videogamebench: Can vision-language models complete popular video games?arXiv preprint arXiv:2505.18134,
-
[35]
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents. arXiv preprint arXiv:2602.02474, 2026a. Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, et al. Memrl: Self-evolvi...
work page internal anchor Pith review arXiv 2048
-
[36]
Ckpt Int
C Key Hyperparameters We summarize the main hyperparameters used in co-evolution training for the six game environments in the main paper. Table 3 lists the game-specific settings used in our main experiments. All training runs are conducted on an 8 ×A100 GPU cluster. For games without explicit GRPO overrides, we report the default values directly: GRPO c...
2048
-
[37]
Avalon is a team-based competitive game in which only one side can win. It is structurally harder for the Good side, since Good players must infer hidden roles from sparse signals such as proposals, votes, and quest outcomes, while Evil players begin with full coordination and can strategically hide or sabotage (Light et al., 2023). As shown in Table 1, o...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.