pith. sign in

arxiv: 2605.18181 · v1 · pith:XGGNAC52new · submitted 2026-05-18 · 💻 cs.AI · cs.CL

Scalable Environments Drive Generalizable Agents

Pith reviewed 2026-05-20 10:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords generalizable agentsenvironment scalingexecutable rule-setsdistribution shiftscalable environmentsreinforcement learningagent adaptation
0
0 comments X

The pith

Generalizable agents require scaling the distribution of executable rule-sets across environments rather than only adding trajectories or tasks within fixed benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that agents fail to generalize when environments change their core interaction rules even if more data or tasks are provided under the same rules. It separates this environment scaling from simpler forms of data growth and shows why only the former tackles a world-level shift in interfaces, dynamics, observations, and feedback. A taxonomy is introduced to organize these distinctions, followed by methods for building varied environments through controllable generators or open generative models. The authors then link environment scaling to stateful learning that lets agents update across different rule-sets. If correct, this supplies the substrate needed for measurable progress toward agents that handle genuinely new worlds.

Core claim

The authors argue that generalization beyond training requires environment scaling, which expands the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current practices leave agents brittle to changes in interfaces, dynamics, observations, or feedback signals because these constitute a distinct world-level distribution shift. The paper offers a unified taxonomy distinguishing the three scaling types by what changes in the rule-set and by their primary deliverables, then synthesizes construction paradigms that contrast programmatic generators with generative world models and outlines coupling to learned,

What carries the argument

The unified taxonomy that classifies scaling approaches by the specific changes they induce in the executable rule-set and by the primary deliverable each produces.

If this is right

  • Agents trained this way can adapt to new dynamics and observation spaces without retraining from scratch.
  • Benchmarks can be designed around controlled rule-set variation to track genuine generalization.
  • Programmatic generators enable precise testing while generative models supply broader, open-ended coverage.
  • Stateful learning with learned update rules becomes necessary to carry knowledge across differing environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks could be built that deliberately vary rule-sets in systematic ways to quantify how much environment scaling improves transfer.
  • The same logic might apply to planning domains where the underlying physics or reward structure is altered between episodes.
  • An immediate test would be to generate families of environments differing only in one rule component and measure whether cross-family performance rises only when that component is varied during training.

Load-bearing premise

That alterations to executable rule-sets create a distinct world-level distribution shift that cannot be overcome by scaling trajectories or tasks inside fixed interaction rules.

What would settle it

Train one set of agents with large increases in trajectories and tasks inside a single fixed rule-set and another set with explicit variation across many different rule-sets, then measure whether both reach comparable success on entirely novel rule combinations never seen in training.

Figures

Figures reproduced from arXiv: 2605.18181 by Bang Liu, Chenglin Wu, Fanqi Kong, Guibin Zhang, Jianhao Ruan, Jiayi Zhang, Jinyu Xiang, Maojia Song, Yuyu Luo, Zhaoyang Yu.

Figure 1
Figure 1. Figure 1: Bottleneck shifts as we scale agents: from static data to interactive worlds. Different scaling targets yield different primary gains: LM data improves language and knowledge; trajectories improve policy competence; tasks improve environment generalization within a fixed rule set; environments enable cross-environment adaptation under changing rules and interfaces. obscures what is being scaled and makes c… view at source ↗
Figure 2
Figure 2. Figure 2: Scalable environments as a substrate for scaling agents. We organize scaling into trajectory, task, and environment regimes (A) with corresponding deliverables (B), relate them to learning mechanisms and capability (C), and summarize the generate–interact–learn–scale loop (D). Environment as an executable rule set. We model an environment E as an executable specification of interaction rules, which determi… view at source ↗
read the original abstract

Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues that such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper is a position piece arguing that generalizable agents require environment scaling—expanding the distribution over executable rule-sets (interfaces, dynamics, observations, feedback signals)—rather than only trajectory or task scaling within fixed benchmarks. It introduces a taxonomy distinguishing the three scaling types by their primary deliverables and rule-set changes, contrasts programmatic generators with generative world models for constructing scalable environments, and sketches how environment scaling can be paired with stateful update rules for cross-environment adaptation.

Significance. If the reframing holds, the work could usefully redirect attention in agent research from data volume within static rule-sets toward controllable expansion of environment distributions, providing a shared vocabulary and synthesis of construction approaches that may help organize future empirical efforts on robust generalization.

major comments (1)
  1. [Abstract and taxonomy section] Abstract and taxonomy section: the central claim that changes to executable rule-sets constitute a distinct world-level shift not addressable by task scaling rests on an unformalized distinction; without explicit criteria or illustrative mappings from existing benchmarks (e.g., how a change in dynamics differs from a task reparameterization), the taxonomy remains difficult to apply or falsify.
minor comments (2)
  1. [Construction paradigms section] The synthesis of programmatic versus generative paradigms would benefit from a short table comparing controllability, verifiability, and coverage properties.
  2. [Stateful learning section] A few forward references to concrete environment generators (e.g., specific procedural or learned simulators) would help ground the discussion of coupling with stateful learning mechanisms.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of the position paper to help organize research on robust generalization. We address the major comment on the taxonomy below, agreeing that greater explicitness will strengthen the contribution. The revised manuscript incorporates clarifications and examples as described.

read point-by-point responses
  1. Referee: [Abstract and taxonomy section] Abstract and taxonomy section: the central claim that changes to executable rule-sets constitute a distinct world-level shift not addressable by task scaling rests on an unformalized distinction; without explicit criteria or illustrative mappings from existing benchmarks (e.g., how a change in dynamics differs from a task reparameterization), the taxonomy remains difficult to apply or falsify.

    Authors: We agree that the taxonomy benefits from more explicit criteria and concrete mappings to make the distinctions operational and falsifiable. The original manuscript defines the three scaling regimes by their primary deliverables and by which elements of the executable rule-set are altered, with environment scaling specifically involving changes to interfaces, dynamics, observations, or feedback signals. To address the concern, the revised version adds a dedicated subsection that formalizes the rule-set components and provides decision criteria: a modification counts as environment scaling if it alters the underlying transition or observation model in a way that cannot be reduced to reparameterizing goals or rewards within the original model. We include illustrative mappings, for example contrasting (i) reparameterizing reward functions or goal locations inside a fixed physics engine (task scaling) with (ii) changing the physics engine itself or the available actions (environment scaling). A new table maps several existing benchmarks (Atari variants, Procgen, Meta-World) to the taxonomy categories, showing how dynamics changes cross the boundary while task reparameterizations do not. These additions preserve the position-paper character while rendering the framework easier to apply and test. revision: yes

Circularity Check

0 steps flagged

No significant circularity; position paper is self-contained

full rationale

This is a position paper that advances a conceptual taxonomy distinguishing trajectory scaling, task scaling, and environment scaling according to which elements of executable rule-sets are varied. The central claim—that generalization requires explicit expansion of the distribution over rule-sets—is presented as an argumentative reframing motivated directly by the stated goal of addressing world-level distribution shift. No equations, fitted parameters, derivations, or load-bearing self-citations exist that reduce the claims to their own inputs by construction. The distinctions and synthesis of construction paradigms are motivated independently by the paper's own stated objectives without self-referential definitions or unverified prior results from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is a position paper without new empirical results or formal derivations. It relies on domain assumptions about agent brittleness and the necessity of rule-set diversity.

axioms (2)
  • domain assumption Current scaling practices focused on trajectories or tasks within fixed benchmarks leave agents brittle to changes in interfaces, dynamics, observations, or feedback signals.
    Invoked directly in the abstract as the core challenge motivating environment scaling.
  • domain assumption Stateful learning mechanisms with learned update rules can enable cross-environment adaptation when coupled with environment scaling.
    Outlined in the abstract as a necessary complement to environment scaling.

pith-pipeline@v0.9.0 · 5756 in / 1364 out tokens · 39576 ms · 2026-05-20T10:16:11.615003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 8 internal anchors

  1. [1]

    Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025

    Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, et al. Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025

  2. [2]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

  3. [3]

    SIMA 2: A generalist embodied agent for virtual worlds

    Adrian Bolton, Alexander Lerchner, Alexandra Cordell, Alexandre Moufarek, Andrew Bolt, Andrew Lampinen, Anna Mitenkova, Arne Olav Hallingstad, Bojan Vujatovic, Bonnie Li, et al. Sima 2: A generalist embodied agent for virtual worlds.arXiv preprint arXiv:2512.04797, 2025

  4. [4]

    Scaling agent learning via experience synthesis, 2025

    Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, et al. Scaling agent learning via experience synthesis.arXiv preprint arXiv:2511.03773, 2025

  5. [5]

    org/CorpusID:273186996

    Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, et al. Interactcomp: Evaluating search agents with ambiguous queries.arXiv preprint arXiv:2510.24668, 2025

  6. [6]

    Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu, Shijue Huang, Zhenyu Li, Yang Zhao, Xiaoshuai Song, Xiaoxi Li, et al. Agent-world: Scaling real-world environment synthesis for evolving general agent intelligence.arXiv preprint arXiv:2604.18292, 2026

  7. [7]

    Web world models.arXiv preprint arXiv:2512.23676, 2025

    Jichen Feng, Yifan Zhang, Chenggong Zhang, Yifu Lu, Shilong Liu, and Mengdi Wang. Web world models.arXiv preprint arXiv:2512.23676, 2025

  8. [8]

    2509.17158 , archivePrefix=

    Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, et al. Are: Scaling up agent environments and evaluations.arXiv preprint arXiv:2509.17158, 2025

  9. [9]

    Beyond fixed tasks: Unsupervised environment design for task-level pairs

    Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alex F Spies, Alessandra Russo, and Michael Dennis. Beyond fixed tasks: Unsupervised environment design for task-level pairs. arXiv preprint arXiv:2511.12706, 2025

  10. [10]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

  11. [11]

    Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025

    Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025

  12. [12]

    Efficient agent training for computer use.arXiv preprint arXiv:2505.13909, 2025

    Yanheng He, Jiahe Jin, and Pengfei Liu. Efficient agent training for computer use.arXiv preprint arXiv:2505.13909, 2025. 10

  13. [13]

    Data interpreter: An LLM agent for data science

    Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. Data int...

  14. [14]

    Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation

    Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jian-Guang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 496–507, 2025

  15. [15]

    Environment scaling for interactive agentic experience collection: A survey

    Yuchen Huang, Sijia Li, Minghao Liu, Wei Liu, Shijue Huang, Zhiyuan Fan, Hou Pong Chan, and Yi R Fung. Environment scaling for interactive agentic experience collection: A survey. arXiv preprint arXiv:2511.09586, 2025

  16. [16]

    arXiv preprint arXiv:2310.01824 , year=

    Emily Jin, Jiaheng Hu, Zhuoyi Huang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, and Roberto Martín-Martín. Mini-behavior: A procedurally generated benchmark for long-horizon decision- making in embodied ai.arXiv preprint arXiv:2310.01824, 2023

  17. [17]

    The dawn of natural language to SQL: are we fully ready? [experiment, analysis & benchmark ].Proc

    Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. The dawn of natural language to SQL: are we fully ready? [experiment, analysis & benchmark ].Proc. VLDB Endow., 17(11):3318–3331, 2024

  18. [18]

    Alpha-sql: Zero-shot text-to-sql using monte carlo tree search

    Boyan Li, Jiayi Zhang, Ju Fan, Yanwei Xu, Chong Chen, Nan Tang, and Yuyu Luo. Alpha-sql: Zero-shot text-to-sql using monte carlo tree search. InICML. OpenReview.net, 2025

  19. [19]

    WebSailor: Navigating Super-human Reasoning for Web Agent

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

  20. [20]

    Autobencher: Towards declarative benchmark construction.arXiv preprint arXiv:2407.08351, 2024

    Xiang Lisa Li, Farzaan Kaiyom, Evan Zheran Liu, Yifan Mai, Percy Liang, and Tatsunori Hashimoto. Autobencher: Towards declarative benchmark construction.arXiv preprint arXiv:2407.08351, 2024

  21. [21]

    Simulating environments with reasoning models for agent training.arXiv preprint arXiv:2511.01824, 2025

    Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulka- rni, Radha Poovendran, Robert Sim, and Saravan Rajmohan. Simulating environments with reasoning models for agent training.arXiv preprint arXiv:2511.01824, 2025

  22. [22]

    LEAD: iterative data selection for efficient LLM instruction tuning.CoRR, abs/2505.07437, 2025

    Xiaotian Lin, Yanlin Qi, Yizhang Zhu, Themis Palpanas, Chengliang Chai, Nan Tang, and Yuyu Luo. LEAD: iterative data selection for efficient LLM instruction tuning.CoRR, abs/2505.07437, 2025

  23. [23]

    Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

    Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

  24. [24]

    NitroGen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

    Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, et al. Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

  25. [25]

    General agents need world models

    Jonathan Richens, Tom Everitt, and David Abel. General agents need world models. In F orty-second International Conference on Machine Learning, 2025

  26. [26]

    Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

    Jianhao Ruan, Zhihao Xu, Yiran Peng, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo, and Jiayi Zhang. Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

  27. [27]

    Taskcraft: Automated generation of agentic tasks

    Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, et al. Taskcraft: Automated generation of agentic tasks.arXiv preprint arXiv:2506.10055, 2025

  28. [28]

    Welcome to the era of experience

    David Silver and Richard S Sutton. Welcome to the era of experience. 2025. 11

  29. [29]

    EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

    Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis. arXiv preprint arXiv:2601.05808, 2026

  30. [30]

    Procedural environment generation for tool-use agents.arXiv preprint arXiv:2506.11045, 2025

    Michael Sullivan, Mareike Hartmann, and Alexander Koller. Procedural environment generation for tool-use agents.arXiv preprint arXiv:2506.11045, 2025

  31. [31]

    Measuring general intelligence with generated games.arXiv preprint arXiv:2505.07215, 2025

    Vivek Verma, David Huang, William Chen, Dan Klein, and Nicholas Tomlin. Measuring general intelligence with generated games.arXiv preprint arXiv:2505.07215, 2025

  32. [32]

    Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

    Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

  33. [33]

    Autowebworld: Synthesizing infinite verifiable web environments via finite state machines.arXiv preprint arXiv:2602.14296, 2026

    Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, et al. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines.arXiv preprint arXiv:2602.14296, 2026

  34. [34]

    arXiv preprint arXiv:2502.06855 , year=

    Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Fengwei Teng, Jinhao Tu, Xinbing Liang, Sirui Hong, Chenglin Wu, and Yuyu Luo. Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025

  35. [35]

    Agentsynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205, 2025

    Jingxu Xie, Dylan Xu, Xuandong Zhao, and Dawn Song. Agentsynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205, 2025

  36. [36]

    Unlocking implicit experience: Synthesizing tool-use trajectories from text.arXiv preprint arXiv:2601.10355, 2026

    Zhihao Xu, Rumei Li, Jiahuan Li, Rongxiang Weng, Jingang Wang, Xunliang Cai, and Xiting Wang. Unlocking implicit experience: Synthesizing tool-use trajectories from text, 2026. URL https://arxiv.org/abs/2601.10355

  37. [37]

    Assessing adaptive world models in machines with novel games.arXiv preprint arXiv:2507.12821, 2025

    Lance Ying, Katherine M Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J Gershman, Jacob D Andreas, et al. Assessing adaptive world models in machines with novel games.arXiv preprint arXiv:2507.12821, 2025

  38. [38]

    MemEvolve: Meta-Evolution of Agent Memory Systems

    Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

  39. [39]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

  40. [40]

    Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

    Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jian- hao Ruan, Jinlin Wang, Maojia Song, et al. Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

  41. [41]

    Harnessing Agentic Evolution

    Jiayi Zhang, Yongfeng Gu, Jianhao Ruan, Maojia Song, Yiran Peng, Zhiguang Han, Jinyu Xiang, Zhitao Wang, Caiyin Yang, Yixi Ouyang, et al. Harnessing agentic evolution.arXiv preprint arXiv:2605.13821, 2026

  42. [42]

    Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126, 2026

    Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126, 2026

  43. [43]

    A survey of data agents: Emerging paradigm or overstated hype?CoRR, abs/2510.23587, 2025

    Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, Chengliang Chai, Chong Chen, Shimin Di, Ju Fan, Ji Sun, Nan Tang, Fugee Tsung, Jiannan Wang, Chenglin Wu, Yanwei Xu, Shaolei Zhang, Yong Zhang, Xuanhe Zhou, Guoliang Li, and Yuyu Luo. A survey of data agents: Emerging paradigm or overst...