pith. sign in

arxiv: 2602.09514 · v3 · submitted 2026-02-10 · 💻 cs.CL · cs.AI

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Pith reviewed 2026-05-16 05:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM evaluationlong-horizon planninginteractive economiesagent benchmarksplan-and-executeeconomic decision makingpartial observabilitystochastic environments
0
0 comments X

The pith

No single large language model excels across all long-horizon economic planning scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EcoGym, a benchmark with three interactive economy environments for testing continuous plan-and-execute decisions by LLMs over long horizons. Evaluation uses business metrics such as net worth and daily active users under partial observability and stochastic conditions. Tests on eleven leading models show no single model leads in every scenario, with each exhibiting clear weaknesses in either high-level strategy formulation or efficient action execution. This matters for anyone building autonomous agents because it identifies where current LLMs fall short in persistent, real-world-like economic tasks rather than short episodic ones.

Core claim

EcoGym supplies a unified framework of three environments—Vending, Freelance, and Operation—with standardized interfaces and effectively unbounded horizons, then measures eleven LLMs on outcomes including net worth, income, and DAU. The experiments establish that models display systematic suboptimality: each one underperforms in either strategic coherence or execution efficiency, so that dominance in one environment does not transfer to the others.

What carries the argument

EcoGym benchmark consisting of three environments with budgeted actions and business-relevant scoring over 1000+ step horizons.

If this is right

  • LLM agents require separate improvements in high-level planning and low-level execution to handle extended economic interactions.
  • Evaluation of autonomous agents must shift from short episodic tasks to persistent stochastic environments to reveal true capability gaps.
  • Business-aligned metrics such as net worth and DAU provide clearer signals for practical utility than generic task success rates.
  • Open release of the environments enables direct comparison of new models and study of controllability versus performance trade-offs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Specialized or ensemble models tuned to individual economic domains may outperform any single general model.
  • Incorporating economic simulation data during training could reduce the observed suboptimality in strategy and execution.
  • Extending the benchmark to multi-agent settings would test whether the same limitations appear in competitive or cooperative economic dynamics.

Load-bearing premise

The three chosen environments and their metrics capture the essential difficulties of real long-horizon economic decisions under uncertainty.

What would settle it

Observation of one LLM achieving top scores on net worth, income, and DAU simultaneously across the Vending, Freelance, and Operation environments would disprove the claimed systematic tension.

read the original abstract

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending (adapted from the closed-source Vending-Bench, with full open-source release), Freelance (new), and Operation (new), implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability utility trade-offs in economic settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EcoGym, a benchmark for long-horizon plan-and-execute decision making in interactive economies consisting of three environments (Vending adapted from Vending-Bench, plus new Freelance and Operation environments) with standardized interfaces, budgeted actions, and 1000+ step horizons under partial observability and stochasticity. Experiments on eleven LLMs using business-relevant metrics (net worth, income, DAU) claim that no single model dominates across scenarios and that models exhibit suboptimality in either high-level strategies or efficient action execution.

Significance. If the environments and metrics isolate long-horizon planning deficits, EcoGym would offer a valuable open, extensible testbed for LLM agent evaluation in persistent economic settings and highlight strategy-execution trade-offs. The open-source release of the environments is a concrete strength that supports reproducibility and community use.

major comments (2)
  1. [§3] §3 (Environments): No ablation studies or baseline comparisons (e.g., myopic/greedy policies or short-horizon rollouts) are reported to show that the chosen metrics (net worth, income, DAU) actually penalize immediate-reward maximization and require coherent multi-step planning. Without this, the central claim of planning suboptimality cannot be distinguished from local execution errors.
  2. [§4-5] §4-5 (Experimental Setup and Results): The manuscript provides no statistical details (confidence intervals, significance tests), data exclusion rules, or sensitivity analysis on prompt templates and action budgets for the eleven models. This undermines verification of the 'no single model dominates' finding and the strategy-versus-execution tension.
minor comments (2)
  1. [Abstract] Abstract: The phrase '365 day-loops for evaluation' is unclear; the full text should explicitly define the evaluation protocol and how it maps to the 1000+ step horizon.
  2. [Implementation] Implementation details: The standardized interfaces and partial-observability mechanisms should be illustrated with a concrete example from one environment to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the changes we will make in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Environments): No ablation studies or baseline comparisons (e.g., myopic/greedy policies or short-horizon rollouts) are reported to show that the chosen metrics (net worth, income, DAU) actually penalize immediate-reward maximization and require coherent multi-step planning. Without this, the central claim of planning suboptimality cannot be distinguished from local execution errors.

    Authors: We agree that ablation studies with baselines such as myopic or greedy policies would strengthen the evidence that the metrics capture long-horizon planning deficits rather than just execution errors. We will add these baseline comparisons in the revised manuscript to better isolate the planning suboptimality. revision: yes

  2. Referee: [§4-5] §4-5 (Experimental Setup and Results): The manuscript provides no statistical details (confidence intervals, significance tests), data exclusion rules, or sensitivity analysis on prompt templates and action budgets for the eleven models. This undermines verification of the 'no single model dominates' finding and the strategy-versus-execution tension.

    Authors: We acknowledge the lack of statistical rigor in the current presentation. In the revision, we will report confidence intervals, conduct appropriate significance tests, detail data exclusion rules, and perform sensitivity analyses on prompt templates and action budgets. This will provide stronger support for our findings regarding model performance across scenarios. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct performance measurements

full rationale

The paper introduces EcoGym as an open benchmark consisting of three environments (Vending, Freelance, Operation) and reports empirical results from running eleven LLMs on long-horizon tasks. Central claims rest on observed outcomes (net worth, income, DAU) rather than any derivation, equation, or fitted parameter that reduces to inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the evaluation chain; results are obtained by direct simulation against the defined environments and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the three environments and their outcome metrics are representative of long-horizon economic planning; no free parameters are described in the abstract, and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Business-relevant outcomes such as net worth, income, and DAU are appropriate proxies for strategic coherence and robustness in interactive economies.
    Invoked when defining evaluation targets in the abstract.

pith-pipeline@v0.9.0 · 5571 in / 1259 out tokens · 29012 ms · 2026-05-16T05:50:17.791274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

    cs.AI 2026-04 unverdicted novelty 6.0

    CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

    Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, and Vincent Bissonnette. Herobench: A benchmark for long-horizon planning and structured reasoning in virtual worlds.arXiv preprint arXiv:2508.12782, 2025

  2. [2]

    Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025

    Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents. arXiv preprint arXiv:2502.15840, 2025

  3. [3]

    Vending-bench 2, 2025

    Axel Backlund and Lukas Petersson. Vending-bench 2, 2025. URL https://andonlabs.com/evals/ vending-bench-2

  4. [4]

    xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

    Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, YuanGong, ChenSun, HanHou, HuiYang, JamesPan, JiananLou, JiayiMao, JizhengLiu, JinpengLi, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi- Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Y...

  5. [5]

    Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain, 2024

    Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, and Baobao Chang. Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain, 2024. URLhttps://arxiv.org/abs/2402.15527

  6. [6]

    StockBench: Can LLM agents trade stocks profitably in real-world markets?arXiv preprint arXiv:2510.02209, 2025

    Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li. Stockbench: Can llm agents trade stocks profitably in real-world markets?, 2025. URLhttps://arxiv.org/abs/2510.02209

  7. [7]

    Finqa: A dataset of numerical reasoning over financial data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, 2021

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  9. [9]

    Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864, 2025

    Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, et al. Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864, 2025

  10. [10]

    Large language models empowered agent-based modeling and simulation: A survey and perspectives, 2023

    Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives, 2023. URL https://arxiv.org/abs/2312.11970

  11. [11]

    Agent-based simulation of a financial market with large language models, 2025

    Ryuji Hashimoto, Takehiro Takayanagi, Masahiro Suzuki, and Kiyoshi Izumi. Agent-based simulation of a financial market with large language models, 2025. URLhttps://arxiv.org/abs/2510.12189

  12. [12]

    Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset.Advances in Neural Information Processing Systems, 35:29217–29234, 2022

    Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset.Advances in Neural Information Processing Systems, 35:29217–29234, 2022

  13. [13]

    Step-DeepResearch technical report.arXiv preprint arXiv:2512.20491, 2025

    Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu L...

  14. [14]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  15. [15]

    AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

    Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning, 2025. URLhttps://arxiv.org/abs/2503.07608. 12

  16. [16]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  17. [17]

    Agent-oriented planning in multi- agent systems

    Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi- agent systems. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=EqcLAU6gyU

  18. [18]

    Econagent: large language model-empowered agents for simulating macroeconomic activities

    Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15523–15536, 2024

  19. [19]

    Quantagents: Towards multi-agent financial system via simulated trading, 2025

    Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Quantagents: Towards multi-agent financial system via simulated trading, 2025. URLhttps://arxiv.org/abs/2510.04643

  20. [20]

    Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025

    Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, et al. Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025

  21. [21]

    Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu

    Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business-driven real-world financial benchmark for evaluating llms.arXiv preprint arXiv:2505.19457, 2025

  22. [22]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...

  23. [23]

    NeurIPS / arXiv preprint 2401.13178

    Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents, 2024. URLhttps: //arxiv.org/abs/2401.13178

  24. [24]

    (2025, September 2)

    Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, et al. Remote labor index: Measuring ai automation of remote work.arXiv preprint arXiv:2510.26787, 2025

  25. [25]

    Stocksim: A dual-mode order-level simulator for evaluating multi-agent llms in financial markets,

    Charidimos Papadakis, Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, and Giorgos Stamou. Stocksim: A dual-mode order-level simulator for evaluating multi-agent llms in financial markets,

  26. [26]

    URLhttps://arxiv.org/abs/2507.09255

  27. [27]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  28. [28]

    GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

    Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

  29. [29]

    Sciq: an invitation and recommendations to combine science and inuit qaujimajatuqangit for meaningful engagement of inuit communities in research.Arctic Science, 6(3):326–339, 2020

    C Pedersen, M Otokiak, I Koonoo, J Milton, E Maktar, A Anaviapik, M Milton, G Porter, A Scott, C Newman, et al. Sciq: an invitation and recommendations to combine science and inuit qaujimajatuqangit for meaningful engagement of inuit communities in research.Arctic Science, 6(3):326–339, 2020

  30. [30]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URLhttps://arxiv.org/abs/2405.14573

  31. [31]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning, 2021. URLhttps://arxiv.org/abs/ 2010.03768

  32. [32]

    Tongyi DeepResearch Technical Report

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Ga...

  33. [33]

    O-mem: Omni memory system for personalized, long horizon, self-evolving agents

    Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, et al. O-mem: Omni memory system for personalized, long horizon, self-evolving agents. arXiv preprint arXiv:2511.13593, 2025

  34. [34]

    Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, Abu Dhabi, United Arab Emirates, December 2022. Association...

  35. [35]

    Scienceworld: Is your agent smarter than a 5th grader?arXiv preprint arXiv:2203.07540,

    Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader?, 2022. URLhttps://arxiv.org/abs/2203.07540

  36. [36]

    The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

    Xingyao Wang, Simon Rosenberg, Juan Michelini, Calvin Smith, Hoang Tran, Engel Nyst, Rohit Malhotra, Xuhui Zhou, Valerie Chen, Robert Brennan, and Graham Neubig. The openhands software agent sdk: A composable and extensible foundation for production agents, 2025. URLhttps://arxiv.org/abs/2511.03690

  37. [37]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URLhttps://arxiv.org/abs/2504.12516

  38. [38]

    Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

    Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

  39. [39]

    Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

    Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

  40. [40]

    URLhttps://arxiv.org/abs/2305.08144

  41. [41]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita-Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfor...

  42. [42]

    MemEvolve: Meta-Evolution of Agent Memory Systems

    Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

  43. [43]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

  44. [44]

    OAgents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025

    He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, et al. Oagents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025. 14 A Model API Detailed Information For proprietary models, we utilize specific version snapshots where available to account for the “model d...

  45. [45]

    complexity,

    Response: Respond to the Agent’s argument. If they claim "complexity," verify it in the trace. Output Format Return strictly a JSON object: { "internal_assessment": "Evaluate the quality/value of the trajectory content...", "proposed_money": <float, precision 2>, "reasoning": "Message to the agent explaining your valuation based on the trace audit..." } C...