EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Boyu Feng; Guibin Zhang; He Zhu; Jiaheng Liu; Jincheng Ren; Jinxiang Xia; Kangqi Song; Li Lu; Minghao Liu; Shengze Xu

arxiv: 2602.09514 · v3 · submitted 2026-02-10 · 💻 cs.CL · cs.AI

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Xavier Hu , Jinxiang Xia , Shengze Xu , Kangqi Song , Yishuo Yuan , Guibin Zhang , Jincheng Ren , Boyu Feng

show 8 more authors

Li Lu Tieyong Zeng Jiaheng Liu Minghao Liu He Zhu Yuchen Eleanor Jiang Wei Wang Wangchunshu Zhou

This is my paper

Pith reviewed 2026-05-16 05:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM evaluationlong-horizon planninginteractive economiesagent benchmarksplan-and-executeeconomic decision makingpartial observabilitystochastic environments

0 comments

The pith

No single large language model excels across all long-horizon economic planning scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EcoGym, a benchmark with three interactive economy environments for testing continuous plan-and-execute decisions by LLMs over long horizons. Evaluation uses business metrics such as net worth and daily active users under partial observability and stochastic conditions. Tests on eleven leading models show no single model leads in every scenario, with each exhibiting clear weaknesses in either high-level strategy formulation or efficient action execution. This matters for anyone building autonomous agents because it identifies where current LLMs fall short in persistent, real-world-like economic tasks rather than short episodic ones.

Core claim

EcoGym supplies a unified framework of three environments—Vending, Freelance, and Operation—with standardized interfaces and effectively unbounded horizons, then measures eleven LLMs on outcomes including net worth, income, and DAU. The experiments establish that models display systematic suboptimality: each one underperforms in either strategic coherence or execution efficiency, so that dominance in one environment does not transfer to the others.

What carries the argument

EcoGym benchmark consisting of three environments with budgeted actions and business-relevant scoring over 1000+ step horizons.

If this is right

LLM agents require separate improvements in high-level planning and low-level execution to handle extended economic interactions.
Evaluation of autonomous agents must shift from short episodic tasks to persistent stochastic environments to reveal true capability gaps.
Business-aligned metrics such as net worth and DAU provide clearer signals for practical utility than generic task success rates.
Open release of the environments enables direct comparison of new models and study of controllability versus performance trade-offs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Specialized or ensemble models tuned to individual economic domains may outperform any single general model.
Incorporating economic simulation data during training could reduce the observed suboptimality in strategy and execution.
Extending the benchmark to multi-agent settings would test whether the same limitations appear in competitive or cooperative economic dynamics.

Load-bearing premise

The three chosen environments and their metrics capture the essential difficulties of real long-horizon economic decisions under uncertainty.

What would settle it

Observation of one LLM achieving top scores on net worth, income, and DAU simultaneously across the Vending, Freelance, and Operation environments would disprove the claimed systematic tension.

read the original abstract

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending (adapted from the closed-source Vending-Bench, with full open-source release), Freelance (new), and Operation (new), implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability utility trade-offs in economic settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EcoGym adds two new economic environments plus an open Vending-Bench under one interface, but the experiments do not yet show that the reported model gaps come from planning shortfalls rather than local execution errors.

read the letter

The paper's real contribution is the testbed itself. It ships Freelance and Operation as fresh environments, releases an open version of Vending-Bench, and puts all three behind the same decision loop with 1000-plus step horizons and metrics that track net worth, income, and DAU. That setup is more persistent than most episodic agent benchmarks, and the authors run it on eleven models to show that no single one leads across the board while some fall short on strategy and others on execution. Those are useful data points for anyone trying to build agents that stay coherent over long economic runs. The open release and unified interface make it straightforward for others to extend or compare against. The citation list is standard and does not lean on self-reinforcing loops. The central claim about a strategy-versus-execution tension is plausible on the surface, but the abstract gives no ablations against myopic or greedy baselines. Without that, it is unclear whether the environments actually punish short-horizon policies or whether the observed suboptimality is mostly local action mistakes. The stress-test note on this point lands because the paper does not report whether immediate-reward maximization already produces competitive scores. Minor gaps include missing statistical details on the model comparisons and no explicit discussion of how partial observability is enforced in practice. This work is aimed at groups building and evaluating long-horizon LLM agents who need grounded economic testbeds. It is worth sending to peer review because the environments are new and the interface is reusable; the results section will need tighter controls on what the metrics actually isolate before the tension claim can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper introduces EcoGym, a benchmark for long-horizon plan-and-execute decision making in interactive economies consisting of three environments (Vending adapted from Vending-Bench, plus new Freelance and Operation environments) with standardized interfaces, budgeted actions, and 1000+ step horizons under partial observability and stochasticity. Experiments on eleven LLMs using business-relevant metrics (net worth, income, DAU) claim that no single model dominates across scenarios and that models exhibit suboptimality in either high-level strategies or efficient action execution.

Significance. If the environments and metrics isolate long-horizon planning deficits, EcoGym would offer a valuable open, extensible testbed for LLM agent evaluation in persistent economic settings and highlight strategy-execution trade-offs. The open-source release of the environments is a concrete strength that supports reproducibility and community use.

major comments (2)

[§3] §3 (Environments): No ablation studies or baseline comparisons (e.g., myopic/greedy policies or short-horizon rollouts) are reported to show that the chosen metrics (net worth, income, DAU) actually penalize immediate-reward maximization and require coherent multi-step planning. Without this, the central claim of planning suboptimality cannot be distinguished from local execution errors.
[§4-5] §4-5 (Experimental Setup and Results): The manuscript provides no statistical details (confidence intervals, significance tests), data exclusion rules, or sensitivity analysis on prompt templates and action budgets for the eleven models. This undermines verification of the 'no single model dominates' finding and the strategy-versus-execution tension.

minor comments (2)

[Abstract] Abstract: The phrase '365 day-loops for evaluation' is unclear; the full text should explicitly define the evaluation protocol and how it maps to the 1000+ step horizon.
[Implementation] Implementation details: The standardized interfaces and partial-observability mechanisms should be illustrated with a concrete example from one environment to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the changes we will make in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Environments): No ablation studies or baseline comparisons (e.g., myopic/greedy policies or short-horizon rollouts) are reported to show that the chosen metrics (net worth, income, DAU) actually penalize immediate-reward maximization and require coherent multi-step planning. Without this, the central claim of planning suboptimality cannot be distinguished from local execution errors.

Authors: We agree that ablation studies with baselines such as myopic or greedy policies would strengthen the evidence that the metrics capture long-horizon planning deficits rather than just execution errors. We will add these baseline comparisons in the revised manuscript to better isolate the planning suboptimality. revision: yes
Referee: [§4-5] §4-5 (Experimental Setup and Results): The manuscript provides no statistical details (confidence intervals, significance tests), data exclusion rules, or sensitivity analysis on prompt templates and action budgets for the eleven models. This undermines verification of the 'no single model dominates' finding and the strategy-versus-execution tension.

Authors: We acknowledge the lack of statistical rigor in the current presentation. In the revision, we will report confidence intervals, conduct appropriate significance tests, detail data exclusion rules, and perform sensitivity analyses on prompt templates and action budgets. This will provide stronger support for our findings regarding model performance across scenarios. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct performance measurements

full rationale

The paper introduces EcoGym as an open benchmark consisting of three environments (Vending, Freelance, Operation) and reports empirical results from running eleven LLMs on long-horizon tasks. Central claims rest on observed outcomes (net worth, income, DAU) rather than any derivation, equation, or fitted parameter that reduces to inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the evaluation chain; results are obtained by direct simulation against the defined environments and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the three environments and their outcome metrics are representative of long-horizon economic planning; no free parameters are described in the abstract, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Business-relevant outcomes such as net worth, income, and DAU are appropriate proxies for strategic coherence and robustness in interactive economies.
Invoked when defining evaluation targets in the abstract.

pith-pipeline@v0.9.0 · 5571 in / 1259 out tokens · 29012 ms · 2026-05-16T05:50:17.791274+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
cs.AI 2026-04 unverdicted novelty 6.0

CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, and Vincent Bissonnette. Herobench: A benchmark for long-horizon planning and structured reasoning in virtual worlds.arXiv preprint arXiv:2508.12782, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025

Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents. arXiv preprint arXiv:2502.15840, 2025

work page arXiv 2025
[3]

Vending-bench 2, 2025

Axel Backlund and Lukas Petersson. Vending-bench 2, 2025. URL https://andonlabs.com/evals/ vending-bench-2

work page 2025
[4]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, YuanGong, ChenSun, HanHou, HuiYang, JamesPan, JiananLou, JiayiMao, JizhengLiu, JinpengLi, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi- Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Y...

work page arXiv 2025
[5]

Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain, 2024

Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, and Baobao Chang. Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain, 2024. URLhttps://arxiv.org/abs/2402.15527

work page arXiv 2024
[6]

StockBench: Can LLM agents trade stocks profitably in real-world markets?arXiv preprint arXiv:2510.02209, 2025

Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li. Stockbench: Can llm agents trade stocks profitably in real-world markets?, 2025. URLhttps://arxiv.org/abs/2510.02209

work page arXiv 2025
[7]

Finqa: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, 2021

work page 2021
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864, 2025

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, et al. Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864, 2025

work page arXiv 2025
[10]

Large language models empowered agent-based modeling and simulation: A survey and perspectives, 2023

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives, 2023. URL https://arxiv.org/abs/2312.11970

work page arXiv 2023
[11]

Agent-based simulation of a financial market with large language models, 2025

Ryuji Hashimoto, Takehiro Takayanagi, Masahiro Suzuki, and Kiyoshi Izumi. Agent-based simulation of a financial market with large language models, 2025. URLhttps://arxiv.org/abs/2510.12189

work page arXiv 2025
[12]

Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset.Advances in Neural Information Processing Systems, 35:29217–29234, 2022

Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset.Advances in Neural Information Processing Systems, 35:29217–29234, 2022

work page 2022
[13]

Step-DeepResearch technical report.arXiv preprint arXiv:2512.20491, 2025

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu L...

work page arXiv 2025
[14]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning, 2025. URLhttps://arxiv.org/abs/2503.07608. 12

work page internal anchor Pith review arXiv 2025
[16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Agent-oriented planning in multi- agent systems

Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi- agent systems. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=EqcLAU6gyU

work page 2025
[18]

Econagent: large language model-empowered agents for simulating macroeconomic activities

Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15523–15536, 2024

work page 2024
[19]

Quantagents: Towards multi-agent financial system via simulated trading, 2025

Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Quantagents: Towards multi-agent financial system via simulated trading, 2025. URLhttps://arxiv.org/abs/2510.04643

work page arXiv 2025
[20]

Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025

Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, et al. Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025

work page arXiv 2025
[21]

Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu

Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business-driven real-world financial benchmark for evaluating llms.arXiv preprint arXiv:2505.19457, 2025

work page arXiv 2025
[22]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

NeurIPS / arXiv preprint 2401.13178

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents, 2024. URLhttps: //arxiv.org/abs/2401.13178

work page arXiv 2024
[24]

(2025, September 2)

Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, et al. Remote labor index: Measuring ai automation of remote work.arXiv preprint arXiv:2510.26787, 2025

work page arXiv 2025
[25]

Stocksim: A dual-mode order-level simulator for evaluating multi-agent llms in financial markets,

Charidimos Papadakis, Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, and Giorgos Stamou. Stocksim: A dual-mode order-level simulator for evaluating multi-agent llms in financial markets,

work page
[26]

URLhttps://arxiv.org/abs/2507.09255

work page arXiv
[27]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[28]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

work page internal anchor Pith review arXiv 2025
[29]

Sciq: an invitation and recommendations to combine science and inuit qaujimajatuqangit for meaningful engagement of inuit communities in research.Arctic Science, 6(3):326–339, 2020

C Pedersen, M Otokiak, I Koonoo, J Milton, E Maktar, A Anaviapik, M Milton, G Porter, A Scott, C Newman, et al. Sciq: an invitation and recommendations to combine science and inuit qaujimajatuqangit for meaningful engagement of inuit communities in research.Arctic Science, 6(3):326–339, 2020

work page 2020
[30]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URLhttps://arxiv.org/abs/2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning, 2021. URLhttps://arxiv.org/abs/ 2010.03768

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Ga...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

O-mem: Omni memory system for personalized, long horizon, self-evolving agents

Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, et al. O-mem: Omni memory system for personalized, long horizon, self-evolving agents. arXiv preprint arXiv:2511.13593, 2025

work page arXiv 2025
[34]

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, Abu Dhabi, United Arab Emirates, December 2022. Association...

work page doi:10.18653/v1/2022.emnlp-main.775 2022
[35]

Scienceworld: Is your agent smarter than a 5th grader?arXiv preprint arXiv:2203.07540,

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader?, 2022. URLhttps://arxiv.org/abs/2203.07540

work page arXiv 2022
[36]

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Xingyao Wang, Simon Rosenberg, Juan Michelini, Calvin Smith, Hoang Tran, Engel Nyst, Rohit Malhotra, Xuhui Zhou, Valerie Chen, Robert Brennan, and Graham Neubig. The openhands software agent sdk: A composable and extensible foundation for production agents, 2025. URLhttps://arxiv.org/abs/2511.03690

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URLhttps://arxiv.org/abs/2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

work page arXiv 2024
[39]

Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

work page
[40]

URLhttps://arxiv.org/abs/2305.08144

work page arXiv
[41]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita-Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

OAgents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025

He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, et al. Oagents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025. 14 A Model API Detailed Information For proprietary models, we utilize specific version snapshots where available to account for the “model d...

work page arXiv 2025
[45]

complexity,

Response: Respond to the Agent’s argument. If they claim "complexity," verify it in the trace. Output Format Return strictly a JSON object: { "internal_assessment": "Evaluate the quality/value of the trajectory content...", "proposed_money": <float, precision 2>, "reasoning": "Message to the agent explaining your valuation based on the trace audit..." } C...

work page 2000

[1] [1]

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, and Vincent Bissonnette. Herobench: A benchmark for long-horizon planning and structured reasoning in virtual worlds.arXiv preprint arXiv:2508.12782, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025

Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents. arXiv preprint arXiv:2502.15840, 2025

work page arXiv 2025

[3] [3]

Vending-bench 2, 2025

Axel Backlund and Lukas Petersson. Vending-bench 2, 2025. URL https://andonlabs.com/evals/ vending-bench-2

work page 2025

[4] [4]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, YuanGong, ChenSun, HanHou, HuiYang, JamesPan, JiananLou, JiayiMao, JizhengLiu, JinpengLi, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi- Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Y...

work page arXiv 2025

[5] [5]

Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain, 2024

Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, and Baobao Chang. Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain, 2024. URLhttps://arxiv.org/abs/2402.15527

work page arXiv 2024

[6] [6]

StockBench: Can LLM agents trade stocks profitably in real-world markets?arXiv preprint arXiv:2510.02209, 2025

Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li. Stockbench: Can llm agents trade stocks profitably in real-world markets?, 2025. URLhttps://arxiv.org/abs/2510.02209

work page arXiv 2025

[7] [7]

Finqa: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, 2021

work page 2021

[8] [8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864, 2025

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, et al. Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864, 2025

work page arXiv 2025

[10] [10]

Large language models empowered agent-based modeling and simulation: A survey and perspectives, 2023

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives, 2023. URL https://arxiv.org/abs/2312.11970

work page arXiv 2023

[11] [11]

Agent-based simulation of a financial market with large language models, 2025

Ryuji Hashimoto, Takehiro Takayanagi, Masahiro Suzuki, and Kiyoshi Izumi. Agent-based simulation of a financial market with large language models, 2025. URLhttps://arxiv.org/abs/2510.12189

work page arXiv 2025

[12] [12]

Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset.Advances in Neural Information Processing Systems, 35:29217–29234, 2022

Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset.Advances in Neural Information Processing Systems, 35:29217–29234, 2022

work page 2022

[13] [13]

Step-DeepResearch technical report.arXiv preprint arXiv:2512.20491, 2025

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu L...

work page arXiv 2025

[14] [14]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning, 2025. URLhttps://arxiv.org/abs/2503.07608. 12

work page internal anchor Pith review arXiv 2025

[16] [16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Agent-oriented planning in multi- agent systems

Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi- agent systems. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=EqcLAU6gyU

work page 2025

[18] [18]

Econagent: large language model-empowered agents for simulating macroeconomic activities

Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15523–15536, 2024

work page 2024

[19] [19]

Quantagents: Towards multi-agent financial system via simulated trading, 2025

Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Quantagents: Towards multi-agent financial system via simulated trading, 2025. URLhttps://arxiv.org/abs/2510.04643

work page arXiv 2025

[20] [20]

Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025

Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, et al. Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025

work page arXiv 2025

[21] [21]

Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu

Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business-driven real-world financial benchmark for evaluating llms.arXiv preprint arXiv:2505.19457, 2025

work page arXiv 2025

[22] [22]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

NeurIPS / arXiv preprint 2401.13178

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents, 2024. URLhttps: //arxiv.org/abs/2401.13178

work page arXiv 2024

[24] [24]

(2025, September 2)

Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, et al. Remote labor index: Measuring ai automation of remote work.arXiv preprint arXiv:2510.26787, 2025

work page arXiv 2025

[25] [25]

Stocksim: A dual-mode order-level simulator for evaluating multi-agent llms in financial markets,

Charidimos Papadakis, Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, and Giorgos Stamou. Stocksim: A dual-mode order-level simulator for evaluating multi-agent llms in financial markets,

work page

[26] [26]

URLhttps://arxiv.org/abs/2507.09255

work page arXiv

[27] [27]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023

[28] [28]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

work page internal anchor Pith review arXiv 2025

[29] [29]

Sciq: an invitation and recommendations to combine science and inuit qaujimajatuqangit for meaningful engagement of inuit communities in research.Arctic Science, 6(3):326–339, 2020

C Pedersen, M Otokiak, I Koonoo, J Milton, E Maktar, A Anaviapik, M Milton, G Porter, A Scott, C Newman, et al. Sciq: an invitation and recommendations to combine science and inuit qaujimajatuqangit for meaningful engagement of inuit communities in research.Arctic Science, 6(3):326–339, 2020

work page 2020

[30] [30]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URLhttps://arxiv.org/abs/2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning, 2021. URLhttps://arxiv.org/abs/ 2010.03768

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Ga...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

O-mem: Omni memory system for personalized, long horizon, self-evolving agents

Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, et al. O-mem: Omni memory system for personalized, long horizon, self-evolving agents. arXiv preprint arXiv:2511.13593, 2025

work page arXiv 2025

[34] [34]

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, Abu Dhabi, United Arab Emirates, December 2022. Association...

work page doi:10.18653/v1/2022.emnlp-main.775 2022

[35] [35]

Scienceworld: Is your agent smarter than a 5th grader?arXiv preprint arXiv:2203.07540,

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader?, 2022. URLhttps://arxiv.org/abs/2203.07540

work page arXiv 2022

[36] [36]

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Xingyao Wang, Simon Rosenberg, Juan Michelini, Calvin Smith, Hoang Tran, Engel Nyst, Rohit Malhotra, Xuhui Zhou, Valerie Chen, Robert Brennan, and Graham Neubig. The openhands software agent sdk: A composable and extensible foundation for production agents, 2025. URLhttps://arxiv.org/abs/2511.03690

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URLhttps://arxiv.org/abs/2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

work page arXiv 2024

[39] [39]

Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,

work page

[40] [40]

URLhttps://arxiv.org/abs/2305.08144

work page arXiv

[41] [41]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita-Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

OAgents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025

He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, et al. Oagents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025. 14 A Model API Detailed Information For proprietary models, we utilize specific version snapshots where available to account for the “model d...

work page arXiv 2025

[45] [45]

complexity,

Response: Respond to the Agent’s argument. If they claim "complexity," verify it in the trace. Output Format Return strictly a JSON object: { "internal_assessment": "Evaluate the quality/value of the trajectory content...", "proposed_money": <float, precision 2>, "reasoning": "Message to the agent explaining your valuation based on the trace audit..." } C...

work page 2000