EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
Pith reviewed 2026-05-16 05:50 UTC · model grok-4.3
The pith
No single large language model excels across all long-horizon economic planning scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EcoGym supplies a unified framework of three environments—Vending, Freelance, and Operation—with standardized interfaces and effectively unbounded horizons, then measures eleven LLMs on outcomes including net worth, income, and DAU. The experiments establish that models display systematic suboptimality: each one underperforms in either strategic coherence or execution efficiency, so that dominance in one environment does not transfer to the others.
What carries the argument
EcoGym benchmark consisting of three environments with budgeted actions and business-relevant scoring over 1000+ step horizons.
If this is right
- LLM agents require separate improvements in high-level planning and low-level execution to handle extended economic interactions.
- Evaluation of autonomous agents must shift from short episodic tasks to persistent stochastic environments to reveal true capability gaps.
- Business-aligned metrics such as net worth and DAU provide clearer signals for practical utility than generic task success rates.
- Open release of the environments enables direct comparison of new models and study of controllability versus performance trade-offs.
Where Pith is reading between the lines
- Specialized or ensemble models tuned to individual economic domains may outperform any single general model.
- Incorporating economic simulation data during training could reduce the observed suboptimality in strategy and execution.
- Extending the benchmark to multi-agent settings would test whether the same limitations appear in competitive or cooperative economic dynamics.
Load-bearing premise
The three chosen environments and their metrics capture the essential difficulties of real long-horizon economic decisions under uncertainty.
What would settle it
Observation of one LLM achieving top scores on net worth, income, and DAU simultaneously across the Vending, Freelance, and Operation environments would disprove the claimed systematic tension.
read the original abstract
Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending (adapted from the closed-source Vending-Bench, with full open-source release), Freelance (new), and Operation (new), implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability utility trade-offs in economic settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EcoGym, a benchmark for long-horizon plan-and-execute decision making in interactive economies consisting of three environments (Vending adapted from Vending-Bench, plus new Freelance and Operation environments) with standardized interfaces, budgeted actions, and 1000+ step horizons under partial observability and stochasticity. Experiments on eleven LLMs using business-relevant metrics (net worth, income, DAU) claim that no single model dominates across scenarios and that models exhibit suboptimality in either high-level strategies or efficient action execution.
Significance. If the environments and metrics isolate long-horizon planning deficits, EcoGym would offer a valuable open, extensible testbed for LLM agent evaluation in persistent economic settings and highlight strategy-execution trade-offs. The open-source release of the environments is a concrete strength that supports reproducibility and community use.
major comments (2)
- [§3] §3 (Environments): No ablation studies or baseline comparisons (e.g., myopic/greedy policies or short-horizon rollouts) are reported to show that the chosen metrics (net worth, income, DAU) actually penalize immediate-reward maximization and require coherent multi-step planning. Without this, the central claim of planning suboptimality cannot be distinguished from local execution errors.
- [§4-5] §4-5 (Experimental Setup and Results): The manuscript provides no statistical details (confidence intervals, significance tests), data exclusion rules, or sensitivity analysis on prompt templates and action budgets for the eleven models. This undermines verification of the 'no single model dominates' finding and the strategy-versus-execution tension.
minor comments (2)
- [Abstract] Abstract: The phrase '365 day-loops for evaluation' is unclear; the full text should explicitly define the evaluation protocol and how it maps to the 1000+ step horizon.
- [Implementation] Implementation details: The standardized interfaces and partial-observability mechanisms should be illustrated with a concrete example from one environment to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate the changes we will make in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Environments): No ablation studies or baseline comparisons (e.g., myopic/greedy policies or short-horizon rollouts) are reported to show that the chosen metrics (net worth, income, DAU) actually penalize immediate-reward maximization and require coherent multi-step planning. Without this, the central claim of planning suboptimality cannot be distinguished from local execution errors.
Authors: We agree that ablation studies with baselines such as myopic or greedy policies would strengthen the evidence that the metrics capture long-horizon planning deficits rather than just execution errors. We will add these baseline comparisons in the revised manuscript to better isolate the planning suboptimality. revision: yes
-
Referee: [§4-5] §4-5 (Experimental Setup and Results): The manuscript provides no statistical details (confidence intervals, significance tests), data exclusion rules, or sensitivity analysis on prompt templates and action budgets for the eleven models. This undermines verification of the 'no single model dominates' finding and the strategy-versus-execution tension.
Authors: We acknowledge the lack of statistical rigor in the current presentation. In the revision, we will report confidence intervals, conduct appropriate significance tests, detail data exclusion rules, and perform sensitivity analyses on prompt templates and action budgets. This will provide stronger support for our findings regarding model performance across scenarios. revision: yes
Circularity Check
No circularity: empirical benchmark with direct performance measurements
full rationale
The paper introduces EcoGym as an open benchmark consisting of three environments (Vending, Freelance, Operation) and reports empirical results from running eleven LLMs on long-horizon tasks. Central claims rest on observed outcomes (net worth, income, DAU) rather than any derivation, equation, or fitted parameter that reduces to inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the evaluation chain; results are obtained by direct simulation against the defined environments and metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Business-relevant outcomes such as net worth, income, and DAU are appropriate proxies for strategic coherence and robustness in interactive economies.
Forward citations
Cited by 1 Pith paper
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
Reference graph
Works this paper leans on
-
[1]
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, and Vincent Bissonnette. Herobench: A benchmark for long-horizon planning and structured reasoning in virtual worlds.arXiv preprint arXiv:2508.12782, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents. arXiv preprint arXiv:2502.15840, 2025
-
[3]
Axel Backlund and Lukas Petersson. Vending-bench 2, 2025. URL https://andonlabs.com/evals/ vending-bench-2
work page 2025
-
[4]
xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025
Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, YuanGong, ChenSun, HanHou, HuiYang, JamesPan, JiananLou, JiayiMao, JizhengLiu, JinpengLi, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi- Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Y...
-
[5]
Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain, 2024
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, and Baobao Chang. Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain, 2024. URLhttps://arxiv.org/abs/2402.15527
-
[6]
Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li. Stockbench: Can llm agents trade stocks profitably in real-world markets?, 2025. URLhttps://arxiv.org/abs/2510.02209
-
[7]
Finqa: A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, 2021
work page 2021
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864, 2025
Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, et al. Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864, 2025
-
[10]
Large language models empowered agent-based modeling and simulation: A survey and perspectives, 2023
Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives, 2023. URL https://arxiv.org/abs/2312.11970
-
[11]
Agent-based simulation of a financial market with large language models, 2025
Ryuji Hashimoto, Takehiro Takayanagi, Masahiro Suzuki, and Kiyoshi Izumi. Agent-based simulation of a financial market with large language models, 2025. URLhttps://arxiv.org/abs/2510.12189
-
[12]
Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset.Advances in Neural Information Processing Systems, 35:29217–29234, 2022
work page 2022
-
[13]
Step-DeepResearch technical report.arXiv preprint arXiv:2512.20491, 2025
Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu L...
-
[14]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning, 2025. URLhttps://arxiv.org/abs/2503.07608. 12
work page internal anchor Pith review arXiv 2025
-
[16]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Agent-oriented planning in multi- agent systems
Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi- agent systems. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=EqcLAU6gyU
work page 2025
-
[18]
Econagent: large language model-empowered agents for simulating macroeconomic activities
Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15523–15536, 2024
work page 2024
-
[19]
Quantagents: Towards multi-agent financial system via simulated trading, 2025
Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Quantagents: Towards multi-agent financial system via simulated trading, 2025. URLhttps://arxiv.org/abs/2510.04643
-
[20]
Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025
Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, et al. Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025
-
[21]
Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu
Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business-driven real-world financial benchmark for evaluating llms.arXiv preprint arXiv:2505.19457, 2025
-
[22]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
NeurIPS / arXiv preprint 2401.13178
Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents, 2024. URLhttps: //arxiv.org/abs/2401.13178
-
[24]
Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, et al. Remote labor index: Measuring ai automation of remote work.arXiv preprint arXiv:2510.26787, 2025
-
[25]
Stocksim: A dual-mode order-level simulator for evaluating multi-agent llms in financial markets,
Charidimos Papadakis, Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, and Giorgos Stamou. Stocksim: A dual-mode order-level simulator for evaluating multi-agent llms in financial markets,
- [26]
-
[27]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[28]
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025
work page internal anchor Pith review arXiv 2025
-
[29]
C Pedersen, M Otokiak, I Koonoo, J Milton, E Maktar, A Anaviapik, M Milton, G Porter, A Scott, C Newman, et al. Sciq: an invitation and recommendations to combine science and inuit qaujimajatuqangit for meaningful engagement of inuit communities in research.Arctic Science, 6(3):326–339, 2020
work page 2020
-
[30]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URLhttps://arxiv.org/abs/2405.14573
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning, 2021. URLhttps://arxiv.org/abs/ 2010.03768
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[32]
Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Ga...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
O-mem: Omni memory system for personalized, long horizon, self-evolving agents
Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, et al. O-mem: Omni memory system for personalized, long horizon, self-evolving agents. arXiv preprint arXiv:2511.13593, 2025
-
[34]
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, Abu Dhabi, United Arab Emirates, December 2022. Association...
-
[35]
Scienceworld: Is your agent smarter than a 5th grader?arXiv preprint arXiv:2203.07540,
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader?, 2022. URLhttps://arxiv.org/abs/2203.07540
-
[36]
The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Xingyao Wang, Simon Rosenberg, Juan Michelini, Calvin Smith, Hoang Tran, Engel Nyst, Rohit Malhotra, Xuhui Zhou, Valerie Chen, Robert Brennan, and Graham Neubig. The openhands software agent sdk: A composable and extensible foundation for production agents, 2025. URLhttps://arxiv.org/abs/2511.03690
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URLhttps://arxiv.org/abs/2504.12516
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024
-
[39]
Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,
Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction,
- [40]
-
[41]
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita-Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
MemEvolve: Meta-Evolution of Agent Memory Systems
Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
OAgents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025
He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, et al. Oagents: An empirical study of building effective agents.arXiv preprint arXiv:2506.15741, 2025. 14 A Model API Detailed Information For proprietary models, we utilize specific version snapshots where available to account for the “model d...
-
[45]
Response: Respond to the Agent’s argument. If they claim "complexity," verify it in the trace. Output Format Return strictly a JSON object: { "internal_assessment": "Evaluate the quality/value of the trajectory content...", "proposed_money": <float, precision 2>, "reasoning": "Message to the agent explaining your valuation based on the trace audit..." } C...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.