Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

Liyou Gao; Shengchao Qin; Wensheng Tang; Yuandao Cai; Yuzhang Zhu

arxiv: 2605.23574 · v1 · pith:3VP3X4RFnew · submitted 2026-05-22 · 💻 cs.LG · cs.SE

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

Yuandao Cai , Yuzhang Zhu , Liyou Gao , Wensheng Tang , Shengchao Qin This is my paper

Pith reviewed 2026-05-25 04:44 UTC · model grok-4.3

classification 💻 cs.LG cs.SE

keywords LLM agentsquantitative goalsgoal persistencelong-horizon tasksPushBenchagent benchmarkingtool useprogress tracking

0 comments

The pith

LLM agents require explicit progress tracking to persist until quantitative goals are verifiably complete.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that long-horizon LLM agents can execute plausible local tool calls yet stop short of completing a requested count of distinct items. It defines Quantitative Goal Persistence as the need to continue until an external verifier confirms the exact number of valid artifacts, and introduces PushBench to measure duplicates, repeated work, false completion, and progress drift directly. Matched tests find that a state-tracking retrieval controller reaches 69-78 percent success with no duplicates, while a backlog-tracking controller succeeds in 25-50 percent of cases where standard controllers reach zero; frontier agents also drop from many successes at 50 artifacts to only three out of nine at 100. The distinction matters because it isolates a reliability requirement separate from local task competence.

Core claim

Quantitative goals stress a different reliability requirement from local task competence: agents must maintain verified progress and stop only when the requested work is complete, as measured by PushBench on repository-artifact collection where state-tracking and backlog-tracking controllers outperform standard and completion-gated baselines.

What carries the argument

PushBench benchmark that converts quantitative goals into verifier-backed work units for repository-artifact collection, directly exposing repeated work, duplicate submissions, false completion, and progress drift.

If this is right

State-tracking retrieval controllers eliminate duplicate submissions and reach 69-78 percent success on the benchmark tasks.
Backlog-tracking work-unit controllers achieve 25-50 percent success in conditions where standard controllers complete no instances.
Black-box frontier agents solve many 50-artifact tasks but fall to three successes out of nine at 100 artifacts.
Quantitative goals demand maintenance of verified progress rather than reliance on local competence alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent architectures may need dedicated progress-monitoring modules decoupled from action generation to handle countable objectives reliably.
The same persistence gap is likely to appear in other countable long-horizon domains such as iterative data gathering or repeated simulation runs.
Evaluating the same controllers against verifiers with changing rules or noisy feedback would test whether the measured reliability requirement generalizes.

Load-bearing premise

The external verifier and benchmark tasks accurately capture real-world persistence challenges without measurement artifacts from task design or verifier rules.

What would settle it

Standard controllers reaching high success rates on 100-artifact PushBench tasks without specialized state or backlog tracking would falsify the claim that quantitative goals require separate persistence mechanisms.

Figures

Figures reproduced from arXiv: 2605.23574 by Liyou Gao, Shengchao Qin, Wensheng Tang, Yuandao Cai, Yuzhang Zhu.

**Figure 2.** Figure 2: QGP-DataOps-lite success rates. Native and LangGraph (LG) use standard, verifier-gated (VG), and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: QGP-RepoScan target scaling. Each point aggregates the nine task instances in one target bucket from [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Focused gpt-4.1 LangGraph stateful ablation. Success and average verified count rise most sharply only when duplicate filtering, page memory, and buffered verifier-aligned progress state are combined in full STATEQGP. Page memory improves success to 44.4%, but the full controller reaches 72.2%. The paired success delta between full STATEQGP and the no-buffer dedupe+page variant is 0.278 with a 95% bootstra… view at source ↗

**Figure 5.** Figure 5: Black-box frontier-agent QGP-RepoScan high-target evaluation. Each condition aggregates nine task [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Long-horizon language agents can make many plausible local tool calls yet fail to persist until a requested count is actually complete. We study this gap as Quantitative Goal Persistence (QGP): whether an agent keeps working until an external verifier confirms enough distinct valid items. PushBench turns this into a benchmark for repository-artifact collection and verifier-backed work units, so repeated work, duplicate submissions, false completion, and progress drift are measured directly rather than hidden behind a final success flag. In matched controller comparisons, a state-tracking retrieval controller reaches 69-78% success while eliminating duplicate submissions, and a backlog-tracking work-unit controller reaches 25-50% success in settings where standard and completion-gated controllers complete no task instances. Black-box frontier-agent evaluations with Claude Code (Sonnet 4.6) and Codex CLI (gpt-5.4) solve many 50-artifact tasks but drop to 3 out of 9 successes per condition at 100 artifacts. The results show that quantitative goals stress a different reliability requirement from local task competence: agents must maintain verified progress and stop only when the requested work is complete.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines QGP and ships PushBench to measure verified count completion in agents, with controller comparisons that show measurable gaps, but the evaluation leaves open whether scale effects explain the 100-artifact drop.

read the letter

The main takeaway is that this work isolates a persistence requirement for hitting exact verified counts rather than just succeeding locally. It introduces Quantitative Goal Persistence as the property of continuing until an external verifier confirms the target number of distinct items, and builds PushBench around repository-artifact collection tasks to track duplicates, false stops, and progress drift directly.

Referee Report

3 major / 1 minor

Summary. The paper claims that long-horizon LLM agents frequently fail at Quantitative Goal Persistence (QGP), defined as continuing tool use until an external verifier confirms completion of a requested count of distinct valid items rather than stopping on local plausibility. It introduces PushBench, a benchmark based on repository-artifact collection with verifier-backed work units that directly measures duplicates, repeated work, false completion, and progress drift. Matched controller comparisons show a state-tracking retrieval controller achieving 69-78% success and a backlog-tracking work-unit controller reaching 25-50% success in regimes where standard and completion-gated controllers achieve zero successes; black-box evaluations of Claude Code (Sonnet 4.6) and Codex CLI (gpt-5.4) succeed on many 50-artifact tasks but drop to 3/9 successes at 100 artifacts. The central claim is that quantitative goals impose a distinct reliability requirement centered on maintaining verified progress.

Significance. If the results hold after methodological clarification, the work usefully isolates a persistence failure mode that is not reducible to local task competence and supplies a verifier-backed benchmark plus controller baselines that could guide future agent design. The explicit measurement of duplicates and false completion via an external verifier, together with the controller ablation, are concrete strengths that go beyond aggregate success rates.

major comments (3)

[Abstract] Abstract: the performance drop from many 50-artifact successes to 3/9 at 100 artifacts is presented as evidence that quantitative goals stress a distinct persistence requirement, yet the abstract supplies no description of how 50- versus 100-artifact tasks are constructed to hold per-item difficulty, context length, or verifier behavior constant; without such controls the gap could arise from ordinary scaling limits rather than QGP-specific failures.
[Abstract] Abstract: the claims rest on 'matched controller comparisons' and 'verifier-backed work units,' but the abstract provides no information on task construction, verifier implementation, statistical controls, or number of runs, rendering it impossible to verify that the reported success rates (69-78%, 25-50%, 3/9) support the central distinction between QGP and local competence.
[Abstract] Abstract: the frontier-agent results are reported only as 'many 50-artifact tasks' versus '3 out of 9 successes per condition' with no breakdown by condition, no error bars, and no indication of how many total trials underlie the fractions, which directly affects the load-bearing inference that the drop demonstrates a unique reliability requirement.

minor comments (1)

[Abstract] Model identifiers 'Claude Code (Sonnet 4.6)' and 'Codex CLI (gpt-5.4)' are non-standard; clarify whether these refer to specific released versions, internal builds, or hypothetical future models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your review and for highlighting both the potential value of the verifier-backed benchmark and the need for clearer abstract-level reporting. We agree that the abstract must supply sufficient methodological context to support the central claims about QGP. We will revise the abstract to incorporate concise descriptions of task construction, controls, statistical reporting, and trial counts while preserving its brevity. Point-by-point responses to the major comments appear below.

read point-by-point responses

Referee: [Abstract] Abstract: the performance drop from many 50-artifact successes to 3/9 at 100 artifacts is presented as evidence that quantitative goals stress a distinct persistence requirement, yet the abstract supplies no description of how 50- versus 100-artifact tasks are constructed to hold per-item difficulty, context length, or verifier behavior constant; without such controls the gap could arise from ordinary scaling limits rather than QGP-specific failures.

Authors: We accept that the abstract omits an explicit statement of these controls. Section 3 of the manuscript describes the task generator, which fixes per-artifact difficulty, repository size, and verifier rules independently of the target count; context length is managed by the same retrieval mechanism across conditions. We will add a single sentence to the abstract summarizing these controls so that readers can see the drop is measured under matched per-item conditions rather than ordinary scaling. revision: yes
Referee: [Abstract] Abstract: the claims rest on 'matched controller comparisons' and 'verifier-backed work units,' but the abstract provides no information on task construction, verifier implementation, statistical controls, or number of runs, rendering it impossible to verify that the reported success rates (69-78%, 25-50%, 3/9) support the central distinction between QGP and local competence.

Authors: The abstract is intentionally compact, yet we agree it should reference the key methodological elements. Section 3 details the verifier implementation and work-unit definition; Section 4 reports that all controller comparisons use identical task sets, the same number of runs per condition, and the same statistical aggregation. We will revise the abstract to include a brief clause noting matched construction, verifier-backed units, and multi-run evaluation so the reported rates can be assessed in context. revision: yes
Referee: [Abstract] Abstract: the frontier-agent results are reported only as 'many 50-artifact tasks' versus '3 out of 9 successes per condition' with no breakdown by condition, no error bars, and no indication of how many total trials underlie the fractions, which directly affects the load-bearing inference that the drop demonstrates a unique reliability requirement.

Authors: We acknowledge the abstract's summary is imprecise on these points. The main text (Section 5.2 and Table 5) provides per-condition breakdowns, total trial counts (9 per condition), and error bars. We will update the abstract to state the exact trial count and note that the reported drop is consistent across the two frontier agents and both conditions, directing readers to the detailed tables for error bars and per-condition data. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external verifier

full rationale

The paper introduces PushBench as an empirical benchmark for Quantitative Goal Persistence, reporting success rates from direct agent evaluations against an external verifier on repository-artifact tasks. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text; claims rest on observed performance differences (e.g., 69-78% vs. 0% in matched controllers, drop at 100 artifacts) rather than any reduction of outputs to inputs by construction. The evaluation is self-contained against external outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities; work is empirical benchmark evaluation.

pith-pipeline@v0.9.0 · 5745 in / 1132 out tokens · 27085 ms · 2026-05-25T04:44:56.248945+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 14 canonical work pages · 6 internal anchors

[1]

AgentBench: Evaluating LLMs as Agents

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , title =. CoRR , volume =. 2023 ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.03688 2023
[2]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[3]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , booktitle =

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , editor =. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Co...

2024
[4]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.12045 , eprinttype =. 2406.12045 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
[5]

Parameswaran , title =

Ruiying Ma and Shreya Shankar and Ruiqi Chen and Yiming Lin and Sepanta Zeighami and Rajoshi Ghosh and Abhinav Gupta and Anushrut Gupta and Tanmai Gopal and Aditya G. Parameswaran , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.20576 , eprinttype =. 2603.20576 , timestamp =

work page doi:10.48550/arxiv.2603.20576 2026
[6]

AgentBoard: An Analytical Evaluation Board of Multi-turn

Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He , editor =. AgentBoard: An Analytical Evaluation Board of Multi-turn. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Ca...

2024
[7]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =

Shunyu Yao and Howard Chen and John Yang and Karthik Narasimhan , editor =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =. 2022 , url =

2022
[8]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =

Mohit Shridhar and Xingdi Yuan and Marc. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =. 2021 , url =

2021
[9]

Jansen and Marc

Ruoyao Wang and Peter A. Jansen and Marc. ScienceWorld: Is your Agent Smarter than a 5th Grader? , booktitle =. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.775 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.775 2022
[10]

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? , booktitle =

Ori Yoran and Samuel Joseph Amouyal and Chaitanya Malaviya and Ben Bogin and Ofir Press and Jonathan Berant , editor =. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.505 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.505 2024
[11]

Mind2Web: Towards a Generalist Agent for the Web , booktitle =

Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samual Stevens and Boshi Wang and Huan Sun and Yu Su , editor =. Mind2Web: Towards a Generalist Agent for the Web , booktitle =. 2023 , url =

2023
[12]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[13]

Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =

Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =

2023
[14]

Reflexion: language agents with verbal reinforcement learning , booktitle =

Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , editor =. Reflexion: language agents with verbal reinforcement learning , booktitle =. 2023 , url =

2023
[15]

Self-Refine: Iterative Refinement with Self-Feedback , booktitle =

Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , editor =. Self-Refine: Iterative Refinement with Self-Feedb...

2023
[16]

The Twelfth International Conference on Learning Representations,

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , title =. The Twelfth International Conference on Learning Representations,. 2...

2024
[17]

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , booktitle =

Zhicheng Guo and Sijie Cheng and Hao Wang and Shihao Liang and Yujia Qin and Peng Li and Zhiyuan Liu and Maosong Sun and Yang Liu , editor =. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-ACL.664 , timestamp =

work page doi:10.18653/v1/2024.findings-acl.664 2024
[18]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li and Feifan Song and Bowen Yu and Haiyang Yu and Zhoujun Li and Fei Huang and Yongbin Li , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2304.08244 , eprinttype =. 2304.08244 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08244 2023
[19]

Patil and Tianjun Zhang and Xin Wang and Joseph E

Shishir G. Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , editor =. Gorilla: Large Language Model Connected with Massive APIs , booktitle =. 2024 , url =

2024
[20]

Towards Tool Use Alignment of Large Language Models , booktitle =

Zhiyuan Chen and Shiqi Shen and Guangyao Shen and Gong Zhi and Xu Chen and Yankai Lin , editor =. Towards Tool Use Alignment of Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.82 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.82 2024
[21]

ToolHop:

Junjie Ye and Zhengyin Du and Xuesong Yao and Weijian Lin and Yufei Xu and Zehui Chen and Zaiyuan Wang and Sining Zhu and Zhiheng Xi and Siyu Yuan and Tao Gui and Qi Zhang and Xuanjing Huang and Jiecao Chen , editor =. ToolHop:. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

2025
[22]

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments , booktitle =

Yu Gu and Yiheng Shu and Hao Yu and Xiao Liu and Yuxiao Dong and Jie Tang and Jayanth Srinivasa and Hugo Latapie and Yu Su , editor =. Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.436 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.436 2024
[23]

Yu Gu and Kai Zhang and Yuting Ning and Boyuan Zheng and Boyu Gou and Tianci Xue and Cheng Chang and Sanjari Srivastava and Yanan Xie and Peng Qi and Huan Sun and Yu Su , title =. Trans. Mach. Learn. Res. , volume =. 2025 , url =

2025
[24]

Evaluating Large Language Models Trained on Code

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Program Synthesis with Large Language Models

Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu , editor =....

2021
[27]

Measuring Coding Challenge Competence With

Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , editor =. Measuring Coding Challenge Competence With. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Ben...

2021
[28]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[29]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =

2024
[30]

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback , booktitle =

John Yang and Akshara Prabhakar and Karthik Narasimhan and Shunyu Yao , editor =. InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback , booktitle =. 2023 , url =

2023
[31]

McAuley , title =

Tianyang Liu and Canwen Xu and Julian J. McAuley , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[32]

MemGPT: Towards LLMs as Operating Systems

Charles Packer and Vivian Fang and Shishir G. Patil and Kevin Lin and Sarah Wooders and Joseph E. Gonzalez , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2310.08560 , eprinttype =. 2310.08560 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08560 2023
[33]

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

Joon Sung Park and Joseph C. O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein , editor =. Generative Agents: Interactive Simulacra of Human Behavior , booktitle =. 2023 , url =. doi:10.1145/3586183.3606763 , timestamp =

work page doi:10.1145/3586183.3606763 2023
[34]

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Trans. Mach. Learn. Res. , volume =. 2024 , url =

2024
[35]

Frontiers Comput

Lei Wang and Chen Ma and Xueyang Feng and Zeyu Zhang and Hao Yang and Jingsen Zhang and Zhiyuan Chen and Jiakai Tang and Xu Chen and Yankai Lin and Wayne Xin Zhao and Zhewei Wei and Jirong Wen , title =. Frontiers Comput. Sci. , volume =. 2024 , url =. doi:10.1007/S11704-024-40231-1 , timestamp =

work page doi:10.1007/s11704-024-40231-1 2024
[36]

2026 , note =

Requests , howpublished =. 2026 , note =

2026
[37]

2026 , note =

pytest , howpublished =. 2026 , note =

2026
[38]

2026 , note =

Flask , howpublished =. 2026 , note =

2026
[39]

2026 , note =

Aider , howpublished =. 2026 , note =

2026
[40]

2026 , note =

Vega Datasets: Cars , howpublished =. 2026 , note =

2026
[41]

2026 , note =

FiveThirtyEight Data: Airline Safety , howpublished =. 2026 , note =

2026

[1] [1]

AgentBench: Evaluating LLMs as Agents

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , title =. CoRR , volume =. 2023 ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.03688 2023

[2] [2]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[3] [3]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , booktitle =

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , editor =. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Co...

2024

[4] [4]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.12045 , eprinttype =. 2406.12045 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024

[5] [5]

Parameswaran , title =

Ruiying Ma and Shreya Shankar and Ruiqi Chen and Yiming Lin and Sepanta Zeighami and Rajoshi Ghosh and Abhinav Gupta and Anushrut Gupta and Tanmai Gopal and Aditya G. Parameswaran , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.20576 , eprinttype =. 2603.20576 , timestamp =

work page doi:10.48550/arxiv.2603.20576 2026

[6] [6]

AgentBoard: An Analytical Evaluation Board of Multi-turn

Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He , editor =. AgentBoard: An Analytical Evaluation Board of Multi-turn. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Ca...

2024

[7] [7]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =

Shunyu Yao and Howard Chen and John Yang and Karthik Narasimhan , editor =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , booktitle =. 2022 , url =

2022

[8] [8]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =

Mohit Shridhar and Xingdi Yuan and Marc. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , booktitle =. 2021 , url =

2021

[9] [9]

Jansen and Marc

Ruoyao Wang and Peter A. Jansen and Marc. ScienceWorld: Is your Agent Smarter than a 5th Grader? , booktitle =. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.775 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.775 2022

[10] [10]

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? , booktitle =

Ori Yoran and Samuel Joseph Amouyal and Chaitanya Malaviya and Ben Bogin and Ofir Press and Jonathan Berant , editor =. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.505 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.505 2024

[11] [11]

Mind2Web: Towards a Generalist Agent for the Web , booktitle =

Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samual Stevens and Boshi Wang and Huan Sun and Yu Su , editor =. Mind2Web: Towards a Generalist Agent for the Web , booktitle =. 2023 , url =

2023

[12] [12]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[13] [13]

Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =

Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =

2023

[14] [14]

Reflexion: language agents with verbal reinforcement learning , booktitle =

Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , editor =. Reflexion: language agents with verbal reinforcement learning , booktitle =. 2023 , url =

2023

[15] [15]

Self-Refine: Iterative Refinement with Self-Feedback , booktitle =

Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , editor =. Self-Refine: Iterative Refinement with Self-Feedb...

2023

[16] [16]

The Twelfth International Conference on Learning Representations,

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun , title =. The Twelfth International Conference on Learning Representations,. 2...

2024

[17] [17]

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , booktitle =

Zhicheng Guo and Sijie Cheng and Hao Wang and Shihao Liang and Yujia Qin and Peng Li and Zhiyuan Liu and Maosong Sun and Yang Liu , editor =. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-ACL.664 , timestamp =

work page doi:10.18653/v1/2024.findings-acl.664 2024

[18] [18]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li and Feifan Song and Bowen Yu and Haiyang Yu and Zhoujun Li and Fei Huang and Yongbin Li , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2304.08244 , eprinttype =. 2304.08244 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08244 2023

[19] [19]

Patil and Tianjun Zhang and Xin Wang and Joseph E

Shishir G. Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , editor =. Gorilla: Large Language Model Connected with Massive APIs , booktitle =. 2024 , url =

2024

[20] [20]

Towards Tool Use Alignment of Large Language Models , booktitle =

Zhiyuan Chen and Shiqi Shen and Guangyao Shen and Gong Zhi and Xu Chen and Yankai Lin , editor =. Towards Tool Use Alignment of Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.82 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.82 2024

[21] [21]

ToolHop:

Junjie Ye and Zhengyin Du and Xuesong Yao and Weijian Lin and Yufei Xu and Zehui Chen and Zaiyuan Wang and Sining Zhu and Zhiheng Xi and Siyu Yuan and Tao Gui and Qi Zhang and Xuanjing Huang and Jiecao Chen , editor =. ToolHop:. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

2025

[22] [22]

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments , booktitle =

Yu Gu and Yiheng Shu and Hao Yu and Xiao Liu and Yuxiao Dong and Jie Tang and Jayanth Srinivasa and Hugo Latapie and Yu Su , editor =. Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.436 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.436 2024

[23] [23]

Yu Gu and Kai Zhang and Yuting Ning and Boyuan Zheng and Boyu Gou and Tianci Xue and Cheng Chang and Sanjari Srivastava and Yanan Xie and Peng Qi and Huan Sun and Yu Su , title =. Trans. Mach. Learn. Res. , volume =. 2025 , url =

2025

[24] [24]

Evaluating Large Language Models Trained on Code

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Program Synthesis with Large Language Models

Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu , editor =....

2021

[27] [27]

Measuring Coding Challenge Competence With

Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , editor =. Measuring Coding Challenge Competence With. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Ben...

2021

[28] [28]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[29] [29]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =

2024

[30] [30]

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback , booktitle =

John Yang and Akshara Prabhakar and Karthik Narasimhan and Shunyu Yao , editor =. InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback , booktitle =. 2023 , url =

2023

[31] [31]

McAuley , title =

Tianyang Liu and Canwen Xu and Julian J. McAuley , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[32] [32]

MemGPT: Towards LLMs as Operating Systems

Charles Packer and Vivian Fang and Shishir G. Patil and Kevin Lin and Sarah Wooders and Joseph E. Gonzalez , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2310.08560 , eprinttype =. 2310.08560 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08560 2023

[33] [33]

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

Joon Sung Park and Joseph C. O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S. Bernstein , editor =. Generative Agents: Interactive Simulacra of Human Behavior , booktitle =. 2023 , url =. doi:10.1145/3586183.3606763 , timestamp =

work page doi:10.1145/3586183.3606763 2023

[34] [34]

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Trans. Mach. Learn. Res. , volume =. 2024 , url =

2024

[35] [35]

Frontiers Comput

Lei Wang and Chen Ma and Xueyang Feng and Zeyu Zhang and Hao Yang and Jingsen Zhang and Zhiyuan Chen and Jiakai Tang and Xu Chen and Yankai Lin and Wayne Xin Zhao and Zhewei Wei and Jirong Wen , title =. Frontiers Comput. Sci. , volume =. 2024 , url =. doi:10.1007/S11704-024-40231-1 , timestamp =

work page doi:10.1007/s11704-024-40231-1 2024

[36] [36]

2026 , note =

Requests , howpublished =. 2026 , note =

2026

[37] [37]

2026 , note =

pytest , howpublished =. 2026 , note =

2026

[38] [38]

2026 , note =

Flask , howpublished =. 2026 , note =

2026

[39] [39]

2026 , note =

Aider , howpublished =. 2026 , note =

2026

[40] [40]

2026 , note =

Vega Datasets: Cars , howpublished =. 2026 , note =

2026

[41] [41]

2026 , note =

FiveThirtyEight Data: Airline Safety , howpublished =. 2026 , note =

2026