OpenRath: Session-Centered Runtime State for Agent Systems

Fukang Wen; Ruilin Xu; Zhijie Wang

arxiv: 2606.19409 · v1 · pith:Z6N6BOCXnew · submitted 2026-06-17 · 💻 cs.SE · cs.PL

OpenRath: Session-Centered Runtime State for Agent Systems

Fukang Wen , Zhijie Wang , Ruilin Xu This is my paper

Pith reviewed 2026-06-26 20:16 UTC · model grok-4.3

classification 💻 cs.SE cs.PL

keywords sessionruntime stateagent systemsauditable compositionmulti-agent systemsprogramming modelfork merge replay

0 comments

The pith

Session provides agent systems with a first-class runtime value for auditable composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that modern agent systems suffer from fragmented runtime state where transcripts, tool effects, memory events, and other elements are recorded separately, making inspection and reproduction difficult. It proposes Session as the central runtime value passed between agents and workflows in a PyTorch-like programming model for multi-agent systems. Session records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence while defining memory entry points. Because this state travels with the execution value itself, fork, merge, and replay become explicit runtime operations rather than reconstructions from external traces. A sympathetic reader would care if this approach simplifies auditable composition without relying on separate logging systems.

Core claim

The central claim is that Session provides agent systems with a first-class runtime value for auditable composition. Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence while defining where memory interactions enter the runtime record. This makes fork, merge, and replay explicit runtime operations rather than states reconstructed from external traces. The model further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector abstractions, with Selector turning control flow into runtime-routed decisions.

What carries the argument

Session, the runtime value passed between agents and workflows that is branchable, inspectable, replayable, backend-aware, and composable while recording conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence.

If this is right

Fork, merge, and replay become explicit runtime operations.
Control flow decisions are routed at runtime through the Selector abstraction.
Memory interactions enter the runtime record at explicitly defined points.
The overall system gains a unified value for carrying state across multi-session executions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing agent frameworks could adopt Session to reduce reliance on separate tracing tools for reproducibility.
Branching sessions might support parallel exploration of agent behaviors without duplicating external logs.
The model could extend to non-agent workflows where state provenance must remain tied to execution.

Load-bearing premise

That recording conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence inside the same runtime value passed during execution will make fork, merge, and replay explicit operations rather than states reconstructed from external traces.

What would settle it

An implementation of the Session model in which fork or replay still requires reconstructing state from external traces or logs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19409 by Fukang Wen, Ruilin Xu, Zhijie Wang.

**Figure 1.** Figure 1: OpenRath’s core boundary: side-channel state around an agent loop is promoted into a branchable Session value that can produce release evidence artifacts. 2 Related Work The agent ecosystem is converging on a specialized runtime stack: reasoning-and-acting methods, multi-agent frameworks, durable graph runtimes, tracing SDKs, tool/data protocols, realenvironment benchmarks, and provenance standards each … view at source ↗

**Figure 2.** Figure 2: OpenRath’s ecosystem role is a crossing-object boundary. It can work with specialized agent APIs, graph runtimes, tracing SDKs, tool protocols, sandbox providers, and evaluation harnesses by making their effects visible in one Session. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The PyTorch lens. Each agent-runtime concern maps onto one OpenRath object, with Session as the flowing value (the tensor of the runtime) and Selector routing control flow at run time. The mapping is a teaching device, not a claim that agent systems are neural networks. The most important design choice is what each object does not own. An Agent does not own the entire conversation graph; lineage belongs to… view at source ↗

**Figure 4.** Figure 4: Session lifecycle as a single runtime value: the same object is placed, transformed, branched, merged, persisted, and replayed rather than replaced by a separate orchestration state [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Tool execution boundary: schemas are visible to the model, side effects run through the session’s sandbox and backend, and results return as session evidence [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-session runtime and multi-agent workflow share the same boundary: agents route, hand off, and compose work by reading and returning Session state rather than introducing a second runtime object [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Claim-to-evidence protocol: report claims pass through a ledger, evidence packets, and a smoke suite before becoming supported text, scoped text, or explicit limitations. report honest without making the paper read like an internal backlog: supported claims stay in the main thesis; evidence-gated claims stay visible and bounded, and framing claims are separated from empirical evidence. An evidence packet i… view at source ↗

read the original abstract

Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenRath defines a Session as a first-class runtime value that bundles agent state so fork, merge, and replay become explicit operations, but the paper stays at the level of a programming model without code or data.

read the letter

The main thing about this paper is that it defines a Session as a first-class runtime value that bundles agent state so fork, merge, and replay become explicit operations, similar to how PyTorch centers tensors but applied to conversation chunks, lineage, token use, sandbox placement, and tool evidence.

The paper does a clean job of laying out the full model with related pieces like Sandbox, Selector, Workflow, and Memory, and it turns control flow into runtime-routed decisions. They are straightforward about limiting claims to controlled runtime properties and leave quantitative comparisons and live-provider issues for later work.

The soft spot is that nothing is shown beyond the definitions. There is no code, no pseudocode, no example runs, and no evidence protocol in action, so it is impossible to check whether carrying all that state in one passed value actually reduces fragmentation or creates overhead when agents hit real nondeterministic tools. The central thesis rests on the abstraction doing the work by itself.

This is for engineers and researchers building multi-agent systems who already care about reproducibility and auditability. Someone working on runtime state or workflow composition could borrow the structure even if they do not adopt the full set of components.

I would send it to peer review because the problem it targets is real in agent engineering and the framing is coherent, though the authors will need to add concrete support before it can be evaluated properly.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes OpenRath, a PyTorch-analogous programming model for multi-agent, multi-session systems whose core abstraction is the Session: a first-class, branchable, inspectable, replayable, backend-aware, and composable runtime value passed between agents and workflows. Session records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence while defining memory interaction points; this design is claimed to turn fork, merge, and replay into explicit runtime operations. The model further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector (the latter routing control flow at runtime). The report presents the programming model, architecture, audited milestones, and evidence protocol, with all claims explicitly limited to controlled runtime properties.

Significance. If the Session-centered model can be realized without hidden external dependencies, it would offer a unified mechanism for state management that directly supports auditability and reproducibility in agent systems. The explicit treatment of fork/merge/replay and the evidence protocol constitute concrete strengths that could serve as a reference point for subsequent implementations, even though the current manuscript supplies no code, data, or formal semantics.

major comments (2)

[Abstract] Abstract: the central thesis that carrying conversation chunks, sandbox placement, lineage, token usage, pending work, and tool evidence inside the same Session value makes fork, merge, and replay 'explicit runtime operations rather than states reconstructed from external traces' is load-bearing yet rests solely on the definitional claim; no pseudocode, workflow fragment, or operation signature is supplied to show how an agent would invoke these operations on the Session object.
[Abstract] Abstract / architecture description: Selector is said to turn control flow into runtime-routed decisions, but the manuscript supplies no account of how Selector consults the recorded pending work and tool evidence to guarantee replay determinism; this interaction is essential to the claim that all listed state elements remain inside the runtime value.

minor comments (1)

[Abstract] The PyTorch analogy is stated but never elaborated (e.g., which tensor-like operations or autograd-style tracking are being paralleled), leaving the intended developer interface unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central thesis that carrying conversation chunks, sandbox placement, lineage, token usage, pending work, and tool evidence inside the same Session value makes fork, merge, and replay 'explicit runtime operations rather than states reconstructed from external traces' is load-bearing yet rests solely on the definitional claim; no pseudocode, workflow fragment, or operation signature is supplied to show how an agent would invoke these operations on the Session object.

Authors: We agree that the manuscript supplies no pseudocode, workflow fragments, or operation signatures. The presentation centers on the architectural role of Session as a first-class, branchable runtime value that unifies state elements, thereby rendering fork, merge, and replay explicit by definition rather than through external reconstruction. To make this claim more concrete, we will add pseudocode examples and a workflow fragment illustrating the relevant Session operations in the revised version. revision: yes
Referee: [Abstract] Abstract / architecture description: Selector is said to turn control flow into runtime-routed decisions, but the manuscript supplies no account of how Selector consults the recorded pending work and tool evidence to guarantee replay determinism; this interaction is essential to the claim that all listed state elements remain inside the runtime value.

Authors: We acknowledge that the manuscript does not detail the mechanism by which Selector consults pending work and tool evidence. Selector is defined at the architectural level as the component that routes control flow using runtime state carried in the Session. In revision we will expand this section to describe the consultation process and its role in preserving replay determinism while ensuring all state elements remain within the Session value. revision: yes

Circularity Check

0 steps flagged

No significant circularity; definitional proposal of new abstraction

full rationale

The manuscript proposes Session as a first-class runtime value that carries conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence. This is presented as a programming model definition (PyTorch-like analogy for state passing), not a derivation from equations or fitted parameters. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core claim. Fork/merge/replay are defined as explicit operations precisely because the state is carried by the Session value passed at runtime; this is definitional rather than a reduction to prior inputs. The paper explicitly limits claims to controlled runtime properties and leaves quantitative evaluation for follow-on work. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The paper introduces multiple new runtime entities and a programming model without empirical validation or formal proof in the provided abstract; the central claim rests on the assumption that a unified Session value solves fragmentation.

axioms (1)

domain assumption Fragmented runtime state in agent systems is a problem that a single first-class runtime value can solve
Invoked in the opening description of the problem and the thesis that Session enables auditable composition.

invented entities (3)

Session no independent evidence
purpose: Central branchable, inspectable, replayable runtime value that carries all state
Newly defined as the value passed between agents and workflows; no independent evidence supplied.
Sandbox no independent evidence
purpose: Execution environment tied to Session
Defined as part of the model alongside Session.
Selector no independent evidence
purpose: Turns control flow into runtime-routed decisions
New component defined in the model.

pith-pipeline@v0.9.1-grok · 5773 in / 1344 out tokens · 28678 ms · 2026-06-26T20:16:24.808751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 37 linked inside Pith

[1]

ReAct: Synergizing Reasoning and Acting in Language Models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, 2023. URL https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023
[2]

PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performa...

Pith/arXiv arXiv 2019
[3]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation,
[4]

URLhttps://arxiv.org/abs/2308.08155

Pith/arXiv arXiv
[5]

LangGraph: Persistence, 2025

LangChain. LangGraph: Persistence, 2025. URL https://docs.langchain.com/oss/ python/langgraph/persistence

2025
[6]

OpenAI Agents SDK: Tracing, 2025

OpenAI. OpenAI Agents SDK: Tracing, 2025. URL https://openai.github.io/ openai-agents-python/tracing/

2025
[7]

What is the Model Context Protocol (MCP)?, 2025

Model Context Protocol. What is the Model Context Protocol (MCP)?, 2025. URLhttps: //modelcontextprotocol.io/docs/getting-started/intro

2025
[8]

Introducing the Model Context Protocol, 2024

Anthropic. Introducing the Model Context Protocol, 2024. URLhttps://www.anthropic. com/news/model-context-protocol

2024
[9]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2023. URLhttps://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2023
[10]

Self-Consistency Improves Chain of Thought Reasoning in Language Models, 2023

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models, 2023. URLhttps://arxiv.org/abs/2203.11171

Pith/arXiv arXiv 2023
[11]

MRKL Systems: A Modular, Neuro-Symbolic Architecture that Combines Large Language Models, External Knowledge Sources and Discrete Reasoning, 2022

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tenenholtz. MRKL Systems: A Modular, Neuro-Symbolic Architecture that Combines Large Language Models, External Knowledg...

Pith/arXiv arXiv 2022
[12]

Toolformer: Language Models Can Teach Themselves to Use Tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools, 2023. URLhttps://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023
[13]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, 2023. URL https://arxiv.org/abs/2303.17580

Pith/arXiv arXiv 2023
[14]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large Language Model Connected with Massive APIs, 2023. URLhttps://arxiv.org/abs/2305.15334. 16 OpenRath: Session-Centered Runtime State

Pith/arXiv arXiv 2023
[15]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023. URLhttps://arxiv.org/abs/2307.16789

Pith/arXiv arXiv 2023
[16]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs,

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs,
[17]

URLhttps://arxiv.org/abs/2304.08244

Pith/arXiv arXiv
[18]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases, 2023

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases, 2023. URLhttps://arxiv.org/abs/2306.05301

Pith/arXiv arXiv 2023
[19]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, 2023. URLhttps://arxiv.org/abs/2305.10601

Pith/arXiv arXiv 2023
[20]

CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society, 2023

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society, 2023. URLhttps://arxiv.org/abs/2303.17760

Pith/arXiv arXiv 2023
[21]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, 2024

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, 2024. URLhttps://arxiv.org/abs/2308.00352

Pith/arXiv arXiv 2024
[22]

ChatDev: Communicative Agents for Software Development, 2024

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative Agents for Software Development, 2024. URLhttps://arxiv.org/abs/2307. 07924

2024
[23]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors, 2023

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors, 2023. URLhttps://arxiv.org/abs/2308.10848

Pith/arXiv arXiv 2023
[24]

LangGraph: Use Time Travel, 2025

LangChain. LangGraph: Use Time Travel, 2025. URLhttps://docs.langchain.com/oss/ python/langgraph/use-time-travel

2025
[25]

OpenAI Agents SDK: Agents, 2025

OpenAI. OpenAI Agents SDK: Agents, 2025. URL https://openai.github.io/ openai-agents-python/agents/

2025
[26]

Traces, 2025

OpenTelemetry. Traces, 2025. URLhttps://opentelemetry.io/docs/concepts/signals/ traces/

2025
[27]

OpenAPI Specification Version 3.1.0, 2021

OpenAPI Initiative. OpenAPI Specification Version 3.1.0, 2021. URLhttps://swagger.io/ specification/

2021
[28]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System for Large...

Pith/arXiv arXiv 2016
[29]

Reflexion: Language Agents with Verbal Reinforcement Learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning, 2023. URL https://arxiv.org/abs/2303.11366

Pith/arXiv arXiv 2023
[30]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative Agents: Interactive Simulacra of Human Behavior, 2023. URLhttps://arxiv.org/abs/2304.03442

Pith/arXiv arXiv 2023
[31]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as Operating Systems, 2024. URLhttps: //arxiv.org/abs/2310.08560

Pith/arXiv arXiv 2024
[32]

Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023. URLhttps://arxiv.org/abs/2305.16291

Pith/arXiv arXiv 2023
[33]

AgentBench: Evaluating LLMs as Agents, 2025

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, 2025. URLhttps://arxiv.org/ abs/2308.03688

Pith/arXiv arXiv 2025
[34]

URLhttps://arxiv.org/ abs/2406.12045

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, 2024. URLhttps://arxiv.org/ abs/2406.12045

Pith/arXiv arXiv 2024
[35]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,
[36]

URLhttps://arxiv.org/abs/2310.06770

Pith/arXiv arXiv
[37]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024
[38]

SWE-bench Verified, 2024

OpenAI. SWE-bench Verified, 2024. URLhttps://www.swebench.com/verified.html

2024
[39]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, et al. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces, 2026. URLhttps://arxiv. org/abs/2601.11868

Pith/arXiv arXiv 2026
[40]

Barr, Mark Harman, Federica Sarro, and He Ye

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O’Hearn, Earl T. Barr, Mark Harman, Federica Sarro, and He Ye. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks, 2026. URLhttps://arxiv.org/abs/2605.22535

Pith/arXiv arXiv 2026
[41]

No More, No Less: Task Alignment in Terminal Agents, 2026

Sina Mavali, David Pape, Jonathan Evertz, Samira Abedini, Devansh Srivastav, Thorsten Eisenhofer, Sahar Abdelnabi, and Lea Schönherr. No More, No Less: Task Alignment in Terminal Agents, 2026. URLhttps://arxiv.org/abs/2605.12233

Pith/arXiv arXiv 2026
[42]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, 2024. URLhttps://arxiv. org/abs/2307.13854. 18 OpenRath: Session-Centered Runtime State

Pith/arXiv arXiv 2024
[43]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks, 2024

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks, 2024. URLhttps://arxiv. org/abs/2401.13649

Pith/arXiv arXiv 2024
[44]

Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?, 2024. URLhttps://arxiv.org/abs/2403.07718

Pith/arXiv arXiv 2024
[45]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, 2024. URL https://arxiv.org/abs/2404.07972

Pith/arXiv arXiv 2024
[46]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, 2023

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, 2023. URLhttps: //arxiv.org/abs/2207.01206

arXiv 2023
[47]

Mind2Web: Towards a Generalist Agent for the Web, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a Generalist Agent for the Web, 2023. URL https: //arxiv.org/abs/2306.06070

Pith/arXiv arXiv 2023
[48]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning, 2021

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning, 2021. URLhttps://arxiv.org/abs/2010.03768

Pith/arXiv arXiv 2021
[49]

Science- World: Is your Agent Smarter than a 5th Grader?, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Science- World: Is your Agent Smarter than a 5th Grader?, 2022. URLhttps://arxiv.org/abs/ 2203.07540

arXiv 2022
[50]

GAIA: a Benchmark for General AI Assistants, 2023

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a Benchmark for General AI Assistants, 2023. URLhttps://arxiv.org/ abs/2311.12983

Pith/arXiv arXiv 2023
[51]

Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks, 2025....

Pith/arXiv arXiv 2025
[52]

Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, 2023

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, 2023. URL https://arxiv.org/abs/2302. 12173

2023
[53]

Agent-SafetyBench: Evaluating the Safety of LLM Agents, 2025

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-SafetyBench: Evaluating the Safety of LLM Agents, 2025. URL https://arxiv.org/abs/2412.14470

Pith/arXiv arXiv 2025
[54]

SafeArena: Evaluating the Safety of Autonomous Web Agents, 2025

Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Stańczak, and Siva Reddy. SafeArena: Evaluating the Safety of Autonomous Web Agents, 2025. URLhttps://arxiv.org/abs/2503.04957. 19 OpenRath: Session-Centered Runtime State

arXiv 2025
[55]

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents, 2025

Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents, 2025. URLhttps://arxiv.org/abs/2412.13178. 20 OpenRath: Session-Centered Runtime State Appendix A Case Studies The case studies are used as scoped ap...

arXiv 2025

[1] [1]

ReAct: Synergizing Reasoning and Acting in Language Models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, 2023. URL https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023

[2] [2]

PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performa...

Pith/arXiv arXiv 2019

[3] [3]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation,

[4] [4]

URLhttps://arxiv.org/abs/2308.08155

Pith/arXiv arXiv

[5] [5]

LangGraph: Persistence, 2025

LangChain. LangGraph: Persistence, 2025. URL https://docs.langchain.com/oss/ python/langgraph/persistence

2025

[6] [6]

OpenAI Agents SDK: Tracing, 2025

OpenAI. OpenAI Agents SDK: Tracing, 2025. URL https://openai.github.io/ openai-agents-python/tracing/

2025

[7] [7]

What is the Model Context Protocol (MCP)?, 2025

Model Context Protocol. What is the Model Context Protocol (MCP)?, 2025. URLhttps: //modelcontextprotocol.io/docs/getting-started/intro

2025

[8] [8]

Introducing the Model Context Protocol, 2024

Anthropic. Introducing the Model Context Protocol, 2024. URLhttps://www.anthropic. com/news/model-context-protocol

2024

[9] [9]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2023. URLhttps://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2023

[10] [10]

Self-Consistency Improves Chain of Thought Reasoning in Language Models, 2023

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models, 2023. URLhttps://arxiv.org/abs/2203.11171

Pith/arXiv arXiv 2023

[11] [11]

MRKL Systems: A Modular, Neuro-Symbolic Architecture that Combines Large Language Models, External Knowledge Sources and Discrete Reasoning, 2022

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tenenholtz. MRKL Systems: A Modular, Neuro-Symbolic Architecture that Combines Large Language Models, External Knowledg...

Pith/arXiv arXiv 2022

[12] [12]

Toolformer: Language Models Can Teach Themselves to Use Tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools, 2023. URLhttps://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023

[13] [13]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, 2023. URL https://arxiv.org/abs/2303.17580

Pith/arXiv arXiv 2023

[14] [14]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large Language Model Connected with Massive APIs, 2023. URLhttps://arxiv.org/abs/2305.15334. 16 OpenRath: Session-Centered Runtime State

Pith/arXiv arXiv 2023

[15] [15]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023. URLhttps://arxiv.org/abs/2307.16789

Pith/arXiv arXiv 2023

[16] [16]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs,

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs,

[17] [17]

URLhttps://arxiv.org/abs/2304.08244

Pith/arXiv arXiv

[18] [18]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases, 2023

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases, 2023. URLhttps://arxiv.org/abs/2306.05301

Pith/arXiv arXiv 2023

[19] [19]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, 2023. URLhttps://arxiv.org/abs/2305.10601

Pith/arXiv arXiv 2023

[20] [20]

CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society, 2023

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society, 2023. URLhttps://arxiv.org/abs/2303.17760

Pith/arXiv arXiv 2023

[21] [21]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, 2024

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, 2024. URLhttps://arxiv.org/abs/2308.00352

Pith/arXiv arXiv 2024

[22] [22]

ChatDev: Communicative Agents for Software Development, 2024

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative Agents for Software Development, 2024. URLhttps://arxiv.org/abs/2307. 07924

2024

[23] [23]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors, 2023

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors, 2023. URLhttps://arxiv.org/abs/2308.10848

Pith/arXiv arXiv 2023

[24] [24]

LangGraph: Use Time Travel, 2025

LangChain. LangGraph: Use Time Travel, 2025. URLhttps://docs.langchain.com/oss/ python/langgraph/use-time-travel

2025

[25] [25]

OpenAI Agents SDK: Agents, 2025

OpenAI. OpenAI Agents SDK: Agents, 2025. URL https://openai.github.io/ openai-agents-python/agents/

2025

[26] [26]

Traces, 2025

OpenTelemetry. Traces, 2025. URLhttps://opentelemetry.io/docs/concepts/signals/ traces/

2025

[27] [27]

OpenAPI Specification Version 3.1.0, 2021

OpenAPI Initiative. OpenAPI Specification Version 3.1.0, 2021. URLhttps://swagger.io/ specification/

2021

[28] [28]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System for Large...

Pith/arXiv arXiv 2016

[29] [29]

Reflexion: Language Agents with Verbal Reinforcement Learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning, 2023. URL https://arxiv.org/abs/2303.11366

Pith/arXiv arXiv 2023

[30] [30]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative Agents: Interactive Simulacra of Human Behavior, 2023. URLhttps://arxiv.org/abs/2304.03442

Pith/arXiv arXiv 2023

[31] [31]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as Operating Systems, 2024. URLhttps: //arxiv.org/abs/2310.08560

Pith/arXiv arXiv 2024

[32] [32]

Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023. URLhttps://arxiv.org/abs/2305.16291

Pith/arXiv arXiv 2023

[33] [33]

AgentBench: Evaluating LLMs as Agents, 2025

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, 2025. URLhttps://arxiv.org/ abs/2308.03688

Pith/arXiv arXiv 2025

[34] [34]

URLhttps://arxiv.org/ abs/2406.12045

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, 2024. URLhttps://arxiv.org/ abs/2406.12045

Pith/arXiv arXiv 2024

[35] [35]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,

[36] [36]

URLhttps://arxiv.org/abs/2310.06770

Pith/arXiv arXiv

[37] [37]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024

[38] [38]

SWE-bench Verified, 2024

OpenAI. SWE-bench Verified, 2024. URLhttps://www.swebench.com/verified.html

2024

[39] [39]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, et al. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces, 2026. URLhttps://arxiv. org/abs/2601.11868

Pith/arXiv arXiv 2026

[40] [40]

Barr, Mark Harman, Federica Sarro, and He Ye

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O’Hearn, Earl T. Barr, Mark Harman, Federica Sarro, and He Ye. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks, 2026. URLhttps://arxiv.org/abs/2605.22535

Pith/arXiv arXiv 2026

[41] [41]

No More, No Less: Task Alignment in Terminal Agents, 2026

Sina Mavali, David Pape, Jonathan Evertz, Samira Abedini, Devansh Srivastav, Thorsten Eisenhofer, Sahar Abdelnabi, and Lea Schönherr. No More, No Less: Task Alignment in Terminal Agents, 2026. URLhttps://arxiv.org/abs/2605.12233

Pith/arXiv arXiv 2026

[42] [42]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, 2024. URLhttps://arxiv. org/abs/2307.13854. 18 OpenRath: Session-Centered Runtime State

Pith/arXiv arXiv 2024

[43] [43]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks, 2024

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks, 2024. URLhttps://arxiv. org/abs/2401.13649

Pith/arXiv arXiv 2024

[44] [44]

Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?, 2024. URLhttps://arxiv.org/abs/2403.07718

Pith/arXiv arXiv 2024

[45] [45]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, 2024. URL https://arxiv.org/abs/2404.07972

Pith/arXiv arXiv 2024

[46] [46]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, 2023

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, 2023. URLhttps: //arxiv.org/abs/2207.01206

arXiv 2023

[47] [47]

Mind2Web: Towards a Generalist Agent for the Web, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a Generalist Agent for the Web, 2023. URL https: //arxiv.org/abs/2306.06070

Pith/arXiv arXiv 2023

[48] [48]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning, 2021

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning, 2021. URLhttps://arxiv.org/abs/2010.03768

Pith/arXiv arXiv 2021

[49] [49]

Science- World: Is your Agent Smarter than a 5th Grader?, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Science- World: Is your Agent Smarter than a 5th Grader?, 2022. URLhttps://arxiv.org/abs/ 2203.07540

arXiv 2022

[50] [50]

GAIA: a Benchmark for General AI Assistants, 2023

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a Benchmark for General AI Assistants, 2023. URLhttps://arxiv.org/ abs/2311.12983

Pith/arXiv arXiv 2023

[51] [51]

Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks, 2025....

Pith/arXiv arXiv 2025

[52] [52]

Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, 2023

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, 2023. URL https://arxiv.org/abs/2302. 12173

2023

[53] [53]

Agent-SafetyBench: Evaluating the Safety of LLM Agents, 2025

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-SafetyBench: Evaluating the Safety of LLM Agents, 2025. URL https://arxiv.org/abs/2412.14470

Pith/arXiv arXiv 2025

[54] [54]

SafeArena: Evaluating the Safety of Autonomous Web Agents, 2025

Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Stańczak, and Siva Reddy. SafeArena: Evaluating the Safety of Autonomous Web Agents, 2025. URLhttps://arxiv.org/abs/2503.04957. 19 OpenRath: Session-Centered Runtime State

arXiv 2025

[55] [55]

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents, 2025

Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents, 2025. URLhttps://arxiv.org/abs/2412.13178. 20 OpenRath: Session-Centered Runtime State Appendix A Case Studies The case studies are used as scoped ap...

arXiv 2025