Process-Centric Analysis of Agentic Software Systems
Pith reviewed 2026-05-17 03:10 UTC · model grok-4.3
The pith
Representing agent trajectories as Graphectory graphs enables real-time monitoring that improves resolution rates by 6.9 to 23.5 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By encoding the temporal and semantic relations in agent execution trajectories as Graphectory graphs, the analysis shows that agent strategies depend on model strength and problem difficulty, with successful resolutions following structured localization, patching, and validation while failures involve backtracking or disorder. Even successful agents often take inefficient routes. Implementing real-time Graphectory construction and analysis during runtime allows flagging problematic trajectories and intervening with diagnostic messages or rollbacks, which experiments demonstrate raises resolution rates between 6.9 and 23.5 percent across models for difficult cases while shortening the paths.
What carries the argument
Graphectory, a graph-based encoding of temporal and semantic relations within agent trajectories that supports both offline analysis and real-time monitoring.
If this is right
- Stronger LLMs or richer prompts result in more complex Graphectory structures indicating broader exploration and validation.
- Resolved issues typically follow coherent sequences of localization, patching, and validation, unlike the chaotic or backtracking patterns in unresolved cases.
- Even successful trajectories can contain inefficiencies that the graph representation highlights.
- Real-time monitoring with interventions shortens trajectories and boosts resolution rates for problematic instances with negligible overhead.
Where Pith is reading between the lines
- This method could be extended to other domains where agents perform multi-step reasoning, such as web navigation or scientific discovery.
- Patterns identified in Graphectory might inform the design of new prompting techniques that encourage coherent strategies from the start.
- Integrating Graphectory into the agent's own decision-making loop could allow agents to self-monitor and correct mid-execution without external intervention.
Load-bearing premise
The automatically built Graphectory graphs and chosen diagnostic rules capture the key reasoning failures accurately enough that interventions help more than they hurt.
What would settle it
Applying the online monitoring system to a fresh collection of agent trajectories on SWE-bench and measuring whether the reported gains in resolution rate and reductions in trajectory length still appear.
Figures
read the original abstract
Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution, i.e., trajectories, is inherently stochastic and adaptive to the problems they solve. Evaluation of such systems is often outcome-centric. This narrow focus overlooks detailed insights, failing to explain how agents reason, plan, act, or change their strategies. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such systems. Using Graphectory, we automatically analyze 4000 trajectories of two dominant agentic programming workflows, SWE-agent and OpenHands, with four backbone Large Language Models (LLMs), attempting to resolve SWE-bench issues. Our automated analyses (completed within four minutes) reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation; (2) agents' strategies vary with problem difficulty and the underlying LLM - for resolved issues, strategies often follow coherent localization-patching-validation steps, while unresolved ones exhibit chaotic or backtracking behaviors; and (3) even successful agentic systems often display inefficient processes. We also implement a novel technique for real-time construction and analysis of Graphectory and Langutory during agent execution to flag trajectory issues. Upon detecting such issues, the technique notifies the agent with a diagnostic message and, when applicable, rolls back the trajectory. Experiments show that online monitoring and interventions improve resolution rates by 6.9%-23.5% across models for problematic instances, while significantly shortening trajectories with near-zero overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Graphectory, a graph-based representation to encode temporal and semantic relations in stochastic trajectories of agentic systems. It automatically analyzes 4000 trajectories from SWE-agent and OpenHands (four LLMs) on SWE-bench, reporting that richer prompts/stronger models yield more complex graphs, resolved issues show coherent localization-patching-validation steps while unresolved ones exhibit chaotic backtracking, and even successful runs contain inefficiencies. The paper also describes real-time Graphectory/Langutory construction for online flagging of issues, with interventions (diagnostic messages or rollbacks) claimed to raise resolution rates by 6.9%-23.5% on problematic instances while shortening trajectories at near-zero overhead.
Significance. If the Graphectory construction rules and diagnostic interventions are shown to be reliable, the work supplies a concrete process-centric lens for agentic systems that complements outcome-only evaluation on benchmarks such as SWE-bench. The scale of the automated analysis and the low-overhead online technique constitute practical contributions that could inform both empirical studies of LLM-agent behavior and runtime monitoring tools in software engineering.
major comments (2)
- [§5 (Online monitoring and intervention)] §5 (Online monitoring and intervention): The diagnostic rules that map Graphectory features (e.g., chaotic backtracking) to intervention triggers are not specified, nor is any validation against human-labeled reasoning failures or ablation against generic prompts/random rollbacks reported. Because the 6.9%-23.5% resolution gains rest on these rules correctly identifying meaningful failures, the absence of such validation is load-bearing for the central empirical claim.
- [§3-4 (Graphectory construction and analysis)] §3-4 (Graphectory construction and analysis): The precise node/edge construction rules that turn raw trajectories into Graphectory graphs are not detailed, and no inter-rater reliability or statistical controls (confidence intervals, multiple-comparison correction) accompany the reported strategy patterns or resolution improvements. These omissions prevent assessment of whether the observed differences in complexity and coherence are robust.
minor comments (2)
- [Abstract] Abstract: 'Langutory' is mentioned without definition or relation to Graphectory; a brief parenthetical gloss would aid readability.
- [Throughout] Throughout: Ensure that first use of 'Graphectory' includes an explicit definition and that subsequent references maintain consistent capitalization and abbreviation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major concern point by point below and have revised the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: §5 (Online monitoring and intervention): The diagnostic rules that map Graphectory features (e.g., chaotic backtracking) to intervention triggers are not specified, nor is any validation against human-labeled reasoning failures or ablation against generic prompts/random rollbacks reported. Because the 6.9%-23.5% resolution gains rest on these rules correctly identifying meaningful failures, the absence of such validation is load-bearing for the central empirical claim.
Authors: We agree that the diagnostic rules and supporting validation were insufficiently detailed. In the revised manuscript we will expand Section 5 with the exact feature-to-trigger mapping (including thresholds for chaotic backtracking and other indicators), a description of how interventions are selected, and an ablation study comparing our approach against random rollbacks and generic diagnostic prompts. We will also report results from a small human-labeled validation set of trajectories to confirm that the automated rules align with observable reasoning failures. These additions directly address the load-bearing concern for the reported gains. revision: yes
-
Referee: §3-4 (Graphectory construction and analysis): The precise node/edge construction rules that turn raw trajectories into Graphectory graphs are not detailed, and no inter-rater reliability or statistical controls (confidence intervals, multiple-comparison correction) accompany the reported strategy patterns or resolution improvements. These omissions prevent assessment of whether the observed differences in complexity and coherence are robust.
Authors: We acknowledge that the construction rules require greater precision. The revised Section 3 will include a complete algorithmic specification with pseudocode for node and edge creation, along with explicit handling of temporal and semantic relations. Because the entire Graphectory extraction process is deterministic and rule-based, inter-rater reliability does not apply; we will instead emphasize reproducibility by releasing the extraction code. We will also add confidence intervals for all key metrics (graph complexity, coherence scores, resolution rates) and apply Bonferroni correction for multiple comparisons across the four models and two agent frameworks. These changes will allow readers to assess the robustness of the reported patterns. revision: yes
Circularity Check
No significant circularity; central claims are empirically grounded
full rationale
The paper introduces Graphectory as an independent graph-based representation for encoding agent trajectories and applies it to automated analysis of 4000 SWE-bench runs plus real-time intervention experiments. The reported 6.9%-23.5% resolution gains and trajectory shortening are measured as external performance deltas against baseline agent executions, not as quantities defined in terms of Graphectory parameters, diagnostic rules, or fitted values. No equations, self-definitional steps, or load-bearing self-citations reduce the claims to inputs by construction; the modeling choices and rules are tested for utility via observable outcomes rather than being tautological with those outcomes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agent trajectories in LLM-based systems exhibit temporal and semantic relations that can be systematically encoded as graphs.
invented entities (1)
-
Graphectory
no independent evidence
Forward citations
Cited by 4 Pith papers
-
Evaluating Plan Compliance in Autonomous Programming Agents
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
-
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
-
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
Reference graph
Works this paper leans on
-
[1]
Aider: AI pair programming in your terminal
2024. Aider: AI pair programming in your terminal. https://aider.chat/
work page 2024
-
[2]
2024. Composio’s SWE agent advances open-source on SweBench with a 48.6% score using LangGraph and LangSmith. https://blog.langchain.com/composio-swekit/
work page 2024
-
[3]
2024. Moatless tools. https://github.com/aorwall/moatless-tools 24 Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand
work page 2024
-
[4]
Trae Agent reached #1 on SWE-bench Verified, with Claude 4
2025. Trae Agent reached #1 on SWE-bench Verified, with Claude 4. https://www.trae.ai/blog/product_update_0625
work page 2025
-
[5]
Michael Ahn et al . 2022. Do As I Can, Not As I Say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Mistral AI. 2025. Introducing the best open-source model for coding agents. https://mistral.ai/news/devstral
work page 2025
-
[7]
Mistral AI. 2025. Upgrading agentic coding capabilities with the new Devstral models. https://mistral.ai/news/devstral- 2507
work page 2025
-
[8]
Anthropic. 2025. How we built our multi-agent research system. https://www.anthropic.com/engineering/multi- agent-research-system Article
work page 2025
-
[9]
Anthropic. 2025. System Card: Claude Opus 4 & Claude Sonnet 4. https://www-cdn.anthropic.com/ 4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf
work page 2025
-
[10]
Andres Bran et al . 2023. ChemCrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [11]
-
[12]
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. 2025. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [13]
-
[14]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [15]
-
[16]
Jatin Ganhotra. 2025. Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified. https://jatinganhotra.dev/blog/swe-agents/2025/01/05/swe-bench-mutliple-files/ Blog post
work page 2025
-
[17]
Jatin Ganhotra. 2025. The Multi-File Frontier: Why SWE-Bench Verified Doesn’t Reflect Real-World Programming Challenges. https://jatinganhotra.dev/blog/swe-agents/2025/03/30/swe-bench-verified-single-file-saturation/ Blog post
work page 2025
-
[18]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [19]
-
[20]
All Hands. 2024. OpenHands tools for viewing, creating and editing files. https://github.com/All-Hands- AI/OpenHands/blob/main/openhands/agenthub/codeact_agent/tools/str_replace_editor.py. 25
work page 2024
-
[21]
Tatsuro Inaba, Hirokazu Kiyomaru, Fei Cheng, and Sadao Kurohashi. 2023. MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting. InACL (Short). https://aclanthology.org/2023.acl-short.130/
work page 2023
-
[22]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [23]
-
[24]
Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938), 81–93
work page 1938
-
[25]
Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In2013 35th international conference on software engineering (ICSE). IEEE, 802–811
work page 2013
-
[26]
R. Uday Kiran and P. Krishna Reddy. 2010. Mining periodic-frequent patterns with maximum items’ support constraints. InProceedings of the Third Annual ACM Bangalore Conference(Bangalore, India)(COMPUTE ’10). Association for Computing Machinery, New York, NY, USA, Article 1, 8 pages. https://doi.org/10.1145/1754288.1754289
-
[27]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair.Ieee transactions on software engineering38, 1 (2011), 54–72
work page 2011
-
[28]
Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé F Bissyandé, Dongsun Kim, Peng Wu, Jacques Klein, Xiaoguang Mao, and Yves Le Traon. 2020. On the efficiency of test suite based program repair: A systematic assessment of 16 automated repair systems for java programs. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineer...
work page 2020
- [29]
-
[30]
Xiao Liu et al. 2023. AgentBench: Evaluating LLMs as Agents. InNeurIPS
work page 2023
-
[31]
Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. MarsCode Agent: AI-native Automated Bug Fixing.CoRR(2024)
work page 2024
-
[32]
Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code. InProceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages. 298–312
work page 2016
-
[33]
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao
-
[34]
Chameleon: Plug-and-play compositional reasoning with large language models.Advances in Neural Information Processing Systems36 (2023), 43447–43478
work page 2023
-
[35]
Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60
work page 1947
-
[36]
Thomas J. McCabe. 1976. A Complexity Measure.IEEE Transactions on Software EngineeringSE-2, 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837
-
[37]
Reiichiro Nakano et al. 2021. WebGPT: Browser-assisted question-answering with human feedback. InNeurIPS
work page 2021
-
[38]
OpenAI. 2024. OpenAI: Introducing SWE-bench Verified. https://openai.com/index/introducing-swe-bench-verified/
work page 2024
-
[39]
OpenHands. 2025. OpenHands System Prompt. github.com/All-Hands-AI/OpenHands/blob/08118d742b564add3e9 70921ac8910c265ece975/evaluation/benchmarks/swe_bench/prompts/swe_default.j2
work page 2025
-
[40]
Joon Sung Park et al. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InACM CHI
work page 2023
-
[41]
Refact.ai. 2025. Refact.ai, Your Open-Source, Autonomous AI Agent. https://refact.ai/
work page 2025
-
[42]
Noah Shinn et al. 2023. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS
work page 2023
-
[43]
Mohit Shridhar et al. 2023. SayCan: Grounding language models with affordances for robotic control. InICRA
work page 2023
-
[44]
Ramakrishnan Srikant and Rakesh Agrawal. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. InProceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology (EDBT ’96). Springer-Verlag, Berlin, Heidelberg, 3–17
work page 1996
-
[45]
Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision. 11888–11898
work page 2023
-
[46]
SWE-agent. 2024. SWE-agent System Prompt. https://github.com/SWE-agent/SWE- agent/blob/main/config/default.yaml
work page 2024
-
[47]
SWE-agent. 2024. SWE-agent tools for viewing, creating and editing files. https://github.com/SWE-agent/SWE- agent/blob/main/tools/edit_anthropic/config.yaml
work page 2024
-
[48]
Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, et al. 2025. Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks.arXiv preprint arXiv:2505.16901(2025)
-
[49]
Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. Magis: Llm-based multi-agent framework for github issue resolution.Advances in Neural Information Processing Systems37 (2024), 51963–51993
work page 2024
-
[50]
Nalin Wadhwa, Atharv Sonwane, Daman Arora, Abhav Mehrotra, Saiteja Utpala, Ramakrishna B Bairi, Aditya Kanade, and Nagarajan Natarajan. 2024. MASAI: Modular architecture for software-engineering AI agents. InNeurIPS 2024 26 Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand Workshop on Open-World Agents
work page 2024
-
[51]
Shunyu Wang et al . 2023. A Survey on Large Language Model based Autonomous Agents.arXiv preprint arXiv:2309.07864(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In2009 IEEE 31st International Conference on Software Engineering. IEEE, 364–374
work page 2009
-
[54]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2025. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...
work page 2025
-
[56]
Shunyu Yao et al. 2022. ReAct: Synergizing reasoning and acting in language models. InNeurIPS
work page 2022
- [57]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.