pith. sign in

arxiv: 2512.02393 · v3 · submitted 2025-12-02 · 💻 cs.SE · cs.AI· cs.CL

Process-Centric Analysis of Agentic Software Systems

Pith reviewed 2026-05-17 03:10 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords agentic systemstrajectory analysisGraphectoryLLM agentssoftware engineeringprocess monitoringSWE-benchonline intervention
0
0 comments X

The pith

Representing agent trajectories as Graphectory graphs enables real-time monitoring that improves resolution rates by 6.9 to 23.5 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that agentic software systems, which adaptively solve problems using large language models, benefit from shifting evaluation from final outcomes to their internal processes. It introduces Graphectory as a way to model these processes as graphs showing how agents explore, plan, and validate over time. Analysis of thousands of trajectories from common agent frameworks reveals patterns like more complex reasoning in stronger setups and coherent steps in successful cases versus chaotic ones in failures. The authors then use this representation for online detection of issues during execution, allowing the system to notify the agent or roll back steps. This leads to better success on fixing code issues and shorter execution paths without much extra computation.

Core claim

By encoding the temporal and semantic relations in agent execution trajectories as Graphectory graphs, the analysis shows that agent strategies depend on model strength and problem difficulty, with successful resolutions following structured localization, patching, and validation while failures involve backtracking or disorder. Even successful agents often take inefficient routes. Implementing real-time Graphectory construction and analysis during runtime allows flagging problematic trajectories and intervening with diagnostic messages or rollbacks, which experiments demonstrate raises resolution rates between 6.9 and 23.5 percent across models for difficult cases while shortening the paths.

What carries the argument

Graphectory, a graph-based encoding of temporal and semantic relations within agent trajectories that supports both offline analysis and real-time monitoring.

If this is right

  • Stronger LLMs or richer prompts result in more complex Graphectory structures indicating broader exploration and validation.
  • Resolved issues typically follow coherent sequences of localization, patching, and validation, unlike the chaotic or backtracking patterns in unresolved cases.
  • Even successful trajectories can contain inefficiencies that the graph representation highlights.
  • Real-time monitoring with interventions shortens trajectories and boosts resolution rates for problematic instances with negligible overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be extended to other domains where agents perform multi-step reasoning, such as web navigation or scientific discovery.
  • Patterns identified in Graphectory might inform the design of new prompting techniques that encourage coherent strategies from the start.
  • Integrating Graphectory into the agent's own decision-making loop could allow agents to self-monitor and correct mid-execution without external intervention.

Load-bearing premise

The automatically built Graphectory graphs and chosen diagnostic rules capture the key reasoning failures accurately enough that interventions help more than they hurt.

What would settle it

Applying the online monitoring system to a fresh collection of agent trajectories on SWE-bench and measuring whether the reported gains in resolution rate and reductions in trajectory length still appear.

Figures

Figures reproduced from arXiv: 2512.02393 by Jatin Ganhotra, Rahul Krishna, Reyhan Jabbarvand, Saurabh Sinha, Shuyang Liu, Yang Chen.

Figure 1
Figure 1. Figure 1: Raw trajectories of (a) SWE-agent Dev and (b) SWE-agent DSK-V3 when resolving django-10973 i.e., the sequence of reasoning about how to solve the problem and taking the appropriate actions, are different: SWE-agent Dev starts by localizing the bug to client.py file (steps 1–4), creating a reproduction test (step 5), and editing multiple locations of the client.py (steps 6–8). After two additional repetitiv… view at source ↗
Figure 2
Figure 2. Figure 2: Graphectory of SWE-agent Dev (a) and SWE-agent DSK-V3 (b) for problem django-10973 dashed lines demonstrate temporal edges (𝑇 𝐸) and structural edges (𝑆𝐸), respectively. The alphabet of Langutory is Φ = {𝐿, 𝑃,𝑉 }, representing the initial of unique logical phases in program repair: Localization, Patching, and Validation. Graphectory and Langutory promptly provide the following insights about the agents’ be… view at source ↗
Figure 3
Figure 3. Figure 3: Process-centric metrics by agent–model pair. Columns are grouped by [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Graphectory of OpenHands DSK-V3 (a) and SWE-agent DSK-V3 (b) for problem django-13820 (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: p-values of statistical tests on (a) issue repair status and (b) human difficulty alignment. Cells with ★ indicates significant difference (𝑝 ≤ 0.05) we perform the Mann-Whitney U test [34] for each metric across all ⟨agent, model⟩ pairs. This non-parametric statistical test ranks all observations from both groups (a process-centric metric and repair status) and compares the sum of ranks between the two gr… view at source ↗
Figure 6
Figure 6. Figure 6: Graphectory of SWE-agent DSK-V3 for (a) problem sympy-13480 (easy) and (b) problem scikit-learn-14053 (medium) trajectories, we observed many early terminations due to runtime issues caused by its failure to provide model responses in the correct format, which is a known issue.7 3.2.4 Analysis Across Problem Difficulty. Finally, we assess, using the process-centric metrics, how problem difficulty impacts a… view at source ↗
Figure 7
Figure 7. Figure 7: Phase transition sequences (cut-off = 10) [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of terminal trajectory phases [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Phase change distribution across agents and models: (a) Resolved and (b) Unresolved [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Strategy changes of SWE-agent DSK-V3 while attempting to resolve sympy-19783 problem dominant problem-solving strategy. Overall, resolved instances of easy problems exhibit structured and well-ordered phase transitions, typically following concise ⟨𝐿, 𝑃,𝑉 ⟩ cycles that reflect a disci￾plined pipeline of localization, patch generation, and validation. As task difficulty increases, these sequences extend mo… view at source ↗
Figure 11
Figure 11. Figure 11: Example of anti-patterns Scroll (a) and OverlyDeepZoom (b) [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of anti-patterns UnresolvedRetry (a) and EditReversion (b) [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Editor failure modes of (a) StrNotFound, (b) NoEffectEdit, and (c) AmbiguousTarget StrNotFound occurs when an edit fails because the specified old string cannot be found in the file ( [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Localization inefficiency patterns among Resolved (a) and Unresolved (b) instances [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Patching inefficiency patterns among Resolved (a) and Unresolved (b) instances [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
read the original abstract

Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution, i.e., trajectories, is inherently stochastic and adaptive to the problems they solve. Evaluation of such systems is often outcome-centric. This narrow focus overlooks detailed insights, failing to explain how agents reason, plan, act, or change their strategies. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such systems. Using Graphectory, we automatically analyze 4000 trajectories of two dominant agentic programming workflows, SWE-agent and OpenHands, with four backbone Large Language Models (LLMs), attempting to resolve SWE-bench issues. Our automated analyses (completed within four minutes) reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation; (2) agents' strategies vary with problem difficulty and the underlying LLM - for resolved issues, strategies often follow coherent localization-patching-validation steps, while unresolved ones exhibit chaotic or backtracking behaviors; and (3) even successful agentic systems often display inefficient processes. We also implement a novel technique for real-time construction and analysis of Graphectory and Langutory during agent execution to flag trajectory issues. Upon detecting such issues, the technique notifies the agent with a diagnostic message and, when applicable, rolls back the trajectory. Experiments show that online monitoring and interventions improve resolution rates by 6.9%-23.5% across models for problematic instances, while significantly shortening trajectories with near-zero overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Graphectory, a graph-based representation to encode temporal and semantic relations in stochastic trajectories of agentic systems. It automatically analyzes 4000 trajectories from SWE-agent and OpenHands (four LLMs) on SWE-bench, reporting that richer prompts/stronger models yield more complex graphs, resolved issues show coherent localization-patching-validation steps while unresolved ones exhibit chaotic backtracking, and even successful runs contain inefficiencies. The paper also describes real-time Graphectory/Langutory construction for online flagging of issues, with interventions (diagnostic messages or rollbacks) claimed to raise resolution rates by 6.9%-23.5% on problematic instances while shortening trajectories at near-zero overhead.

Significance. If the Graphectory construction rules and diagnostic interventions are shown to be reliable, the work supplies a concrete process-centric lens for agentic systems that complements outcome-only evaluation on benchmarks such as SWE-bench. The scale of the automated analysis and the low-overhead online technique constitute practical contributions that could inform both empirical studies of LLM-agent behavior and runtime monitoring tools in software engineering.

major comments (2)
  1. [§5 (Online monitoring and intervention)] §5 (Online monitoring and intervention): The diagnostic rules that map Graphectory features (e.g., chaotic backtracking) to intervention triggers are not specified, nor is any validation against human-labeled reasoning failures or ablation against generic prompts/random rollbacks reported. Because the 6.9%-23.5% resolution gains rest on these rules correctly identifying meaningful failures, the absence of such validation is load-bearing for the central empirical claim.
  2. [§3-4 (Graphectory construction and analysis)] §3-4 (Graphectory construction and analysis): The precise node/edge construction rules that turn raw trajectories into Graphectory graphs are not detailed, and no inter-rater reliability or statistical controls (confidence intervals, multiple-comparison correction) accompany the reported strategy patterns or resolution improvements. These omissions prevent assessment of whether the observed differences in complexity and coherence are robust.
minor comments (2)
  1. [Abstract] Abstract: 'Langutory' is mentioned without definition or relation to Graphectory; a brief parenthetical gloss would aid readability.
  2. [Throughout] Throughout: Ensure that first use of 'Graphectory' includes an explicit definition and that subsequent references maintain consistent capitalization and abbreviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major concern point by point below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: §5 (Online monitoring and intervention): The diagnostic rules that map Graphectory features (e.g., chaotic backtracking) to intervention triggers are not specified, nor is any validation against human-labeled reasoning failures or ablation against generic prompts/random rollbacks reported. Because the 6.9%-23.5% resolution gains rest on these rules correctly identifying meaningful failures, the absence of such validation is load-bearing for the central empirical claim.

    Authors: We agree that the diagnostic rules and supporting validation were insufficiently detailed. In the revised manuscript we will expand Section 5 with the exact feature-to-trigger mapping (including thresholds for chaotic backtracking and other indicators), a description of how interventions are selected, and an ablation study comparing our approach against random rollbacks and generic diagnostic prompts. We will also report results from a small human-labeled validation set of trajectories to confirm that the automated rules align with observable reasoning failures. These additions directly address the load-bearing concern for the reported gains. revision: yes

  2. Referee: §3-4 (Graphectory construction and analysis): The precise node/edge construction rules that turn raw trajectories into Graphectory graphs are not detailed, and no inter-rater reliability or statistical controls (confidence intervals, multiple-comparison correction) accompany the reported strategy patterns or resolution improvements. These omissions prevent assessment of whether the observed differences in complexity and coherence are robust.

    Authors: We acknowledge that the construction rules require greater precision. The revised Section 3 will include a complete algorithmic specification with pseudocode for node and edge creation, along with explicit handling of temporal and semantic relations. Because the entire Graphectory extraction process is deterministic and rule-based, inter-rater reliability does not apply; we will instead emphasize reproducibility by releasing the extraction code. We will also add confidence intervals for all key metrics (graph complexity, coherence scores, resolution rates) and apply Bonferroni correction for multiple comparisons across the four models and two agent frameworks. These changes will allow readers to assess the robustness of the reported patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims are empirically grounded

full rationale

The paper introduces Graphectory as an independent graph-based representation for encoding agent trajectories and applies it to automated analysis of 4000 SWE-bench runs plus real-time intervention experiments. The reported 6.9%-23.5% resolution gains and trajectory shortening are measured as external performance deltas against baseline agent executions, not as quantities defined in terms of Graphectory parameters, diagnostic rules, or fitted values. No equations, self-definitional steps, or load-bearing self-citations reduce the claims to inputs by construction; the modeling choices and rules are tested for utility via observable outcomes rather than being tautological with those outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that agent behavior can be faithfully captured by a graph of temporal and semantic relations and that the chosen diagnostic rules identify actionable failures; Graphectory itself is an invented modeling construct without external validation cited in the abstract.

axioms (1)
  • domain assumption Agent trajectories in LLM-based systems exhibit temporal and semantic relations that can be systematically encoded as graphs.
    This premise underpins the entire Graphectory construction and is stated in the abstract as the motivation for moving beyond outcome-centric evaluation.
invented entities (1)
  • Graphectory no independent evidence
    purpose: To encode temporal and semantic relations in agent trajectories for automated analysis and real-time monitoring.
    New representation introduced by the paper; no independent prior evidence or external validation is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5625 in / 1475 out tokens · 42662 ms · 2026-05-17T03:10:45.864858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating Plan Compliance in Autonomous Programming Agents

    cs.SE 2026-04 unverdicted novelty 7.0

    Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...

  2. ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

    cs.SE 2026-04 unverdicted novelty 7.0

    ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...

  3. AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits

    cs.SE 2026-04 conditional novelty 7.0

    AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.

  4. Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

    cs.SE 2026-04 accept novelty 7.0

    Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 4 Pith papers · 9 internal anchors

  1. [1]

    Aider: AI pair programming in your terminal

    2024. Aider: AI pair programming in your terminal. https://aider.chat/

  2. [2]

    Composio’s SWE agent advances open-source on SweBench with a 48.6% score using LangGraph and LangSmith

    2024. Composio’s SWE agent advances open-source on SweBench with a 48.6% score using LangGraph and LangSmith. https://blog.langchain.com/composio-swekit/

  3. [3]

    Moatless tools

    2024. Moatless tools. https://github.com/aorwall/moatless-tools 24 Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand

  4. [4]

    Trae Agent reached #1 on SWE-bench Verified, with Claude 4

    2025. Trae Agent reached #1 on SWE-bench Verified, with Claude 4. https://www.trae.ai/blog/product_update_0625

  5. [5]

    Michael Ahn et al . 2022. Do As I Can, Not As I Say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691(2022)

  6. [6]

    Mistral AI. 2025. Introducing the best open-source model for coding agents. https://mistral.ai/news/devstral

  7. [7]

    Mistral AI. 2025. Upgrading agentic coding capabilities with the new Devstral models. https://mistral.ai/news/devstral- 2507

  8. [8]

    Anthropic. 2025. How we built our multi-agent research system. https://www.anthropic.com/engineering/multi- agent-research-system Article

  9. [9]

    Anthropic. 2025. System Card: Claude Opus 4 & Claude Sonnet 4. https://www-cdn.anthropic.com/ 4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

  10. [10]

    Andres Bran et al . 2023. ChemCrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376(2023)

  11. [11]

    Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, and Baishakhi Ray. 2025. Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study.arXiv preprint arXiv:2506.08311(2025)

  12. [12]

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. 2025. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657(2025)

  13. [13]

    Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al. 2024. Coder: Issue resolving with multi-agent and task graphs.arXiv preprint arXiv:2406.01304 (2024)

  14. [14]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  15. [15]

    Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, and Rebecca Qian. 2025. TRAIL: Trace Reasoning and Agentic Issue Localization.arXiv preprint arXiv:2505.08638(2025)

  16. [16]

    Jatin Ganhotra. 2025. Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified. https://jatinganhotra.dev/blog/swe-agents/2025/01/05/swe-bench-mutliple-files/ Blog post

  17. [17]

    Jatin Ganhotra. 2025. The Multi-File Frontier: Why SWE-Bench Verified Doesn’t Reflect Real-World Programming Challenges. https://jatinganhotra.dev/blog/swe-agents/2025/03/30/swe-bench-verified-single-file-saturation/ Blog post

  18. [18]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  19. [19]

    Izzeddin Gur et al . 2023. Survey of Conversational Agents Powered by Large Language Models.arXiv preprint arXiv:2309.06641(2023)

  20. [20]

    All Hands. 2024. OpenHands tools for viewing, creating and editing files. https://github.com/All-Hands- AI/OpenHands/blob/main/openhands/agenthub/codeact_agent/tools/str_replace_editor.py. 25

  21. [21]

    Tatsuro Inaba, Hirokazu Kiyomaru, Fei Cheng, and Sadao Kurohashi. 2023. MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting. InACL (Short). https://aclanthology.org/2023.acl-short.130/

  22. [22]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

  23. [23]

    NVJK Kartik, Garvit Sapra, Rishav Hada, and Nikhil Pareek. 2025. AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production.arXiv preprint arXiv:2509.14647(2025)

  24. [24]

    Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938), 81–93

  25. [25]

    Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In2013 35th international conference on software engineering (ICSE). IEEE, 802–811

  26. [26]

    Uday Kiran and P

    R. Uday Kiran and P. Krishna Reddy. 2010. Mining periodic-frequent patterns with maximum items’ support constraints. InProceedings of the Third Annual ACM Bangalore Conference(Bangalore, India)(COMPUTE ’10). Association for Computing Machinery, New York, NY, USA, Article 1, 8 pages. https://doi.org/10.1145/1754288.1754289

  27. [27]

    Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair.Ieee transactions on software engineering38, 1 (2011), 54–72

  28. [28]

    Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé F Bissyandé, Dongsun Kim, Peng Wu, Jacques Klein, Xiaoguang Mao, and Yves Le Traon. 2020. On the efficiency of test suite based program repair: A systematic assessment of 16 automated repair systems for java programs. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineer...

  29. [29]

    Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, and Li Zhang. 2025. An Empirical Study on Failures in Automated Issue Solving.arXiv preprint arXiv:2509.13941(2025)

  30. [30]

    Xiao Liu et al. 2023. AgentBench: Evaluating LLMs as Agents. InNeurIPS

  31. [31]

    Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. MarsCode Agent: AI-native Automated Bug Fixing.CoRR(2024)

  32. [32]

    Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code. InProceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages. 298–312

  33. [33]

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao

  34. [34]

    Chameleon: Plug-and-play compositional reasoning with large language models.Advances in Neural Information Processing Systems36 (2023), 43447–43478

  35. [35]

    Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60

  36. [36]

    Thomas J. McCabe. 1976. A Complexity Measure.IEEE Transactions on Software EngineeringSE-2, 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837

  37. [37]

    Reiichiro Nakano et al. 2021. WebGPT: Browser-assisted question-answering with human feedback. InNeurIPS

  38. [38]

    OpenAI. 2024. OpenAI: Introducing SWE-bench Verified. https://openai.com/index/introducing-swe-bench-verified/

  39. [39]

    OpenHands. 2025. OpenHands System Prompt. github.com/All-Hands-AI/OpenHands/blob/08118d742b564add3e9 70921ac8910c265ece975/evaluation/benchmarks/swe_bench/prompts/swe_default.j2

  40. [40]

    Joon Sung Park et al. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InACM CHI

  41. [41]

    Refact.ai. 2025. Refact.ai, Your Open-Source, Autonomous AI Agent. https://refact.ai/

  42. [42]

    Noah Shinn et al. 2023. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS

  43. [43]

    Mohit Shridhar et al. 2023. SayCan: Grounding language models with affordances for robotic control. InICRA

  44. [44]

    Ramakrishnan Srikant and Rakesh Agrawal. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. InProceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology (EDBT ’96). Springer-Verlag, Berlin, Heidelberg, 3–17

  45. [45]

    Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision. 11888–11898

  46. [46]

    SWE-agent. 2024. SWE-agent System Prompt. https://github.com/SWE-agent/SWE- agent/blob/main/config/default.yaml

  47. [47]

    SWE-agent. 2024. SWE-agent tools for viewing, creating and editing files. https://github.com/SWE-agent/SWE- agent/blob/main/tools/edit_anthropic/config.yaml

  48. [48]

    Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, et al. 2025. Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks.arXiv preprint arXiv:2505.16901(2025)

  49. [49]

    Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. Magis: Llm-based multi-agent framework for github issue resolution.Advances in Neural Information Processing Systems37 (2024), 51963–51993

  50. [50]

    Nalin Wadhwa, Atharv Sonwane, Daman Arora, Abhav Mehrotra, Saiteja Utpala, Ramakrishna B Bairi, Aditya Kanade, and Nagarajan Natarajan. 2024. MASAI: Modular architecture for software-engineering AI agents. InNeurIPS 2024 26 Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand Workshop on Open-World Agents

  51. [51]

    Shunyu Wang et al . 2023. A Survey on Large Language Model based Autonomous Agents.arXiv preprint arXiv:2309.07864(2023)

  52. [52]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

  53. [53]

    Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In2009 IEEE 31st International Conference on Software Engineering. IEEE, 364–374

  54. [54]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

  55. [55]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2025. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

  56. [56]

    Shunyu Yao et al. 2022. ReAct: Synergizing reasoning and acting in language models. InNeurIPS

  57. [57]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement.arXiv preprint arXiv:2404.05427(April 2024). Autonomous program repair agent demonstrating code generation and debugging capabilities