pith. sign in

arxiv: 2606.06324 · v1 · pith:HSJLAE76new · submitted 2026-06-04 · 💻 cs.SE · cs.MA

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

Pith reviewed 2026-06-28 00:06 UTC · model grok-4.3

classification 💻 cs.SE cs.MA
keywords LLM agentsharness repairfailure diagnosisexecution tracesagent reliabilitySWE-Benchtrace representationbenchmark evaluation
0
0 comments X

The pith

HarnessFix diagnoses LLM agent failures from execution traces and repairs specific harness layers to raise performance 15-50%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HarnessFix as a way to move from vague agent fixes to precise harness repairs. It builds a normalized Harness-aware Trace Intermediate Representation from raw traces and code, then links each failure to the exact step and harness layer responsible. This attribution lets the system collect recurring flaw patterns and apply narrow repair operators that are validated before use. The result is measurable lifts on held-out tests across four agent benchmarks while beating both human and automated baselines. A sympathetic reader would care because current self-improvement loops often change the wrong parts of an agent setup and still leave reliability gaps.

Core claim

HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions, yielding 15.2%--50.0% gains on held-out tests over initial harnesses on SWE-Bench Verified, Terminal-Bench 2.0 Verifi

What carries the argument

Harness-aware Trace Intermediate Representation (HTIR) that normalizes trajectory evidence and records step-level provenance plus control-flow to support precise attribution of failures to harness layers.

If this is right

  • HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0% across the four benchmarks.
  • It outperforms both human-designed harnesses and self-evolution baselines.
  • It surfaces recurring harness-flaw patterns across the ETCLOVG layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trace-normalization step could be reused to debug non-LLM agent frameworks that also separate execution environment from policy.
  • Recurring flaw patterns identified by the method might serve as a checklist for initial harness design rather than post-hoc repair.
  • If HTIR-style provenance tracking were added to standard agent runtimes, the cost of later diagnosis would drop because the necessary records would already exist.
  • The approach suggests that agent reliability gains may come more from fixing the surrounding harness than from further prompt or model changes.

Load-bearing premise

Compiling raw execution traces and harness code into HTIR accurately normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations sufficiently to attribute failures to the correct harness layers without systematic misattribution.

What would settle it

A side-by-side comparison on a held-out set of failed trajectories in which human experts identify different harness layers as responsible than the layers chosen by HarnessFix attribution.

Figures

Figures reproduced from arXiv: 2606.06324 by Junjie Wang, Mengzhuo Chen, Qing Wang, Yawen Wang, Zhe Liu.

Figure 1
Figure 1. Figure 1: Overview of HARNESSFIX. code as a shorthand for the implementation artifacts that realize these mechanisms, including both executable source code and model-facing prompt or configuration resources. Accordingly, harness repair means modifying these harness artifacts, rather than updating the base-model parameters, to correct diagnosed reliability flaws. To describe the mechanisms that make up an agent harne… view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative example for the AppWorld completion-guard harness flaw. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces HarnessFix, a trace-guided framework for diagnosing failures in LLM agent harnesses and applying scoped repairs. It compiles execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR) that normalizes trajectories and captures step-level provenance and control-flow relations, attributes failures to responsible steps and ETCLOVG harness layers, consolidates recurring flaw patterns, and maps them to repair operators. Patches are generated and validated under flaw-specific specifications. Evaluation across SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA, and AppWorld reports 15.2%–50.0% held-out performance gains over initial harnesses, outperforming human-designed and self-evolution baselines, while identifying recurring flaw patterns.

Significance. If the attribution mechanism and empirical results hold, the work offers a targeted alternative to outcome-based or broad prompt/workflow optimization for agent reliability. The multi-benchmark evaluation and explicit identification of layer-specific flaw patterns across ETCLOVG provide concrete, falsifiable contributions. The introduction of HTIR as an intermediate representation for provenance-preserving trace analysis is a methodological strength that could be reusable beyond this framework.

major comments (2)
  1. [Methods (HTIR and attribution)] Methods section on HTIR construction and attribution: The central claim that HTIR accurately normalizes fragmented traces, preserves control-flow edges, and correctly maps failures to ETCLOVG layers without systematic misattribution lacks an independent validation (e.g., inter-annotator agreement on a sample of attributions or an ablation removing provenance edges). This assumption is load-bearing for the reported 15.2%–50.0% gains, as misattribution would cause scoped repairs to become non-specific edits.
  2. [Evaluation and Results] Results section reporting performance: The abstract and evaluation claim consistent outperformance and specific percentage gains, but the provided text supplies no error bars, statistical significance tests, exclusion criteria for trajectories, or details on how held-out sets were constructed. Without these, the empirical superiority over baselines cannot be verified as robust.
minor comments (2)
  1. [Abstract] The acronym ETCLOVG is used without an explicit expansion or table defining the layers in the abstract; a dedicated table or early section defining each layer would improve readability.
  2. [Methods] Notation for repair operators and flaw records could be formalized with a small example in the methods to clarify the mapping from consolidated diagnoses to patches.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback and for recognizing the potential of HTIR as a reusable intermediate representation. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses
  1. Referee: Methods section on HTIR construction and attribution: The central claim that HTIR accurately normalizes fragmented traces, preserves control-flow edges, and correctly maps failures to ETCLOVG layers without systematic misattribution lacks an independent validation (e.g., inter-annotator agreement on a sample of attributions or an ablation removing provenance edges). This assumption is load-bearing for the reported 15.2%–50.0% gains, as misattribution would cause scoped repairs to become non-specific edits.

    Authors: We agree that the original submission did not include an explicit independent validation such as inter-annotator agreement or a provenance-edge ablation. The attribution logic is described in detail in Section 3.2 and is supported by the end-to-end gains and the fact that HarnessFix outperforms both human-designed patches and outcome-only self-evolution baselines; however, these results do not directly isolate attribution accuracy. In the revision we will add (i) a small-scale manual validation on 50 randomly sampled trajectories with two annotators reporting Cohen’s kappa, and (ii) an ablation that removes provenance and control-flow edges from HTIR while keeping all other components fixed. These additions will appear in Section 4.3 and Appendix C. revision: yes

  2. Referee: Results section reporting performance: The abstract and evaluation claim consistent outperformance and specific percentage gains, but the provided text supplies no error bars, statistical significance tests, exclusion criteria for trajectories, or details on how held-out sets were constructed. Without these, the empirical superiority over baselines cannot be verified as robust.

    Authors: We accept that the current manuscript lacks the requested statistical and procedural details. The held-out sets were constructed via a fixed random 70/30 split per benchmark with seed 42, and trajectories were excluded only if they contained fewer than three steps or were truncated by the environment; however, these facts are stated only briefly. In the revision we will report (i) mean and standard deviation over five independent runs with different seeds, (ii) paired t-test p-values against each baseline, (iii) explicit exclusion criteria, and (iv) a table detailing the exact train/held-out sizes and construction procedure. These will be placed in Section 5.1 and Table 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external benchmarks

full rationale

The paper describes an empirical framework (HarnessFix) that compiles traces into HTIR, attributes failures, and applies repairs, with all performance claims (15.2%–50.0% gains) measured directly against held-out test sets on SWE-Bench Verified, Terminal-Bench, GAIA, and AppWorld. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on observable benchmark outcomes rather than any step that reduces to its own inputs by construction, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities beyond the named framework components are stated in the provided text.

invented entities (1)
  • Harness-aware Trace Intermediate Representation (HTIR) no independent evidence
    purpose: Normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations
    Introduced in the abstract as the core representation enabling diagnosis; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5795 in / 1225 out tokens · 19665 ms · 2026-06-28T00:06:04.528012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering

    cs.SE 2026-07 unverdicted novelty 5.0

    A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.

Reference graph

Works this paper leans on

68 extracted references · 9 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    SWE-bench: Can language models resolve real-world GitHub issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” 2023

  2. [2]

    SWE-bench Verified,

    OpenAI, “SWE-bench Verified,” https://www.swebench.com/, 2024

  3. [3]

    Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces,

    M. A. Merrillet al., “Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces,” 2026

  4. [4]

    GAIA: A benchmark for general AI assistants,

    G. Mialonet al., “GAIA: A benchmark for general AI assistants,” 2023

  5. [5]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents,

    H. Trivedi, T. Khot, M. Hartmann, R. Manku, V . Dong, E. Li, S. Gupta, A. Sabharwal, and N. Balasubramanian, “AppWorld: A controllable world of apps and people for benchmarking interactive coding agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024, pp. 16 022–16 076

  6. [6]

    AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering,

    D. Zhang, “AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering,” 2026. [Online]. Available: https://arxiv.org/abs/ 2601.04620

  7. [7]

    ReCreate: Reasoning and creating domain agents driven by experience,

    Z. Hao, H. Wang, J. Luo, J. Zhang, Y . Zhou, Q. Lin, C. Wang, H. Dong, and J. Chen, “ReCreate: Reasoning and creating domain agents driven by experience,” 2026

  8. [8]

    Trajectory-informed memory generation for self- improving agent systems,

    G. Fang, V . Isahagian, K. Jayaram, R. Kumar, V . Muthusamy, P. Oum, and G. Thomas, “Trajectory-informed memory generation for self- improving agent systems,”arXiv preprint arXiv:2603.10600, 2026

  9. [9]

    Agent harness engineering: A survey,

    J. Li, X. Xiao, Y . Zhang, C. Liu, L. Zhao, X. Liao, Y . Ji, J. Wang, J. Gu, Y . Ge, W. Xu, X. Fang, X. Xu, T. Zhao, Y . Kim, T. Wang, J. Hamm, S. Krishnaswamy, J. Huan, and C. Reddy, “Agent harness engineering: A survey,” 2026. [Online]. Available: https://openreview.net/pdf?id=eONq7FdiHa

  10. [10]

    Which Agent causes task failures and when? on automated failure attribution of LLM Multi-Agent systems,

    S. Zhang, M. Yin, J. Zhang, J. Liu, Z. Han, J. Zhang, B. Li, C. Wang, H. Wang, Y . Chenet al., “Which Agent causes task failures and when? on automated failure attribution of LLM Multi-Agent systems,” in Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025, pp. 1–12, spotlight paper

  11. [11]

    Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories,

    I. Bouzenia and M. Pradel, “Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories,” inProceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2025, pp. 1–12

  12. [12]

    When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

    N. Islam, R. S. Ayon, D. G. Thomas, S. Ahmed, and M. Wardat, “When agents fail: A comprehensive study of bugs in llm agents with automated labeling,”arXiv preprint arXiv:2601.15232, 2026

  13. [13]

    A practical guide to building agents,

    OpenAI, “A practical guide to building agents,” https://openai.com/ business/guides-and-resources/a-practical-guide-to-building-ai-agents/, 2025, accessed: 2026-05-09

  14. [14]

    Effective context engineering for AI agents,

    Anthropic, “Effective context engineering for AI agents,” https://www. anthropic.com/engineering/effective-context-engineering-for-ai-agents, Sep. 2025, accessed: 2026-04-30

  15. [15]

    Automated breakpoint generation for debugging

    C. Zhang, J. Yang, D. Yan, S. Yang, and Y . Chen, “Automated breakpoint generation for debugging.”J. Softw., vol. 8, no. 3, pp. 603–616, 2013

  16. [16]

    Deterministic replay: A survey,

    Y . Chen, S. Zhang, Q. Guo, L. Li, R. Wu, and T. Chen, “Deterministic replay: A survey,”ACM Computing Surveys (CSUR), vol. 48, no. 2, pp. 1–47, 2015

  17. [17]

    Improving the Efficiency of LLM Agent Systems through Trajectory Reduction,

    Y .-A. Xiao, P. Gao, C. Peng, and Y . Xiong, “Improving the Efficiency of LLM Agent Systems through Trajectory Reduction,” 2025. [Online]. Available: https://arxiv.org/abs/2509.23586

  18. [18]

    Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems,

    F. Lin, S. Chen, R. Fang, H. Wang, and T. Lin, “Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems,” 2025. [Online]. Available: https://arxiv.org/abs/2510.26585

  19. [19]

    Wink: Recovering from misbehaviors in coding agents,

    R. Nanda, C. Maddila, S. Jha, E. M. Khan, M. Paltenghi, and S. Chandra, “Wink: Recovering from misbehaviors in coding agents,”arXiv preprint arXiv:2602.17037, 2026

  20. [20]

    Process-centric analysis of agentic software systems,

    S. Liu, Y . Chen, R. Krishna, S. Sinha, J. Ganhotra, and R. Jabbarvand, “Process-centric analysis of agentic software systems,”Proceedings of the ACM on Programming Languages, vol. 10, no. OOPSLA1, pp. 1961– 1988, 2026

  21. [21]

    Maestro: Joint Graph & Config Optimization for Reliable AI Agents,

    W. Wang, P. Kattakinda, and S. Feizi, “Maestro: Joint Graph & Config Optimization for Reliable AI Agents,” 2025. [Online]. Available: https://arxiv.org/abs/2509.04642

  22. [22]

    Instruction-Level Weight Shaping: A Framework for Self-Improving AI Agents,

    R. Costa, “Instruction-Level Weight Shaping: A Framework for Self-Improving AI Agents,” 2025. [Online]. Available: https://arxiv.org/ abs/2509.00251

  23. [23]

    SCOPE: Prompt evolution for enhancing agent effectiveness,

    Z. Pei, H.-L. Zhen, S. Kai, S. J. Pan, Y . Wang, M. Yuan, and B. Yu, “SCOPE: Prompt evolution for enhancing agent effectiveness,” 2025

  24. [24]

    Trace2Skill: Distill trajectory-local lessons into transferable agent skills,

    J. Niet al., “Trace2Skill: Distill trajectory-local lessons into transferable agent skills,” 2026

  25. [25]

    Gödel agent: A self-referential agent framework for recursively self-improvement,

    X. Yin, X. Wang, L. Pan, L. Lin, X. Wan, and W. Y . Wang, “Gödel agent: A self-referential agent framework for recursively self-improvement,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, 2025, pp. 27 890–27 913. [Online]. Avail...

  26. [26]

    SICA: A self-improving coding agent,

    M. Robeyns, M. Szummer, and L. Aitchison, “SICA: A self-improving coding agent,” inICLR 2025 Workshop on Scaling Self-Improving Foundation Models, 2025, oral presentation. [Online]. Available: https://openreview.net/forum?id=rShJCyLsOr

  27. [27]

    Darwin gödel machine: Open-ended evolution of self-improving agents,

    J. Zhang, S. Hu, C. Lu, R. T. Lange, and J. Clune, “Darwin gödel machine: Open-ended evolution of self-improving agents,” in The Fourteenth International Conference on Learning Representations, 2026, poster presentation. [Online]. Available: https://openreview.net/ forum?id=pUpzQZTvGY

  28. [28]

    Huxley-gödel machine: Human-level coding agent development by an approximation of the optimal self- improving machine,

    W. Wang, P. Pi˛ ekos, N. Li, F. Laakom, Y . Chen, M. Ostaszewski, M. Zhuge, and J. Schmidhuber, “Huxley-gödel machine: Human-level coding agent development by an approximation of the optimal self- improving machine,” inThe Fourteenth International Conference on Learning Representations, 2026, oral presentation. [Online]. Available: https://openreview.net/...

  29. [29]

    mini-swe-agent: A minimal LLM agent for software engineering,

    SWE-agent Contributors, “mini-swe-agent: A minimal LLM agent for software engineering,” https://github.com/SWE-agent/mini-swe-agent, 2024, accessed: 2026-05-30

  30. [30]

    Harbor: A containerized framework for agent benchmarking,

    Harbor Framework Contributors, “Harbor: A containerized framework for agent benchmarking,” https://github.com/harbor-framework/harbor, 2026, terminus-2 terminal-agent harness; accessed 2026-05-30

  31. [31]

    Open-source DeepResearch: Freeing our search agents,

    A. Roucher, C. Fourrier, L. Tunstall, and L. von Werra, “Open-source DeepResearch: Freeing our search agents,” https://huggingface.co/blog/ open-deep-research, 2025, hugging Face blog; accessed 2026-05-30

  32. [32]

    OpenHands: An open platform for AI software developers as generalist agents,

    X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “OpenHands: An open platform for AI software developers as generalist agents,” inInternational Conference on Learning Represen...

  33. [33]

    Trae Agent: An LLM-based agent for software engineering with test-time scaling,

    Trae Research Team, P. Gao, Z. Tian, X. Meng, X. Wang, R. Hu, Y . Xiao, Y . Liu, Z. Zhang, J. Chen, C. Gao, Y . Lin, Y . Xiong, C. Peng, and X. Liu, “Trae Agent: An LLM-based agent for software engineering with test-time scaling,” 2025. [Online]. Available: https://arxiv.org/abs/2507.23370

  34. [34]

    OpenCode: An open-source coding agent,

    OpenCode Contributors, “OpenCode: An open-source coding agent,” https://github.com/anomalyco/opencode, 2024, accessed: 2026-05-30

  35. [35]

    DeepResearchAgent: A hierarchical multi-agent framework for deep research,

    Skywork AI, “DeepResearchAgent: A hierarchical multi-agent framework for deep research,” https://github.com/SkyworkAI/ DeepResearchAgent, 2025, accessed: 2026-05-30

  36. [36]

    MiroThinker: Pushing the performance boundaries of open- source research agents via model, context, and interactive scaling,

    MiroMind Team, S. Bai, L. Bing, C. Chen, G. Chen, Y . Chen, Z. Chen, Z. Chen, J. Dai, X. Dong, W. Dou, Y . Deng, Y . Fu, J. Ge, C. Han, T. Huang, Z. Huang, J. Jiao, S. Jiang, T. Jiao, X. Jian, L. Lei, R. Li, G. Luo, T. Li, X. Lin, Z. Liu, Z. Li, J. Ni, Q. Ren, P. Sun, S. Su, C. Tao, B. Wang, W. Wang, H. Wang, J. Wang, J. Wang, J. Wang, L. Wang, S. Wang, W...

  37. [37]

    CUGA: A computer-using general- ist agent,

    CUGA Project Contributors, “CUGA: A computer-using general- ist agent,” https://github.com/cuga-project/cuga-agent, 2025, accessed: 2026-05-30

  38. [38]

    GEPA: Reflective prompt evolution can outper- form reinforcement learning,

    L. A. Agrawalet al., “GEPA: Reflective prompt evolution can outper- form reinforcement learning,” 2025

  39. [39]

    SWE-agent: Agent-computer interfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” 2024

  40. [40]

    Natural-language agent harnesses,

    L. Pan, L. Zou, S. Guo, J. Ni, and H.-T. Zheng, “Natural-language agent harnesses,” 2026

  41. [41]

    Meta- harness: End-to-end optimization of model harnesses,

    Y . Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn, “Meta- harness: End-to-end optimization of model harnesses,” 2026

  42. [42]

    AutoHarness: Improving LLM agents by automatically synthesizing a code harness,

    X. Lou, M. Lazaro-Gredilla, A. Dedieu, C. Wendelken, W. Lehrach, and K. P. Murphy, “AutoHarness: Improving LLM agents by automatically synthesizing a code harness,” 2026

  43. [43]

    Synthesizing multi-agent harnesses for vulnerability discovery,

    H. Liu, C. Shou, X. Liu, H. Wen, Y . Chen, R. J. Fang, and Y . Feng, “Synthesizing multi-agent harnesses for vulnerability discovery,” 2026

  44. [44]

    Automated design of agentic systems,

    S. Hu, C. Lu, and J. Clune, “Automated design of agentic systems,” 2024

  45. [45]

    AFlow: Automating agentic workflow generation,

    J. Zhanget al., “AFlow: Automating agentic workflow generation,” 2024

  46. [46]

    AutoFlow: Automated workflow generation for large language model agents,

    Z. Liet al., “AutoFlow: Automated workflow generation for large language model agents,” 2024

  47. [47]

    AgentSquare: Automatic LLM agent search in modular design space,

    Y . Shang, Y . Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y . Li, “AgentSquare: Automatic LLM agent search in modular design space,” 2024

  48. [48]

    MaAS: Multi-agent architecture search via agentic supernet,

    G. Zhang, L. Niu, J. Fang, K. Wang, L. Bai, and X. Wang, “MaAS: Multi-agent architecture search via agentic supernet,” 2025

  49. [49]

    Large language models as tool makers,

    T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou, “Large language models as tool makers,” 2023

  50. [50]

    CREATOR: Disentangling abstract and concrete reasonings of large language models through tool creation,

    C. Qian, C. Han, Y . R. Fung, Y . Qin, Z. Liu, and H. Ji, “CREATOR: Disentangling abstract and concrete reasonings of large language models through tool creation,” 2023

  51. [51]

    ToolGen: Unified tool retrieval and calling via generation,

    R. Wang, X. Han, L. Ji, S. Wang, T. Baldwin, and H. Li, “ToolGen: Unified tool retrieval and calling via generation,” inInternational Conference on Learning Representations, 2025

  52. [52]

    SkillForge: Forging domain-specific, self-evolving agent skills in cloud technical support,

    X. Liu, X. Luo, L. Li, G. Huang, J. Liu, and H. Qiao, “SkillForge: Forging domain-specific, self-evolving agent skills in cloud technical support,” 2026

  53. [53]

    Agent workflow memory,

    Z. Z. Wang, J. Mao, D. Fried, and G. Neubig, “Agent workflow memory,” 2024

  54. [54]

    Agent KB: Leveraging cross-domain experience for agentic problem solving,

    X. Tanget al., “Agent KB: Leveraging cross-domain experience for agentic problem solving,” 2025

  55. [55]

    A-MEM: Agentic memory for LLM agents,

    W. Xuet al., “A-MEM: Agentic memory for LLM agents,” 2025

  56. [56]

    MemInsight: Autonomous memory augmentation for LLM agents,

    R. Salamaet al., “MemInsight: Autonomous memory augmentation for LLM agents,” 2025

  57. [57]

    Agentic context engineering: Evolving contexts for self- improving language models,

    Q. Zhanget al., “Agentic context engineering: Evolving contexts for self- improving language models,” inInternational Conference on Learning Representations, 2026

  58. [58]

    Group-evolving agents: Open-ended self-improvement via experience sharing,

    Z. Weng, A. Antoniades, D. Nathani, Z. Zhang, X. Pu, and X. E. Wang, “Group-evolving agents: Open-ended self-improvement via experience sharing,” 2026

  59. [59]

    Self-evolving multi-agent collaboration networks for software development,

    Y . Hu, Y . Cai, Y . Du, X. Zhu, X. Liu, Z. Yu, Y . Hou, S. Tang, and S. Chen, “Self-evolving multi-agent collaboration networks for software development,” 2024

  60. [60]

    EvoTest: Evolutionary test-time learning for self-improving agentic systems,

    Y . He, J. Liu, Y . Liu, Y . Li, T. Cao, Z. Hu, X. Xu, and B. Hooi, “EvoTest: Evolutionary test-time learning for self-improving agentic systems,” 2025

  61. [61]

    A survey for llm agent trajectory analysis: From failure attribution to enhancement,

    J. Wang, Y . Wang, M. Chen, X. Xie, C. Chen, F. Mu, Z. Liu, and Q. Wang, “A survey for llm agent trajectory analysis: From failure attribution to enhancement,” 2026

  62. [62]

    SWE-TRACE: Optimizing long-horizon SWE agents through rubric process reward models and heuristic test-time scaling,

    H. Hanet al., “SWE-TRACE: Optimizing long-horizon SWE agents through rubric process reward models and heuristic test-time scaling,” 2026

  63. [63]

    View-oriented conversation compiler for agent trace analysis,

    L. Zhang and M. Agrawala, “View-oriented conversation compiler for agent trace analysis,” 2026

  64. [64]

    Beyond resolution rates: Behavioral drivers of coding agent success and failure,

    T. Mehtiyev and W. Assunção, “Beyond resolution rates: Behavioral drivers of coding agent success and failure,” 2026

  65. [65]

    The long-horizon task mirage? diagnosing where and why agentic systems break,

    X. J. Wang, H. Bai, Y . Sun, H. Wang, S. Zhang, W. Hu, M. Schroder, B. Mutlu, D. Song, and R. D. Nowak, “The long-horizon task mirage? diagnosing where and why agentic systems break,” 2026

  66. [66]

    Agen- Tracer: Who is inducing failure in the LLM agentic systems?

    G. Zhang, J. Wang, J. Chen, W. Zhou, K. Wang, and S. Yan, “Agen- Tracer: Who is inducing failure in the LLM agentic systems?” 2025

  67. [67]

    AgentFixer: From failure detection to fix recommendations in LLM agentic systems,

    H. Mulian, S. Zeltyn, I. Levy, L. Galanti, A. Yaeli, and S. Shlomov, “AgentFixer: From failure detection to fix recommendations in LLM agentic systems,” 2026

  68. [68]

    Inference-time scaling of verification: Self-evolving deep research agents via test-time rubric-guided verification,

    Y . Wan, T. Fang, Z. Li, Y . Huo, W. Wang, H. Mi, D. Yu, and M. R. Lyu, “Inference-time scaling of verification: Self-evolving deep research agents via test-time rubric-guided verification,” 2026