pith. sign in

arxiv: 2605.27238 · v1 · pith:BDY5VQLQnew · submitted 2026-05-26 · 💻 cs.SE

EviACT: An Evidence-to-Action Framework for Agentic Program Repair

Pith reviewed 2026-06-29 15:33 UTC · model grok-4.3

classification 💻 cs.SE
keywords automated program repairLLM agentsagentic APRevidence-driven guardrailsprogram repairretrieval scaffoldtest-driven repair
0
0 comments X

The pith

Coordinating retrieval, compile, and test evidence across stages raises agentic program repair resolve rates by 1.6-6 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based agents for program repair often fail to use execution evidence effectively for localization and validation. EviACT introduces three guardrails that coordinate evidence to action: a retrieval scaffold to ground context, a compile gate to filter invalid edits, and a test-driven gate to verify target test recovery. Tested across four benchmarks, the framework outperforms the strongest baselines by 1.6 to 6 points in resolve rate and cuts API costs by 70 to 88 percent where reported. Ablations indicate the gains come from this coordinated chain. Readers would care if this makes repository-level automated repair more reliable and affordable.

Core claim

EviACT is an agentic APR framework that coordinates three evidence-driven guardrails across repair stages. The retrieval scaffold grounds repair context, the compile gate filters invalid edits, and the test-driven gate checks target-test recovery before full regression. Across four benchmarks, EviACT improves resolve rate over the strongest reported comparable baselines by 1.6-6.0 percentage points and shows 70.1-88.6% lower reported per-bug API cost where baseline costs are available. Ablations and diagnostics suggest that these gains are associated with the coordinated evidence-to-action chain, making agentic APR more effective and efficient.

What carries the argument

The evidence-to-action chain of three guardrails: retrieval scaffold for context, compile gate for validity, and test-driven gate for recovery.

If this is right

  • Agentic systems can better localize bugs using grounded retrieval evidence.
  • Invalid patches are filtered earlier by compile checks, saving resources.
  • Target test recovery is verified before full regression testing.
  • Overall resolve rates increase while API costs decrease on benchmarks.
  • Agentic APR becomes more effective and efficient through coordinated evidence use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The guardrail approach might apply to other agentic coding tasks such as feature addition or refactoring.
  • Similar evidence coordination could reduce token usage in long-context agent interactions.
  • Testing the framework on additional benchmarks or with different LLMs would further validate the chain's contribution.
  • Adapting the gates for dynamic or non-deterministic environments could extend its use.

Load-bearing premise

The improvements are due to the coordinated evidence-to-action chain rather than differences in prompt design, model versions, or baseline implementations.

What would settle it

An experiment that fixes the underlying model, prompts, and baseline code while varying only the presence of the three guardrails and measures resolve rates and costs.

Figures

Figures reproduced from arXiv: 2605.27238 by Joost Visser, Qianru Meng, Xiao Zhang, Zhaochun Ren.

Figure 1
Figure 1. Figure 1: Problems in workflow-based agentic APR systems, including a mislocalization cascade (E1, E2, and E3 denote accumulated errors across different repair stages), invalid patch attempts, and high validation cost. APR systems (Yang et al., 2024; Zhang et al., 2024; Arora et al., 2024; Xia et al., 2025) that interact with development environments through search, in￾spection, editing, execution, and validation to… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EVIACT, a four-stage agentic APR pipeline (Setup → Localize → Patch → Verify) that converts traceable execution evidence into repair decisions. The retrieval scaffold derives suspects based on RED failure logs (produced by TD gate); the compile gate filters invalid patch candidates; and the TD gate applies GREEN test to pre-verify the valid patch before full regression. The Localize and Patch a… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-LLM comparison of EVIACT on Defects4J 2.0 from four perspectives. (a) Localization perfor￾mance, measured by Hit@3 against resolve rate. (b) Patch performance, measured by COMPFAIL, the average number of candidates rejected by the compile gate per bug, against resolve rate. (c) Runtime–token efficiency, measured by average runtime and token usage. (d) Tool-use behavior, measured by tool-call count an… view at source ↗
Figure 4
Figure 4. Figure 4: Overall component ablation of EVIACT. (a) Changes in resolve rate, token usage, and runtime rela￾tive to the no-guardrail baseline. (b) Tool-call distribu￾tion across the Localize and Patch stages. 100 150 200 Localization Tokens (K) 25 50 75 100 Hit@3 (%) (a) Localization Performance Baseline +Retrieval +TD +Compile EviACT 100 150 200 Patch Tokens (K) 0 5 10 ValFail (b) Patch Performance Baseline +Retriev… view at source ↗
Figure 5
Figure 5. Figure 5: Stage-level ablation analysis of EVIACT. (a) Localization performance, comparing Hit@3 against localization-token usage. (b) Patch performance, com￾paring validation failures against patch-token usage. 5.3 Diagnostic Analysis We compare the no-guardrail variant and full EVI￾ACT on JacksonDatabind-18 from Defects4J 2.0, where a timezone configuration error causes Test￾Config.testDateFormatConfig to fail. Th… view at source ↗
read the original abstract

LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. However, existing agentic APR systems still struggle to use execution evidence to guide localization, patch generation, and validation. We propose EviACT (Evidence-to-Action), an agentic APR framework that coordinates three evidence-driven guardrails across repair stages. The retrieval scaffold grounds repair context, the compile gate filters invalid edits, and the test-driven gate checks target-test recovery before full regression. Across four benchmarks, EviACT improves resolve rate over the strongest reported comparable baselines by 1.6-6.0 percentage points and shows 70.1-88.6% lower reported per-bug API cost where baseline costs are available. Ablations and diagnostics suggest that these gains are associated with the coordinated evidence-to-action chain, making agentic APR more effective and efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes EviACT, an agentic APR framework that coordinates three evidence-driven guardrails (retrieval scaffold, compile gate, test-driven gate) across localization, patch generation, and validation stages. It reports that this yields 1.6-6.0 percentage point gains in resolve rate over the strongest reported comparable baselines across four benchmarks, together with 70.1-88.6% lower per-bug API cost where baselines report costs, and presents ablations and diagnostics as evidence that the gains are associated with the coordinated evidence-to-action chain.

Significance. If the reported deltas can be shown to arise from the proposed guardrails under controlled re-implementations, the work would supply a concrete coordination pattern that improves both effectiveness and efficiency in repository-level APR. The explicit use of ablations to link mechanism to outcome is a methodological strength that strengthens the contribution relative to purely empirical claims.

major comments (2)
  1. [§4 (Experimental Setup)] §4 (Experimental Setup) and the abstract: the manuscript does not state whether the 'strongest reported comparable baselines' were re-executed under identical conditions (same LLM backend and version, temperature/top-p, iteration budget, and prompt templates outside the new guardrails). Without this control the 1.6-6.0 pp resolve-rate claim cannot be attributed to the three guardrails rather than prompt or model differences.
  2. [§5 (Ablations)] Results tables and §5 (Ablations): no statistical significance tests, confidence intervals, or variance estimates are reported for the resolve-rate deltas across the four benchmarks; this is load-bearing for the cross-benchmark claim that the coordinated chain produces consistent gains.
minor comments (1)
  1. The abstract could briefly indicate the four benchmarks by name and the number of bugs per benchmark to allow readers to assess scope without consulting the full evaluation section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [§4 (Experimental Setup)] §4 (Experimental Setup) and the abstract: the manuscript does not state whether the 'strongest reported comparable baselines' were re-executed under identical conditions (same LLM backend and version, temperature/top-p, iteration budget, and prompt templates outside the new guardrails). Without this control the 1.6-6.0 pp resolve-rate claim cannot be attributed to the three guardrails rather than prompt or model differences.

    Authors: We acknowledge the point. The abstract and evaluation explicitly compare against the strongest reported results from the literature rather than re-executions under matched conditions. The manuscript therefore does not control for LLM backend, temperature, prompt templates, or iteration budget, and the observed deltas cannot be attributed solely to the guardrails. We will revise the abstract and §4 to state this limitation explicitly and to qualify the claims accordingly. revision: yes

  2. Referee: [§5 (Ablations)] Results tables and §5 (Ablations): no statistical significance tests, confidence intervals, or variance estimates are reported for the resolve-rate deltas across the four benchmarks; this is load-bearing for the cross-benchmark claim that the coordinated chain produces consistent gains.

    Authors: We agree that the lack of statistical tests weakens the cross-benchmark claims. In the revision we will add bootstrap confidence intervals for the resolve-rate deltas and, where sample sizes permit, paired significance tests (e.g., McNemar’s test) to supply variance estimates and assess consistency of the gains. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation chain

full rationale

The manuscript presents an empirical framework (EviACT) with three guardrails evaluated on four benchmarks via resolve-rate and cost metrics against external baselines, plus ablations. No derivation chain, equations, fitted parameters presented as predictions, or self-citation load-bearing steps appear in the abstract or described content. Central claims rest on direct benchmark comparisons rather than any reduction to self-defined inputs or prior author results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5687 in / 1082 out tokens · 50305 ms · 2026-06-29T15:33:25.645233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Yiwei Hu, Zhen Liu, Kedie Shu, Shenghua Guan, De- qing Zou, Shouhuai Xu, Bin Yuan, and Hai Jin

    Tdflow: Agentic workflows for test driven soft- ware engineering.arXiv e-prints, pages arXiv–2510. Yiwei Hu, Zhen Liu, Kedie Shu, Shenghua Guan, De- qing Zou, Shouhuai Xu, Bin Yuan, and Hai Jin. 2025. {SoK}: Automated vulnerability repair: Methods, tools, and assessments. In34th USENIX Security Symposium (USENIX Security 25), pages 4421–4440. 9 Carlos E J...

  2. [2]

    Qwen2.5 Technical Report

    Trident: Controlling side effects in automated program repair.IEEE Transactions on Software En- gineering, 48(12):4717–4732. Qwen Team. 2024. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. Haifeng Ruan, Yuntong Zhang, and Abhik Roychoud- hury. 2024. Specrover: Code intent extraction via llms.arXiv preprint arXiv:2408.02232. Xunzhu Tang, Jiechao...

  3. [3]

    Setup: check out the buggy revision and prepare the environment

  4. [4]

    RED: run the originally failing target test and collect failure evidence

  5. [5]

    Localize: use RED evidence and the repository index to identify suspicious files, symbols, and line ranges

  6. [6]

    Patch: generate one minimal candidate edit over the localized working set

  7. [7]

    Compile: apply the edit and run syntax or build checks

  8. [8]

    GREEN: rerun the originally failing target test and require it to pass

  9. [9]

    When asked to localize, output only the requested localization JSON

    Validate: run full regression validation before accepting the patch. When asked to localize, output only the requested localization JSON. When asked to patch, output only the requested edit JSON or unified diff. Do not include explanations unless explicitly requested by the controller. B.2 Localization Prompt Localization Prompt Goal: localize the bug usi...

  10. [10]

    Read the RED log first

  11. [11]

    Extract the failing test, assertion or exception message, and relevant stack frames

  12. [12]

    Derive primary symbols from stack frames, test names, assertion text, and error messages

  13. [13]

    Query the repository index using symbol lookup before textual search

  14. [14]

    Read matched AST spans before selecting suspects

  15. [15]

    Inspect caller, callee, import, inheritance, or reference spans only when they are structurally connected to the RED evidence

  16. [16]

    red_test

    Keep the working set small and evidence-grounded. Output only valid JSON in the following schema: { "red_test": "name of the failing target test", "red_failure_summary": "one-sentence failure summary", "failure_contract": [ "behavior that should hold but is violated" ], "primary_symbols": [ "symbol names derived from RED evidence" ], "suspects": [ { "file...

  17. [17]

    Patch only files that are supported by the suspect spans unless additional context is necessary

  18. [18]

    Preserve unrelated behavior and avoid broad rewrites

  19. [19]

    Do not hardcode behavior only for the observed failing test

  20. [20]

    Ensure that the edit applies cleanly and preserves syntax

  21. [21]

    path": "relative/path/to/file

    If the previous candidate failed to apply, compile, or pass GREEN, use the diagnostic context to revise only the relevant part of the patch. Output only valid JSON in the following schema: [ { "path": "relative/path/to/file", "ops": [ { "type": "replace", "start_line": 10, "end_line": 12, "text": "replacement code with preserved indentation\n" } ] } ] The...

  22. [22]

    Setup Process:Run test suite Process:Run test suite Output:test.trigger.log Output:red.log [TD RED]

  23. [23]

    Exception

    Localize Process: •Read log file •Search for "Exception" keyword •Read MappingIterator.java Process: •Readred.log [TD RED] •Lookup ObjectMapper#setDateFormat • Read span of BaseSettings.java [Retrieval] Output: Located file: MappingIterator.java× Output: Located file: BaseSettings.java✓ API calls: 11, Tokens: 35.4K API calls: 6, Tokens: 20.6K

  24. [24]

    Patch Process: •Read MappingIterator.java •Generate 1 attempt •Generate 2 attempt •Generate 3 attempt •Generate 4 attempt •... Process: • Read span of TestConfig.java and BaseSettings.java [Retrieval] •Generate 1 patch •Compile check [Compile] •Verify withgreen.log [TD GREEN] Output:19 attempts Output:1 attempt, success API calls: 20, Tokens: 203.9K, Comp...

  25. [25]

    Verify Process: •Run full test suite •19 failures remain Process: •Run full test suite •All tests passed Output:Result: Failed, Time: 1129s, Total tokens: 239.3K Output:Result: Success, Time: 153.6s, Total tokens: 50.9K 16