arxiv: 2604.17464 · v1 · submitted 2026-04-19 · 💻 cs.SE · cs.AI

Recognition: unknown

Project Prometheus: Bridging the Intent Gap in Agentic Program Repair via Reverse-Engineered Executable Specifications

Yongchao Wang , Zhiqiu Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords automated program repairagentic workflowsexecutable specificationsGherkinintent gapDefects4Jbehavior-driven developmentmulti-agent systems

0 comments

The pith

Reverse-engineering executable Gherkin specifications from runtime failures lets agentic program repair achieve 93.97% correct patches and rescue 74.4% of bugs that blind agents miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the intent gap in agentic automated program repair arises mainly from misalignment between generated patches and developer goals, and that this can be closed by inferring executable specifications first rather than generating code directly. It uses a multi-agent system to derive Gherkin contracts from failure reports and validates them through an RQA loop that treats ground-truth code as a proxy oracle. A sympathetic reader would care because the approach claims dramatically higher repair rates on the Defects4J benchmark without needing larger models, producing minimal instead of over-engineered fixes. The work positions executable specifications, whether supplied or reverse-engineered, as the key to reliable future APR.

Core claim

Prometheus prioritizes specification inference over code generation by reverse-engineering Gherkin executable contracts from runtime failure reports via a multi-agent architecture grounded in Behavior-Driven Development. A Requirement Quality Assurance loop validates the inferred specifications against ground-truth code as a proxy oracle to prevent hallucination of intent. On 680 Defects4J defects this yields a 93.97% correct patch rate overall and a 74.4% rescue rate on 119 complex bugs that a strong blind agent could not resolve, with qualitative results showing that explicit intent steers agents toward precise minimal corrections.

What carries the argument

The Requirement Quality Assurance (RQA) Loop, a validation mechanism that uses ground-truth code as a proxy oracle to confirm that reverse-engineered Gherkin specifications accurately capture developer intent before patch generation proceeds.

If this is right

Agentic APR systems can reach high repair rates by aligning patches to verified executable specifications instead of relying on direct generation.
Explicit specifications guide agents toward minimal, intent-preserving corrections rather than structurally invasive over-engineering.
The future of APR depends on the ability to obtain or infer executable specifications, shifting emphasis away from ever-larger language models.
Rescue rates on hard bugs improve substantially when intent is made explicit through executable contracts derived from failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reverse-engineering-plus-validation pattern could be tested on requirements-driven code generation tasks outside of bug repair.
Real-world deployment would require alternative validation methods once ground-truth oracles are unavailable.
Many current agent failures in software tasks may trace to missing explicit intent representations rather than insufficient reasoning capacity.
The framework suggests that pre-existing executable specifications could be even more effective than reverse-engineered ones if widely adopted.

Load-bearing premise

Ground-truth code can serve as an unbiased proxy oracle for validating reverse-engineered specifications when measuring rescue rates on known benchmark bugs.

What would settle it

Apply the full Prometheus pipeline to a fresh benchmark of defects where no ground-truth implementations are supplied to the RQA loop and measure whether the rescue rate on previously unrepairable bugs drops sharply or remains near 74%.

Figures

Figures reproduced from arXiv: 2604.17464 by Yongchao Wang, Zhiqiu Huang.

**Figure 1.** Figure 1: The Prometheus Framework. Phase 1: The Architect (Gemini-3.0-Pro) performs root cause analysis and synthesizes a Gherkin specification S. Phase 2: The Engineer validates S through Sandwich Verification—S must fail on Cbuggy and pass on Cf ixed (or human review in production). If verification fails, S is regenerated. Phase 3: The Fixer (Qwen-3.0-Coder) performs a specification-guided surgical repair, produc… view at source ↗

read the original abstract

The transition from neural machine translation to agentic workflows has revolutionized Automated Program Repair (APR). However, existing agents, despite their advanced reasoning capabilities, frequently suffer from the ``Intent Gap'' -- the misalignment between the generated patch and the developer's original intent. Current solutions relying on natural language summaries or adversarial sampling often fail to provide the deterministic constraints required for surgical repairs. In this paper, we introduce \textsc{Prometheus}, a novel framework that bridges this gap by prioritizing \textit{Specification Inference} over code generation. We employ Behavior-Driven Development (BDD) as an executable contract, utilizing a multi-agent architecture to reverse-engineer Gherkin specifications from runtime failure reports. To resolve the ``Hallucination of Intent,'' we propose a \textbf{Requirement Quality Assurance (RQA) Loop}, a mechanism that leverages ground-truth code as a proxy oracle to validate inferred specifications. We evaluated \textsc{Prometheus} on 680 defects from the Defects4J benchmark. The results are transformative: our framework achieved a total correct patch rate of \textbf{93.97\%} (639/680). More significantly, it demonstrated a \textbf{Rescue Rate of 74.4\%}, successfully repairing 119 complex bugs that a strong blind agent failed to resolve. Qualitative analysis reveals that explicit intent guides agents away from structurally invasive over-engineering toward precise, minimal corrections. Our findings suggest that the future of APR lies not in larger models, but in the capability to align code with verified, \textbf{Executable Specifications} -- whether pre-existing or reverse-engineered.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prometheus claims 94% patch rate and 74% rescue on Defects4J by reverse-engineering Gherkin specs then validating them against ground-truth code, but the RQA loop makes the comparison to the blind baseline look circular.

read the letter

The main takeaway is that this paper pushes executable specs as the fix for the intent gap in agentic APR. It uses a multi-agent setup to turn failure reports into Gherkin specs, then runs an RQA loop that checks those specs against the ground-truth code before the repair agent uses them. On 680 Defects4J bugs it reports 93.97% correct patches and rescues 119 bugs the baseline misses. That direction makes sense on its face: giving the agent deterministic constraints instead of just natural-language hints or random sampling should reduce over-engineering and hallucinations of intent. The qualitative observation that explicit specs lead to smaller, more precise patches is also worth noting if it holds in the full results. The multi-agent inference pipeline for specs is the piece that feels freshest relative to earlier APR work that either assumes specs exist or generates tests after the fact. The paper does a clean job laying out why current agentic repair still drifts from developer intent and why BDD-style contracts could help. The numbers are presented plainly and the framing is practical rather than hype-heavy. The soft spot is the evaluation. The RQA loop explicitly uses ground-truth code as a proxy oracle to certify the inferred specs. On a benchmark where the correct patches are already known, this step gives the Prometheus agent an oracle signal that the strong blind baseline never receives. That makes the rescue-rate comparison non-isomorphic; the specs are not purely reverse-engineered from the failure report alone. The abstract does not spell out how the same validation would work in a setting where the fix is unknown, nor does it report whether removing the ground-truth step drops the rescue numbers. Without methods details on baselines, statistical tests, or leakage controls, the headline figures are hard to interpret at face value. This is aimed at people working on agentic repair, spec-based APR, or multi-agent code workflows. A reader who wants concrete ideas for executable intent constraints will find something usable here even if they have to discount the raw percentages. The core proposal is coherent enough and the problem is real enough that it deserves a serious referee, though any review should focus first on whether the RQA loop can be made non-circular or replaced with a ground-truth-free validation method. I would send it to review with that specific request rather than desk-reject.

Referee Report

1 major / 3 minor

Summary. The manuscript presents Prometheus, a multi-agent framework for automated program repair (APR) that addresses the 'Intent Gap' by reverse-engineering executable Gherkin specifications from runtime failure reports using Behavior-Driven Development (BDD). It introduces a Requirement Quality Assurance (RQA) Loop that uses ground-truth code as a proxy oracle to validate these specifications. Evaluated on 680 defects from Defects4J, the framework claims a 93.97% correct patch rate (639/680) and a 74.4% rescue rate, repairing 119 bugs that a strong blind agent could not fix.

Significance. If the reported results can be shown to hold without circular evaluation, the work would be significant for the APR field by demonstrating that prioritizing specification inference over direct code generation can lead to higher repair rates and more precise patches. The focus on executable specifications as contracts is a promising direction, and the rescue of complex bugs highlights potential for handling cases where current agents fail. However, the current evaluation leaves open questions about reproducibility and fairness of comparisons.

major comments (1)

[§4 (Evaluation) and RQA Loop description] §4 (Evaluation) and RQA Loop description: The RQA Loop leverages ground-truth code as a proxy oracle to validate inferred specifications. This setup risks circularity in the rescue rate calculation because the specifications are refined against the known correct behavior on Defects4J bugs, information unavailable to the baseline 'strong blind agent'. The 74.4% rescue rate (119 bugs) and overall 93.97% patch rate may not be directly comparable without demonstrating that the performance holds when RQA validation is performed without access to ground-truth (e.g., using only failure reports or alternative oracles).

minor comments (3)

[Abstract] The abstract reports strong quantitative outcomes but omits details on the baseline agent, statistical tests, number of runs, or how data was handled, making it difficult to assess the claims' robustness.
[Methodology] Missing explicit description of the 'strong blind agent' baseline, including its architecture, prompting strategy, and whether it had access to any specifications.
[Results] No ablation studies on the contribution of the RQA Loop components or sensitivity analysis to the oracle usage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback on our manuscript, especially the concerns about the evaluation setup in Section 4 and the RQA Loop. We address these points below and outline planned revisions to enhance the rigor of our claims.

read point-by-point responses

Referee: The RQA Loop leverages ground-truth code as a proxy oracle to validate inferred specifications. This setup risks circularity in the rescue rate calculation because the specifications are refined against the known correct behavior on Defects4J bugs, information unavailable to the baseline 'strong blind agent'. The 74.4% rescue rate (119 bugs) and overall 93.97% patch rate may not be directly comparable without demonstrating that the performance holds when RQA validation is performed without access to ground-truth (e.g., using only failure reports or alternative oracles).

Authors: We acknowledge that the use of ground-truth code in the RQA Loop introduces a potential circularity, as it provides validation information not available to the baseline agent or in typical real-world APR scenarios. The RQA mechanism is specifically designed to address the 'Hallucination of Intent' by ensuring that the reverse-engineered Gherkin specifications align with the correct program behavior, using the ground-truth as a reliable proxy oracle during our benchmark evaluation. That said, the specification inference process itself is performed exclusively from the runtime failure reports and does not directly access the ground-truth code. The RQA acts as a post-inference filter to discard low-quality specifications. We agree that this makes direct comparison to the blind agent less straightforward, as the blind agent lacks both the specification guidance and the validation step. In the revised manuscript, we will add a dedicated discussion on this limitation and include an ablation experiment. Specifically, we will report results where RQA validation is conducted without ground-truth access, relying instead on execution against the original failing test cases and LLM-based consistency checks. This will allow us to quantify how much of the performance gain is attributable to the oracle versus the executable specification approach. We will also update the rescue rate analysis to note the methodological differences explicitly. These changes will be reflected in a partial revision of the evaluation section. revision: partial

Circularity Check

1 steps flagged

RQA Loop's ground-truth oracle renders reverse-engineered specs equivalent to target behavior by construction

specific steps

self definitional [Abstract]
"To resolve the ``Hallucination of Intent,'' we propose a Requirement Quality Assurance (RQA) Loop, a mechanism that leverages ground-truth code as a proxy oracle to validate inferred specifications."

The framework claims to reverse-engineer Gherkin specs solely from runtime failure reports, yet the RQA validation step uses ground-truth code (the benchmark's correct implementation) as oracle. On Defects4J, where ground-truth patches are known, this allows specs to be confirmed or refined to encode the exact target behavior. The reported patch and rescue rates are therefore achieved with specifications that are equivalent to the correct intent by construction, rather than derived independently; the blind-agent baseline lacks this oracle, rendering the comparison non-isomorphic.

full rationale

The paper's headline results (93.97% patch rate, 74.4% rescue rate on 119 Defects4J bugs) rest on the RQA Loop. The abstract explicitly states that this loop 'leverages ground-truth code as a proxy oracle to validate inferred specifications' after reverse-engineering from runtime failure reports. Because Defects4J supplies known ground-truth patches, the validation step can confirm or adjust specs to match the exact correct behavior. This makes the 'inferred' specs non-independent of the target; the subsequent agentic repair is guided by an oracle signal unavailable to the blind baseline agent. The rescue-rate comparison therefore reduces to a non-isomorphic setup rather than a pure test of specification inference. No equations, self-citations, or uniqueness theorems are present, but the central performance claim is partly defined by this evaluation structure, producing moderate circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that Gherkin specifications accurately capture developer intent when reverse-engineered and that the RQA mechanism provides unbiased validation. No explicit numerical free parameters are stated in the abstract.

axioms (1)

domain assumption Behavior-Driven Development Gherkin specifications can be reliably reverse-engineered from runtime failure reports by a multi-agent system to represent original developer intent.
Invoked as the foundation for prioritizing specification inference over direct code generation.

invented entities (1)

Requirement Quality Assurance (RQA) Loop no independent evidence
purpose: Validate inferred specifications by treating ground-truth code as a proxy oracle to prevent hallucination of intent.
New mechanism introduced to close the intent gap before repair generation.

pith-pipeline@v0.9.0 · 5596 in / 1442 out tokens · 53722 ms · 2026-05-10T05:32:54.748614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Equipping agents for the real world with agent skills.https://www.anthropic

Anthropic. Equipping agents for the real world with agent skills.https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills, Oc- tober 2025. Accessed: 2026-01-02

2025
[2]

Devanbu, and Michael Pradel

Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel. Repairagent: An au- tonomous, llm-based agent for program repair. In47th IEEE/ACM International Con- ference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 2188–2200. IEEE, 2025

2025
[3]

Abstain and validate: A dual-llm policy for reducing noise in agentic program repair.CoRR, abs/2510.03217, 2025

José Cambronero, Michele Tufano, Sherry Shi, Renyao Wei, Grant Uy, Runxiang Cheng, Chin-Jung Liu, Shiying Pan, Satish Chandra, and Pat Rondon. Abstain and validate: A dual-llm policy for reducing noise in agentic program repair.CoRR, abs/2510.03217, 2025

work page arXiv 2025
[4]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 15

work page internal anchor Pith review arXiv 2024
[5]

Tsapr: Atreesearchframework for automated program repair, 2025

HaichuanHu, YeShang, WeifengSun, andQuanjunZhang. Tsapr: Atreesearchframework for automated program repair, 2025

2025
[6]

René Just, Darioush Jalali, and Michael D. Ernst. Defects4j: a database of existing faults to enable controlled testing studies for java programs. In Corina S. Pasareanu and Darko Marinov, editors,International Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, pages 437–440. ACM, 2014

2014
[7]

Hybrid automated program repair by combining large language models and program analysis.ACM Trans

Fengjie Li, Jiajun Jiang, Jiajun Sun, and Hongyu Zhang. Hybrid automated program repair by combining large language models and program analysis.ACM Trans. Softw. Eng. Methodol., 34(7):202:1–202:28, 2025

2025
[8]

Introducing behavior-driven development.Better Software, 6(1):40–52, 2006

Dan North. Introducing behavior-driven development.Better Software, 6(1):40–52, 2006

2006
[9]

Specrover: Code intent extrac- tion via llms

Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. Specrover: Code intent extrac- tion via llms. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 963–974. IEEE, 2025

2025
[10]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

2025
[11]

Less training, more repairing please: revisiting automated program repair via zero-shot learning

Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Abhik Roychoudhury, Cristian Cadar, andMiryungKim, editors,Proceedings of the 30th ACM Joint European Software Engineer- ing Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapor...

2022
[12]

Large language models are qualified benchmark builders: Rebuilding pre-training datasets for advancing code intelligence tasks

KangYang, XinjunMao, ShangwenWang, YanlinWang, TanghaoranZhang, BoLin, Yihao Qin, Zhang Zhang, Yao Lu, and Kamal Al-Sabahi. Large language models are qualified benchmark builders: Rebuilding pre-training datasets for advancing code intelligence tasks. In33rd IEEE/ACM International Conference on Program Comprehension, ICPC@ICSE 2025, Ottawa, ON, Canada, Ap...

2025
[13]

He Ye, Aidan Z. H. Yang, Chang Hu, Yanlin Wang, Tao Zhang, and Claire Le Goues. Adverintent-agent: Adversarial reasoning for repair based on inferred program intent.Proc. ACM Softw. Eng., 2(ISSTA):1398–1420, 2025

2025
[14]

PATCHAGENT: A practical program repair agent mimicking human exper- tise

Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. PATCHAGENT: A practical program repair agent mimicking human exper- tise. In Lujo Bauer and Giancarlo Pellegrino, editors,34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 4381–4400. USENIX Association, 2025. 16

2025