Recognition: unknown
Project Prometheus: Bridging the Intent Gap in Agentic Program Repair via Reverse-Engineered Executable Specifications
Pith reviewed 2026-05-10 05:32 UTC · model grok-4.3
The pith
Reverse-engineering executable Gherkin specifications from runtime failures lets agentic program repair achieve 93.97% correct patches and rescue 74.4% of bugs that blind agents miss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prometheus prioritizes specification inference over code generation by reverse-engineering Gherkin executable contracts from runtime failure reports via a multi-agent architecture grounded in Behavior-Driven Development. A Requirement Quality Assurance loop validates the inferred specifications against ground-truth code as a proxy oracle to prevent hallucination of intent. On 680 Defects4J defects this yields a 93.97% correct patch rate overall and a 74.4% rescue rate on 119 complex bugs that a strong blind agent could not resolve, with qualitative results showing that explicit intent steers agents toward precise minimal corrections.
What carries the argument
The Requirement Quality Assurance (RQA) Loop, a validation mechanism that uses ground-truth code as a proxy oracle to confirm that reverse-engineered Gherkin specifications accurately capture developer intent before patch generation proceeds.
If this is right
- Agentic APR systems can reach high repair rates by aligning patches to verified executable specifications instead of relying on direct generation.
- Explicit specifications guide agents toward minimal, intent-preserving corrections rather than structurally invasive over-engineering.
- The future of APR depends on the ability to obtain or infer executable specifications, shifting emphasis away from ever-larger language models.
- Rescue rates on hard bugs improve substantially when intent is made explicit through executable contracts derived from failures.
Where Pith is reading between the lines
- The same reverse-engineering-plus-validation pattern could be tested on requirements-driven code generation tasks outside of bug repair.
- Real-world deployment would require alternative validation methods once ground-truth oracles are unavailable.
- Many current agent failures in software tasks may trace to missing explicit intent representations rather than insufficient reasoning capacity.
- The framework suggests that pre-existing executable specifications could be even more effective than reverse-engineered ones if widely adopted.
Load-bearing premise
Ground-truth code can serve as an unbiased proxy oracle for validating reverse-engineered specifications when measuring rescue rates on known benchmark bugs.
What would settle it
Apply the full Prometheus pipeline to a fresh benchmark of defects where no ground-truth implementations are supplied to the RQA loop and measure whether the rescue rate on previously unrepairable bugs drops sharply or remains near 74%.
Figures
read the original abstract
The transition from neural machine translation to agentic workflows has revolutionized Automated Program Repair (APR). However, existing agents, despite their advanced reasoning capabilities, frequently suffer from the ``Intent Gap'' -- the misalignment between the generated patch and the developer's original intent. Current solutions relying on natural language summaries or adversarial sampling often fail to provide the deterministic constraints required for surgical repairs. In this paper, we introduce \textsc{Prometheus}, a novel framework that bridges this gap by prioritizing \textit{Specification Inference} over code generation. We employ Behavior-Driven Development (BDD) as an executable contract, utilizing a multi-agent architecture to reverse-engineer Gherkin specifications from runtime failure reports. To resolve the ``Hallucination of Intent,'' we propose a \textbf{Requirement Quality Assurance (RQA) Loop}, a mechanism that leverages ground-truth code as a proxy oracle to validate inferred specifications. We evaluated \textsc{Prometheus} on 680 defects from the Defects4J benchmark. The results are transformative: our framework achieved a total correct patch rate of \textbf{93.97\%} (639/680). More significantly, it demonstrated a \textbf{Rescue Rate of 74.4\%}, successfully repairing 119 complex bugs that a strong blind agent failed to resolve. Qualitative analysis reveals that explicit intent guides agents away from structurally invasive over-engineering toward precise, minimal corrections. Our findings suggest that the future of APR lies not in larger models, but in the capability to align code with verified, \textbf{Executable Specifications} -- whether pre-existing or reverse-engineered.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Prometheus, a multi-agent framework for automated program repair (APR) that addresses the 'Intent Gap' by reverse-engineering executable Gherkin specifications from runtime failure reports using Behavior-Driven Development (BDD). It introduces a Requirement Quality Assurance (RQA) Loop that uses ground-truth code as a proxy oracle to validate these specifications. Evaluated on 680 defects from Defects4J, the framework claims a 93.97% correct patch rate (639/680) and a 74.4% rescue rate, repairing 119 bugs that a strong blind agent could not fix.
Significance. If the reported results can be shown to hold without circular evaluation, the work would be significant for the APR field by demonstrating that prioritizing specification inference over direct code generation can lead to higher repair rates and more precise patches. The focus on executable specifications as contracts is a promising direction, and the rescue of complex bugs highlights potential for handling cases where current agents fail. However, the current evaluation leaves open questions about reproducibility and fairness of comparisons.
major comments (1)
- [§4 (Evaluation) and RQA Loop description] §4 (Evaluation) and RQA Loop description: The RQA Loop leverages ground-truth code as a proxy oracle to validate inferred specifications. This setup risks circularity in the rescue rate calculation because the specifications are refined against the known correct behavior on Defects4J bugs, information unavailable to the baseline 'strong blind agent'. The 74.4% rescue rate (119 bugs) and overall 93.97% patch rate may not be directly comparable without demonstrating that the performance holds when RQA validation is performed without access to ground-truth (e.g., using only failure reports or alternative oracles).
minor comments (3)
- [Abstract] The abstract reports strong quantitative outcomes but omits details on the baseline agent, statistical tests, number of runs, or how data was handled, making it difficult to assess the claims' robustness.
- [Methodology] Missing explicit description of the 'strong blind agent' baseline, including its architecture, prompting strategy, and whether it had access to any specifications.
- [Results] No ablation studies on the contribution of the RQA Loop components or sensitivity analysis to the oracle usage.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on our manuscript, especially the concerns about the evaluation setup in Section 4 and the RQA Loop. We address these points below and outline planned revisions to enhance the rigor of our claims.
read point-by-point responses
-
Referee: The RQA Loop leverages ground-truth code as a proxy oracle to validate inferred specifications. This setup risks circularity in the rescue rate calculation because the specifications are refined against the known correct behavior on Defects4J bugs, information unavailable to the baseline 'strong blind agent'. The 74.4% rescue rate (119 bugs) and overall 93.97% patch rate may not be directly comparable without demonstrating that the performance holds when RQA validation is performed without access to ground-truth (e.g., using only failure reports or alternative oracles).
Authors: We acknowledge that the use of ground-truth code in the RQA Loop introduces a potential circularity, as it provides validation information not available to the baseline agent or in typical real-world APR scenarios. The RQA mechanism is specifically designed to address the 'Hallucination of Intent' by ensuring that the reverse-engineered Gherkin specifications align with the correct program behavior, using the ground-truth as a reliable proxy oracle during our benchmark evaluation. That said, the specification inference process itself is performed exclusively from the runtime failure reports and does not directly access the ground-truth code. The RQA acts as a post-inference filter to discard low-quality specifications. We agree that this makes direct comparison to the blind agent less straightforward, as the blind agent lacks both the specification guidance and the validation step. In the revised manuscript, we will add a dedicated discussion on this limitation and include an ablation experiment. Specifically, we will report results where RQA validation is conducted without ground-truth access, relying instead on execution against the original failing test cases and LLM-based consistency checks. This will allow us to quantify how much of the performance gain is attributable to the oracle versus the executable specification approach. We will also update the rescue rate analysis to note the methodological differences explicitly. These changes will be reflected in a partial revision of the evaluation section. revision: partial
Circularity Check
RQA Loop's ground-truth oracle renders reverse-engineered specs equivalent to target behavior by construction
specific steps
-
self definitional
[Abstract]
"To resolve the ``Hallucination of Intent,'' we propose a Requirement Quality Assurance (RQA) Loop, a mechanism that leverages ground-truth code as a proxy oracle to validate inferred specifications."
The framework claims to reverse-engineer Gherkin specs solely from runtime failure reports, yet the RQA validation step uses ground-truth code (the benchmark's correct implementation) as oracle. On Defects4J, where ground-truth patches are known, this allows specs to be confirmed or refined to encode the exact target behavior. The reported patch and rescue rates are therefore achieved with specifications that are equivalent to the correct intent by construction, rather than derived independently; the blind-agent baseline lacks this oracle, rendering the comparison non-isomorphic.
full rationale
The paper's headline results (93.97% patch rate, 74.4% rescue rate on 119 Defects4J bugs) rest on the RQA Loop. The abstract explicitly states that this loop 'leverages ground-truth code as a proxy oracle to validate inferred specifications' after reverse-engineering from runtime failure reports. Because Defects4J supplies known ground-truth patches, the validation step can confirm or adjust specs to match the exact correct behavior. This makes the 'inferred' specs non-independent of the target; the subsequent agentic repair is guided by an oracle signal unavailable to the blind baseline agent. The rescue-rate comparison therefore reduces to a non-isomorphic setup rather than a pure test of specification inference. No equations, self-citations, or uniqueness theorems are present, but the central performance claim is partly defined by this evaluation structure, producing moderate circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Behavior-Driven Development Gherkin specifications can be reliably reverse-engineered from runtime failure reports by a multi-agent system to represent original developer intent.
invented entities (1)
-
Requirement Quality Assurance (RQA) Loop
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Equipping agents for the real world with agent skills.https://www.anthropic
Anthropic. Equipping agents for the real world with agent skills.https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills, Oc- tober 2025. Accessed: 2026-01-02
2025
-
[2]
Devanbu, and Michael Pradel
Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel. Repairagent: An au- tonomous, llm-based agent for program repair. In47th IEEE/ACM International Con- ference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 2188–2200. IEEE, 2025
2025
-
[3]
José Cambronero, Michele Tufano, Sherry Shi, Renyao Wei, Grant Uy, Runxiang Cheng, Chin-Jung Liu, Shiying Pan, Satish Chandra, and Pat Rondon. Abstain and validate: A dual-llm policy for reducing noise in agentic program repair.CoRR, abs/2510.03217, 2025
-
[4]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 15
work page internal anchor Pith review arXiv 2024
-
[5]
Tsapr: Atreesearchframework for automated program repair, 2025
HaichuanHu, YeShang, WeifengSun, andQuanjunZhang. Tsapr: Atreesearchframework for automated program repair, 2025
2025
-
[6]
René Just, Darioush Jalali, and Michael D. Ernst. Defects4j: a database of existing faults to enable controlled testing studies for java programs. In Corina S. Pasareanu and Darko Marinov, editors,International Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, pages 437–440. ACM, 2014
2014
-
[7]
Hybrid automated program repair by combining large language models and program analysis.ACM Trans
Fengjie Li, Jiajun Jiang, Jiajun Sun, and Hongyu Zhang. Hybrid automated program repair by combining large language models and program analysis.ACM Trans. Softw. Eng. Methodol., 34(7):202:1–202:28, 2025
2025
-
[8]
Introducing behavior-driven development.Better Software, 6(1):40–52, 2006
Dan North. Introducing behavior-driven development.Better Software, 6(1):40–52, 2006
2006
-
[9]
Specrover: Code intent extrac- tion via llms
Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. Specrover: Code intent extrac- tion via llms. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 963–974. IEEE, 2025
2025
-
[10]
Qwen3 technical report, 2025
Qwen Team. Qwen3 technical report, 2025
2025
-
[11]
Less training, more repairing please: revisiting automated program repair via zero-shot learning
Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Abhik Roychoudhury, Cristian Cadar, andMiryungKim, editors,Proceedings of the 30th ACM Joint European Software Engineer- ing Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapor...
2022
-
[12]
Large language models are qualified benchmark builders: Rebuilding pre-training datasets for advancing code intelligence tasks
KangYang, XinjunMao, ShangwenWang, YanlinWang, TanghaoranZhang, BoLin, Yihao Qin, Zhang Zhang, Yao Lu, and Kamal Al-Sabahi. Large language models are qualified benchmark builders: Rebuilding pre-training datasets for advancing code intelligence tasks. In33rd IEEE/ACM International Conference on Program Comprehension, ICPC@ICSE 2025, Ottawa, ON, Canada, Ap...
2025
-
[13]
He Ye, Aidan Z. H. Yang, Chang Hu, Yanlin Wang, Tao Zhang, and Claire Le Goues. Adverintent-agent: Adversarial reasoning for repair based on inferred program intent.Proc. ACM Softw. Eng., 2(ISSTA):1398–1420, 2025
2025
-
[14]
PATCHAGENT: A practical program repair agent mimicking human exper- tise
Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. PATCHAGENT: A practical program repair agent mimicking human exper- tise. In Lujo Bauer and Giancarlo Pellegrino, editors,34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 4381–4400. USENIX Association, 2025. 16
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.