DebugHarness: Emulating Human Dynamic Debugging for Autonomous Program Repair

Baowen Xu; Maolin Sun; Xuanlin Liu; Yibiao Yang; Yuming Zhou

arxiv: 2604.03610 · v1 · submitted 2026-04-04 · 💻 cs.SE

DebugHarness: Emulating Human Dynamic Debugging for Autonomous Program Repair

Maolin Sun , Yibiao Yang , Xuanlin Liu , Yuming Zhou , Baowen Xu This is my paper

Pith reviewed 2026-05-13 17:37 UTC · model grok-4.3

classification 💻 cs.SE

keywords debugharnessdebuggingdynamicstaticmemoryprogramautomatedautonomous

0 comments

The pith

DebugHarness patches approximately 90% of real-world C/C++ security bugs on SEC-bench by emulating interactive human debugging, outperforming baselines by over 30%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Software often contains severe security flaws such as use-after-free errors and memory corruption that are hard to fix automatically. Fuzzers can find these bugs but fixing them usually needs expert manual analysis of how the program actually runs. Most current AI tools for program repair only look at the code without running it, which misses key details about execution. DebugHarness creates an autonomous agent that copies how human engineers debug. It starts from a crash, follows a pattern-guided strategy to investigate, actively queries the live runtime for memory states and execution paths, forms hypotheses, and then creates patches that are checked in a closed validation loop. On the SEC-bench dataset of real C/C++ vulnerabilities, it fixed about 90 percent of the bugs. This is more than 30 percent better than existing AI baselines. The system shows that giving the AI access to live program behavior helps it diagnose and repair problems better than static code generation alone. It bridges static reasoning with the dynamic nature of low-level systems programming.

Core claim

DebugHarness successfully patches approximately 90% of the evaluated bugs. This yields a relative improvement of over 30% compared to state-of-the-art baselines, demonstrating that dynamic debugging significantly enhances LLM diagnostic capabilities.

Load-bearing premise

That the pattern-guided investigation strategy and closed-loop validation cycle can reliably diagnose and fix intricate memory safety violations using only LLM-driven interactive runtime probes without human intervention.

read the original abstract

Patching severe security flaws in complex software remains a major challenge. While automated tools like fuzzers efficiently discover bugs, fixing deep-rooted low-level faults (e.g., use-after-free and memory corruption) still requires labor-intensive manual analysis by experts. Emerging Large Language Model (LLM) agents attempt to automate this pipeline, but they typically treat bug fixing as a purely static code-generation task. Relying solely on static artifacts, these methods miss the dynamic execution context strictly necessary for diagnosing intricate memory safety violations. To overcome these limitations, we introduce DebugHarness, an autonomous LLM-powered debugging agent harness that resolves complex vulnerabilities by emulating the interactive debugging practices of human systems engineers. Instead of merely examining static code, DebugHarness actively queries the live runtime environment. Driven by a reproducible crash, it utilizes a pattern-guided investigation strategy to formulate hypotheses, interactively probes program memory states and execution paths, and synthesizes patches via a closed-loop validation cycle. We evaluate DebugHarness on SEC-bench, a rigorous dataset of real-world C/C++ security vulnerabilities. DebugHarness successfully patches approximately 90% of the evaluated bugs. This yields a relative improvement of over 30% compared to state-of-the-art baselines, demonstrating that dynamic debugging significantly enhances LLM diagnostic capabilities. Overall, DebugHarness establishes a novel paradigm for automated program repair, bridging the gap between static LLM reasoning and the dynamic intricacies of low-level systems programming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach relies on standard assumptions about LLM capabilities and runtime access in software engineering; no new free parameters, axioms, or invented entities are introduced beyond the described system.

pith-pipeline@v0.9.0 · 5560 in / 1101 out tokens · 52617 ms · 2026-05-13T17:37:42.459232+00:00 · methodology

DebugHarness: Emulating Human Dynamic Debugging for Autonomous Program Repair

Core claim

Load-bearing premise

discussion (0)