Probe-and-Refine Tuning of Repository Guidance for Coding Agents

Asa Shepard; Jeannie Albrecht

arxiv: 2606.20512 · v2 · pith:FWMKZJWLnew · submitted 2026-06-18 · 💻 cs.SE · cs.LG

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

Asa Shepard , Jeannie Albrecht This is my paper

Pith reviewed 2026-06-26 15:56 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords coding agentsrepository guidanceLLM tuningSWE-benchprobe-and-refineAGENTS.mdagent performancesynthetic probes

0 comments

The pith

Probe-and-refine tuning raises coding agent resolve rates on repositories by refining guidance files with synthetic probes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the way repository guidance for LLM coding agents is created determines whether it helps or hurts performance. It presents probe-and-refine tuning, a method that generates synthetic bug-fix probes, runs single-shot LLM calls to spot gaps in an existing guidance file, and updates the file iteratively without any agent execution or tool use. Across four trials on SWE-bench Verified using Qwen3.5-35B-A3B, the tuned guidance reaches a 33.0 percent mean resolve rate versus 28.3 percent for the starting static file and 25.5 percent with no guidance. The gains come from agents reaching the right files more often rather than from better edits once there, and the same process reveals that guidance becomes especially useful when agents are given larger step budgets.

Core claim

Probe-and-refine tuning uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, it achieves 33.0 percent mean resolve rate versus 28.3 percent for the static knowledge base used to initialize it and 25.5 percent for an unguided baseline. The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points more instances while per-patch precision remains statistically constant at approximately 59 percent.

What carries the argument

The probe-and-refine tuning procedure that diagnoses gaps in a repository guidance file using synthetic bug-fix probes and updates the file through single-shot LLM calls.

If this is right

Guidance produced by the method lets agents make productive use of larger step budgets.
The tuning loop degrades when the model cannot generate sufficiently diagnostic output about guidance gaps.
Per-patch precision remains constant even when the tuning process itself degrades.
Refined guidance primarily increases the number of problems for which the agent reaches the correct file.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could start from minimal or empty guidance files on repositories that lack any AGENTS.md.
It may scale to larger or private codebases because tuning avoids running full agent trajectories.
Cross-model results suggest that the diagnostic quality of the tuning model is the main limit on further gains.

Load-bearing premise

Synthetic bug-fix probes generated by the same LLM family are representative enough of real repository issues for the tuning model to diagnose and correct guidance gaps.

What would settle it

Applying probe-and-refine to SWE-bench Verified and finding no statistically significant rise in the fraction of instances that receive evaluable patches.

Figures

Figures reproduced from arXiv: 2606.20512 by Asa Shepard, Jeannie Albrecht.

**Figure 1.** Figure 1: Probe-and-refine tuning pipeline. The static_kb artifact feeds both the static_kb condition directly and the refinement loop, which transforms it into the refined guidance using synthetic probes and single-shot diagnosis. No SWE-bench evaluation instances are used during refinement. All three conditions feed the same fixed coding agent. Guidance length. The static_kb artifacts average 1,687 characters (ran… view at source ↗

**Figure 2.** Figure 2: Guidance evolution for django/django. Generic rules (left) are replaced with repo-specific strategies (right) over 5 iterations. The procedure independently converges on a reproduce-first workflow with subsystem-specific tracing instructions, test paths, and middleware dependencies. 3.4 Coding Agent and Fallback The coding agent operates in a ReAct-style loop (Yao et al., 2023): at each step it emits a bas… view at source ↗

**Figure 3.** Figure 3: Mean resolve rate across four independent trials on SWE-bench Verified (500 instances per trial). Bars [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Mean evaluation coverage across four trials at 200 steps. Probe-and-refine produces evaluable patches for [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Patch timing at 200 steps, pooled across four trials. The unguided agent produces the bulk of its patches in the [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Probe-refined-only consistent solves per repository, expressed as a ratio of observed to expected (base-rate). [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Resolve rate across step budgets. At 25 steps, all conditions are equivalent. As budget increases, the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain AGENTS.md files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce probe-and-refine tuning: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0% mean resolve rate vs. 28.3% for the static knowledge base used to initialize it and 25.5% for an unguided baseline (p < 0.001 for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant (~59%, p = 0.119), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Probe-and-refine gives a modest 4.7pp resolve-rate lift on SWE-bench by tuning guidance with synthetic probes, mainly via better coverage, but the gains rest on how well those probes match real task distributions.

read the letter

The main thing to know is that this paper shows a concrete tuning loop for repository guidance files that raises agent resolve rates on SWE-bench Verified from 28.3% to 33.0% across four trials, with the improvement coming from higher coverage rather than better per-patch precision.

What is new is the probe-and-refine procedure itself: it generates synthetic bug-fix probes, runs single-shot LLM calls to diagnose gaps in the AGENTS.md file, and patches the guidance iteratively without any agent execution or tool use during the tuning phase. The paper does a solid job separating coverage from precision, reporting that refined guidance produces evaluable patches for 14.5pp more instances while precision stays flat at ~59%. It also includes a step-budget experiment and a cross-model check with another 30B model that shows the loop fails when the model cannot output useful diagnostics.

The soft spot is the assumption that the synthetic probes are representative enough of the actual SWE-bench tasks. The reported gains depend on the probes exposing guidance gaps that matter for real instances; if the probes over-weight single-file or model-specific cases, the coverage boost could be narrower than claimed. The abstract gives p-values and trial counts but no direct validation of probe file locations or error categories against the test distribution, so that needs checking in the full methods.

This is for people working on coding agents who already maintain or generate repo guidance and want a low-cost way to iterate on it. A reader focused on SWE-bench or practical context engineering would get usable ideas from the method and the coverage breakdown. It has enough empirical grounding and a replicable procedure to deserve serious referee time, even if revisions should strengthen the probe validation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces probe-and-refine tuning, a procedure that iteratively diagnoses and patches repository guidance files (e.g., AGENTS.md) using synthetic bug-fix probes via single-shot LLM calls, without any agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, it reports a mean resolve rate of 33.0% versus 28.3% for the static knowledge base initializer and 25.5% for an unguided baseline (p < 0.001 for both contrasts). The lift is attributed to improved coverage (14.5pp more evaluable patches) while per-patch precision remains statistically unchanged (~59%, p = 0.119). Additional experiments examine step-budget utilization and cross-model behavior with NVIDIA-Nemotron-3-Nano-30B-A3B.

Significance. If the empirical result holds, the work supplies a low-overhead method for improving repository guidance that demonstrably increases coverage without altering per-patch precision, and shows that better guidance enables productive use of larger step budgets. Credit is due for the multi-trial design with reported p-values, the explicit coverage-vs-precision breakdown, and the boundary-condition experiment on model diagnostic capability. These elements make the central performance claim more falsifiable than typical single-run agent evaluations.

major comments (2)

[Probe generation and results paragraphs] The claim that the 4.7pp resolve-rate improvement (33.0% vs 28.3%) arises from refined guidance that generalizes to real SWE-bench Verified tasks depends on synthetic probes being representative of actual bug distributions. No quantitative comparison is provided of probe file locations, multi-file complexity, or error categories against the test distribution (see the probe-generation description and the results paragraph reporting the coverage gain).
[Methods and experimental setup] The reported mean rates, p-values, and coverage-vs-precision breakdown cannot be fully verified because the text omits full methods details: exact prompts and sampling parameters for probe generation, how the four trials were made independent, dataset splits between tuning probes and evaluation, and any error analysis of failed probes.

minor comments (2)

[Abstract and model descriptions] The abstract and results text use 'A3B' in model names without defining the suffix or its relation to model architecture.
[Step-budget experiment paragraph] The step-budget experiment is mentioned but lacks a figure or table reference in the provided text, making it difficult to assess how guidance interacts with step count.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the manuscript's claims and reproducibility. We address each major comment below and will revise accordingly.

read point-by-point responses

Referee: [Probe generation and results paragraphs] The claim that the 4.7pp resolve-rate improvement (33.0% vs 28.3%) arises from refined guidance that generalizes to real SWE-bench Verified tasks depends on synthetic probes being representative of actual bug distributions. No quantitative comparison is provided of probe file locations, multi-file complexity, or error categories against the test distribution (see the probe-generation description and the results paragraph reporting the coverage gain).

Authors: We agree that a quantitative comparison of probe characteristics to the test distribution would make the generalization argument more robust. The current manuscript relies on the empirical lift observed on held-out real tasks as evidence of utility, but does not include such a distributional analysis. In the revised manuscript we will add a dedicated subsection (with accompanying table) that reports file-location overlap, multi-file complexity statistics, and error-category distributions for the synthetic probes versus the SWE-bench Verified test set, using appropriate overlap metrics and statistical tests. revision: yes
Referee: [Methods and experimental setup] The reported mean rates, p-values, and coverage-vs-precision breakdown cannot be fully verified because the text omits full methods details: exact prompts and sampling parameters for probe generation, how the four trials were made independent, dataset splits between tuning probes and evaluation, and any error analysis of failed probes.

Authors: We acknowledge that the submitted manuscript does not supply the level of methodological detail required for full verification. The four trials used distinct random seeds for both probe generation and downstream evaluation; probes were generated from the repository without using any evaluation-task instances. In the revision we will expand the Methods section and add an appendix containing the exact prompts, sampling parameters (temperature, top-p, max tokens), a precise description of trial independence and dataset splits, and a brief error analysis of any probes that failed to produce usable diagnostic output. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of internal definitions

full rationale

The paper reports direct empirical measurements of resolve rates on the external SWE-bench Verified benchmark across multiple trials and models. The 33.0% vs 28.3% contrast is a measured outcome of running the tuned guidance on real tasks, not a quantity derived by construction from fitted parameters, self-citations, or renamed inputs. No equations, uniqueness theorems, or ansatzes are present that reduce the central claim to the tuning procedure itself. The method uses synthetic probes for tuning but evaluates on held-out real tasks, keeping the reported lift falsifiable and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5850 in / 1245 out tokens · 30806 ms · 2026-06-26T15:56:33.837622+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 10 canonical work pages · 2 internal anchors

[1]

SWE-bench Verified

SWE-bench. SWE-bench Verified. 2024.https://www.swebench.com/verified.html. Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR,

2024
[2]

Qwen3.5-35B-A3B

Qwen Team. Qwen3.5-35B-A3B. 2026.https://huggingface.co/Qwen/Qwen3.5-35B-A3B. Jai Lal Lulla et al. On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents.ICSE JAWs,

2026
[3]

Thibaud Gloaguen et al

arXiv:2601.20404. Thibaud Gloaguen et al. Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? arXiv:2602.11988,

work page arXiv
[4]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia et al. Agentless: Demystifying LLM-based Software Engineering Agents.arXiv:2407.01489,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

RepoGraph: Enhancing AI software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684, 2024

arXiv:2410.14684. Yuntong Zhang et al. AutoCodeRover: Autonomous Program Improvement.ISSTA,

work page arXiv
[6]

Hassan, and Hajimu Iida

Worawalan Chatlatanagulchai et al. Agent READMEs: An Empirical Study of Context Files for Agentic Coding. arXiv:2511.12884,

work page arXiv
[7]

Codified Context: Infrastructure for AI Agents in a Complex Codebase.arXiv:2602.20478,

Aristidis Vasilopoulos. Codified Context: Infrastructure for AI Agents in a Complex Codebase.arXiv:2602.20478,

work page arXiv
[8]

SWE Context Bench: A Benchmark for Context Learning in Coding

Jared Zhu et al. SWE Context Bench: A Benchmark for Context Learning in Coding.arXiv:2602.08316,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Meta Context Engineering via Agentic Skill Evolution.arXiv:2601.21557,

Haoran Ye et al. Meta Context Engineering via Agentic Skill Evolution.arXiv:2601.21557,

work page arXiv
[10]

arXiv preprint arXiv:2502.17424 , year =

Jan Betley et al. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs. arXiv:2502.17424,

work page arXiv
[11]

Bostrom, Nick

Jan Betley et al. Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs.arXiv:2512.09742,

work page arXiv
[12]

Cloud, M

Alex Cloud et al. Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data. arXiv:2507.14805,

work page arXiv
[13]

Dietterich

Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895–1923,

1923

[1] [1]

SWE-bench Verified

SWE-bench. SWE-bench Verified. 2024.https://www.swebench.com/verified.html. Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR,

2024

[2] [2]

Qwen3.5-35B-A3B

Qwen Team. Qwen3.5-35B-A3B. 2026.https://huggingface.co/Qwen/Qwen3.5-35B-A3B. Jai Lal Lulla et al. On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents.ICSE JAWs,

2026

[3] [3]

Thibaud Gloaguen et al

arXiv:2601.20404. Thibaud Gloaguen et al. Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? arXiv:2602.11988,

work page arXiv

[4] [4]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia et al. Agentless: Demystifying LLM-based Software Engineering Agents.arXiv:2407.01489,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

RepoGraph: Enhancing AI software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684, 2024

arXiv:2410.14684. Yuntong Zhang et al. AutoCodeRover: Autonomous Program Improvement.ISSTA,

work page arXiv

[6] [6]

Hassan, and Hajimu Iida

Worawalan Chatlatanagulchai et al. Agent READMEs: An Empirical Study of Context Files for Agentic Coding. arXiv:2511.12884,

work page arXiv

[7] [7]

Codified Context: Infrastructure for AI Agents in a Complex Codebase.arXiv:2602.20478,

Aristidis Vasilopoulos. Codified Context: Infrastructure for AI Agents in a Complex Codebase.arXiv:2602.20478,

work page arXiv

[8] [8]

SWE Context Bench: A Benchmark for Context Learning in Coding

Jared Zhu et al. SWE Context Bench: A Benchmark for Context Learning in Coding.arXiv:2602.08316,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Meta Context Engineering via Agentic Skill Evolution.arXiv:2601.21557,

Haoran Ye et al. Meta Context Engineering via Agentic Skill Evolution.arXiv:2601.21557,

work page arXiv

[10] [10]

arXiv preprint arXiv:2502.17424 , year =

Jan Betley et al. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs. arXiv:2502.17424,

work page arXiv

[11] [11]

Bostrom, Nick

Jan Betley et al. Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs.arXiv:2512.09742,

work page arXiv

[12] [12]

Cloud, M

Alex Cloud et al. Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data. arXiv:2507.14805,

work page arXiv

[13] [13]

Dietterich

Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895–1923,

1923