Recognition: unknown
SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent
Pith reviewed 2026-05-07 15:42 UTC · model grok-4.3
The pith
SWE-Edit splits code editing into a Viewer subagent for on-demand inspection and an Editor subagent for applying changes from plans, freeing the main agent to reason in cleaner context windows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By separating code editing into a Viewer that extracts task-relevant code on demand and an Editor that executes modifications from high-level plans, SWE-Edit lets the main agent maintain focused reasoning in clean context windows; training the editor with GRPO for adaptive mode selection further reduces errors compared with fixed formats, producing a 2.1 percent higher resolved rate and 17.9 percent lower inference cost on SWE-bench Verified while introducing a predictive code-editing benchmark.
What carries the argument
The dual-subagent decomposition in which a Viewer extracts only task-relevant code and an Editor performs modifications from abstract plans, combined with GRPO-trained adaptive selection among editing formats.
If this is right
- Higher resolved rates become possible on software engineering benchmarks without increasing overall token budget.
- Editing models can be screened and improved using a lightweight benchmark that correlates with end-to-end agent performance.
- Context windows stay smaller across multi-turn interactions because irrelevant code is never loaded into the main agent's state.
- Adaptive format selection reduces the frequency of malformed edits compared with always using a single find-and-replace template.
Where Pith is reading between the lines
- The same split of viewing from acting could be tested on other context-heavy agent tasks such as repository-scale refactoring or long-document editing.
- Measuring subagent communication tokens separately would reveal whether the efficiency gain holds when the main agent must repeatedly query the Viewer.
- The new editing benchmark could be used to pre-train or select models before they are plugged into any larger agent framework.
- If coordination cost proves low, the Viewer and Editor could be reused across multiple independent main agents running in parallel.
Load-bearing premise
Dividing inspection and editing across separate subagents will not create coordination overhead or new error propagation that cancels the reported gains in resolution rate and cost.
What would settle it
A full run of SWE-Edit on SWE-bench Verified in which the sum of tokens used by the main agent plus both subagents is measured and found to exceed the cost of the original single-context baseline while the resolved rate stays the same or drops.
Figures
read the original abstract
Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit execution within a single context window, forcing agents to interleave exploratory viewing with strictly formatted edit generation. This causes irrelevant information to accumulate and degrades agent performance. To address this, we propose SWE-Edit, which decomposes code editing into two specialized subagents: a Viewer that extracts task-relevant code on demand, and an Editor that executes modifications from high-level plans--allowing the main agent to focus on reasoning while delegating context-intensive operations to clean context windows. We further investigate what makes an effective editing model: observing that the prevalent find-and-replace format is error-prone, we train Qwen3-8B with GRPO to adaptively select editing modes, yielding improved editing efficiency over single-format baselines. On SWE-bench Verified, SWE-Edit improves resolved rate by 2.1% while reducing inference cost by 17.9%. We additionally propose a code editing benchmark that reliably predicts downstream agentic performance, providing practical guidance for editing model selection. Our code is publicly available at https://github.com/microsoft/SWE-Edit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SWE-Edit to address context coupling in LLM agents for software engineering tasks by decomposing editing into a Viewer subagent (for on-demand code extraction) and an Editor subagent (for executing modifications from high-level plans). It trains Qwen3-8B with GRPO for adaptive editing mode selection instead of fixed find-and-replace, and introduces a new code editing benchmark claimed to predict downstream agent performance. On SWE-bench Verified, the approach reports a 2.1% higher resolved rate and 17.9% lower inference cost, with code released publicly.
Significance. If the empirical gains hold after proper controls, the decomposition and adaptive training could meaningfully improve efficiency and performance of SWE agents by reducing irrelevant context accumulation. The public code release and the predictive benchmark (if validated) would be useful contributions for guiding editing model choices in the field.
major comments (3)
- Abstract: the headline claims of +2.1% resolved rate and -17.9% inference cost are presented as direct measurements but without error bars, standard deviations across runs, or statistical significance tests, which is required to determine whether these differences are reliable given the high variance typical of LLM agent evaluations on SWE-bench.
- Abstract: no ablation is reported that isolates the net effect of the Viewer-Editor split (including all view-request, plan-handoff, and result-return messages) from the GRPO adaptive mode selection; without a single-context baseline that preserves equivalent viewing/editing capability, it is impossible to confirm that coordination overhead does not offset or reverse the reported cost reduction.
- Abstract: the claim that the new code editing benchmark 'reliably predicts downstream agentic performance' is asserted without any correlation analysis, cross-validation results, or quantitative evidence linking benchmark scores to SWE-bench outcomes, making this a load-bearing assertion for the paper's practical guidance contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the comments identify areas for strengthening the empirical support, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract: the headline claims of +2.1% resolved rate and -17.9% inference cost are presented as direct measurements but without error bars, standard deviations across runs, or statistical significance tests, which is required to determine whether these differences are reliable given the high variance typical of LLM agent evaluations on SWE-bench.
Authors: We agree that measures of variability and statistical testing are essential given the stochastic nature of LLM agent evaluations. The reported improvements are based on multiple runs, but error bars and significance tests were not included in the abstract for brevity. In the revised manuscript we will report standard deviations across three independent runs and include paired t-test p-values for the key metrics, both in the abstract and the main results. revision: yes
-
Referee: Abstract: no ablation is reported that isolates the net effect of the Viewer-Editor split (including all view-request, plan-handoff, and result-return messages) from the GRPO adaptive mode selection; without a single-context baseline that preserves equivalent viewing/editing capability, it is impossible to confirm that coordination overhead does not offset or reverse the reported cost reduction.
Authors: The manuscript already contains separate ablations for the subagent decomposition and for GRPO-based mode selection. To isolate the coordination overhead as requested, we will add a new single-context baseline that provides equivalent viewing and editing tools. This experiment will quantify the net effect of the split versus the overhead of the additional messages and will be reported in the revised results section. revision: yes
-
Referee: Abstract: the claim that the new code editing benchmark 'reliably predicts downstream agentic performance' is asserted without any correlation analysis, cross-validation results, or quantitative evidence linking benchmark scores to SWE-bench outcomes, making this a load-bearing assertion for the paper's practical guidance contribution.
Authors: We acknowledge that the predictive claim requires quantitative backing. The current manuscript supports the claim with qualitative alignment and case studies. In the revision we will add a dedicated validation subsection containing Pearson and Spearman correlation coefficients between benchmark scores and SWE-bench resolved rates across multiple models, together with cross-validation results, to provide the requested evidence. revision: yes
Circularity Check
No circularity: empirical measurements and architectural proposal remain independent of self-referential definitions or fitted predictions
full rationale
The paper's central claims consist of an architectural decomposition (Viewer/Editor subagents plus GRPO-trained adaptive mode selection) and direct empirical results on SWE-bench Verified (+2.1% resolved rate, -17.9% cost). These are presented as experimental outcomes rather than quantities derived from equations or first-principles arguments inside the paper. The additional code-editing benchmark is introduced and evaluated for predictive correlation with agent performance, but this is an empirical observation, not a self-defining loop or a parameter fitted to the target metric and then renamed as a prediction. No load-bearing step reduces by construction to its own inputs, and no self-citation chain is invoked to justify uniqueness or necessity of the approach. The derivation chain is therefore self-contained experimental work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A file with numbered lines in the format: LINE_NUMBER\tLINE_CONTENT
-
[2]
A query describing what the user is looking for Your job is to analyze the file and return the line ranges that are most relevant to the query. Consider: - Function/method definitions that match the query - Class definitions related to the query - Variable declarations or assignments relevant to the query - Import statements if they’re relevant - Comments...
-
[3]
Only output the JSON array, no additional explanation or comments
-
[4]
Line numbers are 1-indexed (first line is line 1)
-
[5]
Each range should include complete logical blocks (don’t cut functions/classes in the middle)
-
[6]
Include a few lines of context before and after each relevant section when appropriate
-
[7]
If nothing in the file is relevant to the query, return an empty array: []
-
[8]
Ranges should be sorted by start line number
-
[9]
Merge overlapping or adjacent ranges
-
[10]
Where is the calculate_total function defined?
Keep ranges focused - don’t include entire files unless the query asks for everything Example 1 - Finding a specific function: Query: "Where is the calculate_total function defined?" Output: [[15, 28]] Example 2 - Finding multiple related sections: Query: "How is user authentication handled?" Output: [[5, 8], [23, 45], [102, 130]] Example 3 - Nothing rele...
-
[11]
The SEARCH block must match the original file content EXACTLY, including whitespace and indentation
-
[12]
You can make multiple edits by including multiple search-replace blocks
-
[13]
If the SEARCH block is empty (no content between <<<<<<< SEARCH and =======), it means you want to REWRITE THE ENTIRE FILE with 14 SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent the content in the REPLACE block
-
[14]
Each SEARCH block must be unique in the file - if there are multiple matches, include more context
-
[15]
Hello, World!
Only output the search-replace blocks, no additional explanation or comments Example 1 - Modifying specific lines: <<<<<<< SEARCH def calculate_total(items): return sum(items) ======= def calculate_total(items): if not items: return 0 return sum(items) >>>>>>> REPLACE Example 2 - Multiple edits: <<<<<<< SEARCH import os ======= import os import sys >>>>>>...
-
[16]
As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure
-
[17]
Create a script to reproduce the error and execute it with ‘python <filename.py>‘ using the execute_bash tool to confirm the error - **Important:** If testing a Python package, add ‘import sys; sys.path.insert(0, ’{{ instance.repo_path }}’)‘ at the top of your script before package imports to ensure you’re testing the local version, not an installed version
-
[18]
Edit the source code of the repo to resolve the issue
-
[19]
Rerun your reproduce script and confirm that the error is fixed!
-
[20]
http://url
Think about edge cases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. B. Full Experiment Results Table 8.Detailed Performance Metrics on SWE-bench Verified. Results are averaged over three independent runs for each configuration. “Succ.” denotes the success rate of editor tool calls. Config...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.