RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

Hang Su; Hanyu Li; Jun Zhu; Speed Zhu; Yichi Zhang; Yinpeng Dong

arxiv: 2605.26177 · v1 · pith:Z2IMSEUSnew · submitted 2026-05-25 · 💻 cs.SE · cs.AI

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

Hanyu Li , Yichi Zhang , Speed Zhu , Hang Su , Jun Zhu , Yinpeng Dong This is my paper

Pith reviewed 2026-06-29 20:51 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code agentsrepository context reasoningperturbationsSWE-Benchsoftware engineering benchmarksAI coding agentsstructure-aware workflows

0 comments

The pith

Code agents lose most of their issue-solving ability when repository structure is perturbed while meaning stays intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code agents achieve high scores on repository-level tasks such as issue resolution, yet it is unclear whether this reflects genuine ability to locate and reason over information spread across multiple files. The paper applies three kinds of semantics-preserving perturbations to the repository and measures the resulting performance change on SWE-Bench Verified. When the perturbations force agents to use broader context, success rates fall sharply. A second stage converts the same structural bottlenecks into standalone tasks, driving average performance from 66.8 percent down to 25.3 percent and exposing an exploration drift in which agents reach more files but extract little usable structure. The authors respond with a structure-first workflow that separates exploration from problem solving and records clear gains.

Core claim

RepoMirage shows that current code agents exhibit a significant deficiency in repository context reasoning: semantics-preserving repository-level perturbations cause clear performance drops when correct solutions require wider context access, and converting the targeted structural bottlenecks into explicit tasks reduces average success from 66.8 percent to 25.3 percent; trajectory analysis further reveals exploration drift, where agents access broader context yet fail to convert it into effective structure information, while a structure-first prototype workflow yields notable gains.

What carries the argument

RepoMirage, a two-stage evaluation suite built on SWE-Bench Verified that uses semantics-preserving repository-level perturbations to increase the demand for context reasoning.

If this is right

Agents that solve issues under the original repository layout lose most of that capability once structural access patterns are altered.
Trajectory logs show agents reach more files after perturbation but still fail to extract usable structural relations.
A workflow that first builds explicit structural scaffolding and only then solves the problem produces measurable gains over standard end-to-end approaches.
The gap appears not only in issue resolution but in any task whose solution depends on cross-file relations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that omit structural perturbations may systematically overstate agent capability on real repositories.
Methods that treat repository navigation as a distinct, first-class step could transfer to other long-context code tasks such as refactoring or test generation.
The performance gap may shrink if agents are trained with explicit rewards for recovering file-relation graphs rather than only for final patch correctness.

Load-bearing premise

The observed performance drops are caused specifically by insufficient repository context reasoning rather than by side effects of the perturbations, task reformulation, or agent exploration behavior.

What would settle it

Run the same agents on the perturbed repositories after supplying an explicit, complete structural map of file relations; if success rates remain near the original 66.8 percent, the claim that the drop stems from missing context reasoning would be supported.

Figures

Figures reproduced from arXiv: 2605.26177 by Hang Su, Hanyu Li, Jun Zhu, Speed Zhu, Yichi Zhang, Yinpeng Dong.

**Figure 1.** Figure 1: Overview of REPOMIRAGE. REPOMIRAGE is an evaluation suite for probing repository context reasoning in code agents. A. REPOMIRAGE-Perturb. To test whether issue-resolution performance remains stable under higher repository-context demands, it applies three semantics-preserving perturbations while keeping the original task and evaluation unchanged. B. REPOMIRAGE-Extend. To make repository context reasoning … view at source ↗

**Figure 2.** Figure 2: Numbers of accessed files per instance resolved in SWE-bench Verified for GPT-5 (avg. 2.15) and DeepSeek-V3.2 (avg. 3.97). While prior work shows that more than 80% of SWE-Bench Verified instances ultimately require edits to only a single file [24], and that advanced models can identify a buggy file path with up to 76% accuracy even without repository structure [25], these findings are descriptive and not… view at source ↗

**Figure 3.** Figure 3: Ablation study of different repository perturbations across models. We further examine whether the performance drop is driven by a single perturbation type or by their combined effect [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Task distribution in REPOMIRAGE-Extend. Instead of converting each benchmark instance into all four tasks, we assign every original instance to exactly one task family according to the structural signal it most naturally instantiates. Instances with multifile gold patches are retained for Multi-File Issue Resolution, while the remaining instances are distributed across the three perturbationderived tasks… view at source ↗

**Figure 5.** Figure 5: Behavior shifts under REPOMIRAGE-Extend. ∆ measures the change from SWEBench Verified to REPOMIRAGE-Extend. Files Inspected counts distinct opened files; Explore Stage Proportion is the pre-edit step ratio; transition metrics report action-transition changes. ∆¯ and ∆0.5 denote the mean and median, respectively. 5.1 Behavioral Diagnosis under Stronger Context Demands To understand why agents fail under st… view at source ↗

**Figure 6.** Figure 6: Pipeline of REPOANCHOR. A normal agent mixes its actions in one trajectory, failing to retrieve usable information. REPOANCHOR separates the process into structure understanding and problem solving, using an intermediate INSTRUCTIONS.md file to pass repository context forward. explored before editing increases significantly and the pre-edit exploration stage becomes longer, indicating that agents need to s… view at source ↗

**Figure 7.** Figure 7: Performance gains from REPOANCHOR. Resolved rates of four representative models before and after applying REPOANCHOR. The consistent gains across task types and models show that structure-first repository understanding can improve downstream task solving [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for repository structural hints. We use the same hint template for all models under the corresponding task setting. The only instancespecific part is the original problem statement, which is inserted into the prompt. This design makes the hint experiment a controlled diagnostic intervention. It approximates the structural understanding that an ideal exploration process should recover, while keeping… view at source ↗

**Figure 9.** Figure 9: Prompt for exploration stage of REPOANCHOR. B Additional Analysis and Examples B.1 Behavior Shifts on REPOMIRAGE-Perturb [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for problem solving stage of REPOANCHOR. *** *** *** *** *** *** *** *** [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Behavior shifts under REPOMIRAGE-Perturb. ∆ measures the change from SWEBench Verified to REPOMIRAGE-Perturb. Files Inspected counts distinct opened files; Explore Stage Proportion is the pre-edit step ratio; transition metrics report action-transition changes. ∆¯ and ∆0.5 denote the mean and median, respectively. B.2 Task-Level Transition Comparison [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Task-level transition probability differences. Each heatmap corresponds to one REPOMIRAGE-Extend task for GPT-5. Rows denote the current action state, columns denote the next action state, and each cell reports the transition-probability difference between the REPOMIRAGEExtend setting and SWE-Bench setting. Red cells indicate positive shifts, blue cells indicate negative shifts, and the annotated values… view at source ↗

**Figure 13.** Figure 13: An example of INSTRUCTION.md generated by REPOANCHOR The released code contains two main components. RepoMirage_Perturb/ builds REPOMIRAGEPerturb repository images, applies repository-level perturbations, and exports per-instance metadata. RepoMirage_Extend/ uses this metadata to assign instances to task families, generate REPOMIRAGE-Extend task images, and provide deterministic validation scripts for a… view at source ↗

read the original abstract

Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the ability to identify the task-relevant information across multiple files and reason over the relations among them. To investigate this question, we introduce RepoMirage, a two-stage evaluation suite built on SWE-Bench Verified that adopts perturbation as a diagnostic tool to increase the demand for context reasoning by transforming how the repository is exposed. First, RepoMirage-Perturb applies three types of semantics-preserving repository-level perturbations, revealing a clear performance drop when correct solving requires broader context access. RepoMirage-Extend further turns perturbation-targeted structural bottlenecks into explicit tasks beyond issue resolution, where the average performance declines from 66.8% in the original setting to 25.3%, indicating a significant deficiency in repository context reasoning. Further trajectory analysis reveals an exploration drift, where agents access broader repository context but fail to turn it into effective structure information. Motivated by this observation, we propose RepoAnchor, a structure-first prototype workflow that separates repository exploration from downstream problem solving, and show that explicit structural scaffolding yields notable gains. These results uncover an previously overlooked gap in repository context reasoning for code agents and suggest that stronger structure-aware methods are potential to improve them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Code agents drop sharply under semantics-preserving repo perturbations, but the drops may reflect task or search artifacts rather than a clean isolation of context-reasoning failure.

read the letter

The one or two things to know: this paper reports large performance drops on SWE-Bench when repositories are perturbed in semantics-preserving ways, and an even steeper fall from 66.8% to 25.3% when those structural issues are turned into explicit tasks. The trajectory analysis and the RepoAnchor prototype are secondary but useful additions.

What is new is the two-stage diagnostic built on an existing benchmark. RepoMirage-Perturb applies three perturbation types to force broader context access, and RepoMirage-Extend converts the resulting bottlenecks into standalone structure-extraction tasks. The exploration-drift observation and the simple structure-first workflow that reportedly improves results are straightforward extensions of the main idea.

The work does a decent job of turning an existing benchmark into a probe for a plausible limitation. The numbers are large and the setup is easy to understand. It engages directly with real agent behavior on multi-file tasks rather than inventing a new benchmark from scratch.

The soft spot is the causal claim. The abstract and the stress-test note both leave open whether the observed drops come specifically from weak context reasoning or from side effects of the perturbations themselves, changes in agent search costs, or the way tasks are reformulated. Without explicit controls that hold non-context factors constant, the attribution to "significant deficiency in repository context reasoning" is not fully secured. The paper would be stronger with more detail on those controls.

This is for researchers working on code agents and evaluation methods in AI for software engineering. Readers who care about diagnosing agent limitations on real repositories will get value from the perturbation approach and the quantified gaps. It deserves a serious referee because the empirical signal is direct and the issue is relevant to the subfield, even if the interpretation needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces RepoMirage, a two-stage benchmark built on SWE-Bench Verified that uses semantics-preserving repository perturbations (RepoMirage-Perturb) followed by task reformulation (RepoMirage-Extend) to probe whether code agents' success on issue resolution reflects genuine repository context reasoning. It reports a drop from 66.8% to 25.3% average performance on the extended tasks, trajectory analysis showing exploration drift, and a structure-first prototype (RepoAnchor) that yields gains, concluding that current agents have a significant deficiency in repository context reasoning.

Significance. If the perturbations and reformulations cleanly isolate context-reasoning demand without introducing confounding task hardness or exploration costs, the work would identify a previously under-examined limitation in repository-level code agents and motivate structure-aware scaffolding methods. The empirical framing on an external benchmark and the proposal of an explicit workflow are positive features.

major comments (3)

[§4] §4 (RepoMirage-Extend): the claim that the 66.8% → 25.3% drop demonstrates a 'significant deficiency in repository context reasoning' assumes the task reformulation affects only structural access demand. No ablation is described that holds task formulation, exploration budget, and non-context difficulty constant while varying only the perturbation-induced structural bottlenecks; without such controls the attribution does not follow.
[§3] §3 (RepoMirage-Perturb): the three semantics-preserving perturbations are presented as increasing context-reasoning demand, yet the manuscript provides no quantitative verification that the perturbations leave file-access costs, search-heuristic compatibility, and implicit task hardness unchanged. If any of these factors shift, the observed performance decline cannot be isolated to context reasoning.
[§5] Trajectory analysis (mentioned in abstract and §5): the reported 'exploration drift' is offered as supporting evidence, but the paper does not report statistical controls or baseline comparisons that rule out changes in agent search heuristics or file-access costs as the primary driver of the drift.

minor comments (2)

[Abstract] Abstract: 'an previously overlooked gap' should read 'a previously overlooked gap'.
The manuscript should include explicit data-exclusion rules, statistical significance tests on the reported drops, and the precise definition of 'average performance' (e.g., pass@1, success rate across instances) to allow verification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that additional quantitative controls and ablations would strengthen the isolation of repository context reasoning effects in our experiments. We will revise the manuscript to incorporate these elements as outlined below.

read point-by-point responses

Referee: [§4] §4 (RepoMirage-Extend): the claim that the 66.8% → 25.3% drop demonstrates a 'significant deficiency in repository context reasoning' assumes the task reformulation affects only structural access demand. No ablation is described that holds task formulation, exploration budget, and non-context difficulty constant while varying only the perturbation-induced structural bottlenecks; without such controls the attribution does not follow.

Authors: We acknowledge that an explicit ablation isolating only the structural access demand would provide stronger evidence. In the revised manuscript, we will add such an ablation study in §4, where we compare performance under controlled conditions that vary the structural bottlenecks while keeping task formulation, exploration budget, and other difficulty factors constant. This will help confirm that the performance drop is attributable to context reasoning demands. revision: yes
Referee: [§3] §3 (RepoMirage-Perturb): the three semantics-preserving perturbations are presented as increasing context-reasoning demand, yet the manuscript provides no quantitative verification that the perturbations leave file-access costs, search-heuristic compatibility, and implicit task hardness unchanged. If any of these factors shift, the observed performance decline cannot be isolated to context reasoning.

Authors: The perturbations were constructed to preserve semantics and task solutions while altering repository structure to necessitate broader context access. However, we agree that quantitative verification of unchanged factors would be valuable. We will revise §3 to include quantitative comparisons, such as metrics on file-access costs, search compatibility, and task hardness before and after perturbations, to demonstrate that these remain consistent. revision: yes
Referee: [§5] Trajectory analysis (mentioned in abstract and §5): the reported 'exploration drift' is offered as supporting evidence, but the paper does not report statistical controls or baseline comparisons that rule out changes in agent search heuristics or file-access costs as the primary driver of the drift.

Authors: The trajectory analysis in §5 illustrates that agents access more files but fail to effectively use the structural information. To strengthen this, we will incorporate statistical controls and baseline comparisons in the revised manuscript to rule out alternative drivers such as changes in search heuristics or file-access costs. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation on external benchmark; no circular derivations

full rationale

The paper reports experimental results from applying semantics-preserving perturbations to the external SWE-Bench Verified benchmark and measuring agent performance drops (66.8% to 25.3%). No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. Central claims rest on direct empirical observations and trajectory analysis rather than any self-definitional, fitted-input, or self-citation reductions. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that SWE-Bench Verified remains a valid measure of agent capability after perturbation and that performance differences can be attributed to context reasoning. No free parameters or invented entities are introduced.

axioms (1)

domain assumption SWE-Bench Verified is a valid benchmark for measuring repository-level code agent performance
The entire evaluation suite is constructed on top of this benchmark.

pith-pipeline@v0.9.1-grok · 5778 in / 1265 out tokens · 22093 ms · 2026-06-29T20:51:43.262101+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 25 canonical work pages · 16 internal anchors

[1]

Evaluating large language models in class-level code generation

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

2024
[2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Measuring coding challenge competence with apps.NeurIPS, 2021

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.NeurIPS, 2021

2021
[5]

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025.arxiv preprint arXiv:2504.02605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Repomastereval: Evaluating code completion via real-world repositories

Qinyun Wu, Chao Peng, Pengfei Gao, Ruida Hu, Haoyu Gan, Bo Jiang, Jinhe Tang, Zhiwen Deng, Zhanming Guan, Cuiyun Gao, et al. Repomastereval: Evaluating code completion via real-world repositories. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3672–3683. IEEE, 2025

2025
[7]

SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

2024
[8]

Repobench: Benchmarking repository- level code auto-completion systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, 2023

2023
[9]

Vibe coding in practice: Motivations, chal- lenges, and a future outlook – a grey literature review.arXiv preprint arXiv:2510.00328, 2025

Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. Vibe coding in practice: Motivations, chal- lenges, and a future outlook – a grey literature review.arXiv preprint arXiv:2510.00328, 2025

work page arXiv 2025
[10]

Large language models for cyber security: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 2024

HanXiang Xu, ShenAo Wang, Ningke Li, Kailong Wang, Yanjie Zhao, Kai Chen, Ting Yu, Yang Liu, and HaoYu Wang. Large language models for cyber security: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 2024

2024
[11]

The effects of generative ai on high-skilled work: Evidence from three field experiments with software developers.Management Science, 2026

Kevin Zheyuan Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. The effects of generative ai on high-skilled work: Evidence from three field experiments with software developers.Management Science, 2026

2026
[12]

Autocoderover: Au- tonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Au- tonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024

2024
[13]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[15]

Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025. 10

work page arXiv 2025
[16]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified,
[18]

URLhttps://openai.com/index/introducing-swe-bench-verified/
[19]

Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Sar- avan Rajmohan, and Dongmei Zhang. Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025

work page arXiv 2025
[20]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Boosting adversarial attacks with momentum

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193, 2018

2018
[22]

Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference

R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 3428–3448, 2019

2019
[23]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

PatchRecall: Patch-Driven Retrieval for Automated Program Repair

Mahir Labib Dihan, Faria Binta Awal, and Md Ishrak Ahsan. Patchrecall: Patch-driven retrieval for automated program repair.arXiv preprint arXiv:2604.10481, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

The swe-bench illusion: When state-of-the-art llms remember instead of reason.arXiv preprint arXiv:2506.12286, 2025

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. The swe-bench illusion: When state-of-the-art llms remember instead of reason.arXiv preprint arXiv:2506.12286, 2025

work page arXiv 2025
[27]

Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

2019
[29]

Introducing GPT-4.1 in the api, April 2025

OpenAI. Introducing GPT-4.1 in the api, April 2025. URL https://openai.com/index/ gpt-4-1/

2025
[30]

Gemini-3.1-Pro model card, February 2025

Google. Gemini-3.1-Pro model card, February 2025. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/

2025
[31]

Claude system card, 2026

Anthropic. Claude system card, 2026. URL https://www.anthropic.com/system-cards

2026
[32]

MiniMax-M2.7 model card, March 2026

Google. MiniMax-M2.7 model card, March 2026. URL https://www.minimax.io/models/ text/m27

2026
[33]

Qwen3-coder-next technical report

Qwen Team. Qwen3-coder-next technical report. Technical report, February
[34]

URL https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_ next_tech_report.pdf
[35]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b. 11

2026
[36]

Understanding software engineering agents: A study of thought-action-result trajectories.arXiv preprint arXiv:2506.18824, 2025

Islem Bouzenia and Michael Pradel. Understanding software engineering agents: A study of thought-action-result trajectories.arXiv preprint arXiv:2506.18824, 2025

work page arXiv 2025
[37]

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study

Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, and Baishakhi Ray. Under- standing software engineering agents through the lens of traceability: An empirical study.arXiv preprint arXiv:2506.08311, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code

Myeongsoo Kim, Dingmin Wang, Siwei Cui, Farima Farmahinifarahani, Shweta Garg, Baishakhi Ray, Terry Yue Zhuo, Rajdeep Mukherjee, and Varun Kumar. Trajeval: Decomposing code agent trajectories for fine-grained diagnosis.arXiv preprint arXiv:2603.24631, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36:21558–21572, 2023

2023
[41]

Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

2022
[42]

Ds-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen- tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, pages 18319–18345. PMLR, 2023

2023
[43]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570, 2023

Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570, 2023

work page arXiv 2023
[45]

Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems, 36:46701–46723, 2023

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Kr- ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems, 36:46701–46723, 2023

2023
[46]

Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories

Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, and Gabriel Maduekwe. Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories. arXiv preprint arXiv:2512.17419, 2025

work page arXiv 2025
[47]

Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

2020
[48]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel- low, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[49]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[50]

Beyond accuracy: Behavioral testing of nlp models with checklist

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4902–4912, 2020

2020
[51]

Evaluating models’ local deci- sion boundaries via contrast sets

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. Evaluating models’ local deci- sion boundaries via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, 2020. 12

2020
[52]

Recode: Robustness evaluation of code generation models

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, et al. Recode: Robustness evaluation of code generation models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13818–13843, 2023

2023
[53]

Variable renaming-based adversarial test generation for code model: Benchmark and enhancement.ACM Transactions on Software Engineering and Methodology, 35(1):1–28, 2025

Jin Wen, Qiang Hu, Yuejun Guo, Maxime Cordy, and Yves Le Traon. Variable renaming-based adversarial test generation for code model: Benchmark and enhancement.ACM Transactions on Software Engineering and Methodology, 35(1):1–28, 2025

2025
[54]

Cctest: Testing and repairing code completion systems

Zongjie Li, Chaozheng Wang, Zhibo Liu, Haoxuan Wang, Dong Chen, Shuai Wang, and Cuiyun Gao. Cctest: Testing and repairing code completion systems. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1238–1250. IEEE, 2023

2023
[55]

Dip: Dead code insertion based black-box attack for programming language model

CheolWon Na, YunSeok Choi, and Jee-Hyong Lee. Dip: Dead code insertion based black-box attack for programming language model. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7777–7791, 2023

2023
[56]

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Pedro Orvalho and Marta Kwiatkowska. Are large language models robust in understanding code against semantics-preserving mutations?arXiv preprint arXiv:2505.10443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

What can large language models capture about code functional equivalence? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6865–6903, 2025

Nickil Maveli, Antonio Vergari, and Shay B Cohen. What can large language models capture about code functional equivalence? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6865–6903, 2025

2025
[58]

Man Ho Lam, Chaozheng Wang, Jen-tse Huang, and Michael R. Lyu. Codecrash: Stress testing llm reasoning under structural and semantic perturbations.arXiv preprint arXiv:2504.14119, 2025

work page arXiv 2025
[59]

Gistify! codebase- level understanding via runtime execution.arXiv preprint arXiv:2510.26790, 2025

Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, et al. Gistify! codebase- level understanding via runtime execution.arXiv preprint arXiv:2510.26790, 2025

work page arXiv 2025
[60]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 13 A Experimental Details A.1 Compute Resources All experime...

2023
[61]

real runtime files

Renamed Files: Some files that contain the real runtime logic are renamed, we call them "real runtime files" below
[62]

These files are decoy only, may be very similar to the real file

Decoy Files: For each renamed real file, here are also distracting files. These files are decoy only, may be very similar to the real file
[63]

Together, they form a four-layer import graph

Proxy Files: Files whose names start with `proxy_` are proxy import files. Together, they form a four-layer import graph. The real runtime files do not directly import the final Python library; instead, it reaches that library through this proxy chain
[64]

Some constants originally belonging to the real runtime files have been extracted into these dependency files, and the real runtime files reads those values from them

JSON Files: Files whose names start with `dependency_` are external dependency files. Some constants originally belonging to the real runtime files have been extracted into these dependency files, and the real runtime files reads those values from them
[65]

Instead, at the same directory level, there is a folder whose name is the original runtime name of that code unit

Wrapper Folder: The real runtime files are not directly imported by the rest of the repository under its own filename. Instead, at the same directory level, there is a folder whose name is the original runtime name of that code unit. Inside that folder, an `__init__.py` file re-exports or imports the original file. In other words, the repository wraps som...
[66]

Locate the directory or directories most relevant to the task
[67]

Thoroughly explore and understand the structure inside that area
[68]

Infer relative file roles
[69]

Write a structured summary into /testbed/INSTRUCTION.md inside the target directory
[70]

A second agent will later perform editing based on your output

Stop after finishing /testbed/INSTRUCTION.md. A second agent will later perform editing based on your output. ## Guide
[71]

The aiming may be multiple

First make a reasonable guess about the aiming directory based on the task description and repository layout. The aiming may be multiple
[72]

Enter the guessed aiming directory and list the local files and subdirectories
[73]

Read the files that you think are relevant to the task or necessary for understanding the structure
[74]

Read the files related to the candidate files carefully
[75]

Your final answer must based on runtime evidence, do not only trust the searching and reading, do not trust the filename or the other surface patterns

Understanding the structure and logic, you should use small validation testing commands that follows the same loading or resolution logic as the code whenever possible. Your final answer must based on runtime evidence, do not only trust the searching and reading, do not trust the filename or the other surface patterns. You should explore deeply into the a...
[76]

- filename2: XXX; which file uses it: XXX; which file is used by it: XXX; description: XXX

The directories and all files you have read, and your runtime evidence - filename1: XXX; which file uses it: XXX; which file is used by it: XXX; description: XXX. - filename2: XXX; which file uses it: XXX; which file is used by it: XXX; description: XXX
[77]

The relationships between files

The detailed struture of the directory. The relationships between files. And your runtime evidence
[78]

B Additional Analysis and Examples B.1 Behavior Shifts on REPOMIRAGE-Perturb Fig

Other unclear points (Do not mention any unclear points that is obviously irrelative to the task) Figure 9: Prompt for exploration stage of REPOANCHOR. B Additional Analysis and Examples B.1 Behavior Shifts on REPOMIRAGE-Perturb Fig. 11 shows the behavior shifts on REPOMIRAGE-Perturb, where agents still solve the original issue-resolution task but under p...
[79]

Your colleague has written an INSTRUCTION.md in the /teatbed directory, detailing the location of the task target directory and files, the structure of task related files in the target directory
[80]

The purpose of this document is to reduce the time required for you to understand the directory structure

You should read and utilize this document reasonably. The purpose of this document is to reduce the time required for you to understand the directory structure
[81]

Figure 10: Prompt for problem solving stage of REPOANCHOR

You can also focus on any unclear points in the document, since the content here may be wrong and it may be an important point. Figure 10: Prompt for problem solving stage of REPOANCHOR. ***️ ***️ ***️***️ ***️ ***️ ***️ ***️ Figure 11:Behavior shifts under REPOMIRAGE-Perturb. ∆ measures the change from SWE- Bench Verified to REPOMIRAGE-Perturb. Files Ins...

[1] [1]

Evaluating large language models in class-level code generation

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

2024

[2] [2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Measuring coding challenge competence with apps.NeurIPS, 2021

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.NeurIPS, 2021

2021

[5] [5]

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025.arxiv preprint arXiv:2504.02605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Repomastereval: Evaluating code completion via real-world repositories

Qinyun Wu, Chao Peng, Pengfei Gao, Ruida Hu, Haoyu Gan, Bo Jiang, Jinhe Tang, Zhiwen Deng, Zhanming Guan, Cuiyun Gao, et al. Repomastereval: Evaluating code completion via real-world repositories. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3672–3683. IEEE, 2025

2025

[7] [7]

SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

2024

[8] [8]

Repobench: Benchmarking repository- level code auto-completion systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, 2023

2023

[9] [9]

Vibe coding in practice: Motivations, chal- lenges, and a future outlook – a grey literature review.arXiv preprint arXiv:2510.00328, 2025

Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. Vibe coding in practice: Motivations, chal- lenges, and a future outlook – a grey literature review.arXiv preprint arXiv:2510.00328, 2025

work page arXiv 2025

[10] [10]

Large language models for cyber security: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 2024

HanXiang Xu, ShenAo Wang, Ningke Li, Kailong Wang, Yanjie Zhao, Kai Chen, Ting Yu, Yang Liu, and HaoYu Wang. Large language models for cyber security: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 2024

2024

[11] [11]

The effects of generative ai on high-skilled work: Evidence from three field experiments with software developers.Management Science, 2026

Kevin Zheyuan Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. The effects of generative ai on high-skilled work: Evidence from three field experiments with software developers.Management Science, 2026

2026

[12] [12]

Autocoderover: Au- tonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Au- tonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024

2024

[13] [13]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024

[15] [15]

Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025. 10

work page arXiv 2025

[16] [16]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified,

[18] [18]

URLhttps://openai.com/index/introducing-swe-bench-verified/

[19] [19]

Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Sar- avan Rajmohan, and Dongmei Zhang. Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025

work page arXiv 2025

[20] [20]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Boosting adversarial attacks with momentum

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193, 2018

2018

[22] [22]

Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference

R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 3428–3448, 2019

2019

[23] [23]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

PatchRecall: Patch-Driven Retrieval for Automated Program Repair

Mahir Labib Dihan, Faria Binta Awal, and Md Ishrak Ahsan. Patchrecall: Patch-driven retrieval for automated program repair.arXiv preprint arXiv:2604.10481, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

The swe-bench illusion: When state-of-the-art llms remember instead of reason.arXiv preprint arXiv:2506.12286, 2025

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. The swe-bench illusion: When state-of-the-art llms remember instead of reason.arXiv preprint arXiv:2506.12286, 2025

work page arXiv 2025

[27] [27]

Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

2019

[28] [29]

Introducing GPT-4.1 in the api, April 2025

OpenAI. Introducing GPT-4.1 in the api, April 2025. URL https://openai.com/index/ gpt-4-1/

2025

[29] [30]

Gemini-3.1-Pro model card, February 2025

Google. Gemini-3.1-Pro model card, February 2025. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/

2025

[30] [31]

Claude system card, 2026

Anthropic. Claude system card, 2026. URL https://www.anthropic.com/system-cards

2026

[31] [32]

MiniMax-M2.7 model card, March 2026

Google. MiniMax-M2.7 model card, March 2026. URL https://www.minimax.io/models/ text/m27

2026

[32] [33]

Qwen3-coder-next technical report

Qwen Team. Qwen3-coder-next technical report. Technical report, February

[33] [34]

URL https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_ next_tech_report.pdf

[34] [35]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b. 11

2026

[35] [36]

Understanding software engineering agents: A study of thought-action-result trajectories.arXiv preprint arXiv:2506.18824, 2025

Islem Bouzenia and Michael Pradel. Understanding software engineering agents: A study of thought-action-result trajectories.arXiv preprint arXiv:2506.18824, 2025

work page arXiv 2025

[36] [37]

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study

Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, and Baishakhi Ray. Under- standing software engineering agents through the lens of traceability: An empirical study.arXiv preprint arXiv:2506.08311, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [38]

Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code

Myeongsoo Kim, Dingmin Wang, Siwei Cui, Farima Farmahinifarahani, Shweta Garg, Baishakhi Ray, Terry Yue Zhuo, Rajdeep Mukherjee, and Varun Kumar. Trajeval: Decomposing code agent trajectories for fine-grained diagnosis.arXiv preprint arXiv:2603.24631, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [39]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[39] [40]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36:21558–21572, 2023

2023

[40] [41]

Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

2022

[41] [42]

Ds-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen- tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, pages 18319–18345. PMLR, 2023

2023

[42] [43]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [44]

Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570, 2023

Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570, 2023

work page arXiv 2023

[44] [45]

Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems, 36:46701–46723, 2023

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Kr- ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems, 36:46701–46723, 2023

2023

[45] [46]

Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories

Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, and Gabriel Maduekwe. Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories. arXiv preprint arXiv:2512.17419, 2025

work page arXiv 2025

[46] [47]

Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

2020

[47] [48]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel- low, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[48] [49]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar- ial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[49] [50]

Beyond accuracy: Behavioral testing of nlp models with checklist

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4902–4912, 2020

2020

[50] [51]

Evaluating models’ local deci- sion boundaries via contrast sets

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. Evaluating models’ local deci- sion boundaries via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, 2020. 12

2020

[51] [52]

Recode: Robustness evaluation of code generation models

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, et al. Recode: Robustness evaluation of code generation models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13818–13843, 2023

2023

[52] [53]

Variable renaming-based adversarial test generation for code model: Benchmark and enhancement.ACM Transactions on Software Engineering and Methodology, 35(1):1–28, 2025

Jin Wen, Qiang Hu, Yuejun Guo, Maxime Cordy, and Yves Le Traon. Variable renaming-based adversarial test generation for code model: Benchmark and enhancement.ACM Transactions on Software Engineering and Methodology, 35(1):1–28, 2025

2025

[53] [54]

Cctest: Testing and repairing code completion systems

Zongjie Li, Chaozheng Wang, Zhibo Liu, Haoxuan Wang, Dong Chen, Shuai Wang, and Cuiyun Gao. Cctest: Testing and repairing code completion systems. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1238–1250. IEEE, 2023

2023

[54] [55]

Dip: Dead code insertion based black-box attack for programming language model

CheolWon Na, YunSeok Choi, and Jee-Hyong Lee. Dip: Dead code insertion based black-box attack for programming language model. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7777–7791, 2023

2023

[55] [56]

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Pedro Orvalho and Marta Kwiatkowska. Are large language models robust in understanding code against semantics-preserving mutations?arXiv preprint arXiv:2505.10443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [57]

What can large language models capture about code functional equivalence? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6865–6903, 2025

Nickil Maveli, Antonio Vergari, and Shay B Cohen. What can large language models capture about code functional equivalence? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6865–6903, 2025

2025

[57] [58]

Man Ho Lam, Chaozheng Wang, Jen-tse Huang, and Michael R. Lyu. Codecrash: Stress testing llm reasoning under structural and semantic perturbations.arXiv preprint arXiv:2504.14119, 2025

work page arXiv 2025

[58] [59]

Gistify! codebase- level understanding via runtime execution.arXiv preprint arXiv:2510.26790, 2025

Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, et al. Gistify! codebase- level understanding via runtime execution.arXiv preprint arXiv:2510.26790, 2025

work page arXiv 2025

[59] [60]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 13 A Experimental Details A.1 Compute Resources All experime...

2023

[60] [61]

real runtime files

Renamed Files: Some files that contain the real runtime logic are renamed, we call them "real runtime files" below

[61] [62]

These files are decoy only, may be very similar to the real file

Decoy Files: For each renamed real file, here are also distracting files. These files are decoy only, may be very similar to the real file

[62] [63]

Together, they form a four-layer import graph

Proxy Files: Files whose names start with `proxy_` are proxy import files. Together, they form a four-layer import graph. The real runtime files do not directly import the final Python library; instead, it reaches that library through this proxy chain

[63] [64]

Some constants originally belonging to the real runtime files have been extracted into these dependency files, and the real runtime files reads those values from them

JSON Files: Files whose names start with `dependency_` are external dependency files. Some constants originally belonging to the real runtime files have been extracted into these dependency files, and the real runtime files reads those values from them

[64] [65]

Instead, at the same directory level, there is a folder whose name is the original runtime name of that code unit

Wrapper Folder: The real runtime files are not directly imported by the rest of the repository under its own filename. Instead, at the same directory level, there is a folder whose name is the original runtime name of that code unit. Inside that folder, an `__init__.py` file re-exports or imports the original file. In other words, the repository wraps som...

[65] [66]

Locate the directory or directories most relevant to the task

[66] [67]

Thoroughly explore and understand the structure inside that area

[67] [68]

Infer relative file roles

[68] [69]

Write a structured summary into /testbed/INSTRUCTION.md inside the target directory

[69] [70]

A second agent will later perform editing based on your output

Stop after finishing /testbed/INSTRUCTION.md. A second agent will later perform editing based on your output. ## Guide

[70] [71]

The aiming may be multiple

First make a reasonable guess about the aiming directory based on the task description and repository layout. The aiming may be multiple

[71] [72]

Enter the guessed aiming directory and list the local files and subdirectories

[72] [73]

Read the files that you think are relevant to the task or necessary for understanding the structure

[73] [74]

Read the files related to the candidate files carefully

[74] [75]

Your final answer must based on runtime evidence, do not only trust the searching and reading, do not trust the filename or the other surface patterns

Understanding the structure and logic, you should use small validation testing commands that follows the same loading or resolution logic as the code whenever possible. Your final answer must based on runtime evidence, do not only trust the searching and reading, do not trust the filename or the other surface patterns. You should explore deeply into the a...

[75] [76]

- filename2: XXX; which file uses it: XXX; which file is used by it: XXX; description: XXX

The directories and all files you have read, and your runtime evidence - filename1: XXX; which file uses it: XXX; which file is used by it: XXX; description: XXX. - filename2: XXX; which file uses it: XXX; which file is used by it: XXX; description: XXX

[76] [77]

The relationships between files

The detailed struture of the directory. The relationships between files. And your runtime evidence

[77] [78]

B Additional Analysis and Examples B.1 Behavior Shifts on REPOMIRAGE-Perturb Fig

Other unclear points (Do not mention any unclear points that is obviously irrelative to the task) Figure 9: Prompt for exploration stage of REPOANCHOR. B Additional Analysis and Examples B.1 Behavior Shifts on REPOMIRAGE-Perturb Fig. 11 shows the behavior shifts on REPOMIRAGE-Perturb, where agents still solve the original issue-resolution task but under p...

[78] [79]

Your colleague has written an INSTRUCTION.md in the /teatbed directory, detailing the location of the task target directory and files, the structure of task related files in the target directory

[79] [80]

The purpose of this document is to reduce the time required for you to understand the directory structure

You should read and utilize this document reasonably. The purpose of this document is to reduce the time required for you to understand the directory structure

[80] [81]

Figure 10: Prompt for problem solving stage of REPOANCHOR

You can also focus on any unclear points in the document, since the content here may be wrong and it may be an important point. Figure 10: Prompt for problem solving stage of REPOANCHOR. ***️ ***️ ***️***️ ***️ ***️ ***️ ***️ Figure 11:Behavior shifts under REPOMIRAGE-Perturb. ∆ measures the change from SWE- Bench Verified to REPOMIRAGE-Perturb. Files Ins...