{"total":15,"items":[{"citing_arxiv_id":"2605.22526","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"\"Refactoring Runaway\": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution","primary_cat":"cs.SE","submitted_at":"2026-05-21T14:18:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17526","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering","primary_cat":"cs.SE","submitted_at":"2026-05-17T16:15:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04637","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies","primary_cat":"cs.MA","submitted_at":"2026-05-06T08:30:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00433","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning","primary_cat":"cs.SE","submitted_at":"2026-05-01T06:10:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23822","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant","primary_cat":"cs.SE","submitted_at":"2026-04-26T17:59:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent history, and git worktree isolation while self-validating outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11270","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating LLM Agents on Automated Software Analysis Tasks","primary_cat":"cs.SE","submitted_at":"2026-04-13T10:24:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LLM-based software engineer- ing agents have attracted considerable attention in recent years. Issue solving and program repair are among the most prominent tasks, with notable examples including SWE-Agent [62], RepairA- gent [5], AutoCodeRover [ 67], OpenHands [ 60], FixAgent [ 33], AgentCoder [ 30], Magis [ 55], MarsCode Agent [ 36], and Trae Agent [18]. Other tasks addressed by software engineering agents include generating issue-reproducing tests [ 1, 10, 43, 44], root cause analysis [46], and debugging computational notebooks [22]. Our work targets automated software analysis, introduces a new benchmark, and proposes an agent design tailored to this task. Agent design and prompt engineering.ReAct [ 64] improves agent"},{"citing_arxiv_id":"2604.09515","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation","primary_cat":"cs.SE","submitted_at":"2026-04-10T17:37:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5-Coder [16], Deepseek- Coder [13], and CodeGeeX [55]. These models are trained on mas- sive code corpora collected from software repositories, enabling strong performance across a wide range of programming tasks. With the rapid development of coding-oriented LLMs, a variety of ap- proaches have been proposed, ranging from function-level synthe- sis [10, 19, 46, 51] to repository-level code generation [9, 23, 43, 50, 52]. For example, Zan et al. proposed DiffCoder, which improves function-level code generation involving API usage by modeling the differences (diffs) between related coding tasks, mimicking hu- man analogical learning [51]. At the repository level, Zhang et al. introduced CODEAGENT, an LLM-based agent framework that"},{"citing_arxiv_id":"2604.06861","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"REAgent: Requirement-Driven LLM Agents for Software Issue Resolution","primary_cat":"cs.SE","submitted_at":"2026-04-08T09:22:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"2 Motivating Example To illustrate the importance of issue-oriented requirements, we present a real-world case demonstrating the motivation ofREAgent. Figure 1 shows an example from the SWE-bench Verified [ 47] dataset with instance_iddjango__django-16642. We first employ an advanced LLM (DeepSeek-V3.2 [24]) within the state-of-the-art Trae-agent [18] framework to generate a patch directly from the original issue description. However, due to the incompleteness and ambiguity of the issue description, the generated patch is incorrect. Specifically, the issue description fails to specify the encoding asso- ciated with the \".Z\" file inmimetypes.guess_type(), leading the agent to incorrectly assume that the encoding name is \"Z\", which"},{"citing_arxiv_id":"2604.05481","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Role of Fault Localization Context for LLM-Based Program Repair","primary_cat":"cs.SE","submitted_at":"2026-04-07T06:21:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04580","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints","primary_cat":"cs.SE","submitted_at":"2026-04-06T10:26:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Search-driven techniques including SWE-Search [5] and multi-agent debate frameworks [23] further guide patch exploration, while experience-based systems [11, 32] leverage memory to transfer prior repair knowledge across tasks. Recent studies also highlight the importance of inference-time compute scaling. Approaches such as Thinking Longer, Not Larger [29] and Trae Agent [15] show that deeper test-time reasoning and reflection substantially improve repair success. Complementary work like BugPilot [43] focuses on constructing more complex bug scenarios to stress-test these systems. Across these directions, the common assumption is that behavioral constraints are externally provided and remain fixed during repair, and progress primarily comes from improving repository"},{"citing_arxiv_id":"2604.01905","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers","primary_cat":"cs.CR","submitted_at":"2026-04-02T11:22:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents a component-centric PoC dataset of malicious MCP servers and a two-stage behavioral deviation detector Connor achieving 94.6% F1-score.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06231","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Automating Database-Native Function Code Synthesis with LLMs","primary_cat":"cs.DB","submitted_at":"2026-04-02T02:56:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreSQL, and DuckDB while generating functions absent from SQLite 3.50.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For example, as shown in Figure 1 (b), PostgreSQL functions nearly tripled from 237 (v11) to 630 (v18) [ 14], DuckDB grew from 60 (v0.3.3) to 666 (v1.4.0) [ 6], and SQLite increased from 52 (v3.8.0) to 143 (v3.50.0) [ 16]. This expansion is driven by new scenario support (e.g., BI analysis [ 18, 52, 54, 55] and geometric processing [ 35]) and business migration [ 24, 49, 53]. Specifically, in legacy migration scenarios (e.g., Oracle to PostgreSQL), implementing proprietary functions is a major bottleneck [ 11, 12, 56], with code refactoring accounting for 30%-60% of migration budgets and requiring 40-80 hours per 1,000 code lines [10]. Synthesizing database native functions is a critical task for extending system capabilities, as"},{"citing_arxiv_id":"2602.07900","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents","primary_cat":"cs.SE","submitted_at":"2026-02-08T10:26:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.16858","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Investigating Test Overfitting on SWE-bench","primary_cat":"cs.SE","submitted_at":"2025-11-20T23:55:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18270","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can Old Tests Do New Tricks for Resolving SWE Issues?","primary_cat":"cs.SE","submitted_at":"2025-10-21T03:42:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}