hub Canonical reference

Trae agent: An llm-based agent for software engineering with test-time scaling

· 2025 · arXiv 2507.23370

Canonical reference. 80% of citing Pith papers cite this work as background.

20 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1

citation-polarity summary

background 4 baseline 1

representative citing papers

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

cs.MA · 2026-05-06 · conditional · novelty 7.0

SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

cs.CR · 2026-04-02 · unverdicted · novelty 7.0

Presents a component-centric PoC dataset of malicious MCP servers and a two-stage behavioral deviation detector Connor achieving 94.6% F1-score.

Automating Database-Native Function Code Synthesis with LLMs

cs.DB · 2026-04-02 · conditional · novelty 7.0

DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreSQL, and DuckDB while generating functions absent from SQLite 3.50.

Investigating Test Overfitting on SWE-bench

cs.SE · 2025-11-20 · unverdicted · novelty 7.0

The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.

Beyond Textual Repository Exploration: Dual-Modal Structural Reasoning for Agentic Issue Resolution

cs.SE · 2026-07-02 · unverdicted · novelty 6.0

DUALVIEW is a dual-modal framework using Module Coupling, Function Call, Class Hierarchy, and Program Dependence graphs to enable persistent structural reasoning for agentic issue resolution, reporting gains on SWE-bench Pro and Verified.

LLVM-Bench: Benchmarking and Advancing Large Language Models for LLVM Compiler Issue Resolution

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

LLVM-Bench supplies 423 validated LLVM issues and LLVM-Gym automates evaluation, showing LLMs are limited but an ensemble reaches 21.99% resolution.

CLIP: Lightweight Cosine-Law-Based Inverted-List Pruning for IVF-Based Vector Search

cs.DB · 2026-06-29 · unverdicted · novelty 6.0

CLIP proposes a cosine-law-based pruning method for IVF vector search enabling O(1) cluster and log-time vector pruning with guarantees, plus variants for hierarchical and dynamic settings, showing up to 78% pruning and 69% efficiency gains.

Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning

cs.SE · 2026-05-01 · unverdicted · novelty 6.0

REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.

When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation

cs.SE · 2026-04-10 · unverdicted · novelty 6.0

LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.

REAgent: Requirement-Driven LLM Agents for Software Issue Resolution

cs.SE · 2026-04-08 · unverdicted · novelty 6.0

REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.

On the Role of Fault Localization Context for LLM-Based Program Repair

cs.SE · 2026-04-07 · unverdicted · novelty 6.0

More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.

Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints

cs.SE · 2026-04-06 · unverdicted · novelty 6.0

Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.

Can Old Tests Do New Tricks for Resolving SWE Issues?

cs.SE · 2025-10-21 · conditional · novelty 6.0

TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.

CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming

cs.CL · 2026-05-23 · unverdicted · novelty 5.0

CP-Agent improves LLM competitive programming performance via calibrated feedback mechanisms that target false-admission risk, evidence against bad programs, and success hazard.

"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution

cs.SE · 2026-05-21 · unverdicted · novelty 5.0

Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.

KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant

cs.SE · 2026-04-26 · unverdicted · novelty 5.0 · 2 refs

The paper introduces KISS Sorcar, a simple open-source AI agent framework with a five-layer hierarchy and git worktree isolation to address context limits, error propagation, and reviewability in software engineering tasks.

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

cs.SE · 2026-02-08 · unverdicted · novelty 5.0

Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

cs.SE · 2026-06-04

citing papers explorer

Showing 20 of 20 citing papers.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering cs.SE · 2026-05-17 · unverdicted · none · ref 14
SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies cs.MA · 2026-05-06 · conditional · none · ref 17
SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.
Evaluating LLM Agents on Automated Software Analysis Tasks cs.SE · 2026-04-13 · unverdicted · none · ref 18
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers cs.CR · 2026-04-02 · unverdicted · none · ref 19
Presents a component-centric PoC dataset of malicious MCP servers and a two-stage behavioral deviation detector Connor achieving 94.6% F1-score.
Automating Database-Native Function Code Synthesis with LLMs cs.DB · 2026-04-02 · conditional · none · ref 24
DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreSQL, and DuckDB while generating functions absent from SQLite 3.50.
Investigating Test Overfitting on SWE-bench cs.SE · 2025-11-20 · unverdicted · none · ref 9
The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
Beyond Textual Repository Exploration: Dual-Modal Structural Reasoning for Agentic Issue Resolution cs.SE · 2026-07-02 · unverdicted · none · ref 8
DUALVIEW is a dual-modal framework using Module Coupling, Function Call, Class Hierarchy, and Program Dependence graphs to enable persistent structural reasoning for agentic issue resolution, reporting gains on SWE-bench Pro and Verified.
LLVM-Bench: Benchmarking and Advancing Large Language Models for LLVM Compiler Issue Resolution cs.SE · 2026-07-01 · unverdicted · none · ref 32
LLVM-Bench supplies 423 validated LLVM issues and LLVM-Gym automates evaluation, showing LLMs are limited but an ensemble reaches 21.99% resolution.
CLIP: Lightweight Cosine-Law-Based Inverted-List Pruning for IVF-Based Vector Search cs.DB · 2026-06-29 · unverdicted · none · ref 20
CLIP proposes a cosine-law-based pruning method for IVF vector search enabling O(1) cluster and log-time vector pruning with guarantees, plus variants for hierarchical and dynamic settings, showing up to 78% pruning and 69% efficiency gains.
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning cs.SE · 2026-05-01 · unverdicted · none · ref 11
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation cs.SE · 2026-04-10 · unverdicted · none · ref 10
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution cs.SE · 2026-04-08 · unverdicted · none · ref 18
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
On the Role of Fault Localization Context for LLM-Based Program Repair cs.SE · 2026-04-07 · unverdicted · none · ref 7
More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints cs.SE · 2026-04-06 · unverdicted · none · ref 15
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
Can Old Tests Do New Tricks for Resolving SWE Issues? cs.SE · 2025-10-21 · conditional · none · ref 15
TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.
CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming cs.CL · 2026-05-23 · unverdicted · none · ref 15
CP-Agent improves LLM competitive programming performance via calibrated feedback mechanisms that target false-admission risk, evidence against bad programs, and success hazard.
"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution cs.SE · 2026-05-21 · unverdicted · none · ref 15
Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant cs.SE · 2026-04-26 · unverdicted · none · ref 7 · 2 links
The paper introduces KISS Sorcar, a simple open-source AI agent framework with a five-layer hierarchy and git worktree isolation to address context limits, error propagation, and reviewability in software engineering tasks.
Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents cs.SE · 2026-02-08 · unverdicted · none · ref 15
Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.
From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws cs.SE · 2026-06-04 · unreviewed · ref 33

Trae agent: An llm-based agent for software engineering with test-time scaling

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer