hub Mixed citations

Zhang et al

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang · 2025 · arXiv 2505.23419

Mixed citation behavior. Most common role is background (60%).

14 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1 method 1 other 1

citation-polarity summary

background 3 unclear 1 use method 1

representative citing papers

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13 · conditional · novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

cs.MA · 2026-05-06 · conditional · novelty 7.0

SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.

PlayCoder: Making LLM-Generated GUI Code Playable

cs.SE · 2026-04-21 · conditional · novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.

Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?

cs.SE · 2026-02-20 · conditional · novelty 7.0

Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

cs.SE · 2025-12-20 · unverdicted · novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

cs.SE · 2025-11-02 · unverdicted · novelty 7.0

Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.

From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents

cs.SE · 2026-05-21 · unverdicted · novelty 6.0

P2T distills reference patches into a latent process graph and uses it to select shortest effective trajectory segments from teacher rollouts, yielding up to 10.8 point Pass@1 gains on SWE-bench Verified with 15% lower inference cost using only 1.8k instances.

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

cs.SE · 2026-05-13 · unverdicted · novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

Toward Scalable Terminal Task Synthesis via Skill Graphs

cs.AI · 2026-04-28 · unverdicted · novelty 6.0

SkillSynth uses a scenario-mediated skill graph to sample workflow paths and generate executable terminal tasks, enabling controlled diversity in training trajectories for agents.

You Don't Need Public Tests to Generate Correct Code

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

cs.SE · 2025-09-21 · conditional · novelty 6.0

SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.

PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

cs.SE · 2026-05-21 · conditional · novelty 5.0

PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

citing papers explorer

Showing 14 of 14 citing papers.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · conditional · none · ref 61
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies cs.MA · 2026-05-06 · conditional · none · ref 6
SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.
PlayCoder: Making LLM-Generated GUI Code Playable cs.SE · 2026-04-21 · conditional · none · ref 83
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs? cs.SE · 2026-02-20 · conditional · none · ref 16
Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios cs.SE · 2025-12-20 · unverdicted · none · ref 65
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems cs.SE · 2025-11-02 · unverdicted · none · ref 67
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents cs.SE · 2026-05-21 · unverdicted · none · ref 29
P2T distills reference patches into a latent process graph and uses it to select shortest effective trajectory segments from teacher rollouts, yielding up to 10.8 point Pass@1 gains on SWE-bench Verified with 15% lower inference cost using only 1.8k instances.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 16
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Toward Scalable Terminal Task Synthesis via Skill Graphs cs.AI · 2026-04-28 · unverdicted · none · ref 15
SkillSynth uses a scenario-mediated skill graph to sample workflow paths and generate executable terminal tasks, enabling controlled diversity in training trajectories for agents.
You Don't Need Public Tests to Generate Correct Code cs.SE · 2026-04-23 · unverdicted · none · ref 22
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unverdicted · none · ref 98
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? cs.SE · 2025-09-21 · conditional · none · ref 19
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction cs.SE · 2026-05-21 · conditional · none · ref 13
PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 59
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

Zhang et al

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer