hub

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.CoRR, abs/2512.12730

[Online] · 2025 · arXiv 2512.12730

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2 background 1

citation-polarity summary

background 2 baseline 1

representative citing papers

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

RealClawBench turns 281 real OpenClaw sessions into reproducible tasks that preserve the original distribution and shows the best of 14 models solves only 65.8 percent.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.

Toward Executable Repository-Level Code Generation via Environment Alignment

cs.SE · 2026-04-04 · unverdicted · novelty 7.0

EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

cs.SE · 2026-05-13 · unverdicted · novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering

cs.SE · 2026-07-01 · unverdicted · novelty 5.0

A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.

ATM: CID-Brokered Pre-Write Admission for Multi-Agent Code Co-Synthesis

cs.SE · 2026-06-29 · unverdicted · novelty 5.0

ATM is a CID-brokered governance framework that maps write intents to semantic atoms for pre-admission control, validation, and neutral-steward application in single-domain multi-agent code synthesis.

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

cs.CL · 2026-06-10 · unverdicted · novelty 5.0

This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

cs.AI · 2026-06-30 · unverdicted · novelty 2.0

Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.

citing papers explorer

Showing 13 of 13 citing papers.

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields cs.AI · 2026-06-09 · unverdicted · none · ref 3
Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.
RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions cs.CL · 2026-06-02 · unverdicted · none · ref 3
RealClawBench turns 281 real OpenClaw sessions into reproducible tasks that preserve the original distribution and shows the best of 14 models solves only 65.8 percent.
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents cs.SE · 2026-05-20 · unverdicted · none · ref 15
SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering cs.SE · 2026-05-17 · unverdicted · none · ref 11
SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades cs.SE · 2026-05-14 · unverdicted · none · ref 66
SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
Constraint Decay: The Fragility of LLM Agents in Backend Code Generation cs.SE · 2026-05-07 · unverdicted · none · ref 31
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios cs.SE · 2026-04-08 · unverdicted · none · ref 7
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
Toward Executable Repository-Level Code Generation via Environment Alignment cs.SE · 2026-04-04 · unverdicted · none · ref 4
EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 10
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering cs.SE · 2026-07-01 · unverdicted · none · ref 21
A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.
ATM: CID-Brokered Pre-Write Admission for Multi-Agent Code Co-Synthesis cs.SE · 2026-06-29 · unverdicted · none · ref 28
ATM is a CID-brokered governance framework that maps write intents to semantic atoms for pre-admission control, validation, and neutral-steward application in single-domain multi-agent code synthesis.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 145
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity cs.AI · 2026-06-30 · unverdicted · none · ref 36
Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.CoRR, abs/2512.12730

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer