{"total":14,"items":[{"citing_arxiv_id":"2607.01087","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering","primary_cat":"cs.SE","submitted_at":"2026-07-01T15:44:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00248","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity","primary_cat":"cs.AI","submitted_at":"2026-06-30T22:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00041","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ATM: CID-Brokered Pre-Write Admission for Multi-Agent Code Co-Synthesis","primary_cat":"cs.SE","submitted_at":"2026-06-29T16:02:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ATM is a CID-brokered governance framework that maps write intents to semantic atoms for pre-admission control, validation, and neutral-steward application in single-domain multi-agent code synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19613","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns","primary_cat":"cs.SE","submitted_at":"2026-06-17T21:36:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12191","ref_index":145,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application","primary_cat":"cs.CL","submitted_at":"2026-06-10T15:15:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":",CSR-Bench [136], SWT-Bench [137], Terminal-Bench [42], KernelBench [43], MBPP [41],etc. Code Debugging (§4.6.4)e.g.,InterCode [138], SWE-Bench [16], SWE-Bench Pro [139], DebugBench [140],etc. Domain-Specific(§4.7) Biomedical and Healthcare(§4.7.1) e.g.,MedAgentBench [44], MedAgentGym [141], BioAgent Bench [142], Biocoder [143],etc. Science and Technology(§4.7.2) e.g.,DSEval [144], DSBench [46], ScienceAgentBench [45], MLE-bench [145], MLE-Dojo [146],etc. Finance and Investment(§4.7.3) e.g.,FinDeepResearch [147], StockBench [148], Finance Agent Benchmark [149],etc. Cross-Domain (§4.8)Cross-Domain (§4.8)e.g.,OpenAI Gym [47], HuggingGPT [150], AgentBench [48], AgentBoard [151], GEM [3],etc. EnvironmentSynthesis (§5) Symbolic Synthesis(§5.1) Task-Driven Synthesis(§5.1.1) e.g.,SWE-Gym [152], SWE-smith [153], AgentScaler [154], Agent2World [155], Text2World [156],WorldCoder [157], LLM-in-Sandbox [158], R2E-Gym [159], Scale-SWE [160], EnvScaler [161],etc."},{"citing_arxiv_id":"2606.11042","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields","primary_cat":"cs.AI","submitted_at":"2026-06-09T16:10:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03889","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions","primary_cat":"cs.CL","submitted_at":"2026-06-02T16:51:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RealClawBench turns 281 real OpenClaw sessions into reproducible tasks that preserve the original distribution and shows the best of 14 models solves only 65.8 percent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21384","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents","primary_cat":"cs.SE","submitted_at":"2026-05-20T16:41:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17526","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering","primary_cat":"cs.SE","submitted_at":"2026-05-17T16:15:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14415","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades","primary_cat":"cs.SE","submitted_at":"2026-05-14T06:04:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13139","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle","primary_cat":"cs.SE","submitted_at":"2026-05-13T08:05:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.","context_count":1,"top_context_role":"dataset","top_context_polarity":"baseline","context_text":"Table 1:Comparison between SWE-Cycle and current software engineering benchmarks. Benchmark Scenario Task End-to-End Evaluation Protocol Env. Impl. TestGen. Judger Partial Scoring SWE-bench & Variants [15, 5, 7, 35] Issue✗ ✓ ✗ ✗UnitTest✗EnvBench [9] Env Setup✓ ✗ ✗ ✗UnitTest✗TestEval [28] Existing Code✗ ✗ ✓ ✗UnitTest✗DevBench [12] Greenfield✗ ✓ ✓ ✗UnitTest+LLM✗PRDBench [10] Greenfield✗ ✓ ✗ ✓Agent✓NL2Repo [8] Greenfield✗ ✗ ✗ ✓UnitTest✗ SWE-Cycle (Ours) Issue ✓ ✓ ✓ ✓ Agent ✓ proficiency in environment reconstruction, code implementation, and verification test generation to provide a holistic view of their true autonomous potential. 2 Related Work Code Agent Benchmarks.SWE-bench [ 15] has established the standard for evaluating code agents"},{"citing_arxiv_id":"2605.06445","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Constraint Decay: The Fragility of LLM Agents in Backend Code Generation","primary_cat":"cs.SE","submitted_at":"2026-05-07T15:44:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06742","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios","primary_cat":"cs.SE","submitted_at":"2026-04-08T07:09:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03622","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Executable Repository-Level Code Generation via Environment Alignment","primary_cat":"cs.SE","submitted_at":"2026-04-04T07:37:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2026. ACM ISBN 978-x-xxxx-xxxx-x/YYYY/MM https://doi.org/10.1145/nnnnnnn.nnnnnnn from function completion to repository-level code completion [11, 18,19,33]. As code generation moves beyond single functions and files towardrepository-level generation, models are increasingly requiredtoconstructcompletemulti-filerepositoriesfromhigh-level requirements [4]. Under this evaluation setting, the objective is no longer merely to generate plausible code fragments, but to deliver an executable repositorythat can be successfully installed, satisfy its external dependencies, resolve its internal references, be launched, andbevalidatedinarealexecutionenvironment.Therefore,ensuring repository executability becomes a fundamental prerequisite for"}],"limit":50,"offset":0}