pith. sign in

Refactorbench: Evaluating stateful reasoning in language agents through code

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

citation-role summary

baseline 2 background 1 dataset 1

citation-polarity summary

years

2026 4 2025 2

representative citing papers

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

cs.SE · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.

citing papers explorer

Showing 6 of 6 citing papers.