← back to paper
arxiv: 2512.04111 · 2 revisions
CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding