pith. sign in

← back to paper

Review history

arxiv: 2512.04111 · 2 revisions

CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

  1. 2026-05-22 UNVERDICTED LOW v0.9.0 novelty 7.0
    45534 ms 5768 in 1260 out 2026-05-22T11:55:20.207362+00:00
  2. 2026-05-21 UNVERDICTED LOW v0.9.0 novelty 7.0
    36068 ms 5880 in 1220 out 2026-05-21T18:05:07.835330+00:00