pith. sign in

← back to paper

Review history

arxiv: 2510.10930 · 2 revisions

Evaluating Language Models' Evaluations of Games

  1. 2026-05-21 UNVERDICTED LOW v0.9.0 novelty 6.0
    68635 ms 5832 in 1295 out 2026-05-21T20:52:03.862431+00:00
  2. 2026-05-18 UNVERDICTED LOW v0.9.0 novelty 7.0
    34579 ms 5832 in 1202 out 2026-05-18T08:26:06.648592+00:00