pith. machine review for the scientific record. sign in

← back to paper

Review history

arxiv: 2605.12483 · 2 revisions

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

  1. 2026-05-15 UNVERDICTED LOW v0.9.0 novelty 6.0
    56827 ms 5675 in 1235 out 2026-05-15T05:20:20.794990+00:00
  2. 2026-05-13 UNVERDICTED LOW v0.9.0 novelty 5.0
    38297 ms 5678 in 1205 out 2026-05-13T05:01:24.708241+00:00