Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
-
2026-05-21
UNVERDICTED
LOW
v0.9.0
novelty 5.0
57431 ms
8560 in
1406 out
2026-05-21T07:52:35.152181+00:00
-
2026-05-19
UNVERDICTED
LOW
v0.9.0
novelty 5.0
32058 ms
5849 in
1135 out
2026-05-19T16:37:05.873569+00:00
-
2026-05-15
UNVERDICTED
LOW
v0.9.0
novelty 6.0
56827 ms
5675 in
1235 out
2026-05-15T05:20:20.794990+00:00
-
2026-05-13
UNVERDICTED
LOW
v0.9.0
novelty 5.0
38297 ms
5678 in
1205 out
2026-05-13T05:01:24.708241+00:00