JudgmentBench supplies the first public paired rubric and preference annotations from legal experts on the same LLM outputs, showing comparative judgments outperform rubrics in recovering quality orderings.
Pairwise or pointwise? evaluating feedback protocols for bias in llm-based evaluation
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
AdaptSim is an adaptive user simulator for CRS evaluation that combines automatic prompt generation, open actions, controlled text generation, and BFS-based pairwise comparison to produce realistic dialogues and assess system robustness across domains.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
citing papers explorer
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.