ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
Language models that think, chat better
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
baseline 1polarities
baseline 1representative citing papers
A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reasoning benchmarks.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
-
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reasoning benchmarks.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
- SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions