One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2025 2representative citing papers
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
citing papers explorer
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.