Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
hub
Reasoninggym: Reasoningenvironmentsforreinforcementlearningwithverifiable rewards
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 10representative citing papers
K2V extends RLVR to knowledge-intensive domains by synthesizing verifiable data and verifying reasoning processes, yielding improved domain reasoning with preserved general capabilities.
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
Gym-V supplies 179 visual environments showing that observation scaffolding like captions and rules matters more for training success than the choice of RL algorithm.
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
citing papers explorer
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
-
Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains
K2V extends RLVR to knowledge-intensive domains by synthesizing verifiable data and verifying reasoning processes, yielding improved domain reasoning with preserved general capabilities.
-
AIPO: Learning to Reason from Active Interaction
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
-
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
-
SPHINX: A Synthetic Environment for Visual Perception and Reasoning
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
-
Gym-V: A Unified Vision Environment System for Agentic Vision Research
Gym-V supplies 179 visual environments showing that observation scaffolding like captions and rules matters more for training success than the choice of RL algorithm.
-
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.