This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Pure Python
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Tool use in LLMs improves final-answer accuracy but degrades reasoning quality through Tool-Induced Myopia, with the effect worsening as tool calls increase and shifting errors toward logic and assumption failures.
citing papers explorer
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models
Tool use in LLMs improves final-answer accuracy but degrades reasoning quality through Tool-Induced Myopia, with the effect worsening as tool calls increase and shifting errors toward logic and assumption failures.