REINFORCE self-training on competitive programming tasks exhibits robust rise-then-collapse in pass@1; CARE, ES, and GRPO mitigate it in model-size-dependent ways across Qwen-2.5-3B/7B and a Gemma pilot.
AgentHPO: Large language model agent for hyperparameter optimization.arXiv preprint arXiv:2402.11427,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training
REINFORCE self-training on competitive programming tasks exhibits robust rise-then-collapse in pass@1; CARE, ES, and GRPO mitigate it in model-size-dependent ways across Qwen-2.5-3B/7B and a Gemma pilot.