APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
arXiv preprint arXiv:2407.15186 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
LogCopilot is an LLM framework that builds a hierarchical knowledge base from logs and generates/executes LogQL queries from natural language instructions, reporting 76.8% average accuracy across four datasets.
citing papers explorer
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
LogCopilot: Automating Log Aggregation Analysis through Large Language Models
LogCopilot is an LLM framework that builds a hierarchical knowledge base from logs and generates/executes LogQL queries from natural language instructions, reporting 76.8% average accuracy across four datasets.