arXiv preprint arXiv:2502.01715 , year=

Yufan Ye, Ting Zhang, Wenbin Jiang, Hua Huang · 2025 · arXiv 2502.01715

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

GDCR assigns step-level rewards via distance to the answer node in a training-time ER graph and SAPO combines these with trajectory advantages for credit assignment in agentic search.

TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning

cs.SE · 2026-04-02 · unverdicted · novelty 6.0

By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning cs.CL · 2026-04-11 · unverdicted · none · ref 38
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

arXiv preprint arXiv:2502.01715 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer