← back to paper
arxiv: 2605.19416 · 2 revisions
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models