ECPO improves GiGPO by shrinking low-count action advantages and suppressing noisy anchor states, yielding +5.2/+7.3 success gains on ALFWorld/WebShop with Qwen2.5-1.5B models at negligible extra cost.
arXiv preprint arXiv:2603.03078 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.
citing papers explorer
-
SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training
SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.