ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.
Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards. These obscure which tokens actually contribute to high-quality outputs, creating a credit assignment gap. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, which is a reasoning pattern rarely seen during pretraining but commonly required in deployment. We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes. OSPO transforms task feedback into potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy, all without parametric value models. By forming coalitions of semantically coherent units (e.g., phrases describing product attributes or sentences capturing preferences), OSPO identifies which response parts drive performance. Experiments on Amazon ESCI and H&M Fashion datasets including controlled generation tasks show consistent gains over baselines and notable test-time robustness to out-of-distribution retrievers unseen during training.
fields
cs.MA 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit
ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.