InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.
At 2po: Agentic turn-based policy optimization via tree search
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
baseline 1
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
baseline 1polarities
baseline 1representative citing papers
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
citing papers explorer
No citing papers match the current filters.