T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
Codebase Building upon the existing codebase verl (Sheng et al., 2024), our codebase introduces targeted modifications to both the vLLM (Kwon et al
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.