PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 1polarities
unclear 1representative citing papers
Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
citing papers explorer
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration
Iterative Reward Calibration with MT-GRPO and GTPO enables effective multi-turn RL for tool-calling agents, raising Tau-Bench success from 63.8% to 66.7% for a 4B model and from 58.0% to 69.5% for a 30B model.
-
Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.