← back to paper
arxiv: 2605.03862 · 2 revisions
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards