pith. sign in

arxiv: 2603.18859 · v2 · pith:NI66TQHPnew · submitted 2026-03-19 · 💻 cs.AI · cs.CL· cs.LG

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

classification 💻 cs.AI cs.CLcs.LG
keywords rewardflowagenticreasoningrewardrewardsstateacrossgraphs
0
0 comments X
read the original abstract

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.