pith. sign in

arxiv: 2603.19310 · v4 · pith:XQWV44UTnew · submitted 2026-03-13 · 💻 cs.LG · cs.AI

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

Pith reviewed 2026-05-25 06:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM reinforcement learningreward predictiongraph neural networkssemi-supervised learningexperience memorylimited labelspolicy optimizationrollout evaluation
0
0 comments X

The pith

MemReward builds a heterogeneous graph of LLM rollouts and uses a GNN to predict rewards on the unlabeled majority from a small labeled set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that reinforcement learning for LLMs can remain effective even when ground-truth rewards are available for only a small fraction of sampled rollouts. It constructs a graph with nodes representing queries, thinking processes, and answers, linking them through similarity and structural edges so that reward signals can spread from verified examples to unverified ones. A graph neural network is first warmed up on the labeled portion and then applied online during policy updates to supply predicted rewards for the rest. This hybrid labeling strategy is tested on models of 1.5B and 3B parameters across mathematics, question answering, and code generation. A reader would care because obtaining reliable rewards often requires costly human or expert effort, and the approach claims to recover nearly all of the performance obtained when every rollout is labeled.

Core claim

MemReward stores rollouts from an initial LLM policy as nodes in a heterogeneous graph connected by similarity and structural edges, over which a GNN propagates rewards from labeled to unlabeled rollouts. To train the framework, the GNN is first warmed up on labeled rollouts to predict rewards via heterogeneous aggregation over query, thinking, and answer nodes. During online RL fine-tuning, unlabeled rollouts are attached to the graph by query similarity, and the GNN predicts their rewards, yielding a hybrid reward acquisition strategy that combines ground-truth and GNN-predicted rewards.

What carries the argument

Heterogeneous graph whose nodes are queries, thinking processes, and answers connected by similarity-based edges, allowing a graph neural network to propagate reward labels from the labeled subset to the unlabeled rollouts.

If this is right

  • With ground-truth rewards on only 20% of rollouts the method reaches 96.6% of oracle performance on the 1.5B model and 97.3% on the 3B model.
  • The same setting yields performance close to oracle on out-of-domain tasks.
  • The approach works across mathematics, question answering, and code generation benchmarks.
  • Reward prediction is integrated directly into the online policy optimization loop rather than applied only as a preprocessing step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the graph construction generalizes, similar memory structures could reduce label requirements in other sequential decision tasks that rely on expensive feedback.
  • The method implicitly suggests that structural similarity among reasoning traces can substitute for direct verification in many cases.
  • Extending the graph to include cross-task edges might further improve out-of-domain transfer without additional labeling.

Load-bearing premise

Similarity-based edges between rollout nodes let the graph neural network accurately predict rewards for unlabeled examples from the small set of ground-truth labels.

What would settle it

Running the same online RL procedure on a new domain where the 20-percent-label version produces final policy performance more than a few percent below the full-oracle baseline.

read the original abstract

Reinforcement learning has emerged as a powerful paradigm for improving large language model (LLM) reasoning, where rollouts are sampled from the policy and reward signals computed on those rollouts are used to update the policy. However, in data-scarce scenarios, obtaining ground-truth labels to verify rollouts at scale often requires expensive human annotation or labor-intensive expert verification. For instance, evaluating mathematical proofs demands expert review, and open-ended question answering lacks definitive ground truth. When ground-truth labels are scarce, the effectiveness of reinforcement learning fine-tuning is constrained. Inspired by the success of semi-supervised learning in propagating labels from labeled to unlabeled samples, we propose MemReward, a graph-based experience memory framework that integrates reward propagation directly into online policy optimization. MemReward stores rollouts (thinking processes and final answers) from an initial LLM policy as nodes in a heterogeneous graph connected by similarity and structural edges, over which a GNN propagates rewards from labeled to unlabeled rollouts. To train such a framework, we first warm up the GNN on labeled rollouts to predict rewards via heterogeneous aggregation over query, thinking, and answer nodes. During online RL fine-tuning, unlabeled rollouts are attached to the graph by query similarity, and the GNN predicts their rewards, yielding a hybrid reward acquisition strategy that combines ground-truth and GNN-predicted rewards. Experiments on Qwen2.5-1.5B and 3B in mathematics, question answering, and code generation demonstrate that MemReward, with ground-truth rewards on only 20% of rollouts, achieves 96.6% of Oracle performance on 1.5B and 97.3% on 3B, and closely approaches Oracle on out-of-domain tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MemReward, a graph-based experience memory framework for predicting rewards on LLM rollouts with limited ground-truth labels. Rollouts are stored as nodes in a heterogeneous graph connected by similarity and structural edges; a GNN is warmed up on labeled data and then used during online RL to predict rewards for unlabeled rollouts attached via query similarity. Experiments on Qwen2.5-1.5B and 3B models in mathematics, QA, and code generation report that using ground-truth rewards on only 20% of rollouts yields 96.6% and 97.3% of full-oracle performance, respectively, with comparable out-of-domain results.

Significance. If the core mechanism holds, the work could meaningfully reduce annotation costs for RL-based LLM fine-tuning in verification-heavy domains by enabling reliable semi-supervised reward propagation.

major comments (2)
  1. [Abstract] Abstract: the headline performance figures (96.6% of Oracle on 1.5B, 97.3% on 3B with 20% labels) rest on the claim that similarity edges between query/thinking/answer nodes allow the GNN to recover ground-truth rewards on the remaining 80%. No correlation analysis, ablation removing similarity edges, or quantitative check that embedding similarity predicts reward agreement is referenced, yet the skeptic correctly notes that semantically similar thinking traces can differ in correctness (e.g., single arithmetic slip or off-by-one bug).
  2. [Experiments] Experiments section (results tables): the reported percentages of Oracle performance are presented without accompanying variance, number of random seeds, or statistical tests, making it impossible to judge whether the 96.6–97.3% figures are robust or could be explained by variance in the 20% labeled subset.
minor comments (2)
  1. The description of GNN warmup and online attachment of new rollouts is high-level; concrete details on similarity metric, edge weighting, and GNN architecture would improve reproducibility.
  2. Notation for the heterogeneous graph (query, thinking-process, answer nodes) is introduced in the abstract but never formalized with an equation or diagram reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the supporting analyses and statistical reporting.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance figures (96.6% of Oracle on 1.5B, 97.3% on 3B with 20% labels) rest on the claim that similarity edges between query/thinking/answer nodes allow the GNN to recover ground-truth rewards on the remaining 80%. No correlation analysis, ablation removing similarity edges, or quantitative check that embedding similarity predicts reward agreement is referenced, yet the skeptic correctly notes that semantically similar thinking traces can differ in correctness (e.g., single arithmetic slip or off-by-one bug).

    Authors: We agree that the manuscript would benefit from explicit evidence linking embedding similarity to reward agreement. The current version describes the heterogeneous graph but does not report ablations or correlation metrics. We will add (1) an ablation removing similarity edges while retaining structural edges, (2) a quantitative correlation study between cosine similarity of node embeddings and reward label agreement on a held-out validation set, and (3) error-case analysis of semantically similar but reward-differing traces. These additions will directly address the concern. revision: yes

  2. Referee: [Experiments] Experiments section (results tables): the reported percentages of Oracle performance are presented without accompanying variance, number of random seeds, or statistical tests, making it impossible to judge whether the 96.6–97.3% figures are robust or could be explained by variance in the 20% labeled subset.

    Authors: We concur that variance, seed counts, and statistical tests are necessary for assessing robustness. In the revision we will report results averaged over at least five random seeds (including different 20% labeled-subset samplings), include standard deviations, and add paired statistical tests comparing MemReward against the oracle baseline to confirm the reported percentages are not attributable to sampling variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method relies on external labels and standard semi-supervised propagation

full rationale

The paper describes a GNN trained on a subset of ground-truth labeled rollouts to predict rewards for the remainder via similarity-based graph edges. Performance is evaluated empirically against an Oracle baseline that uses full ground-truth labels. No equations, derivations, or self-citations are presented that reduce the reported accuracy (96.6–97.3 % of Oracle) to a quantity defined by construction from the method's own fitted parameters or prior outputs. The approach is self-contained against external benchmarks because label propagation depends on independently obtained ground-truth data for the 20 % warmup set, not on re-using the target predictions themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5865 in / 1047 out tokens · 45134 ms · 2026-05-25T06:22:11.879359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.