Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning
Pith reviewed 2026-05-16 14:13 UTC · model grok-4.3
The pith
Adding group-relative evidence rewards and co-evolving the reward model with the policy improves long-context LLM reasoning over sparse-outcome baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. EAPO shows
What carries the argument
Group-Relative Evidence Reward together with Adaptive Reward-Policy Co-Evolution, which supplies dense process-level signals by scoring evidence quality relative to other rollouts and refines the scorer using only outcome-consistent data.
Load-bearing premise
Refining the reward model only on outcome-consistent rollouts lets it reliably judge evidence-retrieval quality without introducing bias or overfitting to the training distribution.
What would settle it
A separate measurement of evidence-retrieval precision on held-out long documents showing that EAPO produces no measurable gain in retrieval quality or final accuracy compared with standard outcome-reward training.
read the original abstract
While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EAPO (Evidence-Augmented Policy Optimization) for RL-based long-context LLM reasoning. It first validates an Evidence-Augmented Reasoning paradigm via Tree-Structured Evidence Sampling, establishing precise evidence extraction as the decisive bottleneck. EAPO then deploys a Group-Relative Evidence Reward for dense process supervision and an Adaptive Reward-Policy Co-Evolution step that iteratively refines the reward model on outcome-consistent rollouts. Experiments across eight benchmarks report significant gains over SOTA baselines.
Significance. If the central claims hold, the work would meaningfully advance RL for long-context reasoning by converting sparse outcome rewards into dense evidence-focused signals. The Tree-Structured Evidence Sampling result and the co-evolution mechanism could supply a reusable template for process supervision in needle-in-haystack settings, provided the supervision remains grounded rather than proxy-driven.
major comments (1)
- [Adaptive Reward-Policy Co-Evolution mechanism] Adaptive Reward-Policy Co-Evolution (described after the Group-Relative Evidence Reward): the reward model is refined exclusively on rollouts whose final answer matches ground truth. In long-context regimes where multiple non-evidence paths can still produce the correct outcome, this creates a supervision signal that can reinforce lexical overlap, position heuristics, or answer-format cues rather than genuine evidence retrieval fidelity, directly undermining the claim of 'precise process guidance.'
minor comments (1)
- The abstract and method description do not specify the exact architecture or training objective of the reward model (e.g., whether it is a separate LLM head or a fine-tuned copy of the policy), making it difficult to assess reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The concern about the Adaptive Reward-Policy Co-Evolution mechanism is substantive, and we respond point-by-point below while clarifying how the Group-Relative Evidence Reward anchors the supervision in evidence quality.
read point-by-point responses
-
Referee: [Adaptive Reward-Policy Co-Evolution mechanism] Adaptive Reward-Policy Co-Evolution (described after the Group-Relative Evidence Reward): the reward model is refined exclusively on rollouts whose final answer matches ground truth. In long-context regimes where multiple non-evidence paths can still produce the correct outcome, this creates a supervision signal that can reinforce lexical overlap, position heuristics, or answer-format cues rather than genuine evidence retrieval fidelity, directly undermining the claim of 'precise process guidance.'
Authors: We agree that training the reward model solely on outcome-consistent rollouts carries a risk of capturing spurious correlations in long-context settings. However, this risk is mitigated by the Group-Relative Evidence Reward, which supplies dense, within-group comparative signals explicitly on evidence retrieval quality rather than final-answer correctness alone. The co-evolution step then uses these refined evidence scores to iteratively sharpen the reward model's discrimination, as supported by our Tree-Structured Evidence Sampling results establishing evidence extraction as the primary bottleneck. We will revise the manuscript to add a dedicated limitations subsection discussing potential spurious signals, together with new ablation results on evidence-quality metrics (e.g., evidence coverage and grounding scores) that demonstrate EAPO's gains exceed those obtainable from outcome rewards or lexical heuristics. We therefore maintain that the mechanism delivers precise process guidance, but acknowledge the referee's point warrants explicit discussion. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes the EAPO algorithm and its components (Tree-Structured Evidence Sampling for validation, Group-Relative Evidence Reward, and Adaptive Reward-Policy Co-Evolution) as a new RL method for long-context reasoning. The co-evolution step refines the reward model on outcome-consistent rollouts, but this is an explicit training procedure rather than a mathematical reduction where the output is defined as equivalent to the input by construction. No equations or claims reduce the central performance claims to fitted parameters or self-referential definitions. The validation is presented as empirical sampling results, and the overall derivation relies on benchmark evaluations rather than self-citation chains or imported uniqueness theorems. The method is self-contained against external benchmarks with no load-bearing circular steps identified.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.