pith. sign in

arxiv: 2601.10306 · v2 · submitted 2026-01-15 · 💻 cs.AI · cs.CL

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

Pith reviewed 2026-05-16 14:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords long-context reasoningreinforcement learningevidence retrievalpolicy optimizationreward co-evolutionprocess supervisionLLM
0
0 comments X

The pith

Adding group-relative evidence rewards and co-evolving the reward model with the policy improves long-context LLM reasoning over sparse-outcome baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard reinforcement learning for large language models fails in long contexts because final-answer rewards are too sparse to supervise the step of locating relevant evidence amid large amounts of text. It validates this by using tree-structured evidence sampling to isolate evidence extraction as the decisive performance bottleneck rather than the reasoning process itself. EAPO then supplies dense supervision through a Group-Relative Evidence Reward and keeps that supervision accurate by iteratively refining the reward model only on rollouts whose final answers match the ground truth. The result is higher accuracy on long-context benchmarks without changing the underlying model architecture.

Core claim

We establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. EAPO shows

What carries the argument

Group-Relative Evidence Reward together with Adaptive Reward-Policy Co-Evolution, which supplies dense process-level signals by scoring evidence quality relative to other rollouts and refines the scorer using only outcome-consistent data.

Load-bearing premise

Refining the reward model only on outcome-consistent rollouts lets it reliably judge evidence-retrieval quality without introducing bias or overfitting to the training distribution.

What would settle it

A separate measurement of evidence-retrieval precision on held-out long documents showing that EAPO produces no measurable gain in retrieval quality or final accuracy compared with standard outcome-reward training.

read the original abstract

While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces EAPO (Evidence-Augmented Policy Optimization) for RL-based long-context LLM reasoning. It first validates an Evidence-Augmented Reasoning paradigm via Tree-Structured Evidence Sampling, establishing precise evidence extraction as the decisive bottleneck. EAPO then deploys a Group-Relative Evidence Reward for dense process supervision and an Adaptive Reward-Policy Co-Evolution step that iteratively refines the reward model on outcome-consistent rollouts. Experiments across eight benchmarks report significant gains over SOTA baselines.

Significance. If the central claims hold, the work would meaningfully advance RL for long-context reasoning by converting sparse outcome rewards into dense evidence-focused signals. The Tree-Structured Evidence Sampling result and the co-evolution mechanism could supply a reusable template for process supervision in needle-in-haystack settings, provided the supervision remains grounded rather than proxy-driven.

major comments (1)
  1. [Adaptive Reward-Policy Co-Evolution mechanism] Adaptive Reward-Policy Co-Evolution (described after the Group-Relative Evidence Reward): the reward model is refined exclusively on rollouts whose final answer matches ground truth. In long-context regimes where multiple non-evidence paths can still produce the correct outcome, this creates a supervision signal that can reinforce lexical overlap, position heuristics, or answer-format cues rather than genuine evidence retrieval fidelity, directly undermining the claim of 'precise process guidance.'
minor comments (1)
  1. The abstract and method description do not specify the exact architecture or training objective of the reward model (e.g., whether it is a separate LLM head or a fine-tuned copy of the policy), making it difficult to assess reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concern about the Adaptive Reward-Policy Co-Evolution mechanism is substantive, and we respond point-by-point below while clarifying how the Group-Relative Evidence Reward anchors the supervision in evidence quality.

read point-by-point responses
  1. Referee: [Adaptive Reward-Policy Co-Evolution mechanism] Adaptive Reward-Policy Co-Evolution (described after the Group-Relative Evidence Reward): the reward model is refined exclusively on rollouts whose final answer matches ground truth. In long-context regimes where multiple non-evidence paths can still produce the correct outcome, this creates a supervision signal that can reinforce lexical overlap, position heuristics, or answer-format cues rather than genuine evidence retrieval fidelity, directly undermining the claim of 'precise process guidance.'

    Authors: We agree that training the reward model solely on outcome-consistent rollouts carries a risk of capturing spurious correlations in long-context settings. However, this risk is mitigated by the Group-Relative Evidence Reward, which supplies dense, within-group comparative signals explicitly on evidence retrieval quality rather than final-answer correctness alone. The co-evolution step then uses these refined evidence scores to iteratively sharpen the reward model's discrimination, as supported by our Tree-Structured Evidence Sampling results establishing evidence extraction as the primary bottleneck. We will revise the manuscript to add a dedicated limitations subsection discussing potential spurious signals, together with new ablation results on evidence-quality metrics (e.g., evidence coverage and grounding scores) that demonstrate EAPO's gains exceed those obtainable from outcome rewards or lexical heuristics. We therefore maintain that the mechanism delivers precise process guidance, but acknowledge the referee's point warrants explicit discussion. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes the EAPO algorithm and its components (Tree-Structured Evidence Sampling for validation, Group-Relative Evidence Reward, and Adaptive Reward-Policy Co-Evolution) as a new RL method for long-context reasoning. The co-evolution step refines the reward model on outcome-consistent rollouts, but this is an explicit training procedure rather than a mathematical reduction where the output is defined as equivalent to the input by construction. No equations or claims reduce the central performance claims to fitted parameters or self-referential definitions. The validation is presented as empirical sampling results, and the overall derivation relies on benchmark evaluations rather than self-citation chains or imported uniqueness theorems. The method is self-contained against external benchmarks with no load-bearing circular steps identified.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that a learned reward model can serve as accurate process supervision and that outcome-consistent rollouts suffice to keep it calibrated.

pith-pipeline@v0.9.0 · 5485 in / 1204 out tokens · 39069 ms · 2026-05-16T14:13:22.209813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.