Data-Efficient RLVR via Off-Policy Influence Guidance
Pith reviewed 2026-05-18 02:41 UTC · model grok-4.3
The pith
Off-policy influence estimation lets RLVR select the most useful data and train 2.66 times faster while using only 10 percent of the data per stage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CROPI is a multi-stage curriculum RL framework that iteratively selects the data points with highest influence on the current policy's objective. Influence is estimated off-policy from pre-collected trajectories rather than fresh rollouts, and high-dimensional gradients are compressed via sparse random projection. On a 1.5B model this produces a 2.66x step-level acceleration while retaining only 10 percent of the data at each stage relative to training on the full set.
What carries the argument
Off-policy influence estimation from pre-collected trajectories, combined with sparse random projection, to rank and select data for each stage of the RLVR curriculum.
If this is right
- Training proceeds with only 10 percent of the usual data per stage yet reaches target performance in roughly one-third the number of steps on 1.5B models.
- The same selection procedure scales at least to 7B-parameter models while preserving the reported efficiency gains.
- Data selection acquires a theoretical basis grounded in the learning objective rather than relying on heuristic scores.
- The multi-stage loop allows the selected dataset to adapt as the policy improves, creating an implicit curriculum.
Where Pith is reading between the lines
- If the off-policy approximation remains stable across policy updates, the same technique could reduce data needs in other reinforcement-learning settings that currently require fresh rollouts for every candidate example.
- The method opens the possibility of starting RLVR from smaller seed datasets and letting influence guidance grow the effective training set over stages.
- Combining influence-guided selection with other efficiency tools such as gradient checkpointing or quantization could produce compounding reductions in total compute.
Load-bearing premise
The off-policy influence scores computed from fixed trajectories remain a close enough proxy for the true influence that the same data points would have under online rollouts from the evolving policy.
What would settle it
A controlled comparison on a small model in which the top-k data points chosen by the off-policy estimator differ substantially from those chosen by exact online influence computation would show the approximation introduces large bias.
read the original abstract
Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CROPI, a multi-stage curriculum RL framework for data-efficient RLVR in LLMs. It estimates each data point's influence on the learning objective via influence functions, approximates this off-policy using pre-collected trajectories to avoid online rollouts, applies sparse random projections to manage high-dimensional LLM gradients, and iteratively selects the top 10% most influential data per stage. On a 1.5B model it reports a 2.66x step-level training acceleration relative to full-dataset training.
Significance. If the off-policy approximation is shown to be low-bias and the influence estimates remain predictive across stages, the method would supply a principled, theoretically motivated alternative to heuristic data selection in RLVR, with clear potential to reduce compute and data requirements for reasoning-focused LLM training.
major comments (1)
- Abstract: the central 2.66x acceleration claim depends on the off-policy influence estimator accurately approximating the true online influence of each trajectory on the current policy without large bias; the provided text contains no derivation, error analysis, or empirical validation of this approximation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting the importance of validating the off-policy influence approximation that underpins our acceleration claims. We address the major comment below and will revise the manuscript to strengthen the presentation.
read point-by-point responses
-
Referee: Abstract: the central 2.66x acceleration claim depends on the off-policy influence estimator accurately approximating the true online influence of each trajectory on the current policy without large bias; the provided text contains no derivation, error analysis, or empirical validation of this approximation.
Authors: We agree that the abstract, due to length constraints, does not contain the full derivation or error analysis. The manuscript body derives the off-policy estimator by replacing online policy rollouts with influence scores computed on pre-collected trajectories from earlier stages, and provides a bias bound that holds when the trajectory distribution remains sufficiently close to the current policy (via a Lipschitz continuity argument on the influence function). Empirical validation appears in the experiments, where we report that the selected top-10% subsets produce training curves that closely track full-data training while achieving the stated 2.66x step acceleration on the 1.5B model. To make the abstract self-contained, we will add one sentence summarizing the off-policy approximation and its observed low bias. revision: yes
Circularity Check
No significant circularity identified
full rationale
The abstract describes a method for data selection in RLVR using influence functions with an off-policy approximation based on pre-collected trajectories and sparse random projection for dimensionality reduction. No equations, derivations, proofs, or self-citations appear in the available text, so no load-bearing steps can be examined that reduce by construction to fitted inputs or prior self-referential definitions. The central claims rest on experimental comparisons to full-dataset training rather than a closed mathematical chain, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
off-policy influence estimation ... bgβ(θ,s0)≈1/K Σ ∇θ ρπθ k,t bAβ k,t (Eq. 4)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Curriculum RL with Off-Policy Influence guidance (CROPI)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
-
Data Attribution in Adaptive Learning
Occurrence-level attribution in finite-horizon adaptive learning is defined via a conditional interventional target, shown to be unrecoverable from replay data in general but identifiable in a specific structural clas...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.