pith. sign in

arxiv: 2506.11480 · v4 · submitted 2025-06-13 · 💻 cs.LG · cs.AI

LearnAlign: Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment

Pith reviewed 2026-05-19 09:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords data selectionreinforcement learningLLM post-traininggradient alignmentreasoning modelsdata efficiencysuccess rate metric
0
0 comments X

The pith

LearnAlign selects learnable reasoning data for RL by using success rates to fix gradient length bias, enabling fewer examples with equal or better performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LearnAlign to solve the data inefficiency of reinforcement learning with verifiable rewards for improving LLM reasoning. It proposes using the success rate of responses as a measure of data learnability to correct the bias where longer responses have larger gradient norms. By selecting data that is both representative via gradient alignment and learnable via this success rate, the method trains models on much smaller datasets. Experiments show this can reduce the number of training examples by up to 1000 while achieving 77.5% accuracy on GSM8K compared to 77.0% with the full dataset. The gains hold on other math and code benchmarks as well.

Core claim

LearnAlign defines a data learnability metric based on the model's success rate on each reasoning example. This metric adjusts the gradient norm to remove response-length bias. The method then chooses a subset of examples that maximize alignment with the full gradient while having high learnability scores. Training on this subset produces models that match or exceed the performance of training on all available data.

What carries the argument

Data learnability score derived from success rate, used within a gradient-alignment selection procedure to identify high-potential training examples.

If this is right

  • Up to 1000 fewer data points can be used on GSM8K while improving accuracy to 77.5% from 77.0%.
  • Comparable efficiency improvements appear on mathematical and code benchmarks in the DAPO-MATH-17K dataset.
  • The selection maintains or enhances final model performance relative to full data training.
  • Focus shifts to learnable and representative examples rather than using every available reasoning problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Periodic re-computation of success rates as the model improves could refine the selection further over the course of training.
  • Similar learnability proxies might help in other RL applications where data collection is costly.
  • The results imply that standard reasoning datasets contain many examples with low learning value that can be filtered out early.

Load-bearing premise

That the success rate on a data point provides an unbiased proxy for its learning potential and corrects length bias without creating new harmful selection effects.

What would settle it

If training on the LearnAlign-selected data yields lower accuracy than the full dataset on the same number of training steps across multiple reasoning benchmarks, the method's advantage would be disproven.

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we present a novel gradient-alignment-based method, named LearnAlign, which intelligently selects the learnable and representative training reasoning data for RLVR post-training. To overcome the well-known response-length bias in gradient norms, we introduce the data learnability based on the success rate, which indicates the learning potential of each data point. Experiments across five reasoning benchmarks show that our method significantly reduces training data requirements while achieving minor performance degradation or even improving performance compared to full-data training. Specifically, it reduces data requirements by up to 1,000 data points with better performance (77.5%) than that on the full dataset on the GSM8K benchmark (77.0%). Furthermore, its efficiency is demonstrated on both mathematical and code benchmarks by using much less data from the DAPO-MATH-17K dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents LearnAlign, a gradient-alignment-based data selection method for reinforcement learning with verifiable rewards (RLVR) to improve data efficiency in LLM post-training. By defining data learnability using success rates to mitigate response-length bias in gradient norms, it selects the top-k learnable and representative data points. Experiments on five reasoning benchmarks, including GSM8K and DAPO-MATH-17K, show that the method can reduce the number of training data points by up to 1,000 while achieving performance equal to or better than training on the full dataset, for example 77.5% vs 77.0% on GSM8K.

Significance. This work addresses the data inefficiency in RLVR, which is a critical issue for scaling reasoning capabilities in LLMs. If the claims hold, LearnAlign could lead to more efficient training pipelines, potentially lowering the computational cost and data requirements for post-training LLMs on mathematical and coding tasks. The introduction of a success-rate-based learnability proxy is a practical contribution that builds on existing gradient-based selection ideas.

major comments (3)
  1. [Method] The precise definition of the learnability score based on success rate and its integration with gradient alignment is not fully specified with equations; this is load-bearing for the claim that it corrects length bias without introducing new effects.
  2. [Experiments] The headline result on GSM8K (77.5% with ~1000 fewer points vs 77.0% full) lacks details on the exact subset size, how the selection threshold is determined, and crucially, an ablation comparing to random selection or length-matched subsets to rule out residual bias.
  3. [§4.1] No statistical tests or variance estimates across multiple seeds are reported for the benchmark gains, making it hard to determine if the 0.5% improvement is meaningful or due to selection effects.
minor comments (2)
  1. [Abstract] The abstract mentions 'minor performance degradation or even improving performance' but does not quantify the range of outcomes across the five benchmarks.
  2. Ensure that all acronyms like RLVR and DAPO-MATH-17K are defined at first use in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for clarification and strengthening of our manuscript. We appreciate the recognition of LearnAlign's potential contribution to improving data efficiency in RLVR. We address each major comment below and commit to revisions that enhance the paper's rigor without altering its core claims.

read point-by-point responses
  1. Referee: [Method] The precise definition of the learnability score based on success rate and its integration with gradient alignment is not fully specified with equations; this is load-bearing for the claim that it corrects length bias without introducing new effects.

    Authors: We agree that the method description would be strengthened by explicit equations. In the revised manuscript, we will add the formal definition of the success-rate-based learnability score (as the fraction of successful responses for a given data point) and its precise integration with the gradient alignment term to produce the final selection score. This addition will explicitly show how the success-rate proxy addresses response-length bias in gradient norms while preserving the alignment objective. revision: yes

  2. Referee: [Experiments] The headline result on GSM8K (77.5% with ~1000 fewer points vs 77.0% full) lacks details on the exact subset size, how the selection threshold is determined, and crucially, an ablation comparing to random selection or length-matched subsets to rule out residual bias.

    Authors: We will revise the experimental section to report the precise subset sizes used for the GSM8K results, clarify the top-k selection procedure and any threshold criteria, and add ablations against random selection as well as length-matched random subsets. These additions will provide direct evidence that performance differences arise from the proposed selection criteria rather than residual length or selection artifacts. revision: yes

  3. Referee: [§4.1] No statistical tests or variance estimates across multiple seeds are reported for the benchmark gains, making it hard to determine if the 0.5% improvement is meaningful or due to selection effects.

    Authors: We acknowledge the value of statistical reporting. In the revision we will include variance estimates obtained from multiple random seeds for the key benchmarks and apply appropriate statistical tests (e.g., paired t-tests) to evaluate whether the observed gains are statistically significant. Where additional runs are computationally feasible, we will perform them; otherwise we will note the limitation and report available variance. revision: partial

Circularity Check

0 steps flagged

No circularity: selection rule uses independent observables

full rationale

The paper defines data learnability via per-point success rate (an observable training outcome) to debias gradient norms from length effects, then ranks and selects top-k examples by alignment score. This construction does not reduce any claimed prediction or performance gain to a fitted parameter or to the selection rule itself; the reported GSM8K and DAPO-MATH gains are measured on held-out benchmarks after selection, not tautological with the ranking. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the derivation. The method remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on two domain assumptions about gradient bias and success-rate learnability; no explicit free parameters or new entities are named in the abstract, though selection thresholds are likely implicit.

axioms (2)
  • domain assumption Gradient norms in RLVR exhibit response-length bias that must be corrected for fair data selection
    Described as a well-known issue that the success-rate learnability metric is introduced to address.
  • domain assumption Success rate on a data point indicates its learning potential for RLVR
    Used to define data learnability for selecting representative and learnable examples.

pith-pipeline@v0.9.0 · 5731 in / 1371 out tokens · 35781 ms · 2026-05-19T09:32:55.201696+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

    cs.AI 2026-01 unverdicted novelty 6.0

    SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.