arxiv: 2604.07809 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

PolicyLong: Towards On-Policy Context Extension

Junlong Jia , Ziyang Chen , Xing Wu , Chaochen Gao , Tinghao Yu , Feng Zhang , Songlin Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords long-context extensionon-policy learningdata synthesisLLM trainingentropy screeningself-curriculumcontext window

0 comments

The pith

Making long-context data construction dynamic by re-screening examples with the current model creates a self-curriculum that improves LLM performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current methods for building long-context training data run a single offline pass with a fixed model, selecting contexts that lower predictive entropy. This produces a static dataset that drifts away from the model's improving capabilities during training. PolicyLong instead loops the entire screening process—entropy computation, retrieval, and verification—using the model at each step, so the chosen positive examples and hard negatives both come from the model's present uncertainty landscape. The loop generates an emergent curriculum where the training distribution tracks what the model can now handle and what it still resists. Experiments on standard long-context benchmarks show this on-policy approach yields consistent gains that widen as context length increases.

Core claim

PolicyLong shifts data construction toward a dynamic on-policy paradigm by iteratively re-executing entropy-based screening, retrieval, and verification with the current model, ensuring the training distribution tracks evolving capabilities and both positive and hard negative contexts derive from the model's entropy landscape.

What carries the argument

The iterative on-policy loop that reapplies entropy computation and verification using the model being trained to generate co-evolving positive and negative contexts.

If this is right

Gains from the method increase as context length grows to 128K tokens and beyond.
Positive examples and hard negatives co-evolve from the same current entropy map.
The training distribution remains aligned with the model's changing capabilities throughout optimization.
The approach can be applied on top of existing base models such as Qwen2.5-3B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar iterative loops might reduce reliance on large static synthetic datasets for other LLM capabilities.
The frequency of re-screening iterations could be tuned to balance curriculum freshness against compute cost.
Extending the loop to multi-stage training might allow gradual scaling to contexts longer than those tested.

Load-bearing premise

Iteratively re-executing the screening steps with the current model produces a stable self-curriculum that follows the model's progress without training instability or collapse.

What would settle it

Applying the iterative screening loop to a base model and finding no improvement, or even degradation, on long-context benchmarks relative to single-pass methods would indicate the on-policy alignment does not hold.

Figures

Figures reproduced from arXiv: 2604.07809 by Chaochen Gao, Feng Zhang, Junlong Jia, Songlin Hu, Tinghao Yu, Xing Wu, Ziyang Chen.

**Figure 2.** Figure 2: Empirical evidence for the off-policy gap. (a) The entropy distribution of positions initially identified [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Data difficulty progression. We measure the loss reduction from the base model to the progressively [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Needle-in-a-Haystack evaluation results for PolicyLong. The heatmap shows retrieval accuracy across [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: RULER detailed analysis. (a) Subtask breakdown at 128K: PolicyLong achieves the largest gains on [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Extending LLM context windows is hindered by scarce high-quality long-context data. Recent methods synthesize data with genuine long-range dependencies via information-theoretic verification, selecting contexts that reduce a base model's predictive entropy. However, their single-pass offline construction with a fixed model creates a fundamental off-policy gap: the static screening landscape misaligns with the model's evolving capabilities, causing the training distribution to drift. We propose PolicyLong, shifting data construction towards a dynamic on-policy paradigm. By iteratively re-executing data screening (entropy computation, retrieval, and verification) using the current model, PolicyLong ensures the training distribution tracks evolving capabilities, yielding an emergent self-curriculum. Crucially, both positive and hard negative contexts derive from the current model's entropy landscape, co-evolving what the model learns to exploit and resist. Experiments on RULER, HELMET, and LongBench-v2 (Qwen2.5-3B) show PolicyLong consistently outperforms EntropyLong and NExtLong, with gains growing at longer contexts (e.g., +2.54 at 128K on RULER), confirming the value of on-policy data evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PolicyLong's shift to iterative on-policy data screening is a clean conceptual step past static methods like EntropyLong, but the gains need ablations to confirm the loop itself drives them rather than extra data volume.

read the letter

The main new element is turning entropy-based long-context data construction into an on-policy process that re-screens positives and hard negatives with the current model at each step. This directly targets the off-policy drift they identify in prior single-pass approaches, and the self-curriculum framing follows logically from letting the training distribution track capability changes. They lay out the problem clearly and report consistent wins on RULER, HELMET, and LongBench-v2 with Qwen2.5-3B, plus larger margins at 128K contexts such as the +2.54 point lift on RULER. That pattern matches what one would expect if the alignment helps more as context length grows. The soft spots sit in the experimental backing. The abstract gives the outperformance numbers but supplies no iteration counts, no tracking of sample diversity or entropy variance across rounds, and no stability metrics like loss curves or gradient behavior. Without those, the results could reflect cumulative data effects or hyperparameter choices instead of the dynamic feedback. The stress-test worry about the entropy landscape shifting and producing drift or collapse is reasonable and unaddressed in the given summary. This is aimed at groups working on data-centric scaling for long-context models. Anyone already using information-theoretic selection or thinking about self-improving curricula would find the setup useful even before the full details land. It deserves peer review because the idea is straightforward, the benchmarks are standard, and the framing engages the literature honestly, though it will need the missing controls to hold up under scrutiny.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes PolicyLong, an on-policy approach to long-context data synthesis for LLMs. It iteratively re-executes entropy-based screening, retrieval, and verification using the current model (instead of a fixed offline model) to generate positive and hard-negative contexts that co-evolve with the model's capabilities, producing an emergent self-curriculum. Experiments with Qwen2.5-3B on RULER, HELMET, and LongBench-v2 report consistent outperformance over EntropyLong and NExtLong, with gains increasing at longer contexts (e.g., +2.54 at 128K on RULER).

Significance. If the causal mechanism is confirmed, the shift from offline to on-policy data construction could meaningfully improve synthetic long-context training by reducing distribution drift, offering a scalable path for context extension beyond current information-theoretic methods. The reported trend of growing gains at 128K contexts is a promising empirical signal. The work explicitly builds on prior entropy-verification techniques while adding dynamic adaptation, and the multi-benchmark evaluation provides a solid starting point for assessing practical impact.

major comments (3)

[§4 (Experiments)] §4 (Experiments): The central claim that on-policy iteration produces a beneficial self-curriculum (rather than gains from cumulative data volume or hyperparameter effects) is not supported by ablations on iteration count, data diversity metrics (entropy variance or n-gram overlap across rounds), or training stability indicators (loss curves, gradient norms). The +2.54 gain at 128K on RULER could therefore be explained by offline alternatives trained on equivalent total data.
[§3.2 (On-Policy Iteration Procedure)] §3.2 (On-Policy Iteration Procedure): The description of how positive/hard-negative pairs derived from the evolving entropy landscape remain informative and non-redundant across iterations lacks any analysis of potential drift, mode collapse, or overfitting to transient model weaknesses. No checks on iteration dynamics are reported, leaving the stability assumption unverified.
[Table 1 / Figure 3 (Benchmark Results)] Table 1 / Figure 3 (Benchmark Results): Statistical significance, variance across multiple random seeds, or controls for total training tokens are not reported for the cross-method comparisons. This weakens the claim of consistent outperformance and growing gains at longer contexts.

minor comments (2)

[Abstract / §1] The abstract and §1 could briefly define 'on-policy' in this data-construction setting (e.g., by explicit contrast to the fixed-model offline baseline) to improve accessibility for readers outside RL.
[§3] Notation for the iterative update (e.g., model at iteration t, entropy landscape E_t) should be introduced once in §3 and used consistently thereafter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on strengthening the empirical support for the on-policy self-curriculum in PolicyLong. We address each major point below and will revise the manuscript accordingly to incorporate additional ablations, stability analyses, and statistical controls.

read point-by-point responses

Referee: [§4 (Experiments)] The central claim that on-policy iteration produces a beneficial self-curriculum (rather than gains from cumulative data volume or hyperparameter effects) is not supported by ablations on iteration count, data diversity metrics (entropy variance or n-gram overlap across rounds), or training stability indicators (loss curves, gradient norms). The +2.54 gain at 128K on RULER could therefore be explained by offline alternatives trained on equivalent total data.

Authors: We agree that isolating the self-curriculum effect requires explicit controls for data volume. In the revised manuscript, we will add ablations that fix total training tokens while varying iteration count (subsampling data as needed for fairness). We will also report data diversity metrics including entropy variance across rounds and n-gram overlap between iteration-specific contexts, plus training stability indicators such as loss curves and gradient norms. These will show that gains persist beyond equivalent-volume offline baselines, supporting that the on-policy adaptation contributes independently. revision: yes
Referee: [§3.2 (On-Policy Iteration Procedure)] The description of how positive/hard-negative pairs derived from the evolving entropy landscape remain informative and non-redundant across iterations lacks any analysis of potential drift, mode collapse, or overfitting to transient model weaknesses. No checks on iteration dynamics are reported, leaving the stability assumption unverified.

Authors: We acknowledge the absence of explicit dynamics analysis. We will expand §3.2 with new checks: entropy distribution shifts over iterations to detect drift, pairwise overlap metrics on selected positive/hard-negative contexts to quantify redundancy, and diversity tracking on hard-negatives to monitor mode collapse. We will also demonstrate that selected pairs do not overfit transient weaknesses by evaluating continued gains on held-out long-context benchmarks, thereby verifying the stability of the on-policy loop. revision: yes
Referee: [Table 1 / Figure 3 (Benchmark Results)] Statistical significance, variance across multiple random seeds, or controls for total training tokens are not reported for the cross-method comparisons. This weakens the claim of consistent outperformance and growing gains at longer contexts.

Authors: We agree these elements are needed for rigorous comparison. In the revision, we will re-run the primary experiments across at least three random seeds and report means with standard deviations in Table 1 and Figure 3, along with statistical significance tests confirming outperformance. We will also explicitly control and document that all methods use identical total training tokens (via matched sample counts), and verify that the trend of increasing gains at longer contexts holds under these conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural on-policy method with external empirical validation

full rationale

The paper advances a data-construction algorithm (iterative entropy screening, retrieval, and verification using the current model) and reports its superiority via direct comparisons on fixed external benchmarks (RULER, HELMET, LongBench-v2). No mathematical derivation, fitted parameter, or uniqueness theorem is presented whose output is definitionally equivalent to its input. The self-curriculum emerges from the stated procedure rather than from any closed loop that presupposes the measured gains. Self-citations, if present, are not load-bearing for any core claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on domain assumptions about entropy as a reliable signal for useful long-range dependencies and the stability of iterative on-policy updates; no explicit free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Predictive entropy reduction identifies contexts with genuine long-range dependencies useful for training
Invoked as the basis for data screening and verification in the information-theoretic approach.
ad hoc to paper Iterative on-policy re-screening produces a beneficial self-curriculum without instability
Assumed to enable the training distribution to track evolving capabilities.

pith-pipeline@v0.9.0 · 5508 in / 1489 out tokens · 67634 ms · 2026-05-10T17:19:10.137647+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 15 canonical work pages · 5 internal anchors

[1]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, 8 Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks.arXiv preprint arXiv:2412.15204,

work page arXiv
[2]

What is wrong with perplexity for long-context language modeling?arXiv preprint arXiv:2410.23771,

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling?arXiv preprint arXiv:2410.23771,

work page arXiv
[3]

Quest: Query-centric data synthesis approach for long-context scaling of large language model,

Chaochen Gao, Xing Wu, Qi Fu, and Songlin Hu. QUEST: Query-centric data synthesis approach for long-context scaling of large language model.arXiv preprint arXiv:2405.19846,

work page arXiv
[4]

Longmagpie: A self-synthesis method for generating large-scale long-context instructions.arXiv preprint arXiv:2505.17134, 2025a

Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, and Songlin Hu. Longmagpie: A self-synthesis method for generating large-scale long-context instructions.arXiv preprint arXiv:2505.17134, 2025a. Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, and Songlin Hu. NExtLong: Toward effective long-context training without long documents.arXiv preprint arXiv:2501.12...

work page arXiv
[5]

Reinforcement Learning via Self-Distillation

Jonas H¨ubotter, Frederike L¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review arXiv
[6]

Bi Huo, Bin Tu, Cheng Qin, Da Zheng, Debing Zhang, Dongjie Zhang, En Li, Fu Guo, Jian Yao, Jie Lou, et al. dots. llm1 technical report.arXiv preprint arXiv:2506.05767,

work page arXiv
[7]

Entropylong: Effective long-context training via predictive uncertainty.arXiv preprint arXiv:2510.02330,

Junlong Jia, Ziyang Chen, Xing Wu, Chaochen Gao, Zijia Lin, Debing Zhang, Songlin Hu, and Binghui Guo. Entropylong: Effective long-context training via predictive uncertainty.arXiv preprint arXiv:2510.02330,

work page arXiv
[8]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 6769–6781,

2020
[9]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 5835–5847,

2021
[12]

Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779,

work page arXiv
[13]

Self-Distillation Enables Continual Learning

9 Idan Shenfeld, Mehul Damani, Jonas H ¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897,

work page internal anchor Pith review arXiv
[14]

jina- embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024

URL https: //openreview.net/forum?id=LXVswInHOo. Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael G¨unther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, et al. jina-embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173,

work page arXiv
[15]

CoRR , volume =

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models.arXiv preprint arXiv:2309.16039,

work page arXiv
[16]

Helmet: How to evaluate long-context language models effectively and thoroughly, 2025

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effectively and thoroughly.arXiv preprint arXiv:2410.02694,

work page arXiv
[17]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

work page internal anchor Pith review arXiv