Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning

Chenliang Li; Fei Huang; Kaiming Liu; Ming Yan; Peng Li; Weizhou Shen; Xuanyu Lei; Yang Liu; Ya-Qin Zhang; Yuning Wu

arxiv: 2506.05760 · v2 · submitted 2025-06-06 · 💻 cs.CL

Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning

Xuanyu Lei , Chenliang Li , Yuning Wu , Kaiming Liu , Weizhou Shen , Peng Li , Ming Yan , Fei Huang

show 2 more authors

Ya-Qin Zhang Yang Liu

This is my paper

Pith reviewed 2026-05-19 10:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-form writingreinforcement learningcurriculum learninglarge language modelspairwise comparisonadaptive schedulinggeneralization

0 comments

The pith

Adaptive curriculum reinforcement learning improves long-form writing over supervised fine-tuning on 7B-scale models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Writing-RL as a way to push long-form writing capabilities in large language models past the limits of supervised fine-tuning. SFT hits data saturation and performance ceilings, while typical reinforcement learning needs clear verifiable answers that open-ended writing tasks lack. The approach uses margin-aware data selection to focus on promising examples, pairwise comparison rewards to create learning signals without ground truth, and dynamic reference scheduling to match task difficulty to the model's current level. Experiments indicate that this training yields better long-form writing results than strong SFT baselines and that the resulting models transfer unexpectedly well to long-input reasoning problems.

Core claim

Writing-RL is an Adaptive Curriculum Reinforcement Learning framework consisting of margin-aware data selection that prioritizes samples with high learning potential, a pairwise comparison reward mechanism that supplies discriminative signals without verifiable rewards, and dynamic reference scheduling that adaptively adjusts task difficulty based on evolving model performance. When applied to 7B-scale writer models, this framework improves long-form writing performance over strong SFT baselines. Models trained with long-output RL also generalize well to long-input reasoning tasks.

What carries the argument

Adaptive Curriculum Reinforcement Learning framework with margin-aware data selection, pairwise comparison reward mechanism, and dynamic reference scheduling.

If this is right

Long-form writing performance exceeds strong SFT baselines on 7B-scale models.
Models trained on long-output reinforcement learning transfer to long-input reasoning tasks.
The framework enables reinforcement learning for open-ended tasks that lack ground-truth rewards.
Dynamic reference scheduling helps match difficulty to current model capability during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar pairwise comparison methods might support reinforcement learning in other subjective or creative generation domains.
Training focus on long outputs could indirectly strengthen coherence and handling of extended contexts in multiple directions.
The observed transfer suggests output-length training may serve as one route to broader long-context improvements.

Load-bearing premise

Pairwise comparison rewards can generate sufficiently clear and useful learning signals for reinforcement learning when no verifiable ground-truth answers exist in open-ended writing.

What would settle it

If 7B-scale models trained with Writing-RL show no measurable gains in long-form writing quality compared to matched SFT baselines when evaluated on the same benchmarks, the central effectiveness claim would be contradicted.

Figures

Figures reproduced from arXiv: 2506.05760 by Chenliang Li, Fei Huang, Kaiming Liu, Ming Yan, Peng Li, Weizhou Shen, Xuanyu Lei, Yang Liu, Ya-Qin Zhang, Yuning Wu.

**Figure 1.** Figure 1: Overall framework of Writing-RL. 1) Margin-aware Data Selection: prioritizes samples with high learning potential; 2) Pairwise Comparison Reward: provides more discriminative reward signals; 3) Dynamic Reference Scheduling: adaptively incentivizes the model to surpass progressively stronger references. rectly applied to generative writing tasks. Without ground-truth labels, constructing an effective rewa… view at source ↗

**Figure 2.** Figure 2: Sample-wise asynchronous learning schedule [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Length distribution of our long-output RL training dataset and long-input evaluation dataset. ing long-context reasoning benchmark LongBench v2 (Bai et al., 2024a) to evaluate long-input reasoning. Notably, as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Recent advances in Large Language Models(LLMs) have enabled strong performance in long-form writing, but current training paradigms remain limited: Supervised Fine-Tuning (SFT) remains constrained by data saturation and performance ceilings, while Reinforcement Learning with Verifiable Reward (RLVR), though successful in verifiable domains like math and code, cannot be directly migrated to open-ended long-form writing due to a lack of ground-truths. To further advance long-form writing, we present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond SFT. The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals in the absence of verifiable rewards, and Dynamic Reference Scheduling approach, which plays a critical role by adaptively adjusting task difficulty based on evolving model performance. Experiments on 7B-scale writer models show that Writing-RL effectively improves long-form writing performance over strong SFT baselines. Furthermore, we observe that models trained with long-output RL generalize surprisingly well to long-input reasoning tasks, potentially offering a promising perspective for rethinking long-context training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Writing-RL puts together margin-aware selection, pairwise rewards, and dynamic scheduling into an adaptive RL loop for long-form writing, with reported gains over SFT on 7B models plus some generalization to reasoning tasks, but the pairwise signal needs tighter checks.

read the letter

This paper's core move is to adapt RL to open-ended writing by replacing verifiable rewards with a pairwise comparison setup, then layering on margin-based data picking and dynamic difficulty adjustment. The result is a curriculum that reportedly lifts 7B writer models past strong SFT baselines and even transfers to long-input reasoning tasks in a way the authors call surprising.

Referee Report

2 major / 2 minor

Summary. The paper introduces Writing-RL, an adaptive curriculum reinforcement learning framework for advancing long-form writing capabilities in LLMs beyond the limits of supervised fine-tuning (SFT). It consists of three components: a Margin-aware Data Selection strategy to prioritize samples with high learning potential, a Pairwise Comparison Reward mechanism to supply discriminative signals without verifiable ground truth, and Dynamic Reference Scheduling to adaptively adjust task difficulty based on model performance. Experiments on 7B-scale writer models report improvements over strong SFT baselines, along with unexpected generalization from long-output RL training to long-input reasoning tasks.

Significance. If the empirical results hold after addressing validation gaps, the work offers a practical route for applying RL to open-ended, non-verifiable domains such as creative writing. The reported generalization effect from long-output training to long-input reasoning tasks is a potentially high-impact observation that could prompt re-examination of long-context training strategies. The adaptive curriculum design directly targets data saturation issues in SFT and merits further exploration if the reward signal can be shown to be reliable.

major comments (2)

[Pairwise Comparison Reward (framework description)] The Pairwise Comparison Reward mechanism is load-bearing for the headline claim of improvement over SFT, yet the manuscript provides no validation that its signals are stable or non-hallucinated for subjective long-form writing. No inter-judge agreement rates, human correlation studies, or noise-injection ablations are reported, leaving open the possibility that gains arise from reward hacking amplified by the adaptive curriculum rather than genuine writing improvement.
[Experiments and results] The experimental results section claims effective improvement on 7B-scale models over strong SFT baselines and surprising generalization to long-input reasoning, but supplies no exact metric values, statistical significance tests, number of evaluation runs, or component-wise ablation tables. Without these, the magnitude and robustness of the reported gains cannot be assessed.

minor comments (2)

[Abstract] The abstract refers to 'strong SFT baselines' without naming the specific models, datasets, or performance numbers used for comparison.
[Introduction and terminology] Notation for 'long-output RL' versus 'long-input reasoning' should be defined once and used consistently to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will strengthen the manuscript while defending the core contributions on substantive grounds.

read point-by-point responses

Referee: [Pairwise Comparison Reward (framework description)] The Pairwise Comparison Reward mechanism is load-bearing for the headline claim of improvement over SFT, yet the manuscript provides no validation that its signals are stable or non-hallucinated for subjective long-form writing. No inter-judge agreement rates, human correlation studies, or noise-injection ablations are reported, leaving open the possibility that gains arise from reward hacking amplified by the adaptive curriculum rather than genuine writing improvement.

Authors: We agree that explicit validation of the Pairwise Comparison Reward is essential given its central role. The original manuscript described the use of a strong LLM judge for pairwise comparisons to generate discriminative signals without ground truth, but did not report stability metrics. In revision we will add: (1) inter-judge agreement rates across multiple independent LLM judges on the same writing pairs, (2) human correlation analysis on a representative subset of samples, and (3) noise-injection ablations that perturb the reward signal to test robustness. These additions will directly address concerns about hallucination or hacking and will be presented in a new subsection on reward reliability. revision: yes
Referee: [Experiments and results] The experimental results section claims effective improvement on 7B-scale models over strong SFT baselines and surprising generalization to long-input reasoning, but supplies no exact metric values, statistical significance tests, number of evaluation runs, or component-wise ablation tables. Without these, the magnitude and robustness of the reported gains cannot be assessed.

Authors: We concur that the experimental reporting requires greater precision. The revised manuscript will expand the results section to include: exact numerical values for all reported metrics with standard deviations, results from three independent evaluation runs, statistical significance tests (paired t-tests) between Writing-RL and SFT baselines, and full component-wise ablation tables isolating the contribution of Margin-aware Data Selection, Pairwise Comparison Reward, and Dynamic Reference Scheduling. These details were summarized for space in the initial submission but will now be presented comprehensively. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper presents Writing-RL as an empirical framework with three new components (Margin-aware Data Selection, Pairwise Comparison Reward, Dynamic Reference Scheduling) applied to standard RL training on 7B models. Experimental gains over SFT baselines and generalization observations are reported as outcomes of training runs rather than derived by algebraic reduction or self-definition. No equations are shown that equate a prediction to a fitted input by construction, and the central premise does not rest on a load-bearing self-citation chain or imported uniqueness theorem. The Pairwise Comparison Reward is introduced to address the absence of verifiable rewards and is evaluated via downstream performance, not assumed true by definitional fiat. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL assumptions plus domain-specific choices for reward and scheduling; no new physical entities or heavy free parameters are introduced in the abstract.

free parameters (1)

margin threshold in data selection
Used to prioritize samples with high learning potential; value chosen to define high-potential examples.

axioms (1)

domain assumption Pairwise comparisons between model outputs can supply reliable discriminative signals for open-ended writing tasks without ground-truth references.
Invoked to enable RLVR-style training in the absence of verifiable rewards.

pith-pipeline@v0.9.0 · 5761 in / 1182 out tokens · 41268 ms · 2026-05-19T10:49:15.835325+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pairwise Comparison Reward mechanism that provides discriminative learning signals in the absence of verifiable rewards
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and orbit embedding unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dynamic Reference Scheduling approach, which plays a critical role by adaptively adjusting task difficulty based on evolving model performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

arXiv preprint arXiv:2504.03380 , year=

Online difficulty filtering for reasoning oriented reinforcement learning. arXiv preprint arXiv:2504.03380. Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xi- aozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024a. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. Co...

work page arXiv 2009
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing reasoning ca- pability in llms via reinforcement learning. CoRR, abs/2501.12948. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur H...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

GPT-4 Technical Report

DeepScaleR: Surpassing O1-Preview with a 1.5b model by scaling rl. Notion Blog. Pratyush Maini, Skyler Seto, Richard He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. 2024. Rephrasing the web: A recipe for compute and data- efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V o...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

arXiv preprint arXiv:2410.23933 , year=

Association for Computational Linguistics. Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, and Junyang Lin. 2024. Lan- guage models can self-lengthen to generate long texts. CoRR, abs/2410.23933. Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zeku...

work page arXiv 2024
[5]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

DAPO: an open-source LLM reinforcement learning system at scale. CoRR, abs/2503.14476. Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[6]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

V APO: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2504.05118. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, an...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

In our experiment, we use the proximal policy optimization (PPO) (Schul- man et al., 2017) algorithm with generalized advan- tage estimation (GAE) as the advantage estimator

to train our models. In our experiment, we use the proximal policy optimization (PPO) (Schul- man et al., 2017) algorithm with generalized advan- tage estimation (GAE) as the advantage estimator. The training process is conducted using a batch size of 32 for training, with a maximum prompt length of 4096 tokens and response length capped at 10,000 tokens ...

work page 2017
[8]

We utilize a rollout strategy based on the vLLM engine with a tensor model parallel size of 2

with a warm-up ratio of 0.4 to train the actor model, while the critic adopts a higher learning rate (1e-5) with a warm-up ratio of 0.05. We utilize a rollout strategy based on the vLLM engine with a tensor model parallel size of 2. The KL divergence penalty is set to a modest coefficient of 0.001. We train each model for about 400 steps and evaluate the ...

work page
[9]

Relevance: From content highly relevant and fully applicable to the user’s request to completely irrelevant or inapplicable

work page
[10]

Accuracy: From content completely ac- curate with no factual errors or misleading information to content with numerous errors and highly misleading

work page
[11]

Coherence: From clear structure with smooth logical connections to disorganized structure with no coherence

work page
[12]

Clarity: From clear language, rich in detail, and easy to understand to confusing expression with minimal details

work page
[13]

Breadth and Depth: From both broad and deep content with a lot of information to seriously lacking breadth and depth with minimal information

work page
[14]

Analysis

Reading Experience: From excellent reading experience, engaging and easy to understand content to very poor reading ex- perience, boring and hard to understand con- tent. Please evaluate the quality of the following response to a user’s request according to the above requirements. <User Request> 13 $INST$ </User Request> <Response> $RESPONSE$ </Response> ...

work page 2023

[1] [1]

arXiv preprint arXiv:2504.03380 , year=

Online difficulty filtering for reasoning oriented reinforcement learning. arXiv preprint arXiv:2504.03380. Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xi- aozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024a. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. Co...

work page arXiv 2009

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing reasoning ca- pability in llms via reinforcement learning. CoRR, abs/2501.12948. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur H...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

GPT-4 Technical Report

DeepScaleR: Surpassing O1-Preview with a 1.5b model by scaling rl. Notion Blog. Pratyush Maini, Skyler Seto, Richard He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. 2024. Rephrasing the web: A recipe for compute and data- efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V o...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

arXiv preprint arXiv:2410.23933 , year=

Association for Computational Linguistics. Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, and Junyang Lin. 2024. Lan- guage models can self-lengthen to generate long texts. CoRR, abs/2410.23933. Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zeku...

work page arXiv 2024

[5] [5]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

DAPO: an open-source LLM reinforcement learning system at scale. CoRR, abs/2503.14476. Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

V APO: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2504.05118. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, an...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

In our experiment, we use the proximal policy optimization (PPO) (Schul- man et al., 2017) algorithm with generalized advan- tage estimation (GAE) as the advantage estimator

to train our models. In our experiment, we use the proximal policy optimization (PPO) (Schul- man et al., 2017) algorithm with generalized advan- tage estimation (GAE) as the advantage estimator. The training process is conducted using a batch size of 32 for training, with a maximum prompt length of 4096 tokens and response length capped at 10,000 tokens ...

work page 2017

[8] [8]

We utilize a rollout strategy based on the vLLM engine with a tensor model parallel size of 2

with a warm-up ratio of 0.4 to train the actor model, while the critic adopts a higher learning rate (1e-5) with a warm-up ratio of 0.05. We utilize a rollout strategy based on the vLLM engine with a tensor model parallel size of 2. The KL divergence penalty is set to a modest coefficient of 0.001. We train each model for about 400 steps and evaluate the ...

work page

[9] [9]

Relevance: From content highly relevant and fully applicable to the user’s request to completely irrelevant or inapplicable

work page

[10] [10]

Accuracy: From content completely ac- curate with no factual errors or misleading information to content with numerous errors and highly misleading

work page

[11] [11]

Coherence: From clear structure with smooth logical connections to disorganized structure with no coherence

work page

[12] [12]

Clarity: From clear language, rich in detail, and easy to understand to confusing expression with minimal details

work page

[13] [13]

Breadth and Depth: From both broad and deep content with a lot of information to seriously lacking breadth and depth with minimal information

work page

[14] [14]

Analysis

Reading Experience: From excellent reading experience, engaging and easy to understand content to very poor reading ex- perience, boring and hard to understand con- tent. Please evaluate the quality of the following response to a user’s request according to the above requirements. <User Request> 13 $INST$ </User Request> <Response> $RESPONSE$ </Response> ...

work page 2023