arxiv: 2604.03004 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Wanlong Liu , Bo Zhang , Chenliang Li , Shaopeng Lai , Yuning Wu , Xuanyu Lei , Ming Yan This is my paper

Pith reviewed 2026-05-13 19:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords reflectionrevisionopen-ended writingchain-of-thoughtprocess rewardreinforcement learninglarge language modelscreative writing benchmarks

0 comments

The pith

Explicit reflection and revision patterns unlock deep reasoning for open-ended writing tasks where standard chain-of-thought falls short.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mainstream reasoning models achieve only limited gains on writing tasks because they rarely produce genuine reflection and revision steps, unlike their strong performance on mathematics. The paper introduces R2-Write, an automated synthesis method that generates high-quality thinking trajectories through repeated writer-judge interaction, then applies a process reward during reinforcement learning to suppress redundant loops and improve both quality and efficiency. Experiments on creative writing and deep-research benchmarks show clear improvements, establishing that structured self-correction is the missing element for applying long-chain reasoning to open-ended domains.

Core claim

We introduce R2-Write, an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. A process reward mechanism supervises reflection quality during reinforcement learning to prevent redundant patterns. Extensive experiments across creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

What carries the argument

Iterative writer-judge interaction that produces reflection- and revision-enriched trajectories, supervised by a process reward that filters low-quality reflections for both performance and token efficiency.

If this is right

Models trained with R2-Write trajectories outperform standard reasoning models on creative writing and deep-research benchmarks.
The process reward improves both final performance and token efficiency by discouraging redundant reflection steps.
Explicit revision patterns can be synthesized automatically without human-written examples.
The same synthesis approach transfers across multiple open-ended writing domains.
Deep reasoning techniques require domain-specific pattern engineering beyond simple length scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may generalize to other non-verifiable domains such as long-form planning or open-ended code generation where self-correction matters more than final-answer verification.
Process-level rewards could become a standard tool for training agents on tasks lacking clear correctness signals.
Future work could test whether the same trajectories improve multi-turn dialogue or iterative editing workflows.
If reflection quality can be verified cheaply, the approach might reduce reliance on expensive outcome-based rewards in open domains.

Load-bearing premise

The writer-judge loop reliably generates trajectories with genuine deep reflection and revision instead of superficial or repetitive patterns that the reward then reinforces.

What would settle it

Inspection of the generated trajectories shows mostly shallow or repetitive self-talk, or training on R2-Write data yields no measurable gain over standard chain-of-thought on the same writing benchmarks.

Figures

Figures reproduced from arXiv: 2604.03004 by Bo Zhang, Chenliang Li, Ming Yan, Shaopeng Lai, Wanlong Liu, Xuanyu Lei, Yuning Wu.

**Figure 1.** Figure 1: Thinking pattern analysis. The first row shows the pattern distributions of three reasoning models on WritingBench and MATH500. The second row reports, for each model and task, the proportion of patterns that are judged to be helpful for obtaining the correct (or high-scoring) answer. All pattern annotations are obtained using Claude-4.5-Sonnet (Anthropic, 2025). remain unclear. We conduct a more fine-grai… view at source ↗

**Figure 2.** Figure 2: Overview of the R2 -Write pipeline, which consists of three main parts: query data selection, data creation and RL. sites (Team et al., 2025), spanning 10 major categories2 . The detailed pipeline is shown in Appendix B.1. Evaluation Rubrics Construction. Since open-ended writing tasks lack ground-truth answers, we design evaluation rubrics that align with human writing standards. Following prior evaluat… view at source ↗

**Figure 3.** Figure 3: Token length distribution of thinking trajectories across different methods on Writing Bench. This allows us to evaluate the thinking process quality of any model on a given benchmark. Detailed prompts and implementation are provided in the Appendix C.4 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Data distribution of our constructed training set, which includes both SFT and RL data. (a) shows the domain distribution for creative writing tasks, and (b) shows the category distribution for report generating tasks. Inter-Judge Consistency [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Effectiveness of reflection pattern usage. We categorize cases where reflection patterns are triggered into three outcomes: Win (R2 -Write outperforms baseline), Tie (comparable performance), and Lose (baseline outperforms R2 -Write). DG represents Deepresearch Gym, and WB means Writing Bench. The vast majority of reflection instances lead to performance improvements, demonstrating that the model effective… view at source ↗

**Figure 6.** Figure 6: Evolution of total reward, process reward, and answer reward during RL training (steps 0-400). All three reward components show consistent improvement, indicating that the model learns to optimize both the reasoning process and final answer quality simultaneously. Response Length Dynamics [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of response length dynamics during RL training. Our model (right) shows decreasing response length, indicating more efficient reasoning, while the baseline Qwen3-8B + RL (left) exhibits increasing verbosity. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Case 1: Complete Reflection and Revision Process in Writing Bench. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Case 2: Complete Reflection and Revision Process in Writing Bench 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for generating query-specific evaluation rubrics for language model responses. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for generating general-quality evaluation criteria. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for constructing self-reflection and revision thought process during Data Construction. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for response revision based on self-reflection during Data Construction. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: The generation prompt for Math tasks in our experiments. Generation Prompt for Math Tasks System: You are a helpful assistant. Please provide a detailed response to the following writing task. User: {Question} Assistant [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: The Generation Prompt on Writing tasks (WritingBench and HelloBench) in our experiments. Prompt for Answer Reward in RL training Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should cons… view at source ↗

**Figure 16.** Figure 16: Prompt for Answer Reward in RL training 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt for Process Reward in RL training (Step 1). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt for Process Reward in RL training (Step 2). 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt for analyzing patterns of thinking trajectories in language model outputs. For writing tasks, we replace the Ground Truth in the prompt with Evaluation Rubrics. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt for classifying revision types in writing task reflection. The model analyzes the reasoning content to identify the primary issue: Requirement Alignment (RA), Factual & Logical Correction (FLC), or Quality Enhancement (QE). 31 [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

read the original abstract

While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R2-Write gives a practical writer-judge loop plus process reward to inject reflection into writing trajectories, but the gains look modest and the depth claim rests on thin validation.

read the letter

The main thing to know is that this paper identifies a clear gap—standard reasoning models add little on open-ended writing because they skip real reflection and revision—and then offers R2-Write as a way to synthesize trajectories that contain those patterns through repeated writer-judge turns, with a process reward to cut redundancy during RL training. The experiments report gains on creative writing and deep-research benchmarks, which is a reasonable first step beyond math-focused CoT work. The framework itself is straightforward to describe and the efficiency angle from the reward is sensible. What the paper does well is lay out the contrast with verifiable domains and show that the missing behavior is measurable in the generated traces. The citation choices stay focused on the relevant reasoning literature without padding. The soft spots are around attribution and measurement. The iterative loop uses models from the same family, so it can easily converge on surface cues that the reward then reinforces; without detailed ablations on reflection depth, token-level controls, or compute-matched baselines, the reported improvements could come from extra data volume or training steps rather than genuine revision quality. The abstract and setup do not yet show how reflection quality is scored beyond non-redundancy, which leaves the central claim vulnerable. This work is aimed at groups extending reasoning methods to non-verifiable tasks like creative or research writing. A reader already running synthetic data pipelines or RL on LLMs would get concrete ideas to try. It deserves peer review because the problem is real and the method is novel enough to test, even if the current evidence needs tightening on controls and depth metrics.

Referee Report

3 major / 2 minor

Summary. The paper claims that mainstream reasoning models show limited gains on open-ended writing tasks because they lack deep reflection and revision patterns, unlike in mathematics. It introduces R2-Write, which synthesizes high-quality trajectories via iterative writer-judge LLM interactions, augmented by a process reward that supervises reflection quality during RL to avoid redundancy. Experiments on creative writing and deep-research benchmarks are said to demonstrate significant improvements, supporting the conclusion that explicitly adding reflection and revision unlocks deep reasoning for open-ended writing.

Significance. If the results hold after addressing validation gaps, the work would meaningfully extend chain-of-thought techniques beyond verifiable domains into creative and research writing. The process-reward design for efficiency is a practical contribution that could inform agentic writing systems. The synthesis loop offers a reproducible way to generate training data for reflection, which is a strength if the trajectories are shown to be non-superficial.

major comments (3)

[Method (iterative synthesis loop)] The weakest assumption—that the writer-judge loop reliably yields genuine deep reflection rather than repetitive or keyword-driven patterns—is load-bearing for the central claim but receives insufficient validation. Because both roles use models from the same family, the process reward (which penalizes redundancy) may reinforce surface cues without external grounding that the steps improve reasoning depth.
[Experiments] Experiments section: benchmark gains are reported without ablations isolating the contribution of reflection patterns from extra inference-time compute or data volume. It remains possible that improvements arise from the iterative process itself rather than the specific reflection/revision content, undermining the causal link to the proposed mechanism.
[Process reward mechanism] Process reward definition (likely §4): the mechanism is described as supervising reflection quality, but without explicit formulation or metrics distinguishing depth from redundancy, it is unclear how the reward prevents amplification of shallow patterns. Provide the reward function or training details to allow assessment of this safeguard.

minor comments (2)

[Abstract] Abstract: add one or two concrete metrics (e.g., relative improvement on a named benchmark) to make the performance claim more informative.
[Method] Notation for writer and judge roles should be introduced consistently in the method section to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for strengthening the validation of our core claims, and we address each point below with plans for revisions.

read point-by-point responses

Referee: The weakest assumption—that the writer-judge loop reliably yields genuine deep reflection rather than repetitive or keyword-driven patterns—is load-bearing for the central claim but receives insufficient validation. Because both roles use models from the same family, the process reward (which penalizes redundancy) may reinforce surface cues without external grounding that the steps improve reasoning depth.

Authors: We agree that stronger validation of reflection depth is needed. In the revision, we will add a dedicated analysis section with: (1) qualitative trajectory examples contrasting R2-Write reflections against baseline iterative loops, (2) human evaluation scores on reflection depth and usefulness (with inter-annotator agreement), and (3) quantitative metrics such as reasoning step diversity and semantic novelty. We will also explicitly discuss the same-family model limitation and its implications for generalizability. revision: yes
Referee: Experiments section: benchmark gains are reported without ablations isolating the contribution of reflection patterns from extra inference-time compute or data volume. It remains possible that improvements arise from the iterative process itself rather than the specific reflection/revision content, undermining the causal link to the proposed mechanism.

Authors: We acknowledge the need for targeted ablations. The revised manuscript will include new experiments that: (a) compare R2-Write against an iterative baseline with matched token budgets and data volume but without the reflection-specific process reward, and (b) report performance under fixed compute constraints. These will isolate the contribution of the reflection/revision patterns from mere iteration. revision: yes
Referee: Process reward definition (likely §4): the mechanism is described as supervising reflection quality, but without explicit formulation or metrics distinguishing depth from redundancy, it is unclear how the reward prevents amplification of shallow patterns. Provide the reward function or training details to allow assessment of this safeguard.

Authors: We apologize for the insufficient detail in the original submission. The revision will include the full mathematical formulation of the process reward (including the redundancy penalty term based on semantic similarity and the depth supervision signal derived from reasoning step count and logical progression), along with all training hyperparameters, reward scaling factors, and implementation specifics in §4 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; framework is empirically grounded

full rationale

The paper introduces R2-Write as an iterative writer-judge synthesis loop plus process reward for reflection quality, then reports benchmark gains on creative writing and research tasks. No equations, fitted parameters, or self-referential definitions appear in the abstract or description. The claimed improvement is presented as an experimental outcome rather than a quantity that reduces by construction to the method's own inputs or to a self-citation chain. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only information limits visibility into parameters and assumptions; the central claim rests on the unverified premise that reflection patterns synthesized via judge interaction are both high-quality and causally responsible for gains.

axioms (1)

domain assumption Existing mainstream reasoning models lack deep reflection and revision patterns on open-ended writing tasks
Stated as the result of the paper's analysis in the abstract.

pith-pipeline@v0.9.0 · 5471 in / 1109 out tokens · 37822 ms · 2026-05-13T19:27:34.158264+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction... process reward mechanism that supervises reflection quality during reinforcement learning
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We decompose the total reward into a process-level and an answer-level component: R_all = α R_a + (1-α) R_p ... R_p evaluates how the thinking trajectory uses 'reflection' and 'revision'

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

**Analyze Query Type**: Identify the task category (e.g., creative writing, technical explanation, business strategy, instructional content, report generation, etc.)

work page
[2]

**Select Relevant Criteria**: Choose 6-8 criteria from these dimensions based on query type (including but not limited to): - Clarity: Organization and communication effectiveness - Completeness: Coverage of required aspects - Depth: Level of detail and thoroughness - Creativity: Originality and innovation - Appropriateness: Fit to task, audience, and con...

work page
[3]

- **Business/Strategy tasks**:

**Adaptation Guidelines by Task Type** (including but not limited to): - **Creative tasks**: Emphasize Creativity, Appropriateness; de-emphasize Factual Accuracy - **Technical/Scientific tasks**: Emphasize Accuracy, Technical Correctness, Depth - **Instructional tasks**: ... - **Business/Strategy tasks**: ... - **Report generation**:

work page
[4]

A strong answer should: 1)..., 2)..., 3)

**Criterion Description Format**: For each criterion, specify what constitutes a strong answer using 2-4 concrete evaluation points in the format: "A strong answer should: 1)..., 2)..., 3)..." ## Output Format Example { "EVAL_CRITERIA": [ { "name": "Clarity", "description": "Evaluate how clearly the answer communicates its ideas and organization. A strong...

work page 2026
[5]

Wait, I notice that

**Reflect and Discover Problems** - Use the information from the evaluation to identify problems, but never explicitly mention external feedback. Frame all discoveries as your own insights. - Always write as if YOU authored the article and are now reviewing it. - Use reflective phrases naturally: * English: "Wait, I notice that...", "Looking back at...", ...

work page
[6]

**Describe Revision Ideas** - For each problem, explain: * What the problem is * Why it is problematic * How you plan to fix it

work page
[7]

I will revise the original ’[exact original text]’ to ’[specific new version]’

**Provide Concrete Revision Content** - For every issue, provide specific before-and-after text. - Use format: "I will revise the original ’[exact original text]’ to ’[specific new version]’" - Provide complete, usable revision content

work page
[8]

Problem 1

**Output Style** - Write as a continuous internal monologue, not rigid sections. - Do NOT use headings like "Problem 1", "Solution". - Output entirely in the same language as the [Original Article] and [Topic]. Begin your critical self-analysis now: Figure 12.Prompt for constructing self-reflection and revision thought process during Data Construction. 25...

work page 2026
[9]

List all instances found with their specific content from the reasoning

work page
[10]

yes": The final answer matches the ground truth AND this specific instance meaningfully contributed to reaching the correct answer -

For each instance, evaluate contribution: - "yes": The final answer matches the ground truth AND this specific instance meaningfully contributed to reaching the correct answer - "no": The final answer matches the ground truth BUT this specific instance did not meaningfully contribute to the correct answer - "na": The final answer does not match the ground truth

work page
[11]

answer_verification

Provide the total count of instances found Please output your analysis in the following JSON format: { "answer_verification": {"instances": [{"id": 1, "content": "...", "contribution": "yes/no/na"}], " count": <int>}, "backtracking": {"instances": [...], "count": <int>}, "subgoal_setting": {"instances": [...], "count": <int>}, "backward_chaining": {"insta...

work page 2026
[12]

**Requirement Alignment (RA)**: Corrections that align output with user’s explicit requirements - Addressing missing key elements/sections that user requested - Fixing format violations (word count, structure, style) - Adjusting scope or target audience to match specifications - Note: Any revision that brings the output closer to user’s explicit instructions

work page
[13]

**Factual & Logical Correction (FLC)**: Corrections of errors in facts, data, or reasoning - Fixing incorrect numbers, calculations, or citations - Correcting legal articles, historical events, or technical principles - Resolving logical contradictions or flawed reasoning - Updating outdated information or misquoted sources - Note: Any revision that corre...

work page
[14]

patterns

**Quality Enhancement (QE)**: Improvements to overall writing quality - Adding missing details, examples, or deeper analysis - Improving language clarity and terminology precision - Strengthening theoretical support and depth - Enhancing coherence, flow, and readability - Adjusting tone, style, or formatting for professionalism - Note: Any improvement to ...

work page