POLARIS: Guiding Small Models to Write Long Stories

Jenna Russell; John Frederick Wieting; Mohit Iyyer; Rishanth Rajendhran

arxiv: 2606.04095 · v1 · pith:MZHZGJKPnew · submitted 2026-06-02 · 💻 cs.CL · cs.AI

POLARIS: Guiding Small Models to Write Long Stories

Rishanth Rajendhran , Jenna Russell , Mohit Iyyer , John Frederick Wieting This is my paper

Pith reviewed 2026-06-28 10:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-form story writingGRPOLLM-as-a-judgelength generalizationsmall language modelscreative writingpolicy optimizationhuman reference injection

0 comments

The pith

POLARIS trains a 9B model with LLM-judge rewards and human anchors to match larger models on long stories while generalizing beyond training lengths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a specific GRPO training recipe can make small open-weight models competitive at long-form creative writing. It combines an online reward signal from a frontier LLM equipped with a structured story quality rubric and the injection of human-written reference stories as high-reward anchors inside each training group. Applied to Qwen3.5-9B on roughly 1.4K prompt-story pairs using four A100 GPUs, the resulting POLARIS-9B model follows requested lengths more reliably than larger open models and maintains quality on prompts asking for stories up to three times the training maximum, a regime where most models degrade. A blinded human study finds it preferred to the base 9B model and comparable to a 27B model. The work positions length generalization as a practical stress test that separates otherwise similar creative-writing systems.

Core claim

POLARIS-9B, produced by applying the POLARIS recipe (frontier LLM judge with Story Quality rubric plus human-reference injection inside GRPO groups) to Qwen3.5-9B, is competitive with much larger open-weight models on five benchmarks while adhering more closely to length instructions; it preserves quality on out-of-distribution prompts requesting stories up to three times the 4k-word training length where other models degrade in quality, length adherence, or both.

What carries the argument

The POLARIS recipe: a frontier LLM judge supplying online rewards via a structured Story Quality rubric combined with human-reference injection (HRI) that places a teacher-forced human story as a high-reward anchor inside each GRPO group.

If this is right

Small models can be steered to respect explicit length targets without sacrificing story quality even when targets exceed training lengths by a factor of three.
Length adherence and quality retention on longer prompts can be used as a distinguishing benchmark for creative-writing models that are otherwise close in standard evaluations.
The combination of LLM-as-judge rewards and human-reference anchors inside GRPO enables effective training with only 1.4K examples and modest hardware.
Blinded human raters rate POLARIS-9B above its base 9B model and on par with a 27B model across story quality dimensions.
The method works across both in-distribution and out-of-distribution prompts and rubrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the judge rubric generalizes, the same training pattern could be applied to other open-ended generation tasks such as long-form code or dialogue.
The length-generalization stress test may expose weaknesses in models that appear strong on shorter, in-distribution creative writing benchmarks.
Human-reference injection may reduce reward hacking by providing an explicit high-quality anchor that the policy must compete against inside each group.

Load-bearing premise

The frontier LLM judge equipped with the structured Story Quality rubric supplies a reliable reward signal that aligns with human preferences for story quality.

What would settle it

A large-scale blinded human preference study in which POLARIS-9B stories are rated lower than Qwen3.5-27B stories or show clear quality drop-off on 12k-word prompts would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.04095 by Jenna Russell, John Frederick Wieting, Mohit Iyyer, Rishanth Rajendhran.

**Figure 2.** Figure 2: summarizes the full pipeline. 3.1 Structured LLM-as-Judge Reward Motivation. Prior RL approaches to creative writing rely on a reward model trained on pairwise human preference data (Wu et al., 2026; Liao et al., 2026; Li et al., 2026). Trained reward models collapse writing quality into a single scalar, require a separate training pipeline, and can go stale as training progresses and policy rollouts d… view at source ↗

**Figure 3.** Figure 3: Per-dimension Story Quality eval scores (Flash judge) over training steps, plotted as a percent of each tracked dimension’s maximum under the full core rubric (6 positive, 10 negative). Top row: positive dimensions. Bottom row: negative dimensions, plotted as penalty magnitude so downward is better. Benchmark n stories ICC(3,1) Mean Kendall τ Within-story SD % variance from judge Story Quality 20 0.973 0.8… view at source ↗

**Figure 4.** Figure 4: Held-out validation Story Quality (left axis, solid) and composite reward (right axis, dashed) over training steps for Qwen3.5 9B+GRPO and POLARIS– 9B on a 100-prompt validation split (in-distribution length, 1–4k; Gemini 3 Flash judge). Both curves continue to improve through one epoch, with POLARIS– 9B consistently ahead of the plain GRPO run without HRI. higher than GPT-5.4-mini on its rubric (83.0 vs … view at source ↗

**Figure 5.** Figure 5: Distribution of length-adjusted Story Quality scores per model, by length-bucket grouping. For signed Story Quality, we first normalize the raw rubric score from its floor and ceiling to [0, 100], then multiply by min(LR, 1/LR). writing quality. The study covers 60 prompt– generation pairs randomly sampled from two outof-distribution benchmarks: EQ-Bench Creative (24) and LongBench-Write (36). We do not s… view at source ↗

**Figure 6.** Figure 6: Pairwise win rate (P(row beats column)) over the per-dimension judge calls (Gemini 3 Flash, dualposition). Diagonal cells are blank; colors run from red (row loses) through yellow (50/50) to green (row wins). The Elo ratings in [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Hierarchical clustering (k=7 Ward linkage) of the 18 evaluated models by their per-dimension Story Quality profile. Cluster labels are hand-annotated post-hoc to describe the prose patterns characteristic of each tier; the clustering itself is derived only from the per-dim mean vectors. G Model-Specific Diagnostics We close with targeted diagnostics for modelspecific failure modes and ablation-side detail… view at source ↗

**Figure 8.** Figure 8: Story Quality per-dimension profile across the evaluated model set (GPT-5.4 judge, 180-prompt evaluation set), with an additional overall length-adherence column. Each cell shows a normalized percentage. For positive dimensions we report 100 · s/18; for negative dimensions we report 100 · (1 − p/20) so higher remains better; for length adherence we report 100 · min(LR, 1/LR), where LR is the aggregate gene… view at source ↗

read the original abstract

Small open-weight models struggle at long-form creative writing: their generated stories either fall far short of the requested length, or their quality significantly degrades as length increases, especially when compared to frontier models. We present POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), a lower-compute GRPO recipe with two key ingredients: a frontier LLM judge with a structured Story Quality rubric as the online reward, and human-reference injection (HRI), where a teacher-forced human-written story serves as a high-reward anchor within each GRPO group. By applying our training recipe to Qwen3.5-9B, using a dataset of approximately 1.4K prompt-story pairs derived from 100 short-story anthologies and 4 A100 GPUs, we obtain POLARIS-9B. Across five benchmarks spanning in-distribution and out-of-distribution prompts and rubrics, POLARIS-9B is competitive with much larger open-weight models while following length instructions more closely. A blinded human evaluation confirms that POLARIS-9B is preferred to the base Qwen3.5-9B and on par with Qwen3.5-27B. Despite training only on stories up to 4k words, POLARIS-9B preserves quality on prompts requesting stories up to 3 times the training length, a regime where most open-weight models degrade substantially in quality, length adherence, or both. More broadly, our results suggest that length generalization is a meaningful stress test for creative-writing models and a useful lens for distinguishing otherwise close models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POLARIS shows a 9B model can match 27B on length-controlled stories via GRPO with LLM judge rewards plus human reference anchors, but the judge-human correlation on training outputs is not directly tested.

read the letter

The core result is that applying GRPO to Qwen3.5-9B with a frontier LLM judge using a structured Story Quality rubric, plus human-reference injection as anchors in each group, produces a model that follows length instructions more closely than larger open-weight baselines and holds quality on prompts up to three times the 4k-word training length. The blinded human study finds it preferred over the base 9B and on par with the 27B version across the reported benchmarks.

The method is new in its specific combination of the rubric-based judge reward and the anchoring trick inside GRPO groups. It is practical: the training uses roughly 1.4k prompt-story pairs from anthologies and runs on 4 A100s. The length generalization test is a useful way to separate models that otherwise look similar.

The main soft spot is the reward signal. The paper relies on the LLM judge for all online rewards during training, yet it does not report whether those numeric scores correlate with human ratings on the same generations produced mid-training. The human preference results are only on the final outputs, so any mismatch between judge and human preferences could still be exploited. The modest dataset size also leaves open questions about sensitivity to the exact rubric or the choice of frontier judge.

This work is aimed at groups working on efficient controllable generation or small-model creative writing. It has enough concrete method details and mixed automatic-plus-human results to deserve a serious referee, even if the judge validation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces POLARIS, a GRPO-based training method for small open-weight models (applied to Qwen3.5-9B) to improve long-form creative story writing. It combines an online reward from a frontier LLM judge using a structured Story Quality rubric with human-reference injection (HRI), where human-written stories serve as high-reward anchors in each GRPO group. Trained on ~1.4K prompt-story pairs (stories up to 4k words) using 4 A100 GPUs, POLARIS-9B is claimed to match or exceed larger models in quality while better adhering to length instructions, and to generalize quality to prompts requesting up to 12k words (3x training length) where baselines degrade. Results are supported by five benchmarks (in- and out-of-distribution) and a blinded human preference evaluation showing preference over the base 9B model and parity with a 27B model.

Significance. If the central claims hold, the work offers a low-compute recipe for enhancing creative writing in smaller models, with length generalization serving as a useful stress test. The explicit use of HRI as an anchor and the focus on out-of-distribution length provide a concrete, falsifiable contribution that could be adopted or extended in subsequent work on LLM alignment for open-ended generation tasks.

major comments (2)

[Training procedure and human evaluation sections] The manuscript reports a post-training blinded human preference study (POLARIS-9B preferred to base Qwen3.5-9B, on par with 27B) but provides no direct validation—such as correlation coefficients, agreement rates, or side-by-side ratings—between the frontier LLM judge's numeric scores (the sole online reward during GRPO) and human judgments on the identical generations produced during or after training. This is load-bearing for the central claim because any systematic mismatch (e.g., judge bias toward length or verbosity) could allow GRPO to exploit the signal while the final human eval still passes; the skeptic concern is therefore not resolved by the existing human study.
[Experimental results and benchmark descriptions] The claims of competitiveness on five benchmarks and superior length adherence / generalization rest on reported positive outcomes, yet the manuscript supplies insufficient detail on exact baselines, statistical tests (e.g., significance of length-adherence differences), data-exclusion rules, and per-prompt length/quality metrics to allow independent assessment of whether the reported gains are robust or driven by particular prompt subsets.

minor comments (2)

[Dataset construction] Clarify the precise composition of the 1.4K prompt-story dataset (e.g., how prompts were derived from the 100 anthologies and any filtering criteria) to support reproducibility.
[Reward model description] Ensure the structured Story Quality rubric used by the LLM judge is reproduced verbatim or in sufficient detail, including scoring scale and criteria, so that the reward signal can be inspected.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Training procedure and human evaluation sections] The manuscript reports a post-training blinded human preference study (POLARIS-9B preferred to base Qwen3.5-9B, on par with 27B) but provides no direct validation—such as correlation coefficients, agreement rates, or side-by-side ratings—between the frontier LLM judge's numeric scores (the sole online reward during GRPO) and human judgments on the identical generations produced during or after training. This is load-bearing for the central claim because any systematic mismatch (e.g., judge bias toward length or verbosity) could allow GRPO to exploit the signal while the final human eval still passes; the skeptic concern is therefore not resolved by the existing human study.

Authors: We agree that a direct comparison between the LLM judge scores and human ratings on identical generations would provide stronger evidence against potential reward misalignment. The existing blinded human study validates the final model but does not include the requested correlation analysis. In revision we will add a subsection reporting agreement rates, Pearson/Spearman correlations, and side-by-side rating statistics computed on a held-out sample of post-training generations evaluated by both the judge and human raters. revision: yes
Referee: [Experimental results and benchmark descriptions] The claims of competitiveness on five benchmarks and superior length adherence / generalization rest on reported positive outcomes, yet the manuscript supplies insufficient detail on exact baselines, statistical tests (e.g., significance of length-adherence differences), data-exclusion rules, and per-prompt length/quality metrics to allow independent assessment of whether the reported gains are robust or driven by particular prompt subsets.

Authors: We accept that greater experimental transparency is required. The revised manuscript will expand the relevant sections to specify exact baseline configurations and versions, report statistical significance tests (paired t-tests or Wilcoxon signed-rank) for length-adherence and quality differences, state all data-exclusion criteria, and include per-prompt or variance metrics (with standard deviations) for key length and quality scores. Evaluation code and prompts will also be released. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training recipe with external rewards and human validation

full rationale

The paper describes an empirical GRPO training procedure on ~1.4K prompt-story pairs using an external frontier LLM judge (with structured rubric) as the online reward signal plus human-reference injection anchors. Results are evaluated on in- and out-of-distribution benchmarks plus a blinded human preference study. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmarks and human judgments rather than reducing to the training inputs by construction. This is the expected non-finding for a standard empirical ML methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, fitted values, or explicit assumptions are visible in the provided text.

pith-pipeline@v0.9.1-grok · 5836 in / 1226 out tokens · 43678 ms · 2026-06-28T10:14:40.559505+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 18 canonical work pages

[1]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Learning to Reason under Off-Policy Guidance , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[2]

2025 , eprint=

RePO: Replay-Enhanced Policy Optimization , author=. 2025 , eprint=

2025
[3]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[4]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Anthony, Thomas and Tian, Zheng and Barber, David , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

2017
[5]

2025 , eprint=

G ^2 RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance , author=. 2025 , eprint=

2025
[6]

2026 , url=

Xuechen Zhang and Zijian Huang and Yingcong Li and Chenshun Ni and Jiasi Chen and Samet Oymak , booktitle=. 2026 , url=

2026
[7]

Nature , volume =

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author =. Nature , volume =. 2025 , doi =

2025
[8]

2026 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

2026
[9]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024
[10]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[11]

RLHF deciphered: A critical analysis of reinforcement learning from human feedback for llms.ACM Comput

Chaudhari, Shreyas and Aggarwal, Pranjal and Murahari, Vishvak and Rajpurohit, Tanmay and Kalyan, Ashwin and Narasimhan, Karthik and Deshpande, Ameet and Castro da Silva, Bruno , title =. ACM Comput. Surv. , month = sep, articleno =. 2025 , issue_date =. doi:10.1145/3743127 , abstract =

work page doi:10.1145/3743127 2025
[12]

LongWriter: Unleashing 10,000+ Word Generation from Long Context

Yushi Bai and Jiajie Zhang and Xin Lv and Linzhi Zheng and Siqi Zhu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li , booktitle=. LongWriter: Unleashing 10,000+ Word Generation from Long Context. 2025 , url=

2025
[13]

2025 , eprint=

WritingBench: A Comprehensive Benchmark for Generative Writing , author=. 2025 , eprint=

2025
[14]

Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards

Wei, Xiaolong and Lu, Bo and Zhang, Xingyu and Zhao, Zhejun and Shen, Dongdong and Xia, Long and Yin, Dawei. Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.868

work page doi:10.18653/v1/2025.emnlp-main.868 2025
[15]

SWAG : Storytelling With Action Guidance

Pei, Jonathan and Patel, Zeeshan and El-Refai, Karim and Li, Tianle. SWAG : Storytelling With Action Guidance. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.824

work page doi:10.18653/v1/2024.findings-emnlp.824 2024
[16]

A Survey on LLM s for Story Generation

Teleki, Maria and Bengali, Vedangi and Dong, Xiangjue and Janjur, Sai Tejas and Liu, Haoran and Liu, Tian and Wang, Cong and Liu, Ting and Zhang, Yin and Shipman, Frank and Caverlee, James. A Survey on LLM s for Story Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.750

work page doi:10.18653/v1/2025.findings-emnlp.750 2025
[17]

The Fourteenth International Conference on Learning Representations , year=

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=
[18]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , month=. doi:10.1609/aaai.v40i38.40467 , abstractNote=

work page doi:10.1609/aaai.v40i38.40467 2026
[19]

2026 , eprint=

Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling , author=. 2026 , eprint=

2026
[20]

2026 , eprint=

Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization , author=. 2026 , eprint=

2026
[21]

Writing-

Xuanyu Lei and Chenliang Li and Yuning Wu and Kaiming Liu and Weizhou Shen and Peng Li and Ming Yan and Fei Huang and Ya-Qin Zhang and Yang Liu , year=. Writing-
[22]

2025 , eprint=

Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards , author=. 2025 , eprint=

2025
[23]

2026 , eprint=

DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing , author=. 2026 , eprint=

2026
[24]

2026 , eprint=

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning , author=. 2026 , eprint=

2026
[25]

and Patel, Reet and Bosselut, Antoine and Plas, Lonneke Van Der and Beaty, Roger E

Ismayilzada, Mete and Laverghetta Jr., Antonio and Luchini, Simone A. and Patel, Reet and Bosselut, Antoine and Plas, Lonneke Van Der and Beaty, Roger E. Creative Preference Optimization. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.509

work page doi:10.18653/v1/2025.findings-emnlp.509 2025
[26]

2025 , eprint=

Diverse Preference Optimization , author=. 2025 , eprint=

2025
[27]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Li, Xian and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[28]

Second Conference on Language Modeling , year=

Learning to Reason for Long-Form Story Generation , author=. Second Conference on Language Modeling , year=
[29]

2024 , eprint=

EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models , author=. 2024 , eprint=

2024
[30]

2024 , eprint=

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models , author=. 2024 , eprint=

2024
[31]

L it B ench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Fein, Daniel and Russo, Sebastian and Xiang, Violet and Jolly, Kabir and Rafailov, Rafael and Haber, Nick. L it B ench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl...

work page doi:10.18653/v1/2026.eacl-long.362 2026
[32]

N arra B ench: A Comprehensive Framework for Narrative Benchmarking

Hamilton, Sil and Wilkens, Matthew and Piper, Andrew. N arra B ench: A Comprehensive Framework for Narrative Benchmarking. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.176

work page doi:10.18653/v1/2026.eacl-long.176 2026
[33]

2024 , eprint=

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=

2024
[34]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge. Proceedings of the 2025 Conference on Empirical Methods ...

work page doi:10.18653/v1/2025.emnlp-main.138 2025
[35]

The Twelfth International Conference on Learning Representations , year=

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[36]

A Survey on

Jiawei Gu and Xuhui Jiang and Zhichao Shi and Hexiang Tan and Xuehao Zhai and Chengjin Xu and Wei Li and Yinghan Shen and Shengjie Ma and Honghao Liu and Saizhuo Wang and Kun Zhang and Zhouchi Lin and Bowen Zhang and Lionel Ni and Wen Gao and Yuanzhuo Wang and Jian Guo , keywords =. A survey on LLM-as-a-judge , journal =. 2026 , issn =. doi:https://doi.or...

work page doi:10.1016/j.xinn.2025.101253 2026
[37]

2024 , address =

Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing...

work page doi:10.18653/v1/2024.emnlp-main.248 2024
[38]

2025 , eprint=

Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments , author=. 2025 , eprint=

2025
[39]

2026 , eprint=

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks , author=. 2026 , eprint=

2026
[40]

Second Conference on Language Modeling , year=

A Critical Look At Tokenwise Reward-Guided Text Generation , author=. Second Conference on Language Modeling , year=
[41]

doi: 10.18653/v1/2024.findings-naacl.111

Li, Wendi and Wei, Wei and Xu, Kaihe and Xie, Wenfeng and Chen, Dangyang and Cheng, Yu. Reinforcement Learning with Token-level Feedback for Controllable Text Generation. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.111

work page doi:10.18653/v1/2024.findings-naacl.111 2024
[42]

2023 , eprint=

Let's Verify Step by Step , author=. 2023 , eprint=

2023
[43]

2025 , eprint=

Process Reinforcement through Implicit Rewards , author=. 2025 , eprint=

2025
[44]

Zhao, Yanli and Gu, Andrew and Varma, Rohan and Luo, Liang and Huang, Chien-Chin and Xu, Min and Wright, Less and Shojanazeri, Hamid and Ott, Myle and Shleifer, Sam and Desmaison, Alban and Balioglu, Can and Damania, Pritam and Nguyen, Bernard and Chauhan, Geeta and Hao, Yuchen and Mathews, Ajit and Li, Shen , title =. Proc. VLDB Endow. , month = aug, pag...

work page doi:10.14778/3611540.3611569 2023
[45]

2026 , howpublished =

2026
[46]

2026 , month = feb, howpublished =

2026
[47]

2025 , month = dec, howpublished =

2025
[48]

2026 , month = apr, day =

2026
[49]

2026 , eprint =

Ministral 3 , author =. 2026 , eprint =

2026
[50]

What characterises creativity in narrative writing, and how do we assess it? Research findings from a systematic literature search , journal =

Richard D'Souza , keywords =. What characterises creativity in narrative writing, and how do we assess it? Research findings from a systematic literature search , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.tsc.2021.100949 , url =

work page doi:10.1016/j.tsc.2021.100949 2021
[51]

Adler and Theodore E.A

Jonathan M. Adler and Theodore E.A. Waters and Justin Poh and Sophia Seitz , keywords =. The nature of narrative coherence: An empirical approach , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.jrp.2018.01.001 , url =

work page doi:10.1016/j.jrp.2018.01.001 2018
[52]

Boyd and Kate G

Ryan L. Boyd and Kate G. Blackburn and James W. Pennebaker , title =. Science Advances , volume =. 2020 , doi =. https://www.science.org/doi/pdf/10.1126/sciadv.aba2196 , abstract =

work page doi:10.1126/sciadv.aba2196 2020
[53]

Large language models are not fair evaluators

Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

work page doi:10.18653/v1/2024.acl-long.511 2024
[54]

Extracting Training Data from Large Language Models , booktitle =

Nicholas Carlini and Florian Tram. Extracting Training Data from Large Language Models , booktitle =. 2021 , isbn =

2021
[55]

and Lee, Katherine and Cooper, A

Hayes, Jamie and Swanberg, Marika and Chaudhari, Harsh and Yona, Itay and Shumailov, Ilia and Nasr, Milad and Choquette-Choo, Christopher A. and Lee, Katherine and Cooper, A. Feder. Measuring memorization in language models via probabilistic extraction. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comput...

work page doi:10.18653/v1/2025.naacl-long.469 2025
[56]

2023 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

2023

[1] [1]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Learning to Reason under Off-Policy Guidance , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[2] [2]

2025 , eprint=

RePO: Replay-Enhanced Policy Optimization , author=. 2025 , eprint=

2025

[3] [3]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[4] [4]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Anthony, Thomas and Tian, Zheng and Barber, David , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

2017

[5] [5]

2025 , eprint=

G ^2 RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance , author=. 2025 , eprint=

2025

[6] [6]

2026 , url=

Xuechen Zhang and Zijian Huang and Yingcong Li and Chenshun Ni and Jiasi Chen and Samet Oymak , booktitle=. 2026 , url=

2026

[7] [7]

Nature , volume =

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author =. Nature , volume =. 2025 , doi =

2025

[8] [8]

2026 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

2026

[9] [9]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024

[10] [10]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[11] [11]

RLHF deciphered: A critical analysis of reinforcement learning from human feedback for llms.ACM Comput

Chaudhari, Shreyas and Aggarwal, Pranjal and Murahari, Vishvak and Rajpurohit, Tanmay and Kalyan, Ashwin and Narasimhan, Karthik and Deshpande, Ameet and Castro da Silva, Bruno , title =. ACM Comput. Surv. , month = sep, articleno =. 2025 , issue_date =. doi:10.1145/3743127 , abstract =

work page doi:10.1145/3743127 2025

[12] [12]

LongWriter: Unleashing 10,000+ Word Generation from Long Context

Yushi Bai and Jiajie Zhang and Xin Lv and Linzhi Zheng and Siqi Zhu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li , booktitle=. LongWriter: Unleashing 10,000+ Word Generation from Long Context. 2025 , url=

2025

[13] [13]

2025 , eprint=

WritingBench: A Comprehensive Benchmark for Generative Writing , author=. 2025 , eprint=

2025

[14] [14]

Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards

Wei, Xiaolong and Lu, Bo and Zhang, Xingyu and Zhao, Zhejun and Shen, Dongdong and Xia, Long and Yin, Dawei. Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.868

work page doi:10.18653/v1/2025.emnlp-main.868 2025

[15] [15]

SWAG : Storytelling With Action Guidance

Pei, Jonathan and Patel, Zeeshan and El-Refai, Karim and Li, Tianle. SWAG : Storytelling With Action Guidance. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.824

work page doi:10.18653/v1/2024.findings-emnlp.824 2024

[16] [16]

A Survey on LLM s for Story Generation

Teleki, Maria and Bengali, Vedangi and Dong, Xiangjue and Janjur, Sai Tejas and Liu, Haoran and Liu, Tian and Wang, Cong and Liu, Ting and Zhang, Yin and Shipman, Frank and Caverlee, James. A Survey on LLM s for Story Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.750

work page doi:10.18653/v1/2025.findings-emnlp.750 2025

[17] [17]

The Fourteenth International Conference on Learning Representations , year=

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=

[18] [18]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , month=. doi:10.1609/aaai.v40i38.40467 , abstractNote=

work page doi:10.1609/aaai.v40i38.40467 2026

[19] [19]

2026 , eprint=

Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling , author=. 2026 , eprint=

2026

[20] [20]

2026 , eprint=

Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization , author=. 2026 , eprint=

2026

[21] [21]

Writing-

Xuanyu Lei and Chenliang Li and Yuning Wu and Kaiming Liu and Weizhou Shen and Peng Li and Ming Yan and Fei Huang and Ya-Qin Zhang and Yang Liu , year=. Writing-

[22] [22]

2025 , eprint=

Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards , author=. 2025 , eprint=

2025

[23] [23]

2026 , eprint=

DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing , author=. 2026 , eprint=

2026

[24] [24]

2026 , eprint=

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning , author=. 2026 , eprint=

2026

[25] [25]

and Patel, Reet and Bosselut, Antoine and Plas, Lonneke Van Der and Beaty, Roger E

Ismayilzada, Mete and Laverghetta Jr., Antonio and Luchini, Simone A. and Patel, Reet and Bosselut, Antoine and Plas, Lonneke Van Der and Beaty, Roger E. Creative Preference Optimization. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.509

work page doi:10.18653/v1/2025.findings-emnlp.509 2025

[26] [26]

2025 , eprint=

Diverse Preference Optimization , author=. 2025 , eprint=

2025

[27] [27]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Li, Xian and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[28] [28]

Second Conference on Language Modeling , year=

Learning to Reason for Long-Form Story Generation , author=. Second Conference on Language Modeling , year=

[29] [29]

2024 , eprint=

EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models , author=. 2024 , eprint=

2024

[30] [30]

2024 , eprint=

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models , author=. 2024 , eprint=

2024

[31] [31]

L it B ench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Fein, Daniel and Russo, Sebastian and Xiang, Violet and Jolly, Kabir and Rafailov, Rafael and Haber, Nick. L it B ench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl...

work page doi:10.18653/v1/2026.eacl-long.362 2026

[32] [32]

N arra B ench: A Comprehensive Framework for Narrative Benchmarking

Hamilton, Sil and Wilkens, Matthew and Piper, Andrew. N arra B ench: A Comprehensive Framework for Narrative Benchmarking. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.176

work page doi:10.18653/v1/2026.eacl-long.176 2026

[33] [33]

2024 , eprint=

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=

2024

[34] [34]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge. Proceedings of the 2025 Conference on Empirical Methods ...

work page doi:10.18653/v1/2025.emnlp-main.138 2025

[35] [35]

The Twelfth International Conference on Learning Representations , year=

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[36] [36]

A Survey on

Jiawei Gu and Xuhui Jiang and Zhichao Shi and Hexiang Tan and Xuehao Zhai and Chengjin Xu and Wei Li and Yinghan Shen and Shengjie Ma and Honghao Liu and Saizhuo Wang and Kun Zhang and Zhouchi Lin and Bowen Zhang and Lionel Ni and Wen Gao and Yuanzhuo Wang and Jian Guo , keywords =. A survey on LLM-as-a-judge , journal =. 2026 , issn =. doi:https://doi.or...

work page doi:10.1016/j.xinn.2025.101253 2026

[37] [37]

2024 , address =

Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing...

work page doi:10.18653/v1/2024.emnlp-main.248 2024

[38] [38]

2025 , eprint=

Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments , author=. 2025 , eprint=

2025

[39] [39]

2026 , eprint=

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks , author=. 2026 , eprint=

2026

[40] [40]

Second Conference on Language Modeling , year=

A Critical Look At Tokenwise Reward-Guided Text Generation , author=. Second Conference on Language Modeling , year=

[41] [41]

doi: 10.18653/v1/2024.findings-naacl.111

Li, Wendi and Wei, Wei and Xu, Kaihe and Xie, Wenfeng and Chen, Dangyang and Cheng, Yu. Reinforcement Learning with Token-level Feedback for Controllable Text Generation. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.111

work page doi:10.18653/v1/2024.findings-naacl.111 2024

[42] [42]

2023 , eprint=

Let's Verify Step by Step , author=. 2023 , eprint=

2023

[43] [43]

2025 , eprint=

Process Reinforcement through Implicit Rewards , author=. 2025 , eprint=

2025

[44] [44]

Zhao, Yanli and Gu, Andrew and Varma, Rohan and Luo, Liang and Huang, Chien-Chin and Xu, Min and Wright, Less and Shojanazeri, Hamid and Ott, Myle and Shleifer, Sam and Desmaison, Alban and Balioglu, Can and Damania, Pritam and Nguyen, Bernard and Chauhan, Geeta and Hao, Yuchen and Mathews, Ajit and Li, Shen , title =. Proc. VLDB Endow. , month = aug, pag...

work page doi:10.14778/3611540.3611569 2023

[45] [45]

2026 , howpublished =

2026

[46] [46]

2026 , month = feb, howpublished =

2026

[47] [47]

2025 , month = dec, howpublished =

2025

[48] [48]

2026 , month = apr, day =

2026

[49] [49]

2026 , eprint =

Ministral 3 , author =. 2026 , eprint =

2026

[50] [50]

What characterises creativity in narrative writing, and how do we assess it? Research findings from a systematic literature search , journal =

Richard D'Souza , keywords =. What characterises creativity in narrative writing, and how do we assess it? Research findings from a systematic literature search , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.tsc.2021.100949 , url =

work page doi:10.1016/j.tsc.2021.100949 2021

[51] [51]

Adler and Theodore E.A

Jonathan M. Adler and Theodore E.A. Waters and Justin Poh and Sophia Seitz , keywords =. The nature of narrative coherence: An empirical approach , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.jrp.2018.01.001 , url =

work page doi:10.1016/j.jrp.2018.01.001 2018

[52] [52]

Boyd and Kate G

Ryan L. Boyd and Kate G. Blackburn and James W. Pennebaker , title =. Science Advances , volume =. 2020 , doi =. https://www.science.org/doi/pdf/10.1126/sciadv.aba2196 , abstract =

work page doi:10.1126/sciadv.aba2196 2020

[53] [53]

Large language models are not fair evaluators

Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

work page doi:10.18653/v1/2024.acl-long.511 2024

[54] [54]

Extracting Training Data from Large Language Models , booktitle =

Nicholas Carlini and Florian Tram. Extracting Training Data from Large Language Models , booktitle =. 2021 , isbn =

2021

[55] [55]

and Lee, Katherine and Cooper, A

Hayes, Jamie and Swanberg, Marika and Chaudhari, Harsh and Yona, Itay and Shumailov, Ilia and Nasr, Milad and Choquette-Choo, Christopher A. and Lee, Katherine and Cooper, A. Feder. Measuring memorization in language models via probabilistic extraction. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comput...

work page doi:10.18653/v1/2025.naacl-long.469 2025

[56] [56]

2023 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

2023