hub

Writingbench: A comprehensive benchmark for generative writing

Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, Fei Huang · 2025 · arXiv 2503.05244

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2 baseline 1

citation-polarity summary

use dataset 2 baseline 1

representative citing papers

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.

LitVISTA: A Benchmark for Narrative Orchestration in Literary Text

cs.CL · 2026-01-10 · unverdicted · novelty 7.0

LitVISTA benchmark shows frontier LLMs fail to jointly capture narrative function and structure in literary texts, with errors dominated by anchor identification.

LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

cs.CL · 2025-10-17 · unverdicted · novelty 7.0

A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

Later-domain RL training harms earlier domains via second-order damage concentrated in a low-dimensional shared conflict subspace; brief domain refresh contracts this component to enable selective recovery.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

cs.CL · 2025-06-23 · unverdicted · novelty 6.0

LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

cs.CL · 2025-06-13 · conditional · novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

From Personas to Plot: Character-Grounded Multi-Agent Story Generation for Long-Form Narratives

cs.CL · 2026-07-01 · unverdicted · novelty 5.0

MAGNET multi-agent generation with persona grounding and ATLAS graph verification yields 34-50% fewer hallucinations and annotations than single-model or IBSEN baselines at 100-page scale.

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

cs.CL · 2026-05-08 · conditional · novelty 5.0 · 2 refs

EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.

Mind DeepResearch Technical Report

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

Qwen3 Technical Report

cs.CL · 2025-05-14 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

cs.CL · 2026-04-21

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

cs.DC · 2026-02-10

citing papers explorer

Showing 15 of 15 citing papers.

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers cs.LG · 2026-06-10 · unverdicted · none · ref 27
RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization cs.LG · 2026-05-13 · unverdicted · none · ref 12 · 2 links
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.
LitVISTA: A Benchmark for Narrative Orchestration in Literary Text cs.CL · 2026-01-10 · unverdicted · none · ref 3
LitVISTA benchmark shows frontier LLMs fail to jointly capture narrative function and structure in literary texts, with errors dominated by anchor identification.
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation cs.CL · 2025-10-17 · unverdicted · none · ref 3
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL cs.LG · 2026-06-01 · unverdicted · none · ref 37
Later-domain RL training harms earlier domains via second-order damage concentrated in a low-dimensional shared conflict subspace; brief domain refresh contracts this component to enable selective recovery.
Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 34
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CL · 2026-05-08 · conditional · none · ref 39
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning cs.CL · 2025-06-23 · unverdicted · none · ref 41
LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 31
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
From Personas to Plot: Character-Grounded Multi-Agent Story Generation for Long-Form Narratives cs.CL · 2026-07-01 · unverdicted · none · ref 50
MAGNET multi-agent generation with persona grounding and ATLAS graph verification yields 34-50% fewer hallucinations and annotations than single-model or IBSEN baselines at 100-page scale.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs cs.CL · 2026-05-08 · conditional · none · ref 79 · 2 links
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 44
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Qwen3 Technical Report cs.CL · 2025-05-14 · unverdicted · none · ref 37
Pith review generated a malformed one-line summary.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing cs.CL · 2026-04-21 · unreviewed · ref 5
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding cs.DC · 2026-02-10 · unreviewed · ref 49

Writingbench: A comprehensive benchmark for generative writing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer