hub

Writingbench: A comprehensive benchmark for generative writing

Yuning Wu et al · 2025 · arXiv 2503.05244

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2 baseline 1

citation-polarity summary

use dataset 2 baseline 1

representative citing papers

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.

LitVISTA: A Benchmark for Narrative Orchestration in Literary Text

cs.CL · 2026-01-10 · unverdicted · novelty 7.0

LitVISTA benchmark shows frontier LLMs fail to jointly capture narrative function and structure in literary texts, with errors dominated by anchor identification.

LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

cs.CL · 2025-10-17 · unverdicted · novelty 7.0

A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

HarDBench demonstrates that current LLMs are highly susceptible to draft-based jailbreak attacks for harmful content in co-authoring scenarios, and a safety-utility balanced alignment via preference optimization significantly reduces such outputs without harming benign performance.

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

cs.CL · 2025-06-23 · unverdicted · novelty 6.0

LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

cs.CL · 2025-06-13 · conditional · novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

cs.CL · 2026-05-08 · conditional · novelty 5.0 · 2 refs

EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.

Mind DeepResearch Technical Report

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

Qwen3 Technical Report

cs.CL · 2025-05-14 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

cs.DC · 2026-02-10

citing papers explorer

Showing 12 of 12 citing papers.

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization cs.LG · 2026-05-13 · unverdicted · none · ref 12 · 2 links
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.
LitVISTA: A Benchmark for Narrative Orchestration in Literary Text cs.CL · 2026-01-10 · unverdicted · none · ref 3
LitVISTA benchmark shows frontier LLMs fail to jointly capture narrative function and structure in literary texts, with errors dominated by anchor identification.
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation cs.CL · 2025-10-17 · unverdicted · none · ref 3
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 34
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CL · 2026-05-08 · conditional · none · ref 39
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing cs.CL · 2026-04-21 · unverdicted · none · ref 5
HarDBench demonstrates that current LLMs are highly susceptible to draft-based jailbreak attacks for harmful content in co-authoring scenarios, and a safety-utility balanced alignment via preference optimization significantly reduces such outputs without harming benign performance.
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning cs.CL · 2025-06-23 · unverdicted · none · ref 41
LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 31
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs cs.CL · 2026-05-08 · conditional · none · ref 79 · 2 links
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 44
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Qwen3 Technical Report cs.CL · 2025-05-14 · unverdicted · none · ref 37
Pith review generated a malformed one-line summary.
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding cs.DC · 2026-02-10 · unreviewed · ref 49

Writingbench: A comprehensive benchmark for generative writing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer