RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
hub
Writingbench: A comprehensive benchmark for generative writing
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.
LitVISTA benchmark shows frontier LLMs fail to jointly capture narrative function and structure in literary texts, with errors dominated by anchor identification.
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
Later-domain RL training harms earlier domains via second-order damage concentrated in a low-dimensional shared conflict subspace; brief domain refresh contracts this component to enable selective recovery.
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
MAGNET multi-agent generation with persona grounding and ATLAS graph verification yields 34-50% fewer hallucinations and annotations than single-model or IBSEN baselines at 100-page scale.
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Pith review generated a malformed one-line summary.
citing papers explorer
-
Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers
RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
-
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.
-
LitVISTA: A Benchmark for Narrative Orchestration in Literary Text
LitVISTA benchmark shows frontier LLMs fail to jointly capture narrative function and structure in literary texts, with errors dominated by anchor identification.
-
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
-
A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL
Later-domain RL training harms earlier domains via second-order damage concentrated in a low-dimensional shared conflict subspace; brief domain refresh contracts this component to enable selective recovery.
-
Reward Hacking in Rubric-Based Reinforcement Learning
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
-
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
-
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
LongWriter-Zero applies RL from a base model with specialized rewards for length, quality, and structure to outperform SFT baselines and larger models on long-writing benchmarks.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
From Personas to Plot: Character-Grounded Multi-Agent Story Generation for Long-Form Narratives
MAGNET multi-agent generation with persona grounding and ATLAS graph verification yields 34-50% fewer hallucinations and annotations than single-model or IBSEN baselines at 100-page scale.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
Qwen3 Technical Report
Pith review generated a malformed one-line summary.
- HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
- SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding