Mixed citations

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al · 2022

Mixed citation behavior. Most common role is background (60%).

8 Pith papers citing it

Background 60% of classified citations

browse 8 citing papers

citation-role summary

background 3 method 2

citation-polarity summary

background 3 use method 2

representative citing papers

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

cs.CL · 2026-05-10 · conditional · novelty 7.0

K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution

cs.SE · 2026-02-27 · unverdicted · novelty 7.0

IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.

MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

cs.CR · 2026-05-10 · unverdicted · novelty 6.0 · 3 refs

MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

cs.CV · 2025-10-02 · conditional · novelty 6.0

Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

DanceGRPO: Unleashing GRPO on Visual Generation

cs.CV · 2025-05-12 · unverdicted · novelty 6.0

DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

cs.AI · 2025-04-07 · unverdicted · novelty 6.0

VAPO achieves 60.4 on AIME 2024 with Qwen 32B, outperforming prior methods by over 10 points through targeted fixes for value bias, sequence length variation, and sparse rewards.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV · 2025-10-02 · conditional · none · ref 45
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
DanceGRPO: Unleashing GRPO on Visual Generation cs.CV · 2025-05-12 · unverdicted · none · ref 48
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks cs.AI · 2025-04-07 · unverdicted · none · ref 17
VAPO achieves 60.4 on AIME 2024 with Qwen 32B, outperforming prior methods by over 10 points through targeted fixes for value bias, sequence length variation, and sparse rewards.

Training language models to follow instructions with human feedback

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer