Step-by-step preference optimization (SPO) [ 82] introduces step-level evaluation and adjustment, ensuring that preference signals are accurately propagated at each denoising stage

models the denoising procedure as a multi-step Markov decision process (MDP), demonstrating 27 that directly updating the policy based on human preferences within this MDP is equiv

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

citing papers explorer

Showing 1 of 1 citing paper.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 128
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

Step-by-step preference optimization (SPO) [ 82] introduces step-level evaluation and adjustment, ensuring that preference signals are accurately propagated at each denoising stage

fields

years

verdicts

representative citing papers

citing papers explorer