RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Onereward: Unified mask-guided image generation via multi-task human preference learning
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
background 3polarities
background 3representative citing papers
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.
Dynamic-TreeRPO replaces independent trajectory sampling with a tree-structured search using dynamic noise intensities and integrates SFT into RL via a weighted Progress Reward Model to achieve better semantic consistency and efficiency in text-to-image generation.
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.
citing papers explorer
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
-
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
-
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.
-
Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
Dynamic-TreeRPO replaces independent trajectory sampling with a tree-structured search using dynamic noise intensities and integrates SFT into RL via a weighted Progress Reward Model to achieve better semantic consistency and efficiency in text-to-image generation.
-
Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.