Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Pith reviewed 2026-05-21 17:55 UTC · model grok-4.3
The pith
A post-training framework for instruction-based image editing uses diffusion negative-aware finetuning and MLLM logit feedback to reach state-of-the-art benchmark scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Diffusion Negative-aware Finetuning provides a consistent, likelihood-free policy optimization method for diffusion-based editing, while an MLLM supplies reliable implicit feedback through its logits and low-variance group filtering reduces scoring variance, together yielding a model-agnostic post-training recipe that lifts editing performance beyond supervised fine-tuning alone.
What carries the argument
Edit-R1 post-training framework built on Diffusion Negative-aware Finetuning (DiffusionNFT) for policy optimization and an MLLM used as a unified training-free reward model via output logits with low-variance group filtering.
If this is right
- The same framework produces substantial gains when applied to different base models including Qwen-Image-Edit and FLUX-Kontext.
- Training can use higher-order samplers because the optimization remains consistent with the flow matching forward process.
- A single MLLM serves as reward model for many different editing instructions without task-specific training.
- The approach reduces overfitting to annotated patterns and improves generalization outside the training distribution.
Where Pith is reading between the lines
- Similar post-training could lower the amount of human-annotated editing pairs needed to reach high performance.
- The logit-based feedback and group filtering technique might transfer to other diffusion or flow-based generative tasks.
- Low-variance filtering of noisy LLM signals could stabilize reinforcement learning loops in additional multimodal settings.
Load-bearing premise
The multimodal large language model supplies reliable, unbiased fine-grained feedback on editing quality through its output logits across varied instructions.
What would settle it
Applying the full Edit-R1 procedure to a base model such as FLUX-Kontext and measuring no improvement or a drop relative to standard supervised fine-tuning on the ImgEdit benchmark would falsify the claimed gains.
read the original abstract
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Edit-R1, a post-training framework for instruction-based image editing. It proposes Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method aligned with the flow matching forward process, and uses a Multimodal Large Language Model (MLLM) as a training-free reward model via its output logits, augmented by a low-variance group filtering mechanism to reduce scoring noise. The resulting UniWorld-V2 model is reported to achieve state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench, with the framework shown to be model-agnostic and to deliver gains on base models including Qwen-Image-Edit and FLUX-Kontext. Code and models are released publicly.
Significance. If the performance claims and underlying assumptions hold after validation, the work would represent a meaningful contribution to post-training of diffusion-based image editors by enabling exploration beyond supervised fine-tuning distributions through policy optimization. The model-agnostic design and public code release are strengths that support broader applicability and reproducibility in the computer vision community.
major comments (3)
- [§4 (Experiments)] §4 (Experiments): The SOTA claims rest on benchmark scores of 4.49 and 7.83 without reported error bars, multiple random seeds, or statistical significance tests against baselines; this makes it impossible to determine whether the gains from DiffusionNFT and MLLM-driven optimization are robust or could be explained by variance in evaluation.
- [§3.2 (MLLM Reward Model)] §3.2 (MLLM Reward Model): The central assumption that MLLM output logits provide fine-grained, unbiased feedback correlating with editing success across diverse instructions lacks supporting validation such as correlation with human judgments or ablation on logit calibration; without this, the reward signal's reliability for policy optimization remains unverified and could bias the training trajectory.
- [§3.3 (Low-variance Group Filtering)] §3.3 (Low-variance Group Filtering): The filtering mechanism is claimed to reduce noise while preserving the optimization trajectory, yet no analysis is provided on whether it systematically excludes higher-variance (potentially harder or more diverse) edits, which would alter the effective data distribution and risk inflating benchmark scores without true generalization improvement.
minor comments (2)
- [Abstract] The abstract states 'substantial performance gains' on base models but provides no quantitative deltas; adding these numbers would improve precision.
- [§3.1] Notation for the DiffusionNFT objective could benefit from an explicit equation reference when first introduced to aid readers unfamiliar with flow-matching consistency.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor and validation that we will address in the revision. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments): The SOTA claims rest on benchmark scores of 4.49 and 7.83 without reported error bars, multiple random seeds, or statistical significance tests against baselines; this makes it impossible to determine whether the gains from DiffusionNFT and MLLM-driven optimization are robust or could be explained by variance in evaluation.
Authors: We agree that reporting variability and statistical significance would strengthen the claims. The reported scores reflect the best single-run results obtained during development, but we have since performed additional training runs with three different random seeds for the key configurations. In the revised manuscript we will report mean and standard deviation for the main benchmarks and include paired statistical tests against the strongest baselines. revision: yes
-
Referee: [§3.2 (MLLM Reward Model)] §3.2 (MLLM Reward Model): The central assumption that MLLM output logits provide fine-grained, unbiased feedback correlating with editing success across diverse instructions lacks supporting validation such as correlation with human judgments or ablation on logit calibration; without this, the reward signal's reliability for policy optimization remains unverified and could bias the training trajectory.
Authors: The use of MLLM logits is motivated by their ability to provide instruction-aware, continuous signals without additional training. While the original submission did not contain an explicit human correlation study, we will add a targeted validation: we sample a held-out set of 200 edits, collect human preference ratings, and report Spearman correlation between MLLM logit scores and human judgments. We will also include an ablation comparing raw logits versus calibrated or layer-specific variants. revision: yes
-
Referee: [§3.3 (Low-variance Group Filtering)] §3.3 (Low-variance Group Filtering): The filtering mechanism is claimed to reduce noise while preserving the optimization trajectory, yet no analysis is provided on whether it systematically excludes higher-variance (potentially harder or more diverse) edits, which would alter the effective data distribution and risk inflating benchmark scores without true generalization improvement.
Authors: The group filtering selects batches with low intra-group score variance to stabilize the policy gradient estimate. To examine possible distributional shift, we will add an analysis in the revision that compares instruction complexity, image diversity metrics, and edit difficulty proxies before and after filtering. Any observed bias will be quantified and discussed, together with an ablation that relaxes the variance threshold. revision: yes
Circularity Check
No circularity: empirical SOTA claims rest on external benchmarks
full rationale
The paper proposes an empirical post-training framework (DiffusionNFT policy optimization with MLLM logit rewards and low-variance filtering) and reports direct performance numbers on independent external benchmarks (ImgEdit 4.49, GEdit-Bench 7.83). No equation or derivation reduces these scores to a fitted parameter, self-referential quantity, or self-citation chain by construction. The central claims are model-agnostic gains demonstrated via standard training and evaluation, with no load-bearing step that collapses to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- group size and variance threshold for filtering
axioms (1)
- domain assumption MLLM output logits provide fine-grained, training-free feedback that correlates with editing quality across diverse instructions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process... leveraging its output logits to provide fine-grained feedback... low-variance group filtering mechanism
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
-
Inline Critic Steers Image Editing
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in ...
-
Setting the Stage: Text-Driven Scene-Consistent Image Generation
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
-
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
-
Semantic Generative Tuning for Unified Multimodal Models
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
-
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.
-
Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
-
SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing
SmartPhotoCrafter performs automatic photographic image editing by coupling an Image Critic module that identifies deficiencies with a Photographic Artist module that generates edits, trained via multi-stage pretraini...
-
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Reference graph
Works this paper leans on
-
[1]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Honestllm: Toward an honest and helpful large language model
Chujie Gao, Siyuan Wu, Yue Huang, Dongping Chen, Qihui Zhang, Zhengyan Fu, Yao Wan, Lichao Sun, and Xiangliang Zhang. Honestllm: Toward an honest and helpful large language model. arXiv preprint arXiv:2406.00380,
-
[4]
12 Technical Report Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, and Xinglong Wu. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066,
-
[5]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025a. Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, J...
-
[7]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
URLhttps://arxiv.org/abs/2506.15742. Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, and Zhi-Hua Zhou. Generalist reward models: Found inside large language models. arXiv preprint arXiv:2506.23235,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025a. Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A pract...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
13 Technical Report Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, et al. Cot-lized diffusion: Let’s reinforce t2i generation step-by-step.arXiv preprint arXiv:2507.04451, 2025c. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver...
-
[13]
Editscore: Unlocking online rl for image editing via high-fidelity reward modeling
Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. arXiv preprint arXiv:2509.23909,
-
[14]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[18]
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
14 Technical Report Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025a. Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025c
Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025c. Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv pr...
-
[21]
Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human prefer- ence learning for image and video generation.arXiv preprint arXiv:2412.21059,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025a. Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual gener...
-
[23]
Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, et al. Asft: Anchoring safety during llm fine-tuning within narrow safety basin.arXiv preprint arXiv:2506.08473,
-
[24]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Yang Ye, Tianyu He, Shuo Yang, and Jiang Bian. Reinforcement learning with inverse rewards for world model post-training.arXiv preprint arXiv:2509.23958, 2025a. Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025b. Qifan Y...
-
[26]
Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, et al. Lex-art: Rethinking text generation via scalable high-quality data synthesis.arXiv preprint arXiv:2503.21749,
-
[27]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Our proposed method, “Score Logit”, which utilizes the expected value of score logits, achieves a pairwise accuracy of 74.74%. This result significantly surpasses all other baseline methods, including binary classification-based rewards and those using discrete scores. This demonstrates that our continuous reward signal is more effective at capturing the ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.