Flow-OPD: On-Policy Distillation for Flow Matching Models

Feng Zhao; Kaituo Feng; Lin Chen; Shaosheng Cao; Shuang Chen; Wenxuan Huang; Yiming Zhao; Yunlong Lin; Yu Zeng; Zehui Chen

REVIEW 3 major objections 1 minor 5 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Flow-OPD applies on-policy distillation to consolidate separate domain experts into one flow matching text-to-image model.

2026-06-30 22:59 UTC pith:KMSLJJOB

load-bearing objection Flow-OPD transfers on-policy distillation to flow matching via single-reward teachers plus cold-start and MAR, but the reported gains rest on an unverified premise about teacher ceilings with no supporting metrics or ablations. the 3 major comments →

arxiv 2605.08063 v5 pith:KMSLJJOB submitted 2026-05-08 cs.CV cs.AI

Flow-OPD: On-Policy Distillation for Flow Matching Models

Zhen Fang , Wenxuan Huang , Yu Zeng , Yiming Zhao , Shuang Chen , Kaituo Feng , Yunlong Lin , Lin Chen

show 3 more authors

Zehui Chen Shaosheng Cao Feng Zhao

This is my paper

classification cs.CV cs.AI

keywords flow matchingon-policy distillationtext-to-image generationGRPO fine-tuningmulti-task alignmentmanifold regularizationreward sparsity

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-task alignment in flow matching models is blocked by reward sparsity from scalar rewards and gradient interference from competing objectives, which together produce a seesaw effect across metrics. It proposes a two-stage process that first trains isolated domain-specialized teachers with single-reward GRPO, then distills their expertise into one student through on-policy sampling, task-routing labels, and dense trajectory supervision. A manifold anchor regularizer keeps generations on a high-quality manifold to avoid aesthetic loss. The resulting model improves composite performance while maintaining fidelity and human preference scores and shows an emergent ability to exceed any single teacher.

Core claim

Flow-OPD is a post-training framework that first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, then uses a Flow-based Cold-Start to initialize a robust policy and consolidates the teachers into a single student through on-policy sampling, task-routing labeling, and dense trajectory-level supervision, augmented by Manifold Anchor Regularization that supplies task-agnostic full-data supervision to anchor output to a high-quality manifold.

What carries the argument

The two-stage alignment strategy that isolates single-reward teacher training and then performs on-policy distillation with task-routing labeling and dense trajectory supervision, plus Manifold Anchor Regularization for manifold anchoring.

Load-bearing premise

Single-reward GRPO training lets each domain teacher reach its performance ceiling without later interference when their outputs are combined.

What would settle it

A controlled experiment that trains the same set of domain teachers, applies the distillation steps, and measures whether all target metrics rise together without measurable drop in image fidelity or human preference scores.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

The student model exceeds the performance of any individual teacher on the combined task set.
Reward sparsity and gradient interference are reduced enough to eliminate the seesaw effect across heterogeneous objectives.
Image fidelity and human-preference alignment remain intact after the consolidation stage.
The same two-stage pattern scales to building generalist text-to-image models from multiple specialized experts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other generative architectures that currently rely on joint multi-objective fine-tuning.
If the teacher-surpassing effect holds, future work could deliberately create more diverse domain teachers to widen the final performance gap.
Task-routing labeling could be replaced by learned routers without changing the core distillation loop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Flow-OPD transfers on-policy distillation to flow matching via single-reward teachers plus cold-start and MAR, but the reported gains rest on an unverified premise about teacher ceilings with no supporting metrics or ablations.

read the letter

The paper's main contribution is a two-stage recipe for multi-objective alignment in flow matching: train separate GRPO teachers on single rewards, then distill via on-policy sampling, task routing, and dense supervision, plus a Flow-based Cold-Start and Manifold Anchor Regularization to keep the student on a good manifold.

What works is the concrete orchestration and the size of the claimed lifts—GenEval from 63 to 92 and OCR from 59 to 94, plus roughly 10 points over vanilla GRPO while holding fidelity and human preference. The teacher-surpassing effect is a nice observation if it survives scrutiny. The code release promise is also useful.

The soft spots are exactly where the stress-test note points. The whole argument hinges on each single-reward teacher reaching its isolated ceiling without gradient interference or sparsity, yet the abstract gives no teacher-only scores, no single-vs-joint comparison, and no check that extra single-reward steps would not improve the teachers further. Without those, you cannot tell how much the distillation step actually adds versus the cold-start or MAR. No error bars, no ablation table, and no discussion of how task-routing thresholds or sampling temperatures were chosen make the numbers hard to trust at face value.

This is for researchers doing post-training alignment on diffusion or flow models who need scalable multi-reward recipes. A reader looking for practical ideas might extract the two-stage pattern and the regularization trick. It deserves peer review because the application is new and the potential payoff is real, even though the current evidence is preliminary and the key assumption needs direct verification in the full paper.

Referee Report

3 major / 1 minor

Summary. The paper proposes Flow-OPD, a two-stage post-training framework for Flow Matching text-to-image models. Stage 1 trains domain-specialized teacher models via single-reward GRPO fine-tuning; Stage 2 performs on-policy distillation into a student using a Flow-based Cold-Start initialization, task-routing labeling, dense trajectory-level supervision, and a new Manifold Anchor Regularization (MAR) term that anchors generations to a task-agnostic teacher manifold. On Stable Diffusion 3.5 Medium the method is reported to raise GenEval from 63 to 92 and OCR accuracy from 59 to 94, yielding an overall ~10-point gain over vanilla GRPO while preserving fidelity and human preference scores and exhibiting a teacher-surpassing effect.

Significance. If the central empirical claims hold after proper verification, the work would constitute a meaningful contribution to multi-objective alignment of flow-based generative models by offering an explicit mechanism to avoid reward sparsity and gradient interference. The planned public release of code and weights is a clear strength that would support reproducibility.

major comments (3)

[Abstract] Abstract: The load-bearing premise that single-reward GRPO fine-tuning lets each domain-specialized teacher reach its isolated performance ceiling (thereby avoiding gradient interference and reward sparsity) receives no supporting evidence. No teacher-only metrics, no single-reward vs. joint multi-reward ablation, and no check that additional single-reward steps would not further improve the teachers are reported; without these the contribution of the subsequent distillation stage cannot be isolated from the Cold-Start or MAR components.
[Abstract] Abstract: The reported metric gains (GenEval 63→92, OCR 59→94, ~10-point overall improvement) are presented without error bars, standard deviations across seeds, or any description of how post-hoc choices (task-routing labeling thresholds, sampling temperature) were selected or whether they were tuned on the same held-out metrics used for the final comparison to GRPO.
[Abstract] Abstract: The evaluation protocol does not state whether the GRPO baseline scores were obtained with identical sampling budgets, reward-weight schedules, or hyper-parameters as the Flow-OPD teachers, making it impossible to determine whether the observed gains are attributable to the two-stage strategy or to differences in training configuration.

minor comments (1)

[Abstract] The abstract states that 'the codes and weights will be released' but provides no link or repository identifier in the current manuscript; this should be added for completeness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing premise that single-reward GRPO fine-tuning lets each domain-specialized teacher reach its isolated performance ceiling (thereby avoiding gradient interference and reward sparsity) receives no supporting evidence. No teacher-only metrics, no single-reward vs. joint multi-reward ablation, and no check that additional single-reward steps would not further improve the teachers are reported; without these the contribution of the subsequent distillation stage cannot be isolated from the Cold-Start or MAR components.

Authors: We agree that providing teacher-only metrics would help isolate the contributions. In the revised manuscript, we will include a new table or section reporting the performance of each domain-specialized teacher on its target metric, demonstrating that they achieve high scores in isolation. While a full single-reward vs. joint ablation is computationally intensive, we will add a discussion referencing prior literature on gradient interference in multi-objective RL and note that the observed teacher-surpassing effect in the student model supports the value of the distillation stage. We will also confirm that teachers were trained to convergence. revision: partial
Referee: [Abstract] Abstract: The reported metric gains (GenEval 63→92, OCR 59→94, ~10-point overall improvement) are presented without error bars, standard deviations across seeds, or any description of how post-hoc choices (task-routing labeling thresholds, sampling temperature) were selected or whether they were tuned on the same held-out metrics used for the final comparison to GRPO.

Authors: The primary results are reported from our main experimental configuration. We will revise the experimental section to describe the hyperparameter selection process, noting that task-routing thresholds and sampling temperatures were determined using a separate validation set not overlapping with the reported test metrics. Regarding error bars, due to the significant computational resources required for full training runs, we conducted single-seed training but will report standard deviations from multiple inference runs (e.g., 5 seeds) for the final metrics in the revision. revision: yes
Referee: [Abstract] Abstract: The evaluation protocol does not state whether the GRPO baseline scores were obtained with identical sampling budgets, reward-weight schedules, or hyper-parameters as the Flow-OPD teachers, making it impossible to determine whether the observed gains are attributable to the two-stage strategy or to differences in training configuration.

Authors: We will explicitly clarify in the revised experimental protocol that the vanilla GRPO baseline was trained using identical sampling budgets, reward-weight schedules, and hyper-parameters as those used for the individual Flow-OPD teachers, with the key difference being the joint optimization across all rewards in the baseline. This ensures the comparison isolates the effect of the two-stage on-policy distillation approach. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on held-out metrics are independent of method definition

full rationale

The paper describes an empirical two-stage post-training procedure (single-reward GRPO teachers followed by on-policy distillation plus MAR) and reports numerical gains on GenEval, OCR, and other benchmarks relative to vanilla GRPO. No equations, fitted parameters, or self-citations are presented whose outputs are then relabeled as predictions; the performance numbers are obtained from separate evaluation runs on metrics not used to define or tune the procedure itself. The derivation chain therefore consists of standard RL/distillation steps whose validity rests on external experimental comparison rather than reduction to the inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of the introduced components; no machine-checked proofs or parameter-free derivations are present.

free parameters (2)

GRPO reward weights and sampling parameters for teacher training
Chosen per task to reach isolated performance ceilings.
Task-routing labeling thresholds
Determines which teacher supervises each trajectory.

axioms (1)

domain assumption Single-reward GRPO fine-tuning allows each expert to reach its performance ceiling without interference from other objectives
Invoked to justify the first stage of the two-stage strategy.

invented entities (1)

Manifold Anchor Regularization (MAR) no independent evidence
purpose: Provides task-agnostic full-data supervision to anchor outputs to a high-quality manifold
Introduced to counteract aesthetic degradation from pure RL alignment

pith-pipeline@v0.9.1-grok · 5884 in / 1394 out tokens · 26302 ms · 2026-06-30T22:59:49.656576+00:00 · methodology

0 comments

read the original abstract

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .

Figures

Figures reproduced from arXiv: 2605.08063 by Feng Zhao, Kaituo Feng, Lin Chen, Shaosheng Cao, Shuang Chen, Wenxuan Huang, Yiming Zhao, Yunlong Lin, Yu Zeng, Zehui Chen, Zhen Fang.

**Figure 1.** Figure 1: Performance Comparison in Multi-task Training. During training, Flow-OPD exhibits a steady increase in mean rewards across GenEval [21] and OCR [22] benchmarks, reaching a peak of 93. In contrast, vanilla GRPO converges prematurely around 78. Our approach significantly outperforms GRPO in both image synthesis and text rendering while maintaining superior generation quality and human preference alignment. T… view at source ↗

**Figure 2.** Figure 2: Cross-task evaluation of single-reward GRPO. Optimizing with a solitary reward signal [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison between Flow-OPD and various baselines across diverse tasks. Our [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison between Flow-OPD and various baselines across diverse tasks. Our [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Cold-start ablation results. Qualitative results in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 4.** Figure 4: Cold-start ablation results. GRPO-Geneval GRPO-DeQA w.o KL Loss w. KL Loss(Ours) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative ablation results of Manifold Anchor Regularization. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: We use Qwen3-30B-A3B-Instruct-2507. B More Results B.1 Qualitative results More qualitative results are shown in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 6.** Figure 6: The structured evaluation prompt for Qwenvl Score . [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: More quantitative comparisons on the Pickscore evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: More quantitative comparisons on the GenEval evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: More quantitative comparisons on the OCR evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: More quantitative comparisons with DiffusionNFT [49]. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: More quantitative comparisons with DiffusionNFT [49]. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

Review history (5 revisions) →

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation
cs.CV 2026-05 unverdicted novelty 6.0

A multi-teacher distillation framework that packs 50 effect LoRAs and fast sampling into a single adapter while aiming to avoid concept interference.
DanceOPD: On-Policy Generative Field Distillation
cs.CV 2026-06 unverdicted novelty 5.0

DanceOPD routes samples across capability velocity fields in flow-matching models and trains via on-policy student-induced states to compose T2I, local editing, and global editing without mutual interference.
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
cs.LG 2026-06 unverdicted novelty 5.0

FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
Qwen-Image-Flash: Beyond Objective Design
cs.CV 2026-06 unverdicted novelty 4.0

Empirical analysis of data, guidance, and task mixture in few-step distillation of Qwen-Image-2.0 produces the Qwen-Image-Flash model with improved performance in unified generation and editing tasks.
Qwen-Image-2.0-RL Technical Report
cs.CV 2026-06 unverdicted novelty 2.0

Applies RLHF with composite VLM-based reward models and on-policy distillation to a diffusion model, reporting benchmark gains of +2.61 on Qwen-Image-Bench and Elo improvements of +78/+93.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 5 Pith papers · 14 internal anchors

[1]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

work page 2025
[2]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[3]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

arXiv preprint arXiv:2511.22134 , year=

Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, and Feng Zhao. Dualvla: Building a generalizable embodied agent via partial decoupling of reasoning and action.arXiv preprint arXiv:2511.22134, 2025

work page arXiv 2025
[5]

Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. 10

work page 2026
[6]

Vision-deepresearch: Incentivizing deepre- search capability in multimodal large language models

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

work page arXiv 2026
[7]

ReVisual-R1: An open-source 7B multimodal large language model for deep reasoning

Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

work page arXiv 2025
[8]

Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping,

Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

work page arXiv 2025
[10]

Opensearch-vl: An open recipe for frontier multimodal search agents, 2026

Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, and Tianyu Pang. Opensearch-vl: An open recipe for frontier multimodal search agents, 2026

work page 2026
[11]

UniCorn: Towards self-improving unified multimodal models through self-generated supervision, 2026

Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, et al. Unicorn: Towards self-improving unified multimodal models through self-generated supervision.arXiv preprint arXiv:2601.03193, 2026

work page arXiv 2026
[12]

Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi- Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

work page arXiv 2026
[13]

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

arXiv preprint arXiv:2509.06945 , year=

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, et al. Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025

work page arXiv 2025
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Dancegrpo: Unleashing grpo on visual generation, 2025

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025

work page 2025
[18]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023. 11

work page 2023
[22]

Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

work page 2023
[23]

Training diffusion models with reinforcement learning, 2024

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning, 2024

work page 2024
[24]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023

work page 2023
[25]

Imagereward: learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023

work page 2023
[26]

Diffusion model alignment using direct preference optimization, 2023

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023

work page 2023
[27]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Ar-grpo: Training autoregressive image generation models via reinforcement learning.arXiv preprint arXiv:2508.06924, 2025

Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Ar-grpo: Training autoregressive image generation models via reinforcement learning.arXiv preprint arXiv:2508.06924, 2025

work page arXiv 2025
[29]

Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

work page arXiv 2025
[30]

Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, and Feng Zhao. Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

work page arXiv 2025
[31]

Maskfocus: Focusing policy optimization on critical steps for masked image generation.arXiv preprint arXiv:2512.18766, 2025

Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, and Feng Zhao. Maskfocus: Focusing policy optimization on critical steps for masked image generation.arXiv preprint arXiv:2512.18766, 2025

work page arXiv 2025
[32]

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu, Aiming Hao, Jiahong Wu, Xiangxiang Chu, and Feng Zhao. Mar-grpo: Stabilized grpo for ar-diffusion hybrid image generation.arXiv preprint arXiv:2604.06966, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization, 2026

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization, 2026

work page 2026
[34]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024
[35]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

work page 2024
[36]

DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

work page Pith review arXiv 2025
[37]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Lyng, Sanjit Singh Batra, and Robert E

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

work page arXiv 2026
[40]

Paced: Distillation and self-distillation at the frontier of student competence.arXiv e-prints, pages arXiv–2603, 2026

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation and self-distillation at the frontier of student competence.arXiv e-prints, pages arXiv–2603, 2026

work page 2026
[41]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

work page 2025
[42]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023
[43]

Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

work page arXiv 2025
[44]

Laion aesthetics, Aug 2022

Chrisoph Schuhmann. Laion aesthetics, Aug 2022

work page 2022
[45]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[46]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 13 A More Details Following the data and reward configurations of Flow-GRPO, we conducted multi-task hybrid training for G...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

•3 (Fair):In focus, adequate lighting, but lacks creativity

Aesthetic Quality •1-2 (Low):Blurry, poor lighting, or chaotic composition. •3 (Fair):In focus, adequate lighting, but lacks creativity. •4-5 (High):Sharp, vibrant colors, masterful composition and impact

work page
[51]

•3 (Fair):Partially follows, but distorts some important elements

Instruction Following •1-2 (Low):Ignores or contradicts the instruction; misses key elements. •3 (Fair):Partially follows, but distorts some important elements. •4-5 (High):Faithful representation of all elements in the prompt

work page
[52]

winner-takes- all

Overall Score (Priority: Alignment>Aesthetics) The overall score must primarily reflectInstruction Following. A fair image that perfectly follows the prompt scores higher than a beautiful image that misses it. [EXECUTION RULES] •Strictness:Be rigorous; required details must be explicitly supported. •Reasoning:You MUST analyze keyword-by-keyword in the<Tho...

work page 2024

[1] [1]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

work page 2025

[2] [2]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024

[3] [3]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

arXiv preprint arXiv:2511.22134 , year=

Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, and Feng Zhao. Dualvla: Building a generalizable embodied agent via partial decoupling of reasoning and action.arXiv preprint arXiv:2511.22134, 2025

work page arXiv 2025

[5] [5]

Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. 10

work page 2026

[6] [6]

Vision-deepresearch: Incentivizing deepre- search capability in multimodal large language models

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

work page arXiv 2026

[7] [7]

ReVisual-R1: An open-source 7B multimodal large language model for deep reasoning

Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

work page arXiv 2025

[8] [8]

Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping,

Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

work page arXiv 2025

[9] [10]

Opensearch-vl: An open recipe for frontier multimodal search agents, 2026

Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, and Tianyu Pang. Opensearch-vl: An open recipe for frontier multimodal search agents, 2026

work page 2026

[10] [11]

UniCorn: Towards self-improving unified multimodal models through self-generated supervision, 2026

Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, et al. Unicorn: Towards self-improving unified multimodal models through self-generated supervision.arXiv preprint arXiv:2601.03193, 2026

work page arXiv 2026

[11] [12]

Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi- Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

work page arXiv 2026

[12] [13]

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [14]

arXiv preprint arXiv:2509.06945 , year=

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, et al. Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025

work page arXiv 2025

[14] [15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [17]

Dancegrpo: Unleashing grpo on visual generation, 2025

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025

work page 2025

[16] [18]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [19]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [20]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [21]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023. 11

work page 2023

[20] [22]

Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

work page 2023

[21] [23]

Training diffusion models with reinforcement learning, 2024

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning, 2024

work page 2024

[22] [24]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023

work page 2023

[23] [25]

Imagereward: learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023

work page 2023

[24] [26]

Diffusion model alignment using direct preference optimization, 2023

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023

work page 2023

[25] [27]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [28]

Ar-grpo: Training autoregressive image generation models via reinforcement learning.arXiv preprint arXiv:2508.06924, 2025

Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Ar-grpo: Training autoregressive image generation models via reinforcement learning.arXiv preprint arXiv:2508.06924, 2025

work page arXiv 2025

[27] [29]

Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

work page arXiv 2025

[28] [30]

Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, and Feng Zhao. Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

work page arXiv 2025

[29] [31]

Maskfocus: Focusing policy optimization on critical steps for masked image generation.arXiv preprint arXiv:2512.18766, 2025

Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, and Feng Zhao. Maskfocus: Focusing policy optimization on critical steps for masked image generation.arXiv preprint arXiv:2512.18766, 2025

work page arXiv 2025

[30] [32]

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu, Aiming Hao, Jiahong Wu, Xiangxiang Chu, and Feng Zhao. Mar-grpo: Stabilized grpo for ar-diffusion hybrid image generation.arXiv preprint arXiv:2604.06966, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [33]

Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization, 2026

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization, 2026

work page 2026

[32] [34]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024

[33] [35]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

work page 2024

[34] [36]

DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

work page Pith review arXiv 2025

[35] [37]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [38]

Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [39]

Lyng, Sanjit Singh Batra, and Robert E

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

work page arXiv 2026

[38] [40]

Paced: Distillation and self-distillation at the frontier of student competence.arXiv e-prints, pages arXiv–2603, 2026

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation and self-distillation at the frontier of student competence.arXiv e-prints, pages arXiv–2603, 2026

work page 2026

[39] [41]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

work page 2025

[40] [42]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023

[41] [43]

Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

work page arXiv 2025

[42] [44]

Laion aesthetics, Aug 2022

Chrisoph Schuhmann. Laion aesthetics, Aug 2022

work page 2022

[43] [45]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[44] [46]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [47]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [49]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 13 A More Details Following the data and reward configurations of Flow-GRPO, we conducted multi-task hybrid training for G...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [50]

•3 (Fair):In focus, adequate lighting, but lacks creativity

Aesthetic Quality •1-2 (Low):Blurry, poor lighting, or chaotic composition. •3 (Fair):In focus, adequate lighting, but lacks creativity. •4-5 (High):Sharp, vibrant colors, masterful composition and impact

work page

[49] [51]

•3 (Fair):Partially follows, but distorts some important elements

Instruction Following •1-2 (Low):Ignores or contradicts the instruction; misses key elements. •3 (Fair):Partially follows, but distorts some important elements. •4-5 (High):Faithful representation of all elements in the prompt

work page

[50] [52]

winner-takes- all

Overall Score (Priority: Alignment>Aesthetics) The overall score must primarily reflectInstruction Following. A fair image that perfectly follows the prompt scores higher than a beautiful image that misses it. [EXECUTION RULES] •Strictness:Be rigorous; required details must be explicitly supported. •Reasoning:You MUST analyze keyword-by-keyword in the<Tho...

work page 2024