DanceOPD: On-Policy Generative Field Distillation

Bo Dong; Leigang Qu; Lingdong Kong; Lixue Gong; Meng Chu; Tat-Seng Chua; Wei Liu; Wei Zhou; Xiongwei Zhu; Yongyuan Liang

arxiv: 2606.27377 · v1 · pith:DRT274QKnew · submitted 2026-06-25 · 💻 cs.CV · cs.CL· cs.LG

DanceOPD: On-Policy Generative Field Distillation

Wei Zhou , Xiongwei Zhu , Zelin Xu , Bo Dong , Lixue Gong , Yongyuan Liang , Meng Chu , Leigang Qu

show 3 more authors

Lingdong Kong Wei Liu Tat-Seng Chua

This is my paper

Pith reviewed 2026-06-26 04:53 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords flow matchinggenerative distillationtext-to-imageimage editingclassifier-free guidanceon-policy learningvelocity fieldscapability composition

0 comments

The pith

Routing each sample to one expert velocity field and training a flow-matching student on its own low-noise states lets one model compose text-to-image, editing, and guidance without conflicts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern image generators need to combine text-to-image synthesis with local and global editing in one network, yet these tasks interfere and degrade one another. DanceOPD treats each capability as a velocity field over a shared flow-matching state space and routes every training sample to exactly one field. It then queries that field at a single low-noise state produced by the student itself and applies a plain velocity MSE loss. Experiments across T2I, editing, realism absorption, and CFG absorption report stronger target performance while the original generation quality stays intact.

Core claim

DanceOPD is an on-policy generative field distillation framework for flow-matching models. Each capability source is defined as a velocity field over the shared flow state space. The method routes each sample to one capability field, queries one low-noise student-induced state, and trains with a velocity MSE objective so the student learns to compose the expert fields from its own rollout states. The same formulation also absorbs operator-defined fields such as classifier-free guidance.

What carries the argument

On-policy routing of each sample to a single capability velocity field followed by MSE training on the student's own low-noise induced states.

If this is right

The distilled model strengthens target capabilities while preserving anchor generation quality.
The same procedure absorbs realism fields and classifier-free guidance without architectural changes.
Multi-capability composition improves across T2I, local editing, and global editing benchmarks.
A simple velocity MSE objective suffices; no additional regularization terms are introduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-model composition could replace ensembles of specialized generators in production pipelines.
The on-policy querying pattern might extend to sequential or video generation tasks that also suffer capability conflicts.
Adding new capability fields after initial training may require only additional routing without retraining the entire student.
The method's reliance on flow-matching state space suggests it could transfer to other continuous generative paradigms that use velocity or score fields.

Load-bearing premise

Routing every sample to exactly one capability field and querying only one low-noise student state is sufficient for the student to compose the expert fields without creating new interference or requiring extra regularization.

What would settle it

A controlled experiment in which the distilled single model exhibits either lower T2I quality than the anchor model or increased interference between local and global editing compared with separately trained experts.

read the original abstract

Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DanceOPD's on-policy routing of samples to single capability velocity fields plus student-induced states gives a workable way to distill conflicting tasks into one flow-matching model.

read the letter

The main thing to know is that this paper routes each training sample to exactly one expert velocity field, pulls a low-noise state from the student's own rollout, and trains with plain velocity MSE so the student can compose T2I, local editing, global editing, and even CFG without the usual quality drop on the anchor task.

What is new is the explicit on-policy querying step. Earlier distillation work often uses fixed expert trajectories; here the fields are evaluated at states the student actually produces. That choice is presented as the mechanism that lets the student internalize the right blends. The paper also treats CFG as just another absorbable operator field, which keeps the framework uniform. The experiments run on the usual T2I and editing benchmarks and report that target capabilities strengthen while base generation quality holds.

The approach is straightforward and the objective is standard, so there is little risk of hidden fitting. The central assumption—that single-field routing plus one student state is enough—looks reasonable when the capabilities are reasonably distinct, and the abstract claims the results back it up. A minor soft spot is that the routing rule itself is not compared against a multi-field or joint-training baseline in the provided description; if overlaps between editing and generation are heavy, that choice could matter more than the paper tests.

This is for people who need one flow model to do both generation and editing without separate heads or heavy regularization. Readers working on multi-capability diffusion or flow models will find the framing and the empirical pattern useful to try. The claims are concrete enough and the method is simple enough that it deserves a serious referee rather than a desk reject.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces DanceOPD, an on-policy generative field distillation framework for flow-matching models. It routes each sample to exactly one capability-specific velocity field (T2I, local editing, global editing), queries a single low-noise state induced by the current student, and optimizes a velocity MSE objective so the student can compose the expert fields. The formulation is also shown to absorb operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption are reported to demonstrate improved multi-capability composition while preserving anchor generation quality.

Significance. If the empirical results hold under the stated on-policy querying regime, the work supplies a lightweight mechanism for distilling and composing velocity fields in flow-matching models without extra regularization terms. The explicit use of student-induced low-noise states is a concrete strength that directly targets distribution shift in distillation; the ability to absorb CFG as an operator field is also noteworthy and falsifiable.

minor comments (2)

[Abstract / §1] The abstract and introduction would benefit from a short table or bullet list that explicitly contrasts the proposed routing/querying scheme against prior distillation baselines (e.g., standard off-policy MSE or multi-field averaging).
[Method] Notation for the capability fields and the student-induced state should be introduced with a single equation block early in the method section to avoid repeated prose definitions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of DanceOPD, for highlighting the on-policy student-induced state querying and CFG absorption as concrete strengths, and for recommending minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical method

full rationale

The paper presents DanceOPD as an on-policy distillation framework that routes samples to single capability velocity fields, queries student-induced low-noise states, and optimizes a standard velocity MSE objective. The central claim of improved multi-capability composition is framed as an empirical outcome from experiments on T2I, editing, realism, and CFG absorption. No load-bearing derivation step reduces by construction to its inputs, self-definition, or self-citation chains; the routing and querying are explicitly stated mechanisms rather than fitted parameters renamed as predictions, and the objective does not embed circularity. The approach is self-contained against external benchmarks with no uniqueness theorems or ansatzes imported via self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only input supplies insufficient detail to enumerate free parameters, axioms, or invented entities with precision; the framework introduces 'capability field' as a velocity field construct whose independence from the student is not evidenced here.

invented entities (1)

capability field no independent evidence
purpose: velocity field representing one expert capability (T2I, local edit, global edit) over the shared flow state space
Defined in the abstract as the source from which the student learns; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5763 in / 1245 out tokens · 31602 ms · 2026-06-26T04:53:46.543280+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 31 linked inside Pith

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, volume 2024, pages 21246–21263, 2024

2024
[2]

Variational information distillation for knowledge transfer

Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9163–9171, 2019

2019
[3]

Git re-basin: Merging models modulo permutation symmetries

Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022

arXiv 2022
[4]

Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

2025
[5]

HumanEdit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. HumanEdit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

arXiv 2024
[6]

MultiDiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing diffusion paths for controlled image generation. InInternational Conference on Machine Learning, pages 1737–1752. PMLR, 2023

2023
[7]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations, volume 2024, 2024

2024
[8]

InstructPix2Pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to follow image editing instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

2023
[9]

Z-Image: An efficient image generation foundation model with single-stream diffusion transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

Pith/arXiv arXiv 2025
[10]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactionson Graphics, 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactionson Graphics, 42(4):1–10, 2023

2023
[11]

TINO-Edit: Timestep and noise optimization for robust diffusion-based image editing

Sherry X Chen, Yaron Vaxman, Elad Ben Baruch, David Asulin, Aviad Moreshet, Kuo-Chin Lien, Misha Sra, and Pradeep Sen. TINO-Edit: Timestep and noise optimization for robust diffusion-based image editing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6337–6346, 2024

2024
[12]

An empirical study of GPT-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, et al. An empirical study of GPT-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

arXiv 2025
[13]

GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InInternational Conference on Machine Learning, pages 794–803. PMLR, 2018

2018
[14]

Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020

Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020

2039
[15]

PhysBench: Bench- marking and enhancing vision-language models for physical world understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Campagnolo Guizilini, and Yue Wang. PhysBench: Bench- marking and enhancing vision-language models for physical world understanding. InInternational Conference on Learning Representations, 2025

2025
[16]

EditMGT: Unleashing potentials of masked generative transformers in image editing

Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. EditMGT: Unleashing potentials of masked generative transformers in image editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 38038–38048, 2026

2026
[17]

Masked generative transformer is what you need for image editing.arXiv preprint arXiv:2605.10859, 2026

Wei Chow, Linfeng Li, Xian Sun, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. Masked generative transformer is what you need for image editing.arXiv preprint arXiv:2605.10859, 2026. 34

Pith/arXiv arXiv 2026
[18]

Flow matching in latent space

Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow matching in latent space. arXiv preprint arXiv:2307.08698, 2023

arXiv 2023
[19]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvancesin Neural Information Processing Systems, volume 34, pages 8780–8794, 2021

2021
[20]

Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC

Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl- Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC. InInternational Conference on Machine Learning, pages 8489–8510. PMLR, 2023

2023
[21]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, pages 12606–12633. PMLR, 2024

2024
[22]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advancesin Neural Information Processing Systems, 36:79858–79885, 2023

2023
[23]

Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026

Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, and Tat-Seng Chua. Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026

Pith/arXiv arXiv 2026
[24]

Flow-OPD: On-policy distillation for flow matching models.arXiv preprint arXiv:2605.08063, 2026

Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, et al. Flow-OPD: On-policy distillation for flow matching models.arXiv preprint arXiv:2605.08063, 2026

Pith/arXiv arXiv 2026
[25]

Training-free structured diffusion guidance for compositional text-to-image synthesis

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022

arXiv 2022
[26]

Dream- Sim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

StephanieFu, NetanelTamir, ShobhitaSundaram, LucyChai, RichardZhang, TaliDekel, andPhillipIsola. Dream- Sim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Pith/arXiv arXiv 2023
[27]

Efficient knowledge distillation from an ensemble of teachers

Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. InInterspeech, pages 3697–3701, 2017

2017
[28]

An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

Pith/arXiv arXiv 2022
[29]

GenEval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

2023
[30]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In International Conference on Learning Representations, volume 2024, pages 32694–32717, 2024

2024
[31]

Efficient diffusion training via min-SNR weighting strategy

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-SNR weighting strategy. InIEEE/CVF International Conference on Computer Vision, pages 7441–7451, 2023

2023
[32]

A comprehensive overhaul of feature distillation

Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. A comprehensive overhaul of feature distillation. InIEEE/CVF International Conference on Computer Vision, pages 1921–1930, 2019

1921
[33]

Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

Pith/arXiv arXiv 2022
[34]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021

2021
[35]

Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 35

Pith/arXiv arXiv 2015
[36]

Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022
[37]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

2020
[38]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022
[39]

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InIEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

2023
[40]

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2I-CompBench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3563–3579, 2025

2025
[41]

Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

arXiv 2023
[42]

Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

Pith/arXiv arXiv 2022
[43]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

1991
[44]

Rotograd: Gradient homogenization in multitask learning.arXiv preprint arXiv:2103.02631, 2021

Adrián Javaloy and Isabel Valera. Rotograd: Gradient homogenization in multitask learning.arXiv preprint arXiv:2103.02631, 2021

arXiv 2021
[45]

Asymmetric on-policy distillation: Bridging exploitation and imitation at the token level.arXiv preprint arXiv:2605.06387, 2026

Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, and Zequn Sun. Asymmetric on-policy distillation: Bridging exploitation and imitation at the token level.arXiv preprint arXiv:2605.06387, 2026

Pith/arXiv arXiv 2026
[46]

D-OPSD: On-policy self-distillation for continuously tuning step-distilled diffusion models

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, et al. D-OPSD: On-policy self-distillation for continuously tuning step-distilled diffusion models. arXiv preprint arXiv:2605.05204, 2026

Pith/arXiv arXiv 2026
[47]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin Neural Information Processing Systems, 35:26565–26577, 2022

2022
[48]

Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

Dongjun Kim, Chieh-Hsin Lai, WeiHsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. InInternational Conference on Learning Representations, 2024

2024
[49]

Pick-A-Pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-A-Pic: An open dataset of user preferences for text-to-image generation. InAdvances in Neural Information Processing Systems, volume 36, pages 36652–36663, 2023

2023
[50]

AI for auto-research: Roadmap & user guide.arXiv preprint arXiv:2605.18661, 2026

Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, et al. AI for auto-research: Roadmap & user guide.arXiv preprint arXiv:2605.18661, 2026

Pith/arXiv arXiv 2026
[51]

VieScore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. VieScore: Towards explainable metrics for conditional image synthesis evaluation. InAnnual Meeting of the Association for Computational Linguistics, pages 12268–12290, 2024

2024
[52]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023

1931
[53]

DiffusionOPD: A unified perspective of on-policy distillation in diffusion models.arXiv preprint arXiv:2605.15055, 2026

Quanhao Li, Junqiu Yu, Kaixun Jiang, Yujie Wei, Zhen Xing, Pandeng Li, Ruihang Chu, Shiwei Zhang, Yu Liu, and Zuxuan Wu. DiffusionOPD: A unified perspective of on-policy distillation in diffusion models.arXiv preprint arXiv:2605.15055, 2026

Pith/arXiv arXiv 2026
[54]

Schedule your edit: A simple yet effective diffusion noise schedule for image editing

Haonan Lin, Yan Chen, Jiahao Wang, Wenbin An, Mengmeng Wang, Feng Tian, Yong Liu, Guang Dai, Jingdong Wang, and Qianying Wang. Schedule your edit: A simple yet effective diffusion noise schedule for image editing. Advancesin Neural Information Processing Systems, 37:115712–115756, 2024. 36

2024
[55]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[56]

Conflict-averse gradient descent for multi-task learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advancesin Neural Information Processing Systems, 34:18878–18890, 2021

2021
[57]

Flow-GRPO: Training flow matching models via online RL.Advances in Neural Information Processing Systems, 38:40783–40818, 2026

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL.Advances in Neural Information Processing Systems, 38:40783–40818, 2026

2026
[58]

Towards impartial multi-task learning

Liyang Liu, Yi Li, Zhanghui Kuang, Jing-Hao Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Towards impartial multi-task learning. InInternational Conference on Learning Representations, 2021

2021
[59]

Compositional visual generation with composable diffusion models

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean Conference on Computer Vision, pages 423–439. Springer, 2022

2022
[60]

Step1x-Edit: A practical framework for general image editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-Edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

Pith/arXiv arXiv 2025
[61]

Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

Pith/arXiv arXiv 2022
[62]

DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

2022
[63]

DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, 2025

2025
[64]

Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

Pith/arXiv arXiv 2026
[65]

Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

Pith/arXiv arXiv 2023
[66]

Modeling task relationships in multi- task learning with multi-gate mixture-of-experts

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi- task learning with multi-gate mixture-of-experts. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1930–1939, 2018

1930
[67]

Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks

Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, and James Henderson. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. InAnnual Meeting of the Association for Computational Linguistics, pages 565–576, 2021

2021
[68]

Merging models with fisher-weighted averaging

Michael S Matena and Colin Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

2022
[69]

SDEdit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

Pith/arXiv arXiv 2021
[70]

Multi-task learning as a bargaining game.arXiv preprint arXiv:2202.01017, 2022

Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game.arXiv preprint arXiv:2202.01017, 2022

arXiv 2022
[71]

Improveddenoisingdiffusionprobabilisticmodels

AlexanderQuinnNicholandPrafullaDhariwal. Improveddenoisingdiffusionprobabilisticmodels. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021

2021
[72]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

2023
[73]

Adapterfusion: Non-destructive task composition for transfer learning

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InConference of the European Chapter of the Association for Computational Linguistics, pages 487–503, 2021

2021
[74]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InInternational Conference on Learning Representations, 2024. 37

2024
[75]

Dreamfusion: Text-to-3D using 2D diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988, 2022

Pith/arXiv arXiv 2022
[76]

Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

arXiv 2023
[77]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

2022
[78]

FitNets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014

Pith/arXiv arXiv 2014
[79]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011
[80]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023

2023

Showing first 80 references.

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, volume 2024, pages 21246–21263, 2024

2024

[2] [2]

Variational information distillation for knowledge transfer

Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9163–9171, 2019

2019

[3] [3]

Git re-basin: Merging models modulo permutation symmetries

Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022

arXiv 2022

[4] [4]

Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

2025

[5] [5]

HumanEdit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. HumanEdit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

arXiv 2024

[6] [6]

MultiDiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing diffusion paths for controlled image generation. InInternational Conference on Machine Learning, pages 1737–1752. PMLR, 2023

2023

[7] [7]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations, volume 2024, 2024

2024

[8] [8]

InstructPix2Pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to follow image editing instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

2023

[9] [9]

Z-Image: An efficient image generation foundation model with single-stream diffusion transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

Pith/arXiv arXiv 2025

[10] [10]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactionson Graphics, 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactionson Graphics, 42(4):1–10, 2023

2023

[11] [11]

TINO-Edit: Timestep and noise optimization for robust diffusion-based image editing

Sherry X Chen, Yaron Vaxman, Elad Ben Baruch, David Asulin, Aviad Moreshet, Kuo-Chin Lien, Misha Sra, and Pradeep Sen. TINO-Edit: Timestep and noise optimization for robust diffusion-based image editing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6337–6346, 2024

2024

[12] [12]

An empirical study of GPT-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, et al. An empirical study of GPT-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

arXiv 2025

[13] [13]

GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InInternational Conference on Machine Learning, pages 794–803. PMLR, 2018

2018

[14] [14]

Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020

Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020

2039

[15] [15]

PhysBench: Bench- marking and enhancing vision-language models for physical world understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Campagnolo Guizilini, and Yue Wang. PhysBench: Bench- marking and enhancing vision-language models for physical world understanding. InInternational Conference on Learning Representations, 2025

2025

[16] [16]

EditMGT: Unleashing potentials of masked generative transformers in image editing

Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. EditMGT: Unleashing potentials of masked generative transformers in image editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 38038–38048, 2026

2026

[17] [17]

Masked generative transformer is what you need for image editing.arXiv preprint arXiv:2605.10859, 2026

Wei Chow, Linfeng Li, Xian Sun, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. Masked generative transformer is what you need for image editing.arXiv preprint arXiv:2605.10859, 2026. 34

Pith/arXiv arXiv 2026

[18] [18]

Flow matching in latent space

Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow matching in latent space. arXiv preprint arXiv:2307.08698, 2023

arXiv 2023

[19] [19]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvancesin Neural Information Processing Systems, volume 34, pages 8780–8794, 2021

2021

[20] [20]

Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC

Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl- Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC. InInternational Conference on Machine Learning, pages 8489–8510. PMLR, 2023

2023

[21] [21]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, pages 12606–12633. PMLR, 2024

2024

[22] [22]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advancesin Neural Information Processing Systems, 36:79858–79885, 2023

2023

[23] [23]

Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026

Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, and Tat-Seng Chua. Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026

Pith/arXiv arXiv 2026

[24] [24]

Flow-OPD: On-policy distillation for flow matching models.arXiv preprint arXiv:2605.08063, 2026

Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, et al. Flow-OPD: On-policy distillation for flow matching models.arXiv preprint arXiv:2605.08063, 2026

Pith/arXiv arXiv 2026

[25] [25]

Training-free structured diffusion guidance for compositional text-to-image synthesis

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022

arXiv 2022

[26] [26]

Dream- Sim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

StephanieFu, NetanelTamir, ShobhitaSundaram, LucyChai, RichardZhang, TaliDekel, andPhillipIsola. Dream- Sim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Pith/arXiv arXiv 2023

[27] [27]

Efficient knowledge distillation from an ensemble of teachers

Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. InInterspeech, pages 3697–3701, 2017

2017

[28] [28]

An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

Pith/arXiv arXiv 2022

[29] [29]

GenEval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

2023

[30] [30]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In International Conference on Learning Representations, volume 2024, pages 32694–32717, 2024

2024

[31] [31]

Efficient diffusion training via min-SNR weighting strategy

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-SNR weighting strategy. InIEEE/CVF International Conference on Computer Vision, pages 7441–7451, 2023

2023

[32] [32]

A comprehensive overhaul of feature distillation

Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. A comprehensive overhaul of feature distillation. InIEEE/CVF International Conference on Computer Vision, pages 1921–1930, 2019

1921

[33] [33]

Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

Pith/arXiv arXiv 2022

[34] [34]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021

2021

[35] [35]

Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 35

Pith/arXiv arXiv 2015

[36] [36]

Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022

[37] [37]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

2020

[38] [38]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022

[39] [39]

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InIEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

2023

[40] [40]

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2I-CompBench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3563–3579, 2025

2025

[41] [41]

Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

arXiv 2023

[42] [42]

Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

Pith/arXiv arXiv 2022

[43] [43]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

1991

[44] [44]

Rotograd: Gradient homogenization in multitask learning.arXiv preprint arXiv:2103.02631, 2021

Adrián Javaloy and Isabel Valera. Rotograd: Gradient homogenization in multitask learning.arXiv preprint arXiv:2103.02631, 2021

arXiv 2021

[45] [45]

Asymmetric on-policy distillation: Bridging exploitation and imitation at the token level.arXiv preprint arXiv:2605.06387, 2026

Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, and Zequn Sun. Asymmetric on-policy distillation: Bridging exploitation and imitation at the token level.arXiv preprint arXiv:2605.06387, 2026

Pith/arXiv arXiv 2026

[46] [46]

D-OPSD: On-policy self-distillation for continuously tuning step-distilled diffusion models

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, et al. D-OPSD: On-policy self-distillation for continuously tuning step-distilled diffusion models. arXiv preprint arXiv:2605.05204, 2026

Pith/arXiv arXiv 2026

[47] [47]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin Neural Information Processing Systems, 35:26565–26577, 2022

2022

[48] [48]

Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

Dongjun Kim, Chieh-Hsin Lai, WeiHsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. InInternational Conference on Learning Representations, 2024

2024

[49] [49]

Pick-A-Pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-A-Pic: An open dataset of user preferences for text-to-image generation. InAdvances in Neural Information Processing Systems, volume 36, pages 36652–36663, 2023

2023

[50] [50]

AI for auto-research: Roadmap & user guide.arXiv preprint arXiv:2605.18661, 2026

Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, et al. AI for auto-research: Roadmap & user guide.arXiv preprint arXiv:2605.18661, 2026

Pith/arXiv arXiv 2026

[51] [51]

VieScore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. VieScore: Towards explainable metrics for conditional image synthesis evaluation. InAnnual Meeting of the Association for Computational Linguistics, pages 12268–12290, 2024

2024

[52] [52]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023

1931

[53] [53]

DiffusionOPD: A unified perspective of on-policy distillation in diffusion models.arXiv preprint arXiv:2605.15055, 2026

Quanhao Li, Junqiu Yu, Kaixun Jiang, Yujie Wei, Zhen Xing, Pandeng Li, Ruihang Chu, Shiwei Zhang, Yu Liu, and Zuxuan Wu. DiffusionOPD: A unified perspective of on-policy distillation in diffusion models.arXiv preprint arXiv:2605.15055, 2026

Pith/arXiv arXiv 2026

[54] [54]

Schedule your edit: A simple yet effective diffusion noise schedule for image editing

Haonan Lin, Yan Chen, Jiahao Wang, Wenbin An, Mengmeng Wang, Feng Tian, Yong Liu, Guang Dai, Jingdong Wang, and Qianying Wang. Schedule your edit: A simple yet effective diffusion noise schedule for image editing. Advancesin Neural Information Processing Systems, 37:115712–115756, 2024. 36

2024

[55] [55]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[56] [56]

Conflict-averse gradient descent for multi-task learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advancesin Neural Information Processing Systems, 34:18878–18890, 2021

2021

[57] [57]

Flow-GRPO: Training flow matching models via online RL.Advances in Neural Information Processing Systems, 38:40783–40818, 2026

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL.Advances in Neural Information Processing Systems, 38:40783–40818, 2026

2026

[58] [58]

Towards impartial multi-task learning

Liyang Liu, Yi Li, Zhanghui Kuang, Jing-Hao Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Towards impartial multi-task learning. InInternational Conference on Learning Representations, 2021

2021

[59] [59]

Compositional visual generation with composable diffusion models

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean Conference on Computer Vision, pages 423–439. Springer, 2022

2022

[60] [60]

Step1x-Edit: A practical framework for general image editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-Edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

Pith/arXiv arXiv 2025

[61] [61]

Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

Pith/arXiv arXiv 2022

[62] [62]

DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

2022

[63] [63]

DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, 2025

2025

[64] [64]

Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

Pith/arXiv arXiv 2026

[65] [65]

Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

Pith/arXiv arXiv 2023

[66] [66]

Modeling task relationships in multi- task learning with multi-gate mixture-of-experts

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi- task learning with multi-gate mixture-of-experts. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1930–1939, 2018

1930

[67] [67]

Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks

Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, and James Henderson. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. InAnnual Meeting of the Association for Computational Linguistics, pages 565–576, 2021

2021

[68] [68]

Merging models with fisher-weighted averaging

Michael S Matena and Colin Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

2022

[69] [69]

SDEdit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

Pith/arXiv arXiv 2021

[70] [70]

Multi-task learning as a bargaining game.arXiv preprint arXiv:2202.01017, 2022

Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game.arXiv preprint arXiv:2202.01017, 2022

arXiv 2022

[71] [71]

Improveddenoisingdiffusionprobabilisticmodels

AlexanderQuinnNicholandPrafullaDhariwal. Improveddenoisingdiffusionprobabilisticmodels. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021

2021

[72] [72]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

2023

[73] [73]

Adapterfusion: Non-destructive task composition for transfer learning

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InConference of the European Chapter of the Association for Computational Linguistics, pages 487–503, 2021

2021

[74] [74]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InInternational Conference on Learning Representations, 2024. 37

2024

[75] [75]

Dreamfusion: Text-to-3D using 2D diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988, 2022

Pith/arXiv arXiv 2022

[76] [76]

Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

arXiv 2023

[77] [77]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

2022

[78] [78]

FitNets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014

Pith/arXiv arXiv 2014

[79] [79]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011

[80] [80]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023

2023