pith. sign in

arxiv: 2606.27377 · v1 · pith:DRT274QKnew · submitted 2026-06-25 · 💻 cs.CV · cs.CL· cs.LG

DanceOPD: On-Policy Generative Field Distillation

Pith reviewed 2026-06-26 04:53 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords flow matchinggenerative distillationtext-to-imageimage editingclassifier-free guidanceon-policy learningvelocity fieldscapability composition
0
0 comments X

The pith

Routing each sample to one expert velocity field and training a flow-matching student on its own low-noise states lets one model compose text-to-image, editing, and guidance without conflicts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern image generators need to combine text-to-image synthesis with local and global editing in one network, yet these tasks interfere and degrade one another. DanceOPD treats each capability as a velocity field over a shared flow-matching state space and routes every training sample to exactly one field. It then queries that field at a single low-noise state produced by the student itself and applies a plain velocity MSE loss. Experiments across T2I, editing, realism absorption, and CFG absorption report stronger target performance while the original generation quality stays intact.

Core claim

DanceOPD is an on-policy generative field distillation framework for flow-matching models. Each capability source is defined as a velocity field over the shared flow state space. The method routes each sample to one capability field, queries one low-noise student-induced state, and trains with a velocity MSE objective so the student learns to compose the expert fields from its own rollout states. The same formulation also absorbs operator-defined fields such as classifier-free guidance.

What carries the argument

On-policy routing of each sample to a single capability velocity field followed by MSE training on the student's own low-noise induced states.

If this is right

  • The distilled model strengthens target capabilities while preserving anchor generation quality.
  • The same procedure absorbs realism fields and classifier-free guidance without architectural changes.
  • Multi-capability composition improves across T2I, local editing, and global editing benchmarks.
  • A simple velocity MSE objective suffices; no additional regularization terms are introduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-model composition could replace ensembles of specialized generators in production pipelines.
  • The on-policy querying pattern might extend to sequential or video generation tasks that also suffer capability conflicts.
  • Adding new capability fields after initial training may require only additional routing without retraining the entire student.
  • The method's reliance on flow-matching state space suggests it could transfer to other continuous generative paradigms that use velocity or score fields.

Load-bearing premise

Routing every sample to exactly one capability field and querying only one low-noise student state is sufficient for the student to compose the expert fields without creating new interference or requiring extra regularization.

What would settle it

A controlled experiment in which the distilled single model exhibits either lower T2I quality than the anchor model or increased interference between local and global editing compared with separately trained experts.

read the original abstract

Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces DanceOPD, an on-policy generative field distillation framework for flow-matching models. It routes each sample to exactly one capability-specific velocity field (T2I, local editing, global editing), queries a single low-noise state induced by the current student, and optimizes a velocity MSE objective so the student can compose the expert fields. The formulation is also shown to absorb operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption are reported to demonstrate improved multi-capability composition while preserving anchor generation quality.

Significance. If the empirical results hold under the stated on-policy querying regime, the work supplies a lightweight mechanism for distilling and composing velocity fields in flow-matching models without extra regularization terms. The explicit use of student-induced low-noise states is a concrete strength that directly targets distribution shift in distillation; the ability to absorb CFG as an operator field is also noteworthy and falsifiable.

minor comments (2)
  1. [Abstract / §1] The abstract and introduction would benefit from a short table or bullet list that explicitly contrasts the proposed routing/querying scheme against prior distillation baselines (e.g., standard off-policy MSE or multi-field averaging).
  2. [Method] Notation for the capability fields and the student-induced state should be introduced with a single equation block early in the method section to avoid repeated prose definitions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of DanceOPD, for highlighting the on-policy student-induced state querying and CFG absorption as concrete strengths, and for recommending minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical method

full rationale

The paper presents DanceOPD as an on-policy distillation framework that routes samples to single capability velocity fields, queries student-induced low-noise states, and optimizes a standard velocity MSE objective. The central claim of improved multi-capability composition is framed as an empirical outcome from experiments on T2I, editing, realism, and CFG absorption. No load-bearing derivation step reduces by construction to its inputs, self-definition, or self-citation chains; the routing and querying are explicitly stated mechanisms rather than fitted parameters renamed as predictions, and the objective does not embed circularity. The approach is self-contained against external benchmarks with no uniqueness theorems or ansatzes imported via self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only input supplies insufficient detail to enumerate free parameters, axioms, or invented entities with precision; the framework introduces 'capability field' as a velocity field construct whose independence from the student is not evidenced here.

invented entities (1)
  • capability field no independent evidence
    purpose: velocity field representing one expert capability (T2I, local edit, global edit) over the shared flow state space
    Defined in the abstract as the source from which the student learns; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5763 in / 1245 out tokens · 31602 ms · 2026-06-26T04:53:46.543280+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

111 extracted references · 31 linked inside Pith

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, volume 2024, pages 21246–21263, 2024

  2. [2]

    Variational information distillation for knowledge transfer

    Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9163–9171, 2019

  3. [3]

    Git re-basin: Merging models modulo permutation symmetries

    Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022

  4. [4]

    Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

    Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

  5. [5]

    HumanEdit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

    Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. HumanEdit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

  6. [6]

    MultiDiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing diffusion paths for controlled image generation. InInternational Conference on Machine Learning, pages 1737–1752. PMLR, 2023

  7. [7]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations, volume 2024, 2024

  8. [8]

    InstructPix2Pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to follow image editing instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

  9. [9]

    Z-Image: An efficient image generation foundation model with single-stream diffusion transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

  10. [10]

    Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactionson Graphics, 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactionson Graphics, 42(4):1–10, 2023

  11. [11]

    TINO-Edit: Timestep and noise optimization for robust diffusion-based image editing

    Sherry X Chen, Yaron Vaxman, Elad Ben Baruch, David Asulin, Aviad Moreshet, Kuo-Chin Lien, Misha Sra, and Pradeep Sen. TINO-Edit: Timestep and noise optimization for robust diffusion-based image editing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6337–6346, 2024

  12. [12]

    An empirical study of GPT-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

    Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, et al. An empirical study of GPT-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

  13. [13]

    GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks

    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InInternational Conference on Machine Learning, pages 794–803. PMLR, 2018

  14. [14]

    Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020

    Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050, 2020

  15. [15]

    PhysBench: Bench- marking and enhancing vision-language models for physical world understanding

    Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Campagnolo Guizilini, and Yue Wang. PhysBench: Bench- marking and enhancing vision-language models for physical world understanding. InInternational Conference on Learning Representations, 2025

  16. [16]

    EditMGT: Unleashing potentials of masked generative transformers in image editing

    Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. EditMGT: Unleashing potentials of masked generative transformers in image editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 38038–38048, 2026

  17. [17]

    Masked generative transformer is what you need for image editing.arXiv preprint arXiv:2605.10859, 2026

    Wei Chow, Linfeng Li, Xian Sun, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. Masked generative transformer is what you need for image editing.arXiv preprint arXiv:2605.10859, 2026. 34

  18. [18]

    Flow matching in latent space

    Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow matching in latent space. arXiv preprint arXiv:2307.08698, 2023

  19. [19]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvancesin Neural Information Processing Systems, volume 34, pages 8780–8794, 2021

  20. [20]

    Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC

    Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl- Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC. InInternational Conference on Machine Learning, pages 8489–8510. PMLR, 2023

  21. [21]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, pages 12606–12633. PMLR, 2024

  22. [22]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advancesin Neural Information Processing Systems, 36:79858–79885, 2023

  23. [23]

    Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026

    Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, and Tat-Seng Chua. Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026

  24. [24]

    Flow-OPD: On-policy distillation for flow matching models.arXiv preprint arXiv:2605.08063, 2026

    Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, et al. Flow-OPD: On-policy distillation for flow matching models.arXiv preprint arXiv:2605.08063, 2026

  25. [25]

    Training-free structured diffusion guidance for compositional text-to-image synthesis

    Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022

  26. [26]

    Dream- Sim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

    StephanieFu, NetanelTamir, ShobhitaSundaram, LucyChai, RichardZhang, TaliDekel, andPhillipIsola. Dream- Sim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

  27. [27]

    Efficient knowledge distillation from an ensemble of teachers

    Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. InInterspeech, pages 3697–3701, 2017

  28. [28]

    An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

  29. [29]

    GenEval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

  30. [30]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In International Conference on Learning Representations, volume 2024, pages 32694–32717, 2024

  31. [31]

    Efficient diffusion training via min-SNR weighting strategy

    Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-SNR weighting strategy. InIEEE/CVF International Conference on Computer Vision, pages 7441–7451, 2023

  32. [32]

    A comprehensive overhaul of feature distillation

    Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. A comprehensive overhaul of feature distillation. InIEEE/CVF International Conference on Computer Vision, pages 1921–1930, 2019

  33. [33]

    Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  34. [34]

    CLIPScore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021

  35. [35]

    Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 35

  36. [36]

    Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  37. [37]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  38. [38]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  39. [39]

    Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InIEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

  40. [40]

    Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2I-CompBench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3563–3579, 2025

  41. [41]

    Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

    Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

  42. [42]

    Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

  43. [43]

    Adaptive mixtures of local experts

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

  44. [44]

    Rotograd: Gradient homogenization in multitask learning.arXiv preprint arXiv:2103.02631, 2021

    Adrián Javaloy and Isabel Valera. Rotograd: Gradient homogenization in multitask learning.arXiv preprint arXiv:2103.02631, 2021

  45. [45]

    Asymmetric on-policy distillation: Bridging exploitation and imitation at the token level.arXiv preprint arXiv:2605.06387, 2026

    Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, and Zequn Sun. Asymmetric on-policy distillation: Bridging exploitation and imitation at the token level.arXiv preprint arXiv:2605.06387, 2026

  46. [46]

    D-OPSD: On-policy self-distillation for continuously tuning step-distilled diffusion models

    Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, et al. D-OPSD: On-policy self-distillation for continuously tuning step-distilled diffusion models. arXiv preprint arXiv:2605.05204, 2026

  47. [47]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin Neural Information Processing Systems, 35:26565–26577, 2022

  48. [48]

    Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

    Dongjun Kim, Chieh-Hsin Lai, WeiHsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. InInternational Conference on Learning Representations, 2024

  49. [49]

    Pick-A-Pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-A-Pic: An open dataset of user preferences for text-to-image generation. InAdvances in Neural Information Processing Systems, volume 36, pages 36652–36663, 2023

  50. [50]

    AI for auto-research: Roadmap & user guide.arXiv preprint arXiv:2605.18661, 2026

    Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, et al. AI for auto-research: Roadmap & user guide.arXiv preprint arXiv:2605.18661, 2026

  51. [51]

    VieScore: Towards explainable metrics for conditional image synthesis evaluation

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. VieScore: Towards explainable metrics for conditional image synthesis evaluation. InAnnual Meeting of the Association for Computational Linguistics, pages 12268–12290, 2024

  52. [52]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023

  53. [53]

    DiffusionOPD: A unified perspective of on-policy distillation in diffusion models.arXiv preprint arXiv:2605.15055, 2026

    Quanhao Li, Junqiu Yu, Kaixun Jiang, Yujie Wei, Zhen Xing, Pandeng Li, Ruihang Chu, Shiwei Zhang, Yu Liu, and Zuxuan Wu. DiffusionOPD: A unified perspective of on-policy distillation in diffusion models.arXiv preprint arXiv:2605.15055, 2026

  54. [54]

    Schedule your edit: A simple yet effective diffusion noise schedule for image editing

    Haonan Lin, Yan Chen, Jiahao Wang, Wenbin An, Mengmeng Wang, Feng Tian, Yong Liu, Guang Dai, Jingdong Wang, and Qianying Wang. Schedule your edit: A simple yet effective diffusion noise schedule for image editing. Advancesin Neural Information Processing Systems, 37:115712–115756, 2024. 36

  55. [55]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  56. [56]

    Conflict-averse gradient descent for multi-task learning

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advancesin Neural Information Processing Systems, 34:18878–18890, 2021

  57. [57]

    Flow-GRPO: Training flow matching models via online RL.Advances in Neural Information Processing Systems, 38:40783–40818, 2026

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL.Advances in Neural Information Processing Systems, 38:40783–40818, 2026

  58. [58]

    Towards impartial multi-task learning

    Liyang Liu, Yi Li, Zhanghui Kuang, Jing-Hao Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Towards impartial multi-task learning. InInternational Conference on Learning Representations, 2021

  59. [59]

    Compositional visual generation with composable diffusion models

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean Conference on Computer Vision, pages 423–439. Springer, 2022

  60. [60]

    Step1x-Edit: A practical framework for general image editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-Edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

  61. [61]

    Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  62. [62]

    DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

  63. [63]

    DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, 2025

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, 2025

  64. [64]

    Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

    Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

  65. [65]

    Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  66. [66]

    Modeling task relationships in multi- task learning with multi-gate mixture-of-experts

    Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi- task learning with multi-gate mixture-of-experts. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1930–1939, 2018

  67. [67]

    Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks

    Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, and James Henderson. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. InAnnual Meeting of the Association for Computational Linguistics, pages 565–576, 2021

  68. [68]

    Merging models with fisher-weighted averaging

    Michael S Matena and Colin Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

  69. [69]

    SDEdit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

  70. [70]

    Multi-task learning as a bargaining game.arXiv preprint arXiv:2202.01017, 2022

    Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game.arXiv preprint arXiv:2202.01017, 2022

  71. [71]

    Improveddenoisingdiffusionprobabilisticmodels

    AlexanderQuinnNicholandPrafullaDhariwal. Improveddenoisingdiffusionprobabilisticmodels. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021

  72. [72]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  73. [73]

    Adapterfusion: Non-destructive task composition for transfer learning

    Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InConference of the European Chapter of the Association for Computational Linguistics, pages 487–503, 2021

  74. [74]

    SDXL: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InInternational Conference on Learning Representations, 2024. 37

  75. [75]

    Dreamfusion: Text-to-3D using 2D diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988, 2022

  76. [76]

    Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

  77. [77]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  78. [78]

    FitNets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014

  79. [79]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  80. [80]

    DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023

Showing first 80 references.