pith. sign in

arxiv: 2605.05204 · v2 · pith:OMZRZRQAnew · submitted 2026-05-06 · 💻 cs.CV

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Pith reviewed 2026-05-20 23:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsself-distillationon-policy learningfew-step inferenceimage generationfine-tuningcontinuous adaptation
0
0 comments X

The pith

Step-distilled diffusion models can be continuously fine-tuned on new concepts without losing their few-step speed by using on-policy self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the problem that standard fine-tuning breaks the speed of few-step diffusion models. It introduces D-OPSD to turn supervised fine-tuning into an on-policy self-distillation process. The model generates its own trajectories and then compares a text-only student version against a teacher version that also receives the target image. Alignment happens between the two predicted distributions on those self-generated paths. A reader would care because this would let fast image generators keep adapting to new styles or subjects without retraining from scratch or regaining slow multi-step sampling.

Core claim

The paper claims that modern diffusion models with an LLM or VLM encoder inherit in-context capabilities that allow a teacher conditioned on both text prompt and target image to provide reliable supervision to a text-only student. Training then minimizes the difference between the two predicted distributions over the student's own roll-outs, so the model acquires new concepts and styles while its original few-step inference capacity stays intact.

What carries the argument

On-policy self-distillation, in which the model serves as both teacher (conditioned on text plus target image) and student (text only) and the loss aligns their output distributions on trajectories sampled from the student itself.

If this is right

  • The model acquires new concepts and styles through continuous supervised fine-tuning.
  • The original few-step inference performance remains unchanged after tuning.
  • Training draws on the model's inherited in-context capabilities from its encoder to generate the supervisory signal.
  • Practical ongoing adaptation of efficient image generators to specific domains becomes feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-distillation pattern could be tested on other conditional generative models that use encoder-based prompts.
  • Deployed systems might use repeated rounds of this process for gradual personalization without full retraining.
  • Checking the method on base models without strong encoders would test how far the in-context assumption reaches.

Load-bearing premise

Modern diffusion models inherit enough in-context capabilities from their LLM or VLM encoders that a teacher conditioned on both text and target image can reliably supervise a text-only student.

What would settle it

Apply D-OPSD to a step-distilled model and then check whether high-quality images on new concepts still appear in the original few steps; a clear rise in the number of steps required or a drop in quality on either new or original prompts would show the claim is false.

Figures

Figures reproduced from arXiv: 2605.05204 by Dengyang Jiang, Dongyang Liu, Harry Yang, Mingzhe Zheng, Peng Gao, Qilong Wu, Ruoyi Du, Steven Hoi, Xiangpeng Yang, Xin Jin, Zanyi Wang, Zhen Li.

Figure 1
Figure 1. Figure 1: We empirically investigate the visual appearance of generated images when con view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. For each training pair, we first pass the prompt alone and the prompt together with the target image through the encoder to obtain 𝑐𝑠 and 𝑐𝑡 , respectively. We then sample a few-step trajectory using the student branch conditioned on 𝑐𝑠 . After that, the teacher and student predict velocities on the same trajectory states, and the student is updated by Equation 7. After training, the teach… view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison between baseline methods and ours finetuned on Z-Image￾Turbo under customized training settings. Vanilla SFT training sacrifices the original few￾step capacity, and PSO suffers from the overfitting to training set, whereas our method enables the step-distilled model to continuously learn new concepts while maintaining the few-step capacity. large drops in Quality-S and Aesthetic-S in view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison between baseline methods and ours finetuned on Z-Image￾Turbo under full-finetuning settings. SFT and PSO training sacrifices the original few-step capacity, whereas our method enables the step-distilled model to continuously learn to bias target domain while maintaining the few-step capacity as well as the learned knowledge in the original domain. learned knowledge. We conduct training an… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on (a) the different training strategies, and (b) the different way to build teacher model. We report the curves across training steps of DINO feature similarity between the generated images and the targets, as well as the Quality Score of the generated images. Training conducted on Z-Image-Turbo with LoRA. Better to zoom in to check the difference. use the student copy leads to training collapse.… view at source ↗
Figure 6
Figure 6. Figure 6: When the teacher model fails to generate images consistent with the concept ID under multimodal condition and therefore cannot provide an effective supervision sig￾nal, training will fail. Requirements for teacher capability. The success of D-OPSD is contingent upon the base model’s in-context abilities. In spe￾cific, as shown in view at source ↗
Figure 7
Figure 7. Figure 7: We also empirically investigate the difference of generated images when con view at source ↗
Figure 7
Figure 7. Figure 7: When the teacher model fails to generate images consistent with the concept ID under multimodal condition and therefore cannot provide an effective supervision sig￾nal, training will fail. Requirements for teacher capability. The success of D-OPSD is contingent upon the base model’s in-context abilities. In spe￾cific, as shown in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of generated images of Z￾Image-Turbo conditioned on multimodal feature us￾ing Qwen3-VL 4B and Qwen3-VL 4B with LLM part reweighted by Qwen3-4B LM. To address this issue, we re￾place the weights of the LLM component in Qwen3-VL-4B with those from the more compati￾ble Qwen3-4B, while keeping the ViT and Connector weights un￾changed. In this way, we pre￾serve multimodal in-context ca￾pability while… view at source ↗
Figure 8
Figure 8. Figure 8: We also empirically investigate the difference of generated images when con [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of generated images of Z￾Image-Turbo conditioned on multimodal feature us￾ing Qwen3-VL 4B and Qwen3-VL 4B with LLM part reweighted by Qwen3-4B LM. To address this issue, we re￾place the weights of the LLM component in Qwen3-VL-4B with those from the more compati￾ble Qwen3-4B, while keeping the ViT and Connector weights un￾changed. In this way, we pre￾serve multimodal in-context ca￾pability while… view at source ↗
read the original abstract

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes D-OPSD, an on-policy self-distillation training paradigm for step-distilled diffusion models. The core idea is that modern diffusion models with LLM/VLM encoders inherit in-context capabilities, allowing the same model to act as both teacher (conditioned on multimodal input consisting of the text prompt plus target image) and student (conditioned on text features only). Training minimizes divergence between the two predicted distributions evaluated on the student's own roll-outs, with the goal of enabling supervised fine-tuning for new concepts and styles while preserving the original few-step sampling behavior.

Significance. If the central claim is substantiated, the result would be significant for the ongoing shift toward efficient few-step diffusion models. It offers a potential solution to the problem of continuous supervised fine-tuning without degrading inference speed, which is a practical barrier for models such as Z-Image-Turbo and FLUX.2-klein. The on-policy self-distillation framing could also inform related work on self-supervised adaptation in generative models more broadly.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method description): The central assumption that the LLM/VLM encoder supplies sufficient in-context capabilities for the multimodal teacher to produce a reliable, distribution-compatible supervisory signal for the text-only student is stated as a finding but is not accompanied by any derivation, preliminary ablation, or stability argument showing why this conditioning remains valid once parameters are updated on-policy. This assumption is load-bearing for the claim that few-step capacity is preserved.
  2. [§4] §4 (Experiments): No quantitative results, ablations, or comparisons are referenced that isolate the contribution of the on-policy self-distillation objective versus standard supervised fine-tuning; without such evidence it is impossible to assess whether the method actually avoids the distribution shift or capacity erosion described in the abstract.
minor comments (1)
  1. [Abstract] Abstract: The models Z-Image-Turbo and FLUX.2-klein are mentioned without citations or references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method description): The central assumption that the LLM/VLM encoder supplies sufficient in-context capabilities for the multimodal teacher to produce a reliable, distribution-compatible supervisory signal for the text-only student is stated as a finding but is not accompanied by any derivation, preliminary ablation, or stability argument showing why this conditioning remains valid once parameters are updated on-policy. This assumption is load-bearing for the claim that few-step capacity is preserved.

    Authors: We agree that the validity of the in-context capability after on-policy updates is a central assumption. The manuscript presents this as an empirical observation that enables the teacher-student formulation, with the on-policy rollouts intended to keep the distributions aligned. While no formal derivation is given, the design minimizes shift by supervising on the student's own trajectories. To strengthen the presentation, we will add a new preliminary ablation in the revised §3 that measures the divergence between teacher and student predictions on held-out student trajectories both before and after a short training run, providing evidence for stability of the supervisory signal. revision: yes

  2. Referee: [§4] §4 (Experiments): No quantitative results, ablations, or comparisons are referenced that isolate the contribution of the on-policy self-distillation objective versus standard supervised fine-tuning; without such evidence it is impossible to assess whether the method actually avoids the distribution shift or capacity erosion described in the abstract.

    Authors: We acknowledge that the current experiments do not include an explicit head-to-head comparison against standard supervised fine-tuning. The reported results focus on demonstrating that D-OPSD enables acquisition of new concepts while retaining few-step sampling speed. To isolate the contribution of the on-policy objective, we will add quantitative comparisons in the revised §4, including a standard SFT baseline with metrics on both concept fidelity and inference-step preservation, allowing direct assessment of distribution-shift mitigation. revision: yes

Circularity Check

0 steps flagged

No circularity: proposed on-policy objective is independent of its claimed outcomes

full rationale

The paper introduces D-OPSD by first stating an empirical observation that diffusion models with LLM/VLM encoders inherit in-context capabilities, then defines a training process in which the same model acts as teacher (multimodal conditioning on text + target image) and student (text-only) while minimizing divergence on the student's own roll-outs. This formulation is presented as a novel paradigm to enable supervised fine-tuning without eroding few-step sampling; the benefit of preserving original capacity is an intended empirical result of the objective rather than a quantity that reduces to the inputs by definition or by self-citation. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided description, and the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the encoder's in-context learning transfers to the diffusion model in a way that makes multimodal conditioning a valid teacher signal; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Modern diffusion models where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities.
    This finding is invoked to justify formulating training as on-policy self-distillation with different contexts.

pith-pipeline@v0.9.0 · 5796 in / 1283 out tokens · 33631 ms · 2026-05-20T23:14:39.707898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Brief Overview: On-Policy Self-Distillation In Large Language Models

    cs.HC 2026-05 unverdicted novelty 2.0

    OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...

  2. A Brief Overview: On-Policy Self-Distillation In Large Language Models

    cs.HC 2026-05 unverdicted novelty 2.0

    This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 1 Pith paper · 48 internal anchors

  1. [1]

    In: The twelfth international conference on learning representations (2024)

    Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P ., Garea, S.R., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mistakes. In: The twelfth international conference on learning representations (2024)

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  3. [3]

    Black Forest Labs: FLUX.https://github.com/black-forest-labs/flux (2023)

  4. [4]

    Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025)

  5. [5]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  6. [6]

    HunyuanImage 3.0 Technical Report

    Cao, S., Chen, H., Chen, P ., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

  7. [7]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P ., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  8. [8]

    arXiv preprint arXiv:2603.06507 (2026)

    Chefer, H., Esser, P ., Lorenz, D., Podell, D., Raja, V ., Tong, V ., Torralba, A., Rombach, R.: Self-supervised flow matching for scalable multi-modal synthesis. arXiv preprint arXiv:2603.06507 (2026)

  9. [9]

    arXiv preprint arXiv:2510.14974 (2025)

    Chen, H., Zhang, K., Tan, H., Guibas, L., Wetzstein, G., Bi, S.: pi-flow: Policy-based few-step generation via imitation distillation. arXiv preprint arXiv:2510.14974 (2025)

  10. [10]

    In: International Conference on Learning Representations (2024) 18

    Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P ., Lu, H., Li, Z.: Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: International Conference on Learning Representations (2024) 18

  11. [11]

    Science China Information Sciences67(12), 220101 (2024)

    Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences67(12), 220101 (2024)

  12. [12]

    arXiv preprint arXiv:2512.05150 (2025)

    Cheng, Z., Sun, P ., Li, J., Lin, T.: Twinflow: Realizing one-step generation on large models with self-adversarial flows. arXiv preprint arXiv:2512.05150 (2025)

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  14. [14]

    https://huggingface.co/Freepik (2024)

    Daniel Verdú, J.M.: Flux.1 lite: Distilling flux1.dev for efficient text-to-image genera- tion. https://huggingface.co/Freepik (2024)

  15. [15]

    DeepSeek-AI: Deepseek-v4: Towards highly efficient million-token context intelli- gence (2026)

  16. [16]

    Advances in neural information processing systems34, 8780–8794 (2021)

    Dhariwal, P ., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

  17. [17]

    In: Forty-first international conference on machine learning (2024)

    Esser, P ., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  18. [18]

    arXiv preprint arXiv:2309.17425 (2023)

    Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V .: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)

  19. [19]

    arXiv preprint arXiv:2412.01199 (2024)

    Fang, G., Li, K., Ma, X., Wang, X.: Tinyfusion: Diffusion transformers learned shal- low. arXiv preprint arXiv:2412.01199 (2024)

  20. [20]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen- Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

  21. [21]

    Advances in Neural Information Processing Systems36, 52132–52152 (2023)

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)

  22. [22]

    Google DeepMind: Gemini 3.https://deepmind.google/models/gemini/ (2025)

  23. [23]

    Google DeepMind: Gemini 3 pro image model card.https: //storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Image-Model-Card.pdf(2025)

  24. [24]

    International Journal of Computer Vision129(6), 1789–1819 (2021)

    Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International Journal of Computer Vision129(6), 1789–1819 (2021)

  25. [25]

    ELT: Elastic Looped Transformers for Visual Generation

    Goyal, S., Agrawal, S., Anil, G.G., Jain, P ., Paul, S., Kusupati, A.: Elt: Elastic looped transformers for visual generation. arXiv preprint arXiv:2604.09168 (2026)

  26. [26]

    Co-Evolving Policy Distillation

    Gu, N., Yang, C., Si, Q., Qin, C., Yao, D., Fu, P ., Lin, Z., Wang, W., Duan, N., Wang, J.: Co-evolving policy distillation. arXiv preprint arXiv:2604.27083 (2026)

  27. [27]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

  28. [28]

    Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

    He, Y., Kaur, S., Bhaskar, A., Yang, Y., Liu, J., Ri, N., Fowl, L., Panigrahi, A., Chen, D., Arora, S.: Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002 (2026) 19

  29. [29]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

  30. [30]

    Advances in neural information processing systems30(2017)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  31. [31]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  32. [32]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  33. [33]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P ., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  34. [34]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P ., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

  35. [35]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train- test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

  36. [36]

    Journal of quality technol- ogy18(4), 203–210 (1986)

    Hunter, J.S.: The exponentially weighted moving average. Journal of quality technol- ogy18(4), 203–210 (1986)

  37. [37]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Jang, I., Yeom, J., Yeo, J., Lim, H., Kim, T.: Stable on-policy distillation through adap- tive target reformulation. arXiv preprint arXiv:2601.07155 (2026)

  38. [38]

    Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

    Jiang, D., Liu, D., Wang, Z., Wu, Q., Li, L., Li, H., Jin, X., Liu, D., Li, Z., Zhang, B., et al.: Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649 (2025)

  39. [39]

    arXiv preprint arXiv:2505.02831 (2025)

    Jiang, D., Wang, M., Li, L., Zhang, L., Wang, H., Wei, W., Dai, G., Zhang, Y., Wang, J.: No other representation component is needed: Diffusion transformers can provide representation guidance by themselves. arXiv preprint arXiv:2505.02831 (2025)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and im- proving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174–24184 (2024)

  41. [41]

    co/blog/kelseye/training-strategies-of-z-image-turbo(2025)

    Kelseye, Duan, Z.: Training strategies of z-image-turbo.https://huggingface. co/blog/kelseye/training-strategies-of-z-image-turbo(2025)

  42. [42]

    Advances in neural information processing systems36, 36652–36663 (2023)

    Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)

  43. [43]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)

  44. [44]

    The annals of mathemat- ical statistics22(1), 79–86 (1951)

    Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathemat- ical statistics22(1), 79–86 (1951)

  45. [45]

    In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

    Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customiza- tion of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 1931–1941 (2023)

  46. [46]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P ., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

  47. [47]

    RefTon: Reference person shot assist virtual Try-on

    Li, L., Gong, Y., Liu, S., Cheng, B., Ma, Y., Wu, L., Jiang, D., Wang, Z., Leng, D., Yin, Y.: Refvton: person-to-person try on with additional unpaired visual reference. arXiv preprint arXiv:2511.00956 (2025) 20

  48. [48]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for gener- ative modeling. arXiv preprint arXiv:2210.02747 (2022)

  49. [49]

    Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

    Liu, D., Gao, P ., Liu, D., Du, R., Li, Z., Wu, Q., Jin, X., Cao, S., Zhang, S., Li, H., Hoi, S.: Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677 (2025)

  50. [50]

    Advances in neural infor- mation processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural infor- mation processing systems36, 34892–34916 (2023)

  51. [51]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P ., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025)

  52. [52]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

  53. [53]

    On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

    Lu, K., Lab, T.M.: On-policy distillation. Thinking Machines Lab: Connectionism (2025). https://doi.org/10.64434/tml.20251026, https://thinkingmachines.ai/blog/on-policy-distillation

  54. [54]

    arXiv preprint arXiv:2507.18569 (2025)

    Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A.J., Xie, X., Lai, J.H.: Adversar- ial distribution matching for diffusion distillation towards efficient image and video synthesis. arXiv preprint arXiv:2507.18569 (2025)

  55. [55]

    Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

    Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)

  56. [56]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: Synthesiz- ing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

  57. [57]

    arXiv preprint arXiv:2410.18881 (2024)

    Luo, W.: Diff-instruct++: Training one-step text-to-image generator model to align with human preferences. arXiv preprint arXiv:2410.18881 (2024)

  58. [58]

    Advances in Neural Information Processing Systems36, 76525–76546 (2023)

    Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal ap- proach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems36, 76525–76546 (2023)

  59. [59]

    arXiv preprint arXiv:2603.07700 (2026)

    Luo, Y., Hu, T., Luo, W., Tang, J.: Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward. arXiv preprint arXiv:2603.07700 (2026)

  60. [60]

    Learning few- step diffusion models by trajectory distribution matching

    Luo, Y., Hu, T., Sun, J., Cai, Y., Tang, J.: Learning few-step diffusion models by trajec- tory distribution matching. arXiv preprint arXiv:2503.06674 (2025)

  61. [61]

    In: European Conference on Computer Vision

    Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Ex- ploring flow and diffusion-based generative models with scalable interpolant trans- formers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

  62. [62]

    arXiv preprint arXiv:2410.03190 (2024)

    Miao, Z., Yang, Z., Lin, K., Wang, Z., Liu, Z., Wang, L., Qiu, Q.: Tun- ing timestep-distilled diffusion model using pairwise sample optimization. arXiv preprint arXiv:2410.03190 (2024)

  63. [63]

    OpenAI: Gpt-Image-1.https://openai.com/index/ introducing-4o-image-generation/(2025)

  64. [64]

    Ostris: Z-image-de-turbo.https://huggingface.co/ostris/ Z-Image-De-Turbo(2025)

  65. [65]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  66. [66]

    Privileged Information Distillation for Language Models

    Penaloza, E., Vattikonda, D., Gontier, N., Lacoste, A., Charlin, L., Caccia, M.: Privi- leged information distillation for language models. arXiv preprint arXiv:2602.04942 (2026) 21

  67. [67]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  68. [68]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

    Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Li, X., Liu, D., Zhu, X., et al.: Lumina-image 2.0: A unified and efficient image generative framework. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20031– 20042 (2025)

  69. [69]

    SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

    Qin, Y., Wang, L., Fei, H., Zimmermann, R., Bo, L., Lu, Q., Wang, C.: Soar: Self- correction for optimal alignment and refinement in diffusion models. arXiv preprint arXiv:2604.12617 (2026)

  70. [70]

    Qwen Team: Qwen3.5: Towards native multimodal agents.https://qwen.ai/ blog?id=qwen3.5(2026)

  71. [71]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P ., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

  72. [72]

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

  73. [73]

    OpenAI blog1(8), 9 (2019)

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

  74. [74]

    Journal of machine learning research21(140), 1–67 (2020)

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P .J.: Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of machine learning research21(140), 1–67 (2020)

  75. [75]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P ., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  76. [76]

    arXiv preprint arXiv:2404.13686 (2024)

    Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P ., Wang, X., Xiao, X.: Hyper-sd: Tra- jectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686 (2024)

  77. [77]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  78. [78]

    r/StableDiffusion, the Reddit Community: Z-image lora training.https: //www.reddit.com/r/StableDiffusion/comments/1pj0469/zimage_lora_ training/(2025)

  79. [79]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Ruiz, N., Li, Y., Jampani, V ., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

  80. [80]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Sang, H., Xu, Y., Zhou, Z., He, R., Wang, Z., Sun, J.: On-policy self-distillation for reasoning compression. arXiv preprint arXiv:2603.05433 (2026)

Showing first 80 references.