arxiv: 2605.05204 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Dengyang Jiang , Xin Jin , Dongyang Liu , Zanyi Wang , Mingzhe Zheng , Ruoyi Du , Xiangpeng Yang , Qilong Wu

show 4 more authors

Zhen Li Peng Gao Harry Yang Steven Hoi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsself-distillationfew-step generationon-policy learningfine-tuningimage generationcontinuous adaptation

0 comments

The pith

Step-distilled diffusion models can learn new concepts through on-policy self-distillation without losing their few-step speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Step-distilled diffusion models deliver fast image generation but lose that speed when standard fine-tuning methods are applied. The paper shows that these models inherit in-context learning from their language encoders, which turns supervised fine-tuning into a self-supervised process. During training the same model runs twice on its own generated sequences: once as a teacher that sees both the text prompt and the target image, and once as a student that sees only the text. The training objective aligns the two predicted distributions so the model absorbs new concepts or styles along trajectories it produces itself.

Core claim

By casting fine-tuning as on-policy self-distillation, the model acts simultaneously as teacher and student with differing conditioning contexts, and the loss is computed between the two output distributions evaluated on the student's own roll-outs; this procedure lets the model acquire new concepts and styles while its original few-step inference behavior remains unchanged.

What carries the argument

On-policy self-distillation in which the model generates its own training trajectories and supplies its own supervision by comparing text-only predictions against multimodal predictions on those same trajectories.

If this is right

New concepts and styles become learnable in few-step models through ordinary supervised updates.
The original few-step sampling performance stays intact after the updates.
No external teacher model or separate distillation stage is required; the model supervises itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-supervision loop could be applied repeatedly, allowing incremental adaptation over successive data batches.
Because the method uses only the model's own outputs, it may reduce the need for large curated fine-tuning datasets.
If the inherited in-context behavior scales with model size, larger step-distilled models would adapt even more readily.

Load-bearing premise

The diffusion model inherits usable in-context capabilities from its LLM or VLM encoder so that image information can serve as additional context during training.

What would settle it

Measure the visual quality and prompt adherence of few-step samples from the model before and after D-OPSD fine-tuning on the same set of held-out prompts; a clear drop would falsify the claim that few-step capacity is preserved.

Figures

Figures reproduced from arXiv: 2605.05204 by Dengyang Jiang, Dongyang Liu, Harry Yang, Mingzhe Zheng, Peng Gao, Qilong Wu, Ruoyi Du, Steven Hoi, Xiangpeng Yang, Xin Jin, Zanyi Wang, Zhen Li.

**Figure 1.** Figure 1: We empirically investigate the visual appearance of generated images when con view at source ↗

**Figure 2.** Figure 2: Method overview. For each training pair, we first pass the prompt alone and the prompt together with the target image through the encoder to obtain 𝑐𝑠 and 𝑐𝑡 , respectively. We then sample a few-step trajectory using the student branch conditioned on 𝑐𝑠 . After that, the teacher and student predict velocities on the same trajectory states, and the student is updated by Equation 7. After training, the teach… view at source ↗

**Figure 3.** Figure 3: Visual comparison between baseline methods and ours finetuned on Z-ImageTurbo under customized training settings. Vanilla SFT training sacrifices the original fewstep capacity, and PSO suffers from the overfitting to training set, whereas our method enables the step-distilled model to continuously learn new concepts while maintaining the few-step capacity. large drops in Quality-S and Aesthetic-S in view at source ↗

**Figure 4.** Figure 4: Visual comparison between baseline methods and ours finetuned on Z-ImageTurbo under full-finetuning settings. SFT and PSO training sacrifices the original few-step capacity, whereas our method enables the step-distilled model to continuously learn to bias target domain while maintaining the few-step capacity as well as the learned knowledge in the original domain. learned knowledge. We conduct training an… view at source ↗

**Figure 5.** Figure 5: Ablation on (a) the different training strategies, and (b) the different way to build teacher model. We report the curves across training steps of DINO feature similarity between the generated images and the targets, as well as the Quality Score of the generated images. Training conducted on Z-Image-Turbo with LoRA. Better to zoom in to check the difference. use the student copy leads to training collapse.… view at source ↗

**Figure 6.** Figure 6: When the teacher model fails to generate images consistent with the concept ID under multimodal condition and therefore cannot provide an effective supervision signal, training will fail. Requirements for teacher capability. The success of D-OPSD is contingent upon the base model’s in-context abilities. In specific, as shown in view at source ↗

**Figure 7.** Figure 7: We also empirically investigate the difference of generated images when con view at source ↗

**Figure 8.** Figure 8: Comparison of generated images of ZImage-Turbo conditioned on multimodal feature using Qwen3-VL 4B and Qwen3-VL 4B with LLM part reweighted by Qwen3-4B LM. To address this issue, we replace the weights of the LLM component in Qwen3-VL-4B with those from the more compatible Qwen3-4B, while keeping the ViT and Connector weights unchanged. In this way, we preserve multimodal in-context capability while… view at source ↗

read the original abstract

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimized on the model's own trajectory and under it's own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D-OPSD frames fine-tuning of few-step diffusion models as on-policy self-distillation using multimodal teacher signals, but the inheritance assumption from the encoder lacks supporting checks in the provided description.

read the letter

The paper's main contribution is a training setup where a step-distilled diffusion model fine-tunes itself by acting as both teacher and student on its own generated trajectories. The teacher sees text plus the target image through the encoder, the student sees only text, and the loss pulls the student's noise or velocity predictions toward the teacher's on those same samples. This is meant to let the model pick up new concepts or styles while keeping its original few-step inference intact, which standard supervised fine-tuning apparently destroys.

Referee Report

3 major / 2 minor

Summary. The paper proposes D-OPSD, a training paradigm for step-distilled diffusion models that frames supervised fine-tuning as on-policy self-distillation. It claims that models using LLM/VLM encoders inherit in-context capabilities from the encoder; this allows the same model to act as teacher (conditioned on text prompt plus target image) and student (text-only), with training minimizing divergence between the two predicted distributions evaluated on the student's own roll-outs. The method is asserted to enable continuous adaptation to new concepts and styles while preserving the original few-step inference capacity.

Significance. If empirically validated, the approach could meaningfully advance adaptation of efficient few-step diffusion models (e.g., Z-Image-Turbo, FLUX.2-klein) by avoiding the capacity degradation that standard fine-tuning induces. The core idea of leveraging multimodal encoder properties for self-supervised trajectory alignment is conceptually interesting and, if shown to transfer, would offer a practical route for ongoing model improvement without retraining from scratch.

major comments (3)

[Abstract] Abstract: The central claim rests on the unverified assertion that 'the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities.' No derivation, ablation, or comparison of teacher (multimodal) versus student (text-only) predicted distributions on the student's trajectories is supplied. Without evidence that multimodal conditioning supplies a privileged signal that improves the text-only student, the procedure reduces to ordinary supervised fine-tuning, which the abstract states destroys few-step capacity.
[Abstract] Abstract: The manuscript asserts that D-OPSD 'enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity,' yet supplies no experimental results, quantitative metrics (FID, CLIP score, step-wise quality), ablation studies, or before/after comparisons of few-step sampling performance. This absence is load-bearing for the primary contribution.
[Abstract] Abstract: The description of the self-distillation process (minimizing divergence over the student's roll-outs) lacks concrete details on the divergence measure, how roll-outs are sampled during training, the precise conditioning inputs, or the optimization schedule. These omissions prevent assessment of whether the on-policy aspect is implemented in a manner distinct from standard distillation.

minor comments (2)

[Abstract] Abstract: Grammatical and typographical errors: 'make the model acts' should read 'make the model act'; 'it's own supervision' should read 'its own supervision'; 'compromises their inherent' should read 'compromise their inherent'.
[Abstract] Abstract: The abstract would be strengthened by a single sentence outlining the concrete loss or divergence used and the number of roll-out steps employed, to give readers an immediate sense of the method's implementation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments. We will address each major comment in turn, clarifying the manuscript's content and indicating planned revisions to incorporate additional evidence and details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim rests on the unverified assertion that 'the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities.' No derivation, ablation, or comparison of teacher (multimodal) versus student (text-only) predicted distributions on the student's trajectories is supplied. Without evidence that multimodal conditioning supplies a privileged signal that improves the text-only student, the procedure reduces to ordinary supervised fine-tuning, which the abstract states destroys few-step capacity.

Authors: The full manuscript includes a derivation of how the multimodal encoder enables in-context learning for the diffusion model, along with comparisons of the predicted distributions. However, to strengthen the presentation, we will add explicit ablations in the revised version demonstrating the difference in teacher-student alignment when using multimodal versus text-only conditioning on the student's own roll-outs. This will show that the multimodal signal provides additional guidance not available in standard fine-tuning. revision: partial
Referee: [Abstract] Abstract: The manuscript asserts that D-OPSD 'enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity,' yet supplies no experimental results, quantitative metrics (FID, CLIP score, step-wise quality), ablation studies, or before/after comparisons of few-step sampling performance. This absence is load-bearing for the primary contribution.

Authors: We acknowledge that the current version emphasizes the conceptual framework and method description. In the revised manuscript, we will include comprehensive experimental results with quantitative metrics such as FID and CLIP scores, ablation studies on the self-distillation components, and direct comparisons of few-step inference quality before and after adaptation to new concepts and styles. These will demonstrate the preservation of few-step capacity. revision: yes
Referee: [Abstract] Abstract: The description of the self-distillation process (minimizing divergence over the student's roll-outs) lacks concrete details on the divergence measure, how roll-outs are sampled during training, the precise conditioning inputs, or the optimization schedule. These omissions prevent assessment of whether the on-policy aspect is implemented in a manner distinct from standard distillation.

Authors: We will expand the method section in the revision to include precise details: the divergence measure is the KL divergence between the teacher's and student's predicted noise distributions; roll-outs are generated by sampling from the student's current policy using a fixed number of steps; conditioning for the teacher includes both text and image features while the student uses only text; and the optimization follows a standard schedule with learning rate and batch size specified. This will clarify the on-policy nature distinct from offline distillation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core claim rests on an empirical observation that diffusion models with LLM/VLM encoders inherit in-context capabilities, which is presented as a prior finding that justifies treating training as on-policy self-distillation (teacher on multimodal features, student on text-only, minimizing divergence on student roll-outs). This does not reduce any prediction or result to its own inputs by construction, nor does it rely on fitted parameters renamed as outputs, self-citation chains, or smuggled ansatzes. The method is a procedural training paradigm justified externally by the encoder property rather than tautologically defined from the target outcome. No load-bearing uniqueness theorems or renamings of known results appear in the abstract or described chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, invented entities, or detailed axioms beyond the stated inheritance property are described. The central mechanism rests on one key domain assumption.

axioms (1)

domain assumption Modern diffusion models with LLM/VLM encoders inherit in-context capabilities from the encoder.
Explicitly stated in the abstract as the finding that enables on-policy self-distillation during fine-tuning.

pith-pipeline@v0.9.0 · 5563 in / 1338 out tokens · 42042 ms · 2026-05-08T17:21:25.679765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

120 extracted references · 65 canonical work pages · 44 internal anchors

[1]

In: The twelfth international conference on learning representations (2024)

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P ., Garea, S.R., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mistakes. In: The twelfth international conference on learning representations (2024)

2024
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review arXiv 2025
[3]

Black Forest Labs: FLUX.https://github.com/black-forest-labs/flux (2023)

2023
[4]

Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025)

2025
[5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901
[6]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Cao, S., Chen, H., Chen, P ., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

work page arXiv 2025
[7]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P ., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021
[8]

Self-supervised flow matching for scalable multi-modal synthesis, 2026

Chefer, H., Esser, P ., Lorenz, D., Podell, D., Raja, V ., Tong, V ., Torralba, A., Rombach, R.: Self-supervised flow matching for scalable multi-modal synthesis. arXiv preprint arXiv:2603.06507 (2026)

work page arXiv 2026
[9]

arXiv preprint arXiv:2510.14974 (2025)

Chen, H., Zhang, K., Tan, H., Guibas, L., Wetzstein, G., Bi, S.: pi-flow: Policy-based few-step generation via imitation distillation. arXiv preprint arXiv:2510.14974 (2025)

work page arXiv 2025
[10]

In: International Conference on Learning Representations (2024)

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P ., Lu, H., Li, Z.: Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: International Conference on Learning Representations (2024)

2024
[11]

Science China Information Sciences67(12), 220101 (2024)

Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences67(12), 220101 (2024)

2024
[12]

Twinflow: Realizing one-step generation on large models with self-adversarial flows.arXiv preprint arXiv:2512.05150, 2025

Cheng, Z., Sun, P ., Li, J., Lin, T.: Twinflow: Realizing one-step generation on large models with self-adversarial flows. arXiv preprint arXiv:2512.05150 (2025)

work page arXiv 2025
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review arXiv 2025
[14]

https://huggingface.co/Freepik (2024)

Daniel Verdú, J.M.: Flux.1 lite: Distilling flux1.dev for efficient text-to-image genera- tion. https://huggingface.co/Freepik (2024)

2024
[15]

DeepSeek-AI: Deepseek-v4: Towards highly efficient million-token context intelli- gence (2026)

2026
[16]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P ., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

2021
[17]

In: Forty-first international conference on machine learning (2024) 16

Esser, P ., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 16

2024
[18]

arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22

Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V .: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)

work page arXiv 2023
[19]

arXiv preprint arXiv:2412.01199 (2024)

Fang, G., Li, K., Ma, X., Wang, X.: Tinyfusion: Diffusion transformers learned shal- low. arXiv preprint arXiv:2412.01199 (2024)

work page arXiv 2024
[20]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen- Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

work page internal anchor Pith review arXiv 2022
[21]

Advances in Neural Information Processing Systems36, 52132–52152 (2023)

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)

2023
[22]

Google DeepMind: Gemini 3.https://deepmind.google/models/gemini/ (2025)

2025
[23]

Google DeepMind: Gemini 3 pro image model card.https: //storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Image-Model-Card.pdf(2025)

2025
[24]

International Journal of Computer Vision129(6), 1789–1819 (2021)

Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International Journal of Computer Vision129(6), 1789–1819 (2021)

2021
[25]

ELT: Elastic Looped Transformers for Visual Generation

Goyal, S., Agrawal, S., Anil, G.G., Jain, P ., Paul, S., Kusupati, A.: Elt: Elastic looped transformers for visual generation. arXiv preprint arXiv:2604.09168 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Co-Evolving Policy Distillation

Gu, N., Yang, C., Si, Q., Qin, C., Yao, D., Fu, P ., Lin, Z., Wang, W., Duan, N., Wang, J.: Co-evolving policy distillation. arXiv preprint arXiv:2604.27083 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

LTX-2: Efficient Joint Audio-Visual Foundation Model

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

work page Pith review arXiv 2026
[28]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

He, Y., Kaur, S., Bhaskar, A., Yang, Y., Liu, J., Ri, N., Fowl, L., Panigrahi, A., Chen, D., Arora, S.: Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

work page internal anchor Pith review arXiv 2021
[30]

Advances in neural information processing systems30(2017)

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017
[31]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review arXiv 2015
[32]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[33]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P ., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

2022
[34]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P ., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

work page internal anchor Pith review arXiv 2024
[35]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train- test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review arXiv 2025
[36]

Journal of quality technol- ogy18(4), 203–210 (1986)

Hunter, J.S.: The exponentially weighted moving average. Journal of quality technol- ogy18(4), 203–210 (1986)

1986
[37]

Stable On-Policy Distillation through Adaptive Target Reformulation

Jang, I., Yeom, J., Yeo, J., Lim, H., Kim, T.: Stable on-policy distillation through adap- tive target reformulation. arXiv preprint arXiv:2601.07155 (2026) 17

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

Jiang, D., Liu, D., Wang, Z., Wu, Q., Li, L., Li, H., Jin, X., Liu, D., Li, Z., Zhang, B., et al.: Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649 (2025)

work page arXiv 2025
[39]

No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831,

Jiang, D., Wang, M., Li, L., Zhang, L., Wang, H., Wei, W., Dai, G., Zhang, Y., Wang, J.: No other representation component is needed: Diffusion transformers can provide representation guidance by themselves. arXiv preprint arXiv:2505.02831 (2025)

work page arXiv 2025
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and im- proving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174–24184 (2024)

2024
[41]

co/blog/kelseye/training-strategies-of-z-image-turbo(2025)

Kelseye, Duan, Z.: Training strategies of z-image-turbo.https://huggingface. co/blog/kelseye/training-strategies-of-z-image-turbo(2025)

2025
[42]

Advances in neural information processing systems36, 36652–36663 (2023)

Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)

2023
[43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)

2025
[44]

The annals of mathemat- ical statistics22(1), 79–86 (1951)

Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathemat- ical statistics22(1), 79–86 (1951)

1951
[45]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customiza- tion of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 1931–1941 (2023)

1931
[46]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P ., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

work page internal anchor Pith review arXiv 2025
[47]

RefTon: Reference person shot assist virtual Try-on

Li, L., Gong, Y., Liu, S., Cheng, B., Ma, Y., Wu, L., Jiang, D., Wang, Z., Leng, D., Yin, Y.: Refvton: person-to-person try on with additional unpaired visual reference. arXiv preprint arXiv:2511.00956 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for gener- ative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review arXiv 2022
[49]

arXiv preprint arXiv:2511.22677 (2025) 4, 5

Liu, D., Gao, P ., Liu, D., Du, R., Li, Z., Wu, Q., Jin, X., Cao, S., Zhang, S., Li, H., Hoi, S.: Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677 (2025)

work page arXiv 2025
[50]

Advances in neural infor- mation processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural infor- mation processing systems36, 34892–34916 (2023)

2023
[51]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P ., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025)

work page internal anchor Pith review arXiv 2025
[52]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

work page internal anchor Pith review arXiv 2022
[53]

https://thinkingmachines.ai/blog/ on-policy-distillation/

Lu, K., Lab, T.M.: On-policy distillation. Thinking Machines Lab: Connectionism (2025). https://doi.org/10.64434/tml.20251026, https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025
[54]

arXiv preprint arXiv:2507.18569 (2025) 2, 4, 11 1.x-Distill 17

Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A.J., Xie, X., Lai, J.H.: Adversar- ial distribution matching for diffusion distillation towards efficient image and video synthesis. arXiv preprint arXiv:2507.18569 (2025)

work page arXiv 2025
[55]

Knowledge distillation in iterative generative models for improved sampling speed

Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021) 18

work page arXiv 2021
[56]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: Synthesiz- ing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

work page internal anchor Pith review arXiv 2023
[57]

arXiv preprint arXiv:2410.18881 (2024)

Luo, W.: Diff-instruct++: Training one-step text-to-image generator model to align with human preferences. arXiv preprint arXiv:2410.18881 (2024)

work page arXiv 2024
[58]

Advances in Neural Information Processing Systems36, 76525–76546 (2023)

Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal ap- proach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems36, 76525–76546 (2023)

2023
[59]

arXiv preprint arXiv:2603.07700 (2026)

Luo, Y., Hu, T., Luo, W., Tang, J.: Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward. arXiv preprint arXiv:2603.07700 (2026)

work page arXiv 2026
[60]

arXiv preprint arXiv:2503.06674 (2025) 2, 4, 5, 10, 23

Luo, Y., Hu, T., Sun, J., Cai, Y., Tang, J.: Learning few-step diffusion models by trajec- tory distribution matching. arXiv preprint arXiv:2503.06674 (2025)

work page arXiv 2025
[61]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Ex- ploring flow and diffusion-based generative models with scalable interpolant trans- formers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

2024
[62]

Tuning timestep-distilled diffusion model using pairwise sample optimization.arXiv preprint arXiv:2410.03190,

Miao, Z., Yang, Z., Lin, K., Wang, Z., Liu, Z., Wang, L., Qiu, Q.: Tun- ing timestep-distilled diffusion model using pairwise sample optimization. arXiv preprint arXiv:2410.03190 (2024)

work page arXiv 2024
[63]

OpenAI: Gpt-Image-1.https://openai.com/index/ introducing-4o-image-generation/(2025)

2025
[64]

Ostris: Z-image-de-turbo.https://huggingface.co/ostris/ Z-Image-De-Turbo(2025)

2025
[65]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[66]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review arXiv 2023
[67]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Li, X., Liu, D., Zhu, X., et al.: Lumina-image 2.0: A unified and efficient image generative framework. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20031– 20042 (2025)

2025
[68]

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

Qin, Y., Wang, L., Fei, H., Zimmermann, R., Bo, L., Lu, Q., Wang, C.: Soar: Self- correction for optimal alignment and refinement in diffusion models. arXiv preprint arXiv:2604.12617 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

Qwen Team: Qwen3.5: Towards native multimodal agents.https://qwen.ai/ blog?id=qwen3.5(2026)

2026
[70]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P ., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

2021
[71]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

2018
[72]

OpenAI blog1(8), 9 (2019)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

2019
[73]

Journal of machine learning research21(140), 1–67 (2020) 19

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P .J.: Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of machine learning research21(140), 1–67 (2020) 19

2020
[74]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P ., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

work page internal anchor Pith review arXiv 2022
[75]

arXiv preprint arXiv:2404.13686 (2024) 3

Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P ., Wang, X., Xiao, X.: Hyper-sd: Tra- jectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686 (2024)

work page arXiv 2024
[76]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022
[77]

r/StableDiffusion, the Reddit Community: Z-image lora training.https: //www.reddit.com/r/StableDiffusion/comments/1pj0469/zimage_lora_ training/(2025)

2025
[78]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Ruiz, N., Li, Y., Jampani, V ., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

2023
[79]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Sang, H., Xu, Y., Zhou, Z., He, R., Wang, Z., Sun, J.: On-policy self-distillation for reasoning compression. arXiv preprint arXiv:2603.05433 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[80]

In: SIG- GRAPH Asia 2024 Conference Papers

Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P ., Rombach, R.: Fast high- resolution image synthesis with latent adversarial diffusion distillation. In: SIG- GRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024

Showing first 80 references.