Recognition: unknown
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
Pith reviewed 2026-05-08 17:21 UTC · model grok-4.3
The pith
Step-distilled diffusion models can learn new concepts through on-policy self-distillation without losing their few-step speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By casting fine-tuning as on-policy self-distillation, the model acts simultaneously as teacher and student with differing conditioning contexts, and the loss is computed between the two output distributions evaluated on the student's own roll-outs; this procedure lets the model acquire new concepts and styles while its original few-step inference behavior remains unchanged.
What carries the argument
On-policy self-distillation in which the model generates its own training trajectories and supplies its own supervision by comparing text-only predictions against multimodal predictions on those same trajectories.
If this is right
- New concepts and styles become learnable in few-step models through ordinary supervised updates.
- The original few-step sampling performance stays intact after the updates.
- No external teacher model or separate distillation stage is required; the model supervises itself.
Where Pith is reading between the lines
- The same self-supervision loop could be applied repeatedly, allowing incremental adaptation over successive data batches.
- Because the method uses only the model's own outputs, it may reduce the need for large curated fine-tuning datasets.
- If the inherited in-context behavior scales with model size, larger step-distilled models would adapt even more readily.
Load-bearing premise
The diffusion model inherits usable in-context capabilities from its LLM or VLM encoder so that image information can serve as additional context during training.
What would settle it
Measure the visual quality and prompt adherence of few-step samples from the model before and after D-OPSD fine-tuning on the same set of held-out prompts; a clear drop would falsify the claim that few-step capacity is preserved.
Figures
read the original abstract
The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimized on the model's own trajectory and under it's own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes D-OPSD, a training paradigm for step-distilled diffusion models that frames supervised fine-tuning as on-policy self-distillation. It claims that models using LLM/VLM encoders inherit in-context capabilities from the encoder; this allows the same model to act as teacher (conditioned on text prompt plus target image) and student (text-only), with training minimizing divergence between the two predicted distributions evaluated on the student's own roll-outs. The method is asserted to enable continuous adaptation to new concepts and styles while preserving the original few-step inference capacity.
Significance. If empirically validated, the approach could meaningfully advance adaptation of efficient few-step diffusion models (e.g., Z-Image-Turbo, FLUX.2-klein) by avoiding the capacity degradation that standard fine-tuning induces. The core idea of leveraging multimodal encoder properties for self-supervised trajectory alignment is conceptually interesting and, if shown to transfer, would offer a practical route for ongoing model improvement without retraining from scratch.
major comments (3)
- [Abstract] Abstract: The central claim rests on the unverified assertion that 'the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities.' No derivation, ablation, or comparison of teacher (multimodal) versus student (text-only) predicted distributions on the student's trajectories is supplied. Without evidence that multimodal conditioning supplies a privileged signal that improves the text-only student, the procedure reduces to ordinary supervised fine-tuning, which the abstract states destroys few-step capacity.
- [Abstract] Abstract: The manuscript asserts that D-OPSD 'enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity,' yet supplies no experimental results, quantitative metrics (FID, CLIP score, step-wise quality), ablation studies, or before/after comparisons of few-step sampling performance. This absence is load-bearing for the primary contribution.
- [Abstract] Abstract: The description of the self-distillation process (minimizing divergence over the student's roll-outs) lacks concrete details on the divergence measure, how roll-outs are sampled during training, the precise conditioning inputs, or the optimization schedule. These omissions prevent assessment of whether the on-policy aspect is implemented in a manner distinct from standard distillation.
minor comments (2)
- [Abstract] Abstract: Grammatical and typographical errors: 'make the model acts' should read 'make the model act'; 'it's own supervision' should read 'its own supervision'; 'compromises their inherent' should read 'compromise their inherent'.
- [Abstract] Abstract: The abstract would be strengthened by a single sentence outlining the concrete loss or divergence used and the number of roll-out steps employed, to give readers an immediate sense of the method's implementation.
Simulated Author's Rebuttal
We thank the referee for their insightful comments. We will address each major comment in turn, clarifying the manuscript's content and indicating planned revisions to incorporate additional evidence and details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim rests on the unverified assertion that 'the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities.' No derivation, ablation, or comparison of teacher (multimodal) versus student (text-only) predicted distributions on the student's trajectories is supplied. Without evidence that multimodal conditioning supplies a privileged signal that improves the text-only student, the procedure reduces to ordinary supervised fine-tuning, which the abstract states destroys few-step capacity.
Authors: The full manuscript includes a derivation of how the multimodal encoder enables in-context learning for the diffusion model, along with comparisons of the predicted distributions. However, to strengthen the presentation, we will add explicit ablations in the revised version demonstrating the difference in teacher-student alignment when using multimodal versus text-only conditioning on the student's own roll-outs. This will show that the multimodal signal provides additional guidance not available in standard fine-tuning. revision: partial
-
Referee: [Abstract] Abstract: The manuscript asserts that D-OPSD 'enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity,' yet supplies no experimental results, quantitative metrics (FID, CLIP score, step-wise quality), ablation studies, or before/after comparisons of few-step sampling performance. This absence is load-bearing for the primary contribution.
Authors: We acknowledge that the current version emphasizes the conceptual framework and method description. In the revised manuscript, we will include comprehensive experimental results with quantitative metrics such as FID and CLIP scores, ablation studies on the self-distillation components, and direct comparisons of few-step inference quality before and after adaptation to new concepts and styles. These will demonstrate the preservation of few-step capacity. revision: yes
-
Referee: [Abstract] Abstract: The description of the self-distillation process (minimizing divergence over the student's roll-outs) lacks concrete details on the divergence measure, how roll-outs are sampled during training, the precise conditioning inputs, or the optimization schedule. These omissions prevent assessment of whether the on-policy aspect is implemented in a manner distinct from standard distillation.
Authors: We will expand the method section in the revision to include precise details: the divergence measure is the KL divergence between the teacher's and student's predicted noise distributions; roll-outs are generated by sampling from the student's current policy using a fixed number of steps; conditioning for the teacher includes both text and image features while the student uses only text; and the optimization follows a standard schedule with learning rate and batch size specified. This will clarify the on-policy nature distinct from offline distillation. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper's core claim rests on an empirical observation that diffusion models with LLM/VLM encoders inherit in-context capabilities, which is presented as a prior finding that justifies treating training as on-policy self-distillation (teacher on multimodal features, student on text-only, minimizing divergence on student roll-outs). This does not reduce any prediction or result to its own inputs by construction, nor does it rely on fitted parameters renamed as outputs, self-citation chains, or smuggled ansatzes. The method is a procedural training paradigm justified externally by the encoder property rather than tautologically defined from the target outcome. No load-bearing uniqueness theorems or renamings of known results appear in the abstract or described chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modern diffusion models with LLM/VLM encoders inherit in-context capabilities from the encoder.
Reference graph
Works this paper leans on
-
[1]
In: The twelfth international conference on learning representations (2024)
Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P ., Garea, S.R., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mistakes. In: The twelfth international conference on learning representations (2024)
2024
-
[2]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review arXiv 2025
-
[3]
Black Forest Labs: FLUX.https://github.com/black-forest-labs/flux (2023)
2023
-
[4]
Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025)
2025
-
[5]
Advances in neural information processing systems33, 1877–1901 (2020)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)
1901
-
[6]
Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Cao, S., Chen, H., Chen, P ., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)
-
[7]
In: Proceedings of the IEEE/CVF international conference on computer vision
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P ., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
2021
-
[8]
Self-supervised flow matching for scalable multi-modal synthesis, 2026
Chefer, H., Esser, P ., Lorenz, D., Podell, D., Raja, V ., Tong, V ., Torralba, A., Rombach, R.: Self-supervised flow matching for scalable multi-modal synthesis. arXiv preprint arXiv:2603.06507 (2026)
-
[9]
arXiv preprint arXiv:2510.14974 (2025)
Chen, H., Zhang, K., Tan, H., Guibas, L., Wetzstein, G., Bi, S.: pi-flow: Policy-based few-step generation via imitation distillation. arXiv preprint arXiv:2510.14974 (2025)
-
[10]
In: International Conference on Learning Representations (2024)
Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P ., Lu, H., Li, Z.: Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: International Conference on Learning Representations (2024)
2024
-
[11]
Science China Information Sciences67(12), 220101 (2024)
Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences67(12), 220101 (2024)
2024
-
[12]
Cheng, Z., Sun, P ., Li, J., Lin, T.: Twinflow: Realizing one-step generation on large models with self-adversarial flows. arXiv preprint arXiv:2512.05150 (2025)
-
[13]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review arXiv 2025
-
[14]
https://huggingface.co/Freepik (2024)
Daniel Verdú, J.M.: Flux.1 lite: Distilling flux1.dev for efficient text-to-image genera- tion. https://huggingface.co/Freepik (2024)
2024
-
[15]
DeepSeek-AI: Deepseek-v4: Towards highly efficient million-token context intelli- gence (2026)
2026
-
[16]
Advances in neural information processing systems34, 8780–8794 (2021)
Dhariwal, P ., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)
2021
-
[17]
In: Forty-first international conference on machine learning (2024) 16
Esser, P ., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 16
2024
-
[18]
arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22
Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V .: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)
-
[19]
arXiv preprint arXiv:2412.01199 (2024)
Fang, G., Li, K., Ma, X., Wang, X.: Tinyfusion: Diffusion transformers learned shal- low. arXiv preprint arXiv:2412.01199 (2024)
-
[20]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen- Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
work page internal anchor Pith review arXiv 2022
-
[21]
Advances in Neural Information Processing Systems36, 52132–52152 (2023)
Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)
2023
-
[22]
Google DeepMind: Gemini 3.https://deepmind.google/models/gemini/ (2025)
2025
-
[23]
Google DeepMind: Gemini 3 pro image model card.https: //storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Image-Model-Card.pdf(2025)
2025
-
[24]
International Journal of Computer Vision129(6), 1789–1819 (2021)
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International Journal of Computer Vision129(6), 1789–1819 (2021)
2021
-
[25]
ELT: Elastic Looped Transformers for Visual Generation
Goyal, S., Agrawal, S., Anil, G.G., Jain, P ., Paul, S., Kusupati, A.: Elt: Elastic looped transformers for visual generation. arXiv preprint arXiv:2604.09168 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Co-Evolving Policy Distillation
Gu, N., Yang, C., Si, Q., Qin, C., Yao, D., Fu, P ., Lin, Z., Wang, W., Duan, N., Wang, J.: Co-evolving policy distillation. arXiv preprint arXiv:2604.27083 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
LTX-2: Efficient Joint Audio-Visual Foundation Model
HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)
work page Pith review arXiv 2026
-
[28]
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
He, Y., Kaur, S., Bhaskar, A., Yang, Y., Liu, J., Ri, N., Fowl, L., Panigrahi, A., Chen, D., Arora, S.: Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
work page internal anchor Pith review arXiv 2021
-
[30]
Advances in neural information processing systems30(2017)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)
2017
-
[31]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
work page internal anchor Pith review arXiv 2015
-
[32]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
2020
-
[33]
ICLR1(2), 3 (2022)
Hu, E.J., Shen, Y., Wallis, P ., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)
2022
-
[34]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P ., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)
work page internal anchor Pith review arXiv 2024
-
[35]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train- test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)
work page internal anchor Pith review arXiv 2025
-
[36]
Journal of quality technol- ogy18(4), 203–210 (1986)
Hunter, J.S.: The exponentially weighted moving average. Journal of quality technol- ogy18(4), 203–210 (1986)
1986
-
[37]
Stable On-Policy Distillation through Adaptive Target Reformulation
Jang, I., Yeom, J., Yeo, J., Lim, H., Kim, T.: Stable on-policy distillation through adap- tive target reformulation. arXiv preprint arXiv:2601.07155 (2026) 17
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Jiang, D., Liu, D., Wang, Z., Wu, Q., Li, L., Li, H., Jin, X., Liu, D., Li, Z., Zhang, B., et al.: Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649 (2025)
-
[39]
Jiang, D., Wang, M., Li, L., Zhang, L., Wang, H., Wei, W., Dai, G., Zhang, Y., Wang, J.: No other representation component is needed: Diffusion transformers can provide representation guidance by themselves. arXiv preprint arXiv:2505.02831 (2025)
-
[40]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and im- proving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174–24184 (2024)
2024
-
[41]
co/blog/kelseye/training-strategies-of-z-image-turbo(2025)
Kelseye, Duan, Z.: Training strategies of z-image-turbo.https://huggingface. co/blog/kelseye/training-strategies-of-z-image-turbo(2025)
2025
-
[42]
Advances in neural information processing systems36, 36652–36663 (2023)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)
2023
-
[43]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)
2025
-
[44]
The annals of mathemat- ical statistics22(1), 79–86 (1951)
Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathemat- ical statistics22(1), 79–86 (1951)
1951
-
[45]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customiza- tion of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 1931–1941 (2023)
1931
-
[46]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P ., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)
work page internal anchor Pith review arXiv 2025
-
[47]
RefTon: Reference person shot assist virtual Try-on
Li, L., Gong, Y., Liu, S., Cheng, B., Ma, Y., Wu, L., Jiang, D., Wang, Z., Leng, D., Yin, Y.: Refvton: person-to-person try on with additional unpaired visual reference. arXiv preprint arXiv:2511.00956 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for gener- ative modeling. arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review arXiv 2022
-
[49]
arXiv preprint arXiv:2511.22677 (2025) 4, 5
Liu, D., Gao, P ., Liu, D., Du, R., Li, Z., Wu, Q., Jin, X., Cao, S., Zhang, S., Li, H., Hoi, S.: Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677 (2025)
-
[50]
Advances in neural infor- mation processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural infor- mation processing systems36, 34892–34916 (2023)
2023
-
[51]
Flow-GRPO: Training Flow Matching Models via Online RL
Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P ., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025)
work page internal anchor Pith review arXiv 2025
-
[52]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
work page internal anchor Pith review arXiv 2022
-
[53]
https://thinkingmachines.ai/blog/ on-policy-distillation/
Lu, K., Lab, T.M.: On-policy distillation. Thinking Machines Lab: Connectionism (2025). https://doi.org/10.64434/tml.20251026, https://thinkingmachines.ai/blog/on-policy-distillation
-
[54]
arXiv preprint arXiv:2507.18569 (2025) 2, 4, 11 1.x-Distill 17
Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A.J., Xie, X., Lai, J.H.: Adversar- ial distribution matching for diffusion distillation towards efficient image and video synthesis. arXiv preprint arXiv:2507.18569 (2025)
-
[55]
Knowledge distillation in iterative generative models for improved sampling speed
Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021) 18
-
[56]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: Synthesiz- ing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
work page internal anchor Pith review arXiv 2023
-
[57]
arXiv preprint arXiv:2410.18881 (2024)
Luo, W.: Diff-instruct++: Training one-step text-to-image generator model to align with human preferences. arXiv preprint arXiv:2410.18881 (2024)
-
[58]
Advances in Neural Information Processing Systems36, 76525–76546 (2023)
Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal ap- proach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems36, 76525–76546 (2023)
2023
-
[59]
arXiv preprint arXiv:2603.07700 (2026)
Luo, Y., Hu, T., Luo, W., Tang, J.: Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward. arXiv preprint arXiv:2603.07700 (2026)
-
[60]
arXiv preprint arXiv:2503.06674 (2025) 2, 4, 5, 10, 23
Luo, Y., Hu, T., Sun, J., Cai, Y., Tang, J.: Learning few-step diffusion models by trajec- tory distribution matching. arXiv preprint arXiv:2503.06674 (2025)
-
[61]
In: European Conference on Computer Vision
Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Ex- ploring flow and diffusion-based generative models with scalable interpolant trans- formers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)
2024
-
[62]
Miao, Z., Yang, Z., Lin, K., Wang, Z., Liu, Z., Wang, L., Qiu, Q.: Tun- ing timestep-distilled diffusion model using pairwise sample optimization. arXiv preprint arXiv:2410.03190 (2024)
-
[63]
OpenAI: Gpt-Image-1.https://openai.com/index/ introducing-4o-image-generation/(2025)
2025
-
[64]
Ostris: Z-image-de-turbo.https://huggingface.co/ostris/ Z-Image-De-Turbo(2025)
2025
-
[65]
In: Proceedings of the IEEE/CVF international conference on computer vision
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)
2023
-
[66]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
work page internal anchor Pith review arXiv 2023
-
[67]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision
Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Li, X., Liu, D., Zhu, X., et al.: Lumina-image 2.0: A unified and efficient image generative framework. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20031– 20042 (2025)
2025
-
[68]
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
Qin, Y., Wang, L., Fei, H., Zimmermann, R., Bo, L., Lu, Q., Wang, C.: Soar: Self- correction for optimal alignment and refinement in diffusion models. arXiv preprint arXiv:2604.12617 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[69]
Qwen Team: Qwen3.5: Towards native multimodal agents.https://qwen.ai/ blog?id=qwen3.5(2026)
2026
-
[70]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P ., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
2021
-
[71]
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
2018
-
[72]
OpenAI blog1(8), 9 (2019)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)
2019
-
[73]
Journal of machine learning research21(140), 1–67 (2020) 19
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P .J.: Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of machine learning research21(140), 1–67 (2020) 19
2020
-
[74]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Ramesh, A., Dhariwal, P ., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
work page internal anchor Pith review arXiv 2022
-
[75]
arXiv preprint arXiv:2404.13686 (2024) 3
Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P ., Wang, X., Xiao, X.: Hyper-sd: Tra- jectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686 (2024)
-
[76]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
2022
-
[77]
r/StableDiffusion, the Reddit Community: Z-image lora training.https: //www.reddit.com/r/StableDiffusion/comments/1pj0469/zimage_lora_ training/(2025)
2025
-
[78]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Ruiz, N., Li, Y., Jampani, V ., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
2023
-
[79]
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
Sang, H., Xu, Y., Zhou, Z., He, R., Wang, Z., Sun, J.: On-policy self-distillation for reasoning compression. arXiv preprint arXiv:2603.05433 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[80]
In: SIG- GRAPH Asia 2024 Conference Papers
Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P ., Rombach, R.: Fast high- resolution image synthesis with latent adversarial diffusion distillation. In: SIG- GRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.