D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
Pith reviewed 2026-05-20 23:14 UTC · model grok-4.3
The pith
Step-distilled diffusion models can be continuously fine-tuned on new concepts without losing their few-step speed by using on-policy self-distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that modern diffusion models with an LLM or VLM encoder inherit in-context capabilities that allow a teacher conditioned on both text prompt and target image to provide reliable supervision to a text-only student. Training then minimizes the difference between the two predicted distributions over the student's own roll-outs, so the model acquires new concepts and styles while its original few-step inference capacity stays intact.
What carries the argument
On-policy self-distillation, in which the model serves as both teacher (conditioned on text plus target image) and student (text only) and the loss aligns their output distributions on trajectories sampled from the student itself.
If this is right
- The model acquires new concepts and styles through continuous supervised fine-tuning.
- The original few-step inference performance remains unchanged after tuning.
- Training draws on the model's inherited in-context capabilities from its encoder to generate the supervisory signal.
- Practical ongoing adaptation of efficient image generators to specific domains becomes feasible.
Where Pith is reading between the lines
- The same self-distillation pattern could be tested on other conditional generative models that use encoder-based prompts.
- Deployed systems might use repeated rounds of this process for gradual personalization without full retraining.
- Checking the method on base models without strong encoders would test how far the in-context assumption reaches.
Load-bearing premise
Modern diffusion models inherit enough in-context capabilities from their LLM or VLM encoders that a teacher conditioned on both text and target image can reliably supervise a text-only student.
What would settle it
Apply D-OPSD to a step-distilled model and then check whether high-quality images on new concepts still appear in the original few steps; a clear rise in the number of steps required or a drop in quality on either new or original prompts would show the claim is false.
Figures
read the original abstract
The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes D-OPSD, an on-policy self-distillation training paradigm for step-distilled diffusion models. The core idea is that modern diffusion models with LLM/VLM encoders inherit in-context capabilities, allowing the same model to act as both teacher (conditioned on multimodal input consisting of the text prompt plus target image) and student (conditioned on text features only). Training minimizes divergence between the two predicted distributions evaluated on the student's own roll-outs, with the goal of enabling supervised fine-tuning for new concepts and styles while preserving the original few-step sampling behavior.
Significance. If the central claim is substantiated, the result would be significant for the ongoing shift toward efficient few-step diffusion models. It offers a potential solution to the problem of continuous supervised fine-tuning without degrading inference speed, which is a practical barrier for models such as Z-Image-Turbo and FLUX.2-klein. The on-policy self-distillation framing could also inform related work on self-supervised adaptation in generative models more broadly.
major comments (2)
- [Abstract and §3] Abstract and §3 (Method description): The central assumption that the LLM/VLM encoder supplies sufficient in-context capabilities for the multimodal teacher to produce a reliable, distribution-compatible supervisory signal for the text-only student is stated as a finding but is not accompanied by any derivation, preliminary ablation, or stability argument showing why this conditioning remains valid once parameters are updated on-policy. This assumption is load-bearing for the claim that few-step capacity is preserved.
- [§4] §4 (Experiments): No quantitative results, ablations, or comparisons are referenced that isolate the contribution of the on-policy self-distillation objective versus standard supervised fine-tuning; without such evidence it is impossible to assess whether the method actually avoids the distribution shift or capacity erosion described in the abstract.
minor comments (1)
- [Abstract] Abstract: The models Z-Image-Turbo and FLUX.2-klein are mentioned without citations or references.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method description): The central assumption that the LLM/VLM encoder supplies sufficient in-context capabilities for the multimodal teacher to produce a reliable, distribution-compatible supervisory signal for the text-only student is stated as a finding but is not accompanied by any derivation, preliminary ablation, or stability argument showing why this conditioning remains valid once parameters are updated on-policy. This assumption is load-bearing for the claim that few-step capacity is preserved.
Authors: We agree that the validity of the in-context capability after on-policy updates is a central assumption. The manuscript presents this as an empirical observation that enables the teacher-student formulation, with the on-policy rollouts intended to keep the distributions aligned. While no formal derivation is given, the design minimizes shift by supervising on the student's own trajectories. To strengthen the presentation, we will add a new preliminary ablation in the revised §3 that measures the divergence between teacher and student predictions on held-out student trajectories both before and after a short training run, providing evidence for stability of the supervisory signal. revision: yes
-
Referee: [§4] §4 (Experiments): No quantitative results, ablations, or comparisons are referenced that isolate the contribution of the on-policy self-distillation objective versus standard supervised fine-tuning; without such evidence it is impossible to assess whether the method actually avoids the distribution shift or capacity erosion described in the abstract.
Authors: We acknowledge that the current experiments do not include an explicit head-to-head comparison against standard supervised fine-tuning. The reported results focus on demonstrating that D-OPSD enables acquisition of new concepts while retaining few-step sampling speed. To isolate the contribution of the on-policy objective, we will add quantitative comparisons in the revised §4, including a standard SFT baseline with metrics on both concept fidelity and inference-step preservation, allowing direct assessment of distribution-shift mitigation. revision: yes
Circularity Check
No circularity: proposed on-policy objective is independent of its claimed outcomes
full rationale
The paper introduces D-OPSD by first stating an empirical observation that diffusion models with LLM/VLM encoders inherit in-context capabilities, then defines a training process in which the same model acts as teacher (multimodal conditioning on text + target image) and student (text-only) while minimizing divergence on the student's own roll-outs. This formulation is presented as a novel paradigm to enable supervised fine-tuning without eroding few-step sampling; the benefit of preserving original capacity is an intended empirical result of the objective rather than a quantity that reduces to the inputs by definition or by self-citation. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided description, and the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modern diffusion models where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Training minimizes the two predicted distributions over the student's own roll-outs... student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
A Brief Overview: On-Policy Self-Distillation In Large Language Models
OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
-
A Brief Overview: On-Policy Self-Distillation In Large Language Models
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.
Reference graph
Works this paper leans on
-
[1]
In: The twelfth international conference on learning representations (2024)
Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P ., Garea, S.R., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mistakes. In: The twelfth international conference on learning representations (2024)
work page 2024
-
[2]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Black Forest Labs: FLUX.https://github.com/black-forest-labs/flux (2023)
work page 2023
-
[4]
Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025)
work page 2025
-
[5]
Advances in neural information processing systems33, 1877–1901 (2020)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)
work page 1901
-
[6]
HunyuanImage 3.0 Technical Report
Cao, S., Chen, H., Chen, P ., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
In: Proceedings of the IEEE/CVF international conference on computer vision
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P ., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
work page 2021
-
[8]
arXiv preprint arXiv:2603.06507 (2026)
Chefer, H., Esser, P ., Lorenz, D., Podell, D., Raja, V ., Tong, V ., Torralba, A., Rombach, R.: Self-supervised flow matching for scalable multi-modal synthesis. arXiv preprint arXiv:2603.06507 (2026)
-
[9]
arXiv preprint arXiv:2510.14974 (2025)
Chen, H., Zhang, K., Tan, H., Guibas, L., Wetzstein, G., Bi, S.: pi-flow: Policy-based few-step generation via imitation distillation. arXiv preprint arXiv:2510.14974 (2025)
-
[10]
In: International Conference on Learning Representations (2024) 18
Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P ., Lu, H., Li, Z.: Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: International Conference on Learning Representations (2024) 18
work page 2024
-
[11]
Science China Information Sciences67(12), 220101 (2024)
Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences67(12), 220101 (2024)
work page 2024
-
[12]
arXiv preprint arXiv:2512.05150 (2025)
Cheng, Z., Sun, P ., Li, J., Lin, T.: Twinflow: Realizing one-step generation on large models with self-adversarial flows. arXiv preprint arXiv:2512.05150 (2025)
-
[13]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
https://huggingface.co/Freepik (2024)
Daniel Verdú, J.M.: Flux.1 lite: Distilling flux1.dev for efficient text-to-image genera- tion. https://huggingface.co/Freepik (2024)
work page 2024
-
[15]
DeepSeek-AI: Deepseek-v4: Towards highly efficient million-token context intelli- gence (2026)
work page 2026
-
[16]
Advances in neural information processing systems34, 8780–8794 (2021)
Dhariwal, P ., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)
work page 2021
-
[17]
In: Forty-first international conference on machine learning (2024)
Esser, P ., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)
work page 2024
-
[18]
arXiv preprint arXiv:2309.17425 (2023)
Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V .: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)
-
[19]
arXiv preprint arXiv:2412.01199 (2024)
Fang, G., Li, K., Ma, X., Wang, X.: Tinyfusion: Diffusion transformers learned shal- low. arXiv preprint arXiv:2412.01199 (2024)
-
[20]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen- Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Advances in Neural Information Processing Systems36, 52132–52152 (2023)
Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)
work page 2023
-
[22]
Google DeepMind: Gemini 3.https://deepmind.google/models/gemini/ (2025)
work page 2025
-
[23]
Google DeepMind: Gemini 3 pro image model card.https: //storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Image-Model-Card.pdf(2025)
work page 2025
-
[24]
International Journal of Computer Vision129(6), 1789–1819 (2021)
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International Journal of Computer Vision129(6), 1789–1819 (2021)
work page 2021
-
[25]
ELT: Elastic Looped Transformers for Visual Generation
Goyal, S., Agrawal, S., Anil, G.G., Jain, P ., Paul, S., Kusupati, A.: Elt: Elastic looped transformers for visual generation. arXiv preprint arXiv:2604.09168 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Co-Evolving Policy Distillation
Gu, N., Yang, C., Si, Q., Qin, C., Yao, D., Fu, P ., Lin, Z., Wang, W., Duan, N., Wang, J.: Co-evolving policy distillation. arXiv preprint arXiv:2604.27083 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
LTX-2: Efficient Joint Audio-Visual Foundation Model
HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
He, Y., Kaur, S., Bhaskar, A., Yang, Y., Liu, J., Ri, N., Fowl, L., Panigrahi, A., Chen, D., Arora, S.: Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002 (2026) 19
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Advances in neural information processing systems30(2017)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)
work page 2017
-
[31]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[32]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
work page 2020
-
[33]
Hu, E.J., Shen, Y., Wallis, P ., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)
work page 2022
-
[34]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P ., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train- test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Journal of quality technol- ogy18(4), 203–210 (1986)
Hunter, J.S.: The exponentially weighted moving average. Journal of quality technol- ogy18(4), 203–210 (1986)
work page 1986
-
[37]
Stable On-Policy Distillation through Adaptive Target Reformulation
Jang, I., Yeom, J., Yeo, J., Lim, H., Kim, T.: Stable on-policy distillation through adap- tive target reformulation. arXiv preprint arXiv:2601.07155 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Jiang, D., Liu, D., Wang, Z., Wu, Q., Li, L., Li, H., Jin, X., Liu, D., Li, Z., Zhang, B., et al.: Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649 (2025)
-
[39]
arXiv preprint arXiv:2505.02831 (2025)
Jiang, D., Wang, M., Li, L., Zhang, L., Wang, H., Wei, W., Dai, G., Zhang, Y., Wang, J.: No other representation component is needed: Diffusion transformers can provide representation guidance by themselves. arXiv preprint arXiv:2505.02831 (2025)
-
[40]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and im- proving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174–24184 (2024)
work page 2024
-
[41]
co/blog/kelseye/training-strategies-of-z-image-turbo(2025)
Kelseye, Duan, Z.: Training strategies of z-image-turbo.https://huggingface. co/blog/kelseye/training-strategies-of-z-image-turbo(2025)
work page 2025
-
[42]
Advances in neural information processing systems36, 36652–36663 (2023)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)
work page 2023
-
[43]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)
work page 2025
-
[44]
The annals of mathemat- ical statistics22(1), 79–86 (1951)
Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathemat- ical statistics22(1), 79–86 (1951)
work page 1951
-
[45]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customiza- tion of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 1931–1941 (2023)
work page 1931
-
[46]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P ., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
RefTon: Reference person shot assist virtual Try-on
Li, L., Gong, Y., Liu, S., Cheng, B., Ma, Y., Wu, L., Jiang, D., Wang, Z., Leng, D., Yin, Y.: Refvton: person-to-person try on with additional unpaired visual reference. arXiv preprint arXiv:2511.00956 (2025) 20
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for gener- ative modeling. arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Liu, D., Gao, P ., Liu, D., Du, R., Li, Z., Wu, Q., Jin, X., Cao, S., Zhang, S., Li, H., Hoi, S.: Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677 (2025)
-
[50]
Advances in neural infor- mation processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural infor- mation processing systems36, 34892–34916 (2023)
work page 2023
-
[51]
Flow-GRPO: Training Flow Matching Models via Online RL
Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P ., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025
Lu, K., Lab, T.M.: On-policy distillation. Thinking Machines Lab: Connectionism (2025). https://doi.org/10.64434/tml.20251026, https://thinkingmachines.ai/blog/on-policy-distillation
-
[54]
arXiv preprint arXiv:2507.18569 (2025)
Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A.J., Xie, X., Lai, J.H.: Adversar- ial distribution matching for diffusion distillation towards efficient image and video synthesis. arXiv preprint arXiv:2507.18569 (2025)
-
[55]
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[56]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: Synthesiz- ing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
arXiv preprint arXiv:2410.18881 (2024)
Luo, W.: Diff-instruct++: Training one-step text-to-image generator model to align with human preferences. arXiv preprint arXiv:2410.18881 (2024)
-
[58]
Advances in Neural Information Processing Systems36, 76525–76546 (2023)
Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal ap- proach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems36, 76525–76546 (2023)
work page 2023
-
[59]
arXiv preprint arXiv:2603.07700 (2026)
Luo, Y., Hu, T., Luo, W., Tang, J.: Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward. arXiv preprint arXiv:2603.07700 (2026)
-
[60]
Learning few- step diffusion models by trajectory distribution matching
Luo, Y., Hu, T., Sun, J., Cai, Y., Tang, J.: Learning few-step diffusion models by trajec- tory distribution matching. arXiv preprint arXiv:2503.06674 (2025)
-
[61]
In: European Conference on Computer Vision
Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Ex- ploring flow and diffusion-based generative models with scalable interpolant trans- formers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)
work page 2024
-
[62]
arXiv preprint arXiv:2410.03190 (2024)
Miao, Z., Yang, Z., Lin, K., Wang, Z., Liu, Z., Wang, L., Qiu, Q.: Tun- ing timestep-distilled diffusion model using pairwise sample optimization. arXiv preprint arXiv:2410.03190 (2024)
-
[63]
OpenAI: Gpt-Image-1.https://openai.com/index/ introducing-4o-image-generation/(2025)
work page 2025
-
[64]
Ostris: Z-image-de-turbo.https://huggingface.co/ostris/ Z-Image-De-Turbo(2025)
work page 2025
-
[65]
In: Proceedings of the IEEE/CVF international conference on computer vision
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)
work page 2023
-
[66]
Privileged Information Distillation for Language Models
Penaloza, E., Vattikonda, D., Gontier, N., Lacoste, A., Charlin, L., Caccia, M.: Privi- leged information distillation for language models. arXiv preprint arXiv:2602.04942 (2026) 21
work page internal anchor Pith review arXiv 2026
-
[67]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision
Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Li, X., Liu, D., Zhu, X., et al.: Lumina-image 2.0: A unified and efficient image generative framework. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20031– 20042 (2025)
work page 2025
-
[69]
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
Qin, Y., Wang, L., Fei, H., Zimmermann, R., Bo, L., Lu, Q., Wang, C.: Soar: Self- correction for optimal alignment and refinement in diffusion models. arXiv preprint arXiv:2604.12617 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[70]
Qwen Team: Qwen3.5: Towards native multimodal agents.https://qwen.ai/ blog?id=qwen3.5(2026)
work page 2026
-
[71]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P ., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
work page 2021
-
[72]
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
work page 2018
-
[73]
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)
work page 2019
-
[74]
Journal of machine learning research21(140), 1–67 (2020)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P .J.: Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of machine learning research21(140), 1–67 (2020)
work page 2020
-
[75]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Ramesh, A., Dhariwal, P ., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[76]
arXiv preprint arXiv:2404.13686 (2024)
Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P ., Wang, X., Xiao, X.: Hyper-sd: Tra- jectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686 (2024)
-
[77]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
work page 2022
-
[78]
r/StableDiffusion, the Reddit Community: Z-image lora training.https: //www.reddit.com/r/StableDiffusion/comments/1pj0469/zimage_lora_ training/(2025)
work page 2025
-
[79]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Ruiz, N., Li, Y., Jampani, V ., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
work page 2023
-
[80]
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
Sang, H., Xu, Y., Zhou, Z., He, R., Wang, Z., Sun, J.: On-policy self-distillation for reasoning compression. arXiv preprint arXiv:2603.05433 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.