D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Dengyang Jiang; Dongyang Liu; Harry Yang; Mingzhe Zheng; Peng Gao; Qilong Wu; Ruoyi Du; Steven Hoi; Xiangpeng Yang; Xin Jin

arxiv: 2605.05204 · v2 · pith:OMZRZRQAnew · submitted 2026-05-06 · 💻 cs.CV

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Dengyang Jiang , Xin Jin , Dongyang Liu , Zanyi Wang , Mingzhe Zheng , Ruoyi Du , Xiangpeng Yang , Qilong Wu

show 4 more authors

Zhen Li Peng Gao Harry Yang Steven Hoi

This is my paper

Pith reviewed 2026-05-20 23:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsself-distillationon-policy learningfew-step inferenceimage generationfine-tuningcontinuous adaptation

0 comments

The pith

Step-distilled diffusion models can be continuously fine-tuned on new concepts without losing their few-step speed by using on-policy self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the problem that standard fine-tuning breaks the speed of few-step diffusion models. It introduces D-OPSD to turn supervised fine-tuning into an on-policy self-distillation process. The model generates its own trajectories and then compares a text-only student version against a teacher version that also receives the target image. Alignment happens between the two predicted distributions on those self-generated paths. A reader would care because this would let fast image generators keep adapting to new styles or subjects without retraining from scratch or regaining slow multi-step sampling.

Core claim

The paper claims that modern diffusion models with an LLM or VLM encoder inherit in-context capabilities that allow a teacher conditioned on both text prompt and target image to provide reliable supervision to a text-only student. Training then minimizes the difference between the two predicted distributions over the student's own roll-outs, so the model acquires new concepts and styles while its original few-step inference capacity stays intact.

What carries the argument

On-policy self-distillation, in which the model serves as both teacher (conditioned on text plus target image) and student (text only) and the loss aligns their output distributions on trajectories sampled from the student itself.

If this is right

The model acquires new concepts and styles through continuous supervised fine-tuning.
The original few-step inference performance remains unchanged after tuning.
Training draws on the model's inherited in-context capabilities from its encoder to generate the supervisory signal.
Practical ongoing adaptation of efficient image generators to specific domains becomes feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-distillation pattern could be tested on other conditional generative models that use encoder-based prompts.
Deployed systems might use repeated rounds of this process for gradual personalization without full retraining.
Checking the method on base models without strong encoders would test how far the in-context assumption reaches.

Load-bearing premise

Modern diffusion models inherit enough in-context capabilities from their LLM or VLM encoders that a teacher conditioned on both text and target image can reliably supervise a text-only student.

What would settle it

Apply D-OPSD to a step-distilled model and then check whether high-quality images on new concepts still appear in the original few steps; a clear rise in the number of steps required or a drop in quality on either new or original prompts would show the claim is false.

Figures

Figures reproduced from arXiv: 2605.05204 by Dengyang Jiang, Dongyang Liu, Harry Yang, Mingzhe Zheng, Peng Gao, Qilong Wu, Ruoyi Du, Steven Hoi, Xiangpeng Yang, Xin Jin, Zanyi Wang, Zhen Li.

**Figure 1.** Figure 1: We empirically investigate the visual appearance of generated images when con view at source ↗

**Figure 2.** Figure 2: Method overview. For each training pair, we first pass the prompt alone and the prompt together with the target image through the encoder to obtain 𝑐𝑠 and 𝑐𝑡 , respectively. We then sample a few-step trajectory using the student branch conditioned on 𝑐𝑠 . After that, the teacher and student predict velocities on the same trajectory states, and the student is updated by Equation 7. After training, the teach… view at source ↗

**Figure 3.** Figure 3: Visual comparison between baseline methods and ours finetuned on Z-ImageTurbo under customized training settings. Vanilla SFT training sacrifices the original fewstep capacity, and PSO suffers from the overfitting to training set, whereas our method enables the step-distilled model to continuously learn new concepts while maintaining the few-step capacity. large drops in Quality-S and Aesthetic-S in view at source ↗

**Figure 4.** Figure 4: Visual comparison between baseline methods and ours finetuned on Z-ImageTurbo under full-finetuning settings. SFT and PSO training sacrifices the original few-step capacity, whereas our method enables the step-distilled model to continuously learn to bias target domain while maintaining the few-step capacity as well as the learned knowledge in the original domain. learned knowledge. We conduct training an… view at source ↗

**Figure 5.** Figure 5: Ablation on (a) the different training strategies, and (b) the different way to build teacher model. We report the curves across training steps of DINO feature similarity between the generated images and the targets, as well as the Quality Score of the generated images. Training conducted on Z-Image-Turbo with LoRA. Better to zoom in to check the difference. use the student copy leads to training collapse.… view at source ↗

**Figure 6.** Figure 6: When the teacher model fails to generate images consistent with the concept ID under multimodal condition and therefore cannot provide an effective supervision signal, training will fail. Requirements for teacher capability. The success of D-OPSD is contingent upon the base model’s in-context abilities. In specific, as shown in view at source ↗

**Figure 7.** Figure 7: We also empirically investigate the difference of generated images when con view at source ↗

**Figure 7.** Figure 7: When the teacher model fails to generate images consistent with the concept ID under multimodal condition and therefore cannot provide an effective supervision signal, training will fail. Requirements for teacher capability. The success of D-OPSD is contingent upon the base model’s in-context abilities. In specific, as shown in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of generated images of ZImage-Turbo conditioned on multimodal feature using Qwen3-VL 4B and Qwen3-VL 4B with LLM part reweighted by Qwen3-4B LM. To address this issue, we replace the weights of the LLM component in Qwen3-VL-4B with those from the more compatible Qwen3-4B, while keeping the ViT and Connector weights unchanged. In this way, we preserve multimodal in-context capability while… view at source ↗

**Figure 8.** Figure 8: We also empirically investigate the difference of generated images when con [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of generated images of ZImage-Turbo conditioned on multimodal feature using Qwen3-VL 4B and Qwen3-VL 4B with LLM part reweighted by Qwen3-4B LM. To address this issue, we replace the weights of the LLM component in Qwen3-VL-4B with those from the more compatible Qwen3-4B, while keeping the ViT and Connector weights unchanged. In this way, we preserve multimodal in-context capability while… view at source ↗

read the original abstract

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D-OPSD gives a workable on-policy self-distillation route for adapting few-step diffusion models to new concepts without losing speed, but the approach stands or falls on whether the multimodal teacher signal stays reliable during updates.

read the letter

The main point for you is that this paper shows how to keep tuning step-distilled generators like FLUX.2-klein on new data by turning the model into its own teacher and student. The student sees only the text prompt while the teacher gets the prompt plus the target image, and training aligns their outputs on the student's own samples. That setup is the concrete novelty here, and it directly tackles the common failure mode where ordinary fine-tuning destroys the few-step sampling behavior. The authors are right to flag that modern diffusion models with LLM or VLM encoders can carry over some in-context flexibility, and using that to create an on-policy loop is a reasonable engineering move. If the experiments later show stable adaptation with preserved inference speed, the method could see use in deployment pipelines that need ongoing customization. The soft spot is the load-bearing assumption that the extra image context in the teacher produces a signal that remains both accurate and compatible with the student's trajectory once parameters start moving. The abstract states this inheritance as a finding but gives no derivation or early stability check, so the risk of gradual drift in the few-step regime is still open. I would want to see ablations that isolate whether the dual-context teacher actually outperforms simpler distillation baselines and whether the original sampling quality holds after several adaptation rounds. This work is aimed at people already shipping or iterating on efficient image generators rather than the broader diffusion community. A practitioner who needs to add styles or concepts to a fixed-step model without retraining from scratch would find the formulation useful to try. The paper deserves a serious referee because the problem is real and the proposed fix is specific enough to evaluate on its own terms.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes D-OPSD, an on-policy self-distillation training paradigm for step-distilled diffusion models. The core idea is that modern diffusion models with LLM/VLM encoders inherit in-context capabilities, allowing the same model to act as both teacher (conditioned on multimodal input consisting of the text prompt plus target image) and student (conditioned on text features only). Training minimizes divergence between the two predicted distributions evaluated on the student's own roll-outs, with the goal of enabling supervised fine-tuning for new concepts and styles while preserving the original few-step sampling behavior.

Significance. If the central claim is substantiated, the result would be significant for the ongoing shift toward efficient few-step diffusion models. It offers a potential solution to the problem of continuous supervised fine-tuning without degrading inference speed, which is a practical barrier for models such as Z-Image-Turbo and FLUX.2-klein. The on-policy self-distillation framing could also inform related work on self-supervised adaptation in generative models more broadly.

major comments (2)

[Abstract and §3] Abstract and §3 (Method description): The central assumption that the LLM/VLM encoder supplies sufficient in-context capabilities for the multimodal teacher to produce a reliable, distribution-compatible supervisory signal for the text-only student is stated as a finding but is not accompanied by any derivation, preliminary ablation, or stability argument showing why this conditioning remains valid once parameters are updated on-policy. This assumption is load-bearing for the claim that few-step capacity is preserved.
[§4] §4 (Experiments): No quantitative results, ablations, or comparisons are referenced that isolate the contribution of the on-policy self-distillation objective versus standard supervised fine-tuning; without such evidence it is impossible to assess whether the method actually avoids the distribution shift or capacity erosion described in the abstract.

minor comments (1)

[Abstract] Abstract: The models Z-Image-Turbo and FLUX.2-klein are mentioned without citations or references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method description): The central assumption that the LLM/VLM encoder supplies sufficient in-context capabilities for the multimodal teacher to produce a reliable, distribution-compatible supervisory signal for the text-only student is stated as a finding but is not accompanied by any derivation, preliminary ablation, or stability argument showing why this conditioning remains valid once parameters are updated on-policy. This assumption is load-bearing for the claim that few-step capacity is preserved.

Authors: We agree that the validity of the in-context capability after on-policy updates is a central assumption. The manuscript presents this as an empirical observation that enables the teacher-student formulation, with the on-policy rollouts intended to keep the distributions aligned. While no formal derivation is given, the design minimizes shift by supervising on the student's own trajectories. To strengthen the presentation, we will add a new preliminary ablation in the revised §3 that measures the divergence between teacher and student predictions on held-out student trajectories both before and after a short training run, providing evidence for stability of the supervisory signal. revision: yes
Referee: [§4] §4 (Experiments): No quantitative results, ablations, or comparisons are referenced that isolate the contribution of the on-policy self-distillation objective versus standard supervised fine-tuning; without such evidence it is impossible to assess whether the method actually avoids the distribution shift or capacity erosion described in the abstract.

Authors: We acknowledge that the current experiments do not include an explicit head-to-head comparison against standard supervised fine-tuning. The reported results focus on demonstrating that D-OPSD enables acquisition of new concepts while retaining few-step sampling speed. To isolate the contribution of the on-policy objective, we will add quantitative comparisons in the revised §4, including a standard SFT baseline with metrics on both concept fidelity and inference-step preservation, allowing direct assessment of distribution-shift mitigation. revision: yes

Circularity Check

0 steps flagged

No circularity: proposed on-policy objective is independent of its claimed outcomes

full rationale

The paper introduces D-OPSD by first stating an empirical observation that diffusion models with LLM/VLM encoders inherit in-context capabilities, then defines a training process in which the same model acts as teacher (multimodal conditioning on text + target image) and student (text-only) while minimizing divergence on the student's own roll-outs. This formulation is presented as a novel paradigm to enable supervised fine-tuning without eroding few-step sampling; the benefit of preserving original capacity is an intended empirical result of the objective rather than a quantity that reduces to the inputs by definition or by self-citation. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided description, and the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the encoder's in-context learning transfers to the diffusion model in a way that makes multimodal conditioning a valid teacher signal; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Modern diffusion models where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities.
This finding is invoked to justify formulating training as on-policy self-distillation with different contexts.

pith-pipeline@v0.9.0 · 5796 in / 1283 out tokens · 33631 ms · 2026-05-20T23:14:39.707898+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Training minimizes the two predicted distributions over the student's own roll-outs... student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 1 Pith paper · 48 internal anchors

[1]

In: The twelfth international conference on learning representations (2024)

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P ., Garea, S.R., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mistakes. In: The twelfth international conference on learning representations (2024)

work page 2024
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Black Forest Labs: FLUX.https://github.com/black-forest-labs/flux (2023)

work page 2023
[4]

Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025)

work page 2025
[5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

work page 1901
[6]

HunyuanImage 3.0 Technical Report

Cao, S., Chen, H., Chen, P ., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P ., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

work page 2021
[8]

arXiv preprint arXiv:2603.06507 (2026)

Chefer, H., Esser, P ., Lorenz, D., Podell, D., Raja, V ., Tong, V ., Torralba, A., Rombach, R.: Self-supervised flow matching for scalable multi-modal synthesis. arXiv preprint arXiv:2603.06507 (2026)

work page arXiv 2026
[9]

arXiv preprint arXiv:2510.14974 (2025)

Chen, H., Zhang, K., Tan, H., Guibas, L., Wetzstein, G., Bi, S.: pi-flow: Policy-based few-step generation via imitation distillation. arXiv preprint arXiv:2510.14974 (2025)

work page arXiv 2025
[10]

In: International Conference on Learning Representations (2024) 18

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P ., Lu, H., Li, Z.: Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: International Conference on Learning Representations (2024) 18

work page 2024
[11]

Science China Information Sciences67(12), 220101 (2024)

Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences67(12), 220101 (2024)

work page 2024
[12]

arXiv preprint arXiv:2512.05150 (2025)

Cheng, Z., Sun, P ., Li, J., Lin, T.: Twinflow: Realizing one-step generation on large models with self-adversarial flows. arXiv preprint arXiv:2512.05150 (2025)

work page arXiv 2025
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

https://huggingface.co/Freepik (2024)

Daniel Verdú, J.M.: Flux.1 lite: Distilling flux1.dev for efficient text-to-image genera- tion. https://huggingface.co/Freepik (2024)

work page 2024
[15]

DeepSeek-AI: Deepseek-v4: Towards highly efficient million-token context intelli- gence (2026)

work page 2026
[16]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P ., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

work page 2021
[17]

In: Forty-first international conference on machine learning (2024)

Esser, P ., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

work page 2024
[18]

arXiv preprint arXiv:2309.17425 (2023)

Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V .: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)

work page arXiv 2023
[19]

arXiv preprint arXiv:2412.01199 (2024)

Fang, G., Li, K., Ma, X., Wang, X.: Tinyfusion: Diffusion transformers learned shal- low. arXiv preprint arXiv:2412.01199 (2024)

work page arXiv 2024
[20]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen- Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Advances in Neural Information Processing Systems36, 52132–52152 (2023)

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)

work page 2023
[22]

Google DeepMind: Gemini 3.https://deepmind.google/models/gemini/ (2025)

work page 2025
[23]

Google DeepMind: Gemini 3 pro image model card.https: //storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Image-Model-Card.pdf(2025)

work page 2025
[24]

International Journal of Computer Vision129(6), 1789–1819 (2021)

Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International Journal of Computer Vision129(6), 1789–1819 (2021)

work page 2021
[25]

ELT: Elastic Looped Transformers for Visual Generation

Goyal, S., Agrawal, S., Anil, G.G., Jain, P ., Paul, S., Kusupati, A.: Elt: Elastic looped transformers for visual generation. arXiv preprint arXiv:2604.09168 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Co-Evolving Policy Distillation

Gu, N., Yang, C., Si, Q., Qin, C., Yao, D., Fu, P ., Lin, Z., Wang, W., Duan, N., Wang, J.: Co-evolving policy distillation. arXiv preprint arXiv:2604.27083 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

LTX-2: Efficient Joint Audio-Visual Foundation Model

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

He, Y., Kaur, S., Bhaskar, A., Yang, Y., Liu, J., Ri, N., Fowl, L., Panigrahi, A., Chen, D., Arora, S.: Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002 (2026) 19

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Advances in neural information processing systems30(2017)

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

work page 2017
[31]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[32]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020
[33]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P ., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022
[34]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P ., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train- test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Journal of quality technol- ogy18(4), 203–210 (1986)

Hunter, J.S.: The exponentially weighted moving average. Journal of quality technol- ogy18(4), 203–210 (1986)

work page 1986
[37]

Stable On-Policy Distillation through Adaptive Target Reformulation

Jang, I., Yeom, J., Yeo, J., Lim, H., Kim, T.: Stable on-policy distillation through adap- tive target reformulation. arXiv preprint arXiv:2601.07155 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

Jiang, D., Liu, D., Wang, Z., Wu, Q., Li, L., Li, H., Jin, X., Liu, D., Li, Z., Zhang, B., et al.: Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649 (2025)

work page arXiv 2025
[39]

arXiv preprint arXiv:2505.02831 (2025)

Jiang, D., Wang, M., Li, L., Zhang, L., Wang, H., Wei, W., Dai, G., Zhang, Y., Wang, J.: No other representation component is needed: Diffusion transformers can provide representation guidance by themselves. arXiv preprint arXiv:2505.02831 (2025)

work page arXiv 2025
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and im- proving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174–24184 (2024)

work page 2024
[41]

co/blog/kelseye/training-strategies-of-z-image-turbo(2025)

Kelseye, Duan, Z.: Training strategies of z-image-turbo.https://huggingface. co/blog/kelseye/training-strategies-of-z-image-turbo(2025)

work page 2025
[42]

Advances in neural information processing systems36, 36652–36663 (2023)

Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)

work page 2023
[43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)

work page 2025
[44]

The annals of mathemat- ical statistics22(1), 79–86 (1951)

Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathemat- ical statistics22(1), 79–86 (1951)

work page 1951
[45]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customiza- tion of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 1931–1941 (2023)

work page 1931
[46]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P ., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

RefTon: Reference person shot assist virtual Try-on

Li, L., Gong, Y., Liu, S., Cheng, B., Ma, Y., Wu, L., Jiang, D., Wang, Z., Leng, D., Yin, Y.: Refvton: person-to-person try on with additional unpaired visual reference. arXiv preprint arXiv:2511.00956 (2025) 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for gener- ative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

Liu, D., Gao, P ., Liu, D., Du, R., Li, Z., Wu, Q., Jin, X., Cao, S., Zhang, S., Li, H., Hoi, S.: Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677 (2025)

work page arXiv 2025
[50]

Advances in neural infor- mation processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural infor- mation processing systems36, 34892–34916 (2023)

work page 2023
[51]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P ., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Lu, K., Lab, T.M.: On-policy distillation. Thinking Machines Lab: Connectionism (2025). https://doi.org/10.64434/tml.20251026, https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025
[54]

arXiv preprint arXiv:2507.18569 (2025)

Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A.J., Xie, X., Lai, J.H.: Adversar- ial distribution matching for diffusion distillation towards efficient image and video synthesis. arXiv preprint arXiv:2507.18569 (2025)

work page arXiv 2025
[55]

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[56]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: Synthesiz- ing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

arXiv preprint arXiv:2410.18881 (2024)

Luo, W.: Diff-instruct++: Training one-step text-to-image generator model to align with human preferences. arXiv preprint arXiv:2410.18881 (2024)

work page arXiv 2024
[58]

Advances in Neural Information Processing Systems36, 76525–76546 (2023)

Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal ap- proach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems36, 76525–76546 (2023)

work page 2023
[59]

arXiv preprint arXiv:2603.07700 (2026)

Luo, Y., Hu, T., Luo, W., Tang, J.: Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward. arXiv preprint arXiv:2603.07700 (2026)

work page arXiv 2026
[60]

Learning few- step diffusion models by trajectory distribution matching

Luo, Y., Hu, T., Sun, J., Cai, Y., Tang, J.: Learning few-step diffusion models by trajec- tory distribution matching. arXiv preprint arXiv:2503.06674 (2025)

work page arXiv 2025
[61]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Ex- ploring flow and diffusion-based generative models with scalable interpolant trans- formers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

work page 2024
[62]

arXiv preprint arXiv:2410.03190 (2024)

Miao, Z., Yang, Z., Lin, K., Wang, Z., Liu, Z., Wang, L., Qiu, Q.: Tun- ing timestep-distilled diffusion model using pairwise sample optimization. arXiv preprint arXiv:2410.03190 (2024)

work page arXiv 2024
[63]

OpenAI: Gpt-Image-1.https://openai.com/index/ introducing-4o-image-generation/(2025)

work page 2025
[64]

Ostris: Z-image-de-turbo.https://huggingface.co/ostris/ Z-Image-De-Turbo(2025)

work page 2025
[65]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

work page 2023
[66]

Privileged Information Distillation for Language Models

Penaloza, E., Vattikonda, D., Gontier, N., Lacoste, A., Charlin, L., Caccia, M.: Privi- leged information distillation for language models. arXiv preprint arXiv:2602.04942 (2026) 21

work page internal anchor Pith review arXiv 2026
[67]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Li, X., Liu, D., Zhu, X., et al.: Lumina-image 2.0: A unified and efficient image generative framework. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20031– 20042 (2025)

work page 2025
[69]

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

Qin, Y., Wang, L., Fei, H., Zimmermann, R., Bo, L., Lu, Q., Wang, C.: Soar: Self- correction for optimal alignment and refinement in diffusion models. arXiv preprint arXiv:2604.12617 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

Qwen Team: Qwen3.5: Towards native multimodal agents.https://qwen.ai/ blog?id=qwen3.5(2026)

work page 2026
[71]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P ., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021
[72]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

work page 2018
[73]

OpenAI blog1(8), 9 (2019)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

work page 2019
[74]

Journal of machine learning research21(140), 1–67 (2020)

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P .J.: Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of machine learning research21(140), 1–67 (2020)

work page 2020
[75]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P ., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

arXiv preprint arXiv:2404.13686 (2024)

Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P ., Wang, X., Xiao, X.: Hyper-sd: Tra- jectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686 (2024)

work page arXiv 2024
[77]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022
[78]

r/StableDiffusion, the Reddit Community: Z-image lora training.https: //www.reddit.com/r/StableDiffusion/comments/1pj0469/zimage_lora_ training/(2025)

work page 2025
[79]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Ruiz, N., Li, Y., Jampani, V ., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023
[80]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Sang, H., Xu, Y., Zhou, Z., He, R., Wang, Z., Sun, J.: On-policy self-distillation for reasoning compression. arXiv preprint arXiv:2603.05433 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

Showing first 80 references.

[1] [1]

In: The twelfth international conference on learning representations (2024)

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P ., Garea, S.R., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mistakes. In: The twelfth international conference on learning representations (2024)

work page 2024

[2] [2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Black Forest Labs: FLUX.https://github.com/black-forest-labs/flux (2023)

work page 2023

[4] [4]

Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025)

work page 2025

[5] [5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

work page 1901

[6] [6]

HunyuanImage 3.0 Technical Report

Cao, S., Chen, H., Chen, P ., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P ., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

work page 2021

[8] [8]

arXiv preprint arXiv:2603.06507 (2026)

Chefer, H., Esser, P ., Lorenz, D., Podell, D., Raja, V ., Tong, V ., Torralba, A., Rombach, R.: Self-supervised flow matching for scalable multi-modal synthesis. arXiv preprint arXiv:2603.06507 (2026)

work page arXiv 2026

[9] [9]

arXiv preprint arXiv:2510.14974 (2025)

Chen, H., Zhang, K., Tan, H., Guibas, L., Wetzstein, G., Bi, S.: pi-flow: Policy-based few-step generation via imitation distillation. arXiv preprint arXiv:2510.14974 (2025)

work page arXiv 2025

[10] [10]

In: International Conference on Learning Representations (2024) 18

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P ., Lu, H., Li, Z.: Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: International Conference on Learning Representations (2024) 18

work page 2024

[11] [11]

Science China Information Sciences67(12), 220101 (2024)

Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences67(12), 220101 (2024)

work page 2024

[12] [12]

arXiv preprint arXiv:2512.05150 (2025)

Cheng, Z., Sun, P ., Li, J., Lin, T.: Twinflow: Realizing one-step generation on large models with self-adversarial flows. arXiv preprint arXiv:2512.05150 (2025)

work page arXiv 2025

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

https://huggingface.co/Freepik (2024)

Daniel Verdú, J.M.: Flux.1 lite: Distilling flux1.dev for efficient text-to-image genera- tion. https://huggingface.co/Freepik (2024)

work page 2024

[15] [15]

DeepSeek-AI: Deepseek-v4: Towards highly efficient million-token context intelli- gence (2026)

work page 2026

[16] [16]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P ., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

work page 2021

[17] [17]

In: Forty-first international conference on machine learning (2024)

Esser, P ., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

work page 2024

[18] [18]

arXiv preprint arXiv:2309.17425 (2023)

Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V .: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)

work page arXiv 2023

[19] [19]

arXiv preprint arXiv:2412.01199 (2024)

Fang, G., Li, K., Ma, X., Wang, X.: Tinyfusion: Diffusion transformers learned shal- low. arXiv preprint arXiv:2412.01199 (2024)

work page arXiv 2024

[20] [20]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen- Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Advances in Neural Information Processing Systems36, 52132–52152 (2023)

Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023)

work page 2023

[22] [22]

Google DeepMind: Gemini 3.https://deepmind.google/models/gemini/ (2025)

work page 2025

[23] [23]

Google DeepMind: Gemini 3 pro image model card.https: //storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Image-Model-Card.pdf(2025)

work page 2025

[24] [24]

International Journal of Computer Vision129(6), 1789–1819 (2021)

Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International Journal of Computer Vision129(6), 1789–1819 (2021)

work page 2021

[25] [25]

ELT: Elastic Looped Transformers for Visual Generation

Goyal, S., Agrawal, S., Anil, G.G., Jain, P ., Paul, S., Kusupati, A.: Elt: Elastic looped transformers for visual generation. arXiv preprint arXiv:2604.09168 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Co-Evolving Policy Distillation

Gu, N., Yang, C., Si, Q., Qin, C., Yao, D., Fu, P ., Lin, Z., Wang, W., Duan, N., Wang, J.: Co-evolving policy distillation. arXiv preprint arXiv:2604.27083 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

LTX-2: Efficient Joint Audio-Visual Foundation Model

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

He, Y., Kaur, S., Bhaskar, A., Yang, Y., Liu, J., Ri, N., Fowl, L., Panigrahi, A., Chen, D., Arora, S.: Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002 (2026) 19

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

Advances in neural information processing systems30(2017)

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

work page 2017

[31] [31]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[32] [32]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020

[33] [33]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P ., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022

[34] [34]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P ., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train- test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Journal of quality technol- ogy18(4), 203–210 (1986)

Hunter, J.S.: The exponentially weighted moving average. Journal of quality technol- ogy18(4), 203–210 (1986)

work page 1986

[37] [37]

Stable On-Policy Distillation through Adaptive Target Reformulation

Jang, I., Yeom, J., Yeo, J., Lim, H., Kim, T.: Stable on-policy distillation through adap- tive target reformulation. arXiv preprint arXiv:2601.07155 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

Jiang, D., Liu, D., Wang, Z., Wu, Q., Li, L., Li, H., Jin, X., Liu, D., Li, Z., Zhang, B., et al.: Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649 (2025)

work page arXiv 2025

[39] [39]

arXiv preprint arXiv:2505.02831 (2025)

Jiang, D., Wang, M., Li, L., Zhang, L., Wang, H., Wei, W., Dai, G., Zhang, Y., Wang, J.: No other representation component is needed: Diffusion transformers can provide representation guidance by themselves. arXiv preprint arXiv:2505.02831 (2025)

work page arXiv 2025

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and im- proving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24174–24184 (2024)

work page 2024

[41] [41]

co/blog/kelseye/training-strategies-of-z-image-turbo(2025)

Kelseye, Duan, Z.: Training strategies of z-image-turbo.https://huggingface. co/blog/kelseye/training-strategies-of-z-image-turbo(2025)

work page 2025

[42] [42]

Advances in neural information processing systems36, 36652–36663 (2023)

Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)

work page 2023

[43] [43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)

work page 2025

[44] [44]

The annals of mathemat- ical statistics22(1), 79–86 (1951)

Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathemat- ical statistics22(1), 79–86 (1951)

work page 1951

[45] [45]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customiza- tion of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 1931–1941 (2023)

work page 1931

[46] [46]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P ., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

RefTon: Reference person shot assist virtual Try-on

Li, L., Gong, Y., Liu, S., Cheng, B., Ma, Y., Wu, L., Jiang, D., Wang, Z., Leng, D., Yin, Y.: Refvton: person-to-person try on with additional unpaired visual reference. arXiv preprint arXiv:2511.00956 (2025) 20

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for gener- ative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

Liu, D., Gao, P ., Liu, D., Du, R., Li, Z., Wu, Q., Jin, X., Cao, S., Zhang, S., Li, H., Hoi, S.: Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677 (2025)

work page arXiv 2025

[50] [50]

Advances in neural infor- mation processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural infor- mation processing systems36, 34892–34916 (2023)

work page 2023

[51] [51]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P ., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [53]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Lu, K., Lab, T.M.: On-policy distillation. Thinking Machines Lab: Connectionism (2025). https://doi.org/10.64434/tml.20251026, https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025

[54] [54]

arXiv preprint arXiv:2507.18569 (2025)

Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A.J., Xie, X., Lai, J.H.: Adversar- ial distribution matching for diffusion distillation towards efficient image and video synthesis. arXiv preprint arXiv:2507.18569 (2025)

work page arXiv 2025

[55] [55]

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[56] [56]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: Synthesiz- ing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

arXiv preprint arXiv:2410.18881 (2024)

Luo, W.: Diff-instruct++: Training one-step text-to-image generator model to align with human preferences. arXiv preprint arXiv:2410.18881 (2024)

work page arXiv 2024

[58] [58]

Advances in Neural Information Processing Systems36, 76525–76546 (2023)

Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal ap- proach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems36, 76525–76546 (2023)

work page 2023

[59] [59]

arXiv preprint arXiv:2603.07700 (2026)

Luo, Y., Hu, T., Luo, W., Tang, J.: Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward. arXiv preprint arXiv:2603.07700 (2026)

work page arXiv 2026

[60] [60]

Learning few- step diffusion models by trajectory distribution matching

Luo, Y., Hu, T., Sun, J., Cai, Y., Tang, J.: Learning few-step diffusion models by trajec- tory distribution matching. arXiv preprint arXiv:2503.06674 (2025)

work page arXiv 2025

[61] [61]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Ex- ploring flow and diffusion-based generative models with scalable interpolant trans- formers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

work page 2024

[62] [62]

arXiv preprint arXiv:2410.03190 (2024)

Miao, Z., Yang, Z., Lin, K., Wang, Z., Liu, Z., Wang, L., Qiu, Q.: Tun- ing timestep-distilled diffusion model using pairwise sample optimization. arXiv preprint arXiv:2410.03190 (2024)

work page arXiv 2024

[63] [63]

OpenAI: Gpt-Image-1.https://openai.com/index/ introducing-4o-image-generation/(2025)

work page 2025

[64] [64]

Ostris: Z-image-de-turbo.https://huggingface.co/ostris/ Z-Image-De-Turbo(2025)

work page 2025

[65] [65]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

work page 2023

[66] [66]

Privileged Information Distillation for Language Models

Penaloza, E., Vattikonda, D., Gontier, N., Lacoste, A., Charlin, L., Caccia, M.: Privi- leged information distillation for language models. arXiv preprint arXiv:2602.04942 (2026) 21

work page internal anchor Pith review arXiv 2026

[67] [67]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [68]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Li, X., Liu, D., Zhu, X., et al.: Lumina-image 2.0: A unified and efficient image generative framework. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20031– 20042 (2025)

work page 2025

[69] [69]

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

Qin, Y., Wang, L., Fei, H., Zimmermann, R., Bo, L., Lu, Q., Wang, C.: Soar: Self- correction for optimal alignment and refinement in diffusion models. arXiv preprint arXiv:2604.12617 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[70] [70]

Qwen Team: Qwen3.5: Towards native multimodal agents.https://qwen.ai/ blog?id=qwen3.5(2026)

work page 2026

[71] [71]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P ., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021

[72] [72]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

work page 2018

[73] [73]

OpenAI blog1(8), 9 (2019)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

work page 2019

[74] [74]

Journal of machine learning research21(140), 1–67 (2020)

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P .J.: Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of machine learning research21(140), 1–67 (2020)

work page 2020

[75] [75]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P ., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[76] [76]

arXiv preprint arXiv:2404.13686 (2024)

Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P ., Wang, X., Xiao, X.: Hyper-sd: Tra- jectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686 (2024)

work page arXiv 2024

[77] [77]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022

[78] [78]

r/StableDiffusion, the Reddit Community: Z-image lora training.https: //www.reddit.com/r/StableDiffusion/comments/1pj0469/zimage_lora_ training/(2025)

work page 2025

[79] [79]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Ruiz, N., Li, Y., Jampani, V ., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023

[80] [80]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Sang, H., Xu, Y., Zhou, Z., He, R., Wang, Z., Sun, J.: On-policy self-distillation for reasoning compression. arXiv preprint arXiv:2603.05433 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026