Accelerating Rectified Flow Models via Trajectory-Aware Caching

Hongliang Lu; Kai Liu; Naiyang Guan; Renjing Pei; Xiao Liu; Yulun Zhang; Zhikai Chen; Zhixin Wang

arxiv: 2605.16789 · v1 · pith:VXPIXCCEnew · submitted 2026-05-16 · 💻 cs.CV

Accelerating Rectified Flow Models via Trajectory-Aware Caching

Xiao Liu , Kai Liu , Naiyang Guan , Hongliang Lu , Zhixin Wang , Zhikai Chen , Renjing Pei , Yulun Zhang This is my paper

Pith reviewed 2026-05-19 21:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords rectified flowsampling accelerationvelocity cachingorthogonal decompositiontext-to-imagetext-to-videotraining-free

0 comments

The pith

TACache accelerates rectified flow sampling by decomposing velocity changes into parallel and orthogonal parts to safely skip steps and reconstruct velocities from history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TACache as a training-free way to speed up iterative sampling in rectified flow models for images and videos. It isolates magnitude and directional error sources through orthogonal decomposition of velocity acceleration, then sets skip intervals offline using cumulative variation thresholds while reconstructing skipped velocities online from past orthogonal directions. This skip-then-compensate approach reduces accumulated approximation errors that plague earlier caching techniques, delivering higher fidelity at greater speeds on standard benchmarks.

Core claim

TACache operates in two stages: offline computation of skip schedules from cumulative variation thresholds on magnitude and direction indicators obtained via orthogonal decomposition of discrete velocity acceleration, and online reconstruction of each skipped velocity by combining those thresholds with the sample's historical orthogonal direction, all without further model evaluations.

What carries the argument

Orthogonal decomposition of discrete velocity acceleration along the RF trajectory into a parallel component and an orthogonal residual, which isolates magnitude and directional error sources to set bounded skip intervals and enable history-based velocity reconstruction.

If this is right

Up to 4.14 times speedup on text-to-image generation while improving all reference-based fidelity metrics over prior cache methods.
Up to 2.11 times speedup on text-to-video generation with the same fidelity gains.
Consistent quality improvements across BAGEL, FLUX.1-dev, and Wan2.1-1.3B without any retraining.
The skip schedule is determined once per model and reused across samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be applied to other iterative solvers that follow curved trajectories in latent space.
Per-sample adaptation of the historical orthogonal direction might further tighten skip bounds on out-of-distribution prompts.
Combining the offline thresholds with a small learned correction term could extend safe skip lengths without reintroducing error.

Load-bearing premise

Offline cumulative variation thresholds on magnitude and direction indicators from the orthogonal decomposition can reliably bound skip intervals across diverse samples, and combining these with a sample's historical orthogonal direction accurately reconstructs skipped velocities without model evaluations or accumulated error.

What would settle it

Generation runs on held-out prompts where the reconstructed velocities produce outputs with measurably higher error in reference-based fidelity metrics than the baseline caching method at the same number of function evaluations.

Figures

Figures reproduced from arXiv: 2605.16789 by Hongliang Lu, Kai Liu, Naiyang Guan, Renjing Pei, Xiao Liu, Yulun Zhang, Zhikai Chen, Zhixin Wang.

**Figure 2.** Figure 2: Evolution of the magnitude indicator (MI) and direction indicator (DI) over timesteps for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of TACache. TACache consists of an offline calibration stage and an online inference stage. The calibration stage applies POVD to extract per-timestep statistics ( ˜kn, ˜dn), from which SSC constructs a fixed skip schedule; at inference, TASU reconstructs the velocity at skipped steps in four steps, fusing the offline statistics with each sample’s own historical orthogonal direction. 3 Method 3.1 … view at source ↗

**Figure 4.** Figure 4: Visual quality comparison on BAGEL. Red boxes highlight where our method surpasses [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on FLUX.1-dev. The red boxes highlight defects in other methods, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on Wan2.1-1.3B. Speedup and VBench score are reported for each [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Pareto frontiers on BAGEL. TACache achieves a better speed-quality trade-off than [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Mean relative state drift along the 50-step sampling trajectory on 50 GenEval prompts under BAGEL. Both methods are evaluated against the corresponding full-step trajectory advanced from the same noise initialization. TACache reaches a final state drift of 13.9% at a 75.5% skip ratio, while MagCache reaches 14.5% at a 71.4% skip ratio. Shaded bands denote the standard error across prompts; vertical dotted … view at source ↗

**Figure 9.** Figure 9: Additional Pareto frontiers on BAGEL. TACache achieves a better speed-quality trade-off [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visual results of BAGEL 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Visual results of FLUX.1-dev 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Visual results of Wan2.1-1.3B 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Visual results of Wan2.1-1.3B 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Visual results of Wan2.1-1.3B 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

read the original abstract

Diffusion and rectified flow (RF) models generate high-fidelity images and videos, but their iterative velocity-field evaluations are computationally expensive. Existing caching methods accelerate sampling by skipping timesteps, yet their coarse approximations introduce accumulated errors over long skip intervals and degrade quality under aggressive acceleration. We propose TACache (Trajectory-Aware Cache), a training-free acceleration framework following a skip-then-compensate paradigm. TACache performs an orthogonal decomposition of discrete velocity acceleration along the RF trajectory into a parallel component and an orthogonal residual, isolating the magnitude and directional sources of per-step approximation error. The framework operates in two stages: offline, cumulative variation thresholds on the magnitude and direction indicators yield the skip schedule and bound how far each skip interval may extend; online, at each skipped step the offline statistics are combined with the sample's historical orthogonal direction to reconstruct the skipped velocity without additional model evaluations. Experiments on BAGEL, FLUX.1-dev, and Wan2.1-1.3B show that TACache achieves up to 4.14 speedup on text-to-image generation and 2.11 speedup on text-to-video generation, with consistent improvements over prior cache-based methods on all reference-based fidelity metrics. Code will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TACache adds orthogonal decomposition to separate magnitude and direction errors in RF caching, but the lack of math details and untested generalization on thresholds leaves the speedups hard to trust fully.

read the letter

The main point here is that TACache introduces a trajectory-aware caching method for rectified flow models by orthogonally decomposing the discrete velocity acceleration to separate magnitude and directional error sources, then combining offline thresholds with online historical directions for reconstruction. This approach is new in extending beyond basic timestep skipping. It adds an explicit decomposition step and a compensate mechanism that avoids extra model calls during skips. The offline stage sets skip schedules based on cumulative variations, while the online stage reconstructs velocities using past orthogonal components. The work does well in its experimental claims. It reports speedups of up to 4.14 times for text-to-image on BAGEL and FLUX.1-dev, and 2.11 times for text-to-video on Wan2.1-1.3B. These come with improvements over prior cache methods across fidelity metrics, and the training-free design makes it easy to apply to current models without retraining. However, the soft spots are noticeable. The abstract lacks any details on the decomposition math, error bounds, or ablations for the reconstruction process, which leaves the central mechanism hard to evaluate. The offline thresholds from a fixed calibration set might not reliably bound errors for all diverse samples and prompts, potentially allowing reconstruction drift if directional changes occur during skips. This matches the stress-test worry about generalization, and without strong evidence in the paper that it holds up broadly, the reported gains could depend on the specific test conditions. This paper is aimed at people building or deploying generative AI systems where faster inference is needed for images and videos. A reader focused on practical accelerations in diffusion and flow models would get some value from the reported numbers and the new paradigm. I think it deserves a serious referee. The idea addresses a genuine bottleneck, and review could clarify the technical gaps and test the robustness claims.

Referee Report

3 major / 2 minor

Summary. The paper proposes TACache, a training-free acceleration framework for rectified flow (RF) models in text-to-image and text-to-video generation. It introduces an orthogonal decomposition of discrete velocity acceleration into parallel (magnitude) and orthogonal (directional) components along the RF trajectory. Offline, cumulative variation thresholds on these indicators determine safe skip intervals and schedules; online, skipped velocities are reconstructed by combining the offline statistics with the sample's historical orthogonal direction, avoiding additional model evaluations. Experiments on BAGEL, FLUX.1-dev, and Wan2.1-1.3B report speedups up to 4.14× for T2I and 2.11× for T2V, with consistent gains over prior cache-based methods on reference-based fidelity metrics.

Significance. If the key assumptions on threshold generalizability and reconstruction fidelity hold, TACache would provide a practical, training-free way to accelerate RF sampling while mitigating the quality degradation common in aggressive caching. The multi-model evaluation (including both image and video) and planned code release are positive elements that would support reproducibility and adoption in the field.

major comments (3)

[§3] §3 (Method): The manuscript provides no derivation or error bound for the reconstruction step that substitutes the historical orthogonal direction for skipped velocities. Without a quantitative analysis showing that the deviation remains bounded over multiple skips (or an ablation measuring accumulated drift on held-out trajectories), the central claim that TACache avoids the accumulated errors of prior coarse approximations lacks verifiable support.
[§4.1] §4.1 (Offline stage) and Experiments: The cumulative variation thresholds are computed once on a fixed calibration distribution, yet no analysis or ablation demonstrates that these thresholds remain safe for diverse test prompts, noise seeds, or out-of-distribution samples. This directly bears on the weakest assumption that offline bounds generalize and prevent reconstruction drift.
[Results] Results section / Tables: While speedups and metric improvements are reported, there is no ablation isolating the orthogonal decomposition from the skip schedule alone, nor any comparison of reconstruction error with and without the historical direction term. This makes it difficult to attribute the reported fidelity gains specifically to the proposed decomposition.

minor comments (2)

[Abstract] Abstract: The phrase 'Code will be released soon' is vague; specifying a repository link or expected release timeline would improve clarity.
[§3] Notation: The definitions of the magnitude and direction indicators after the orthogonal decomposition should be stated explicitly with equations rather than described only in prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address each major comment point-by-point below. Where the comments identify gaps in analysis or ablations, we have revised the manuscript to incorporate the requested elements, which we believe strengthens the empirical support for TACache's claims.

read point-by-point responses

Referee: [§3] §3 (Method): The manuscript provides no derivation or error bound for the reconstruction step that substitutes the historical orthogonal direction for skipped velocities. Without a quantitative analysis showing that the deviation remains bounded over multiple skips (or an ablation measuring accumulated drift on held-out trajectories), the central claim that TACache avoids the accumulated errors of prior coarse approximations lacks verifiable support.

Authors: We thank the referee for this observation. Section 3 presents the orthogonal decomposition of discrete velocity acceleration into parallel (magnitude) and orthogonal (directional) components along the RF trajectory, with the reconstruction step combining offline cumulative variation statistics and the sample-specific historical orthogonal direction. While the original manuscript does not include a formal derivation or theoretical error bound, the design is motivated by isolating directional changes to reduce approximation error over skips. To provide verifiable support, we will add a new empirical subsection analyzing accumulated drift on held-out trajectories, measuring deviation growth over multiple consecutive skips. revision: yes
Referee: [§4.1] §4.1 (Offline stage) and Experiments: The cumulative variation thresholds are computed once on a fixed calibration distribution, yet no analysis or ablation demonstrates that these thresholds remain safe for diverse test prompts, noise seeds, or out-of-distribution samples. This directly bears on the weakest assumption that offline bounds generalize and prevent reconstruction drift.

Authors: We acknowledge that the thresholds are derived from a single calibration distribution. Our main experiments already evaluate TACache on diverse prompts drawn from standard benchmarks across BAGEL, FLUX.1-dev, and Wan2.1-1.3B for both image and video tasks, with consistent fidelity gains. To directly address generalizability, we will add ablations in the revised manuscript that test the fixed thresholds on varied noise seeds and out-of-distribution prompts, reporting any resulting reconstruction drift and confirming that the offline bounds remain safe. revision: yes
Referee: [Results] Results section / Tables: While speedups and metric improvements are reported, there is no ablation isolating the orthogonal decomposition from the skip schedule alone, nor any comparison of reconstruction error with and without the historical direction term. This makes it difficult to attribute the reported fidelity gains specifically to the proposed decomposition.

Authors: We appreciate this request for clearer attribution. The reported results compare TACache against prior cache-based methods that use different approximation strategies. To isolate the contribution of the orthogonal decomposition and historical direction term, we will add an ablation study in the revised manuscript. This will include a variant that applies the skip schedule without the reconstruction term, and will report reconstruction error metrics both with and without the historical orthogonal direction component. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The TACache framework builds directly on standard rectified flow velocity fields by introducing an orthogonal decomposition of discrete velocity acceleration into parallel and orthogonal components. Offline cumulative variation thresholds are computed once on a calibration distribution to determine skip schedules, while online reconstruction substitutes historical orthogonal directions for skipped steps. These steps are presented as independent engineering choices whose error bounds are claimed to be empirically verifiable on held-out samples rather than derived by construction from the reported speedups or fidelity metrics. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the central claims rest on the external validity of the decomposition and thresholds rather than reducing to the inputs by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the orthogonal decomposition for isolating error sources and on the offline statistics being sufficiently representative for online compensation across samples.

free parameters (1)

cumulative variation thresholds
Offline thresholds on magnitude and direction indicators that determine the skip schedule and maximum skip interval length.

axioms (1)

domain assumption Discrete velocity acceleration along the RF trajectory can be orthogonally decomposed into a parallel component and an orthogonal residual that isolate magnitude and directional sources of per-step approximation error.
This decomposition underpins both the offline skip schedule and the online velocity reconstruction without additional model calls.

pith-pipeline@v0.9.0 · 5767 in / 1420 out tokens · 46977 ms · 2026-05-19T21:27:56.083008+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 7 internal anchors

[1]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

DiCache: Let diffusion model determine its own cache

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, and Jiaqi Wang. DiCache: Let diffusion model determine its own cache. InICLR, 2026

work page 2026
[3]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Acceleration-aware sampling for few-step rectified flow models, 2025

Xiaomeng Fu, Jia Li, Yiming Hu, Yong Wang, Xi Wang, Hayden Kwok-Hay So, and Xiangxiang Chu. Acceleration-aware sampling for few-step rectified flow models, 2025. URL https: //openreview.net/forum?id=hCIdPgTfUw

work page 2025
[5]

GenEval: An object-focused framework for evaluating text-to-image alignment.NeurIPS, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text-to-image alignment.NeurIPS, 2023

work page 2023
[6]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

work page 2022
[7]

ProReflow: Progressive reflow with decomposed velocity

Lei Ke, Haohang Xu, Xuefei Ning, Yu Li, Jiajun Li, Haoling Li, Yuxuan Lin, Dongsheng Jiang, Yujiu Yang, and Linfeng Zhang. ProReflow: Progressive reflow with decomposed velocity. In CVPR, 2025

work page 2025
[8]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Improving the training of rectified flows

Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the training of rectified flows. InNeurIPS, 2024

work page 2024
[10]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

work page 2023
[11]

Timestep embedding tells: It’s time to cache for video diffusion model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InCVPR, 2025

work page 2025
[12]

From reusing to forecasting: Accelerating diffusion models with TaylorSeers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with TaylorSeers. InICCV, 2025

work page 2025
[13]

DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. InNeurIPS, 2022

work page 2022
[14]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

MagCache: Fast video generation with magnitude-aware cache

Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, and Qi Tian. MagCache: Fast video generation with magnitude-aware cache. InNeurIPS, 2025

work page 2025
[16]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Chaoran Feng, Bin Zhu, and Li Yuan. WISE: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

ConsistI2V: Enhancing visual consistency for image-to-video generation.TMLR, 2024

Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. ConsistI2V: Enhancing visual consistency for image-to-video generation.TMLR, 2024. 10

work page 2024
[18]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InICLR, 2022

work page 2022
[19]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

work page 2023
[20]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Exploring CLIP for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring CLIP for assessing the look and feel of images. InAAAI, 2023

work page 2023
[22]

InternVid: A large-scale video-text dataset for multimodal understanding and generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A large-scale video-text dataset for multimodal understanding and generation. In ICLR, 2024

work page 2024
[23]

Image quality assessment: from error visibility to structural similarity.TIP, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004

work page 2004
[24]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtao Zhai, and Weisi Lin. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

PeRFlow: Piecewise rectified flow as universal plug-and-play accelerator

Hanshu Yan, Xingchao Liu, Jiachun Pan, Jun Hao Liew, Qiang Liu, and Jiashi Feng. PeRFlow: Piecewise rectified flow as universal plug-and-play accelerator. InNeurIPS, 2024

work page 2024
[26]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

work page 2018
[27]

ViDiT-Q: Efficient and accurate quantization of diffusion transformers for image and video generation

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Rui Wan, Widyadewi Soedarmadji, Enshu Liu, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. ViDiT-Q: Efficient and accurate quantization of diffusion transformers for image and video generation. InICLR, 2025

work page 2025
[28]

Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching.arXiv preprint arXiv:2507.02860, 2025. 11 A Implementation Details T2I images are generated at 1024×1024 resolution, while T2V clips a...

work page arXiv 2025

[1] [1]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

DiCache: Let diffusion model determine its own cache

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, and Jiaqi Wang. DiCache: Let diffusion model determine its own cache. InICLR, 2026

work page 2026

[3] [3]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Acceleration-aware sampling for few-step rectified flow models, 2025

Xiaomeng Fu, Jia Li, Yiming Hu, Yong Wang, Xi Wang, Hayden Kwok-Hay So, and Xiangxiang Chu. Acceleration-aware sampling for few-step rectified flow models, 2025. URL https: //openreview.net/forum?id=hCIdPgTfUw

work page 2025

[5] [5]

GenEval: An object-focused framework for evaluating text-to-image alignment.NeurIPS, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text-to-image alignment.NeurIPS, 2023

work page 2023

[6] [6]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

work page 2022

[7] [7]

ProReflow: Progressive reflow with decomposed velocity

Lei Ke, Haohang Xu, Xuefei Ning, Yu Li, Jiajun Li, Haoling Li, Yuxuan Lin, Dongsheng Jiang, Yujiu Yang, and Linfeng Zhang. ProReflow: Progressive reflow with decomposed velocity. In CVPR, 2025

work page 2025

[8] [8]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Improving the training of rectified flows

Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the training of rectified flows. InNeurIPS, 2024

work page 2024

[10] [10]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

work page 2023

[11] [11]

Timestep embedding tells: It’s time to cache for video diffusion model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InCVPR, 2025

work page 2025

[12] [12]

From reusing to forecasting: Accelerating diffusion models with TaylorSeers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with TaylorSeers. InICCV, 2025

work page 2025

[13] [13]

DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. InNeurIPS, 2022

work page 2022

[14] [14]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

MagCache: Fast video generation with magnitude-aware cache

Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, and Qi Tian. MagCache: Fast video generation with magnitude-aware cache. InNeurIPS, 2025

work page 2025

[16] [16]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Chaoran Feng, Bin Zhu, and Li Yuan. WISE: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

ConsistI2V: Enhancing visual consistency for image-to-video generation.TMLR, 2024

Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. ConsistI2V: Enhancing visual consistency for image-to-video generation.TMLR, 2024. 10

work page 2024

[18] [18]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InICLR, 2022

work page 2022

[19] [19]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

work page 2023

[20] [20]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Exploring CLIP for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring CLIP for assessing the look and feel of images. InAAAI, 2023

work page 2023

[22] [22]

InternVid: A large-scale video-text dataset for multimodal understanding and generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A large-scale video-text dataset for multimodal understanding and generation. In ICLR, 2024

work page 2024

[23] [23]

Image quality assessment: from error visibility to structural similarity.TIP, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004

work page 2004

[24] [24]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtao Zhai, and Weisi Lin. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

PeRFlow: Piecewise rectified flow as universal plug-and-play accelerator

Hanshu Yan, Xingchao Liu, Jiachun Pan, Jun Hao Liew, Qiang Liu, and Jiashi Feng. PeRFlow: Piecewise rectified flow as universal plug-and-play accelerator. InNeurIPS, 2024

work page 2024

[26] [26]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

work page 2018

[27] [27]

ViDiT-Q: Efficient and accurate quantization of diffusion transformers for image and video generation

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Rui Wan, Widyadewi Soedarmadji, Enshu Liu, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. ViDiT-Q: Efficient and accurate quantization of diffusion transformers for image and video generation. InICLR, 2025

work page 2025

[28] [28]

Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching.arXiv preprint arXiv:2507.02860, 2025. 11 A Implementation Details T2I images are generated at 1024×1024 resolution, while T2V clips a...

work page arXiv 2025