SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration

Huixia Ben; Junxiang Qiu; Shuo Wang; Yanbin Hao; Yuhang Zhang; Zhenhua Tang

arxiv: 2605.27075 · v1 · pith:PRKCF7AEnew · submitted 2026-05-26 · 💻 cs.CV

SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration

Yuhang Zhang , Junxiang Qiu , Huixia Ben , Zhenhua Tang , Shuo Wang , Yanbin Hao This is my paper

Pith reviewed 2026-06-29 18:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion transformersinference accelerationcache controlsoft budgetPI controllertrajectory driftFLUX.1-devtraining-free acceleration

0 comments

The pith

SoftCap pairs a trajectory drift observer with a soft-budget PI controller to raise image quality in diffusion transformer inference at fixed compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SoftCap as a training-free control layer that decides when to run full Transformer steps versus cached approximations during diffusion denoising. It estimates local risk of quality loss from lightweight hidden-state statistics and then uses a proportional-integral controller to nudge the triggering threshold so that total compute stays near but not rigidly pinned to a reference profile. Experiments on FLUX.1-dev show measurable gains over an earlier cache method at nearly identical floating-point operation counts. The approach treats the budget as a soft ceiling that shapes behavior without forcing exact expenditure. A reader would care because diffusion models remain expensive at inference time and any automatic, threshold-free way to spend fewer full steps without quality collapse could make large-scale generation more usable.

Core claim

SoftCap couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller, which adjusts the Full-triggering threshold from realized compute relative to a fixed reference profile. The budget is a soft ceiling: it shapes the threshold but does not require a run to spend a prescribed number of Full evaluations.

What carries the argument

The Soft-Budget PI Controller, which raises or lowers the threshold for a full Transformer evaluation according to how far cumulative compute has drifted from a reference profile while the Trajectory Drift Observer supplies the local risk signal.

If this is right

At a middle-compute operating point on FLUX.1-dev, SoftCap raises ImageReward from 0.967 to 0.981 and lowers LPIPS-Full from 0.518 to 0.498 while holding FLOPs nearly constant versus the prior SpeCa baseline.
Target-sweep runs confirm that relaxing the budget parameter produces the intended soft-ceiling behavior rather than hard quotas.
The entire layer remains training-free and sits on top of existing cache, forecast, or verification strategies.
The controller modulates thresholds continuously rather than relying on fixed schedules or hand-tuned cutoffs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same observer-plus-controller pattern could be tested on other iterative generative architectures that maintain intermediate feature caches, such as video diffusion or autoregressive image models.
If the hidden-state statistics remain predictive at larger model scales, production pipelines could drop manual threshold tuning in favor of a single reference-profile setting.
Replacing the PI controller with a small learned policy that receives the same drift signals might further tighten the quality-compute frontier without adding training cost to the base model.

Load-bearing premise

Lightweight hidden-state statistics give a sufficiently accurate picture of local cache risk for the controller to set thresholds that avoid systematic quality loss.

What would settle it

An experiment in which the drift observer's risk score shows near-zero or negative correlation with measured quality drop when full steps are skipped would falsify the premise that the statistics are informative enough to guide the controller.

Figures

Figures reproduced from arXiv: 2605.27075 by Huixia Ben, Junxiang Qiu, Shuo Wang, Yanbin Hao, Yuhang Zhang, Zhenhua Tang.

**Figure 2.** Figure 2: Comparison of Samples generated by different methods. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Controller ablation on FLUX.1-dev. PIbased policies trade a small amount of computation for improved Full-referenced fidelity relative to fixedthreshold rules. FID-Full and LPIPS-Full are internal generated-vs-Full metrics. TDO cue ablation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 6.** Figure 6: Monitored-layer sweep for the Trajectory Drift [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity to TDO score weights. The se [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Diffusion Transformers (DiTs) achieve strong visual quality, but their iterative denoising process requires many costly Transformer evaluations. Training-free acceleration methods reduce this cost by caching, forecasting, or verifying intermediate features, yet the runtime decision of when to execute a Full step is often driven by fixed schedules or hand-tuned thresholds. We propose \textbf{SoftCap}, a training-free control layer for cache-based DiT inference. SoftCap couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller, which adjusts the Full-triggering threshold from realized compute relative to a fixed reference profile. The budget is a soft ceiling: it shapes the threshold but does not require a run to spend a prescribed number of Full evaluations. On FLUX.1-dev, SoftCap improves over SpeCa at a comparable middle-compute operating point, raising ImageReward from 0.967 to 0.981 and reducing LPIPS-Full from 0.518 to 0.498 at nearly identical FLOPs, while target-sweep diagnostics show the intended soft-ceiling behavior as the budget is relaxed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SoftCap pairs a hidden-state drift observer with a soft-budget PI controller for DiT caching and reports small metric gains over SpeCa on FLUX.1-dev at matched FLOPs, but the observer's predictive accuracy is not directly checked.

read the letter

The paper's core addition is a training-free control layer on top of existing cache-based DiT inference. It combines a Trajectory Drift Observer that pulls lightweight hidden-state stats to guess local cache risk with a Soft-Budget PI Controller that tweaks the full-step threshold according to how much compute has been used relative to a reference profile. The budget acts as a soft ceiling rather than a hard target. On FLUX.1-dev this yields ImageReward up from 0.967 to 0.981 and LPIPS-Full down from 0.518 to 0.498 versus SpeCa at nearly the same FLOPs, plus target-sweep plots that confirm the intended soft-ceiling response.

What stands out is the explicit pairing of the observer and controller for this exact setting; prior caching work used fixed schedules or hand-tuned thresholds. The empirical numbers are concrete and the diagnostics line up with the claimed behavior.

The main gap is validation of the observer itself. The method assumes the chosen hidden-state statistics give a reliable enough signal of cache risk for the controller to act without systematic quality loss or misallocation, yet the reported results are only aggregate metric improvements. There is no correlation plot, ablation on the statistics, or forced-cache versus recompute comparison that would show how well the risk estimate tracks actual per-step degradation. Without that, it is hard to tell whether the gains come from the control logic or from operating-point tuning.

The work is aimed at practitioners who already run cache-based acceleration on models like FLUX and want a lightweight way to manage the full-step decision. It is incremental rather than foundational, but the method is clearly described and the experiments target a real deployed model. The paper deserves a serious referee so the missing validation checks can be requested and the numbers can be stress-tested with error bars and ablations.

Referee Report

1 major / 0 minor

Summary. The paper proposes SoftCap, a training-free control layer for cache-based Diffusion Transformer (DiT) inference. It couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller that dynamically adjusts the Full-step triggering threshold based on realized compute relative to a reference profile. The budget acts as a soft ceiling. On FLUX.1-dev, SoftCap reports gains over SpeCa at a comparable middle-compute point (ImageReward 0.981 vs. 0.967; LPIPS-Full 0.498 vs. 0.518) at nearly identical FLOPs, with target-sweep diagnostics confirming the intended soft-ceiling behavior.

Significance. If the empirical gains prove robust under full experimental protocols, SoftCap would supply a practical, training-free mechanism for adaptive compute control in DiT sampling that avoids both fixed schedules and hard per-run budgets. The target-sweep diagnostics are a positive feature, as they directly test the soft-ceiling property rather than reporting only a single operating point.

major comments (1)

[Abstract] The central claim depends on the Trajectory Drift Observer supplying sufficiently accurate local cache-risk estimates from lightweight hidden-state statistics so that the PI controller can set thresholds without systematic quality loss. The abstract states only that the observer “estimates local cache risk from lightweight hidden-state statistics” and reports aggregate metric improvements; it supplies no correlation analysis, ablation, or ground-truth comparison (e.g., LPIPS or ImageReward delta when a cached step is forced versus recomputed). This validation is load-bearing for the method’s reliability and is absent from the provided description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of SoftCap's practical value and the target-sweep diagnostics. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] The central claim depends on the Trajectory Drift Observer supplying sufficiently accurate local cache-risk estimates from lightweight hidden-state statistics so that the PI controller can set thresholds without systematic quality loss. The abstract states only that the observer “estimates local cache risk from lightweight hidden-state statistics” and reports aggregate metric improvements; it supplies no correlation analysis, ablation, or ground-truth comparison (e.g., LPIPS or ImageReward delta when a cached step is forced versus recomputed). This validation is load-bearing for the method’s reliability and is absent from the provided description.

Authors: We agree that the abstract, as currently written, does not explicitly reference the validation of the Trajectory Drift Observer and therefore leaves the central claim less self-contained than it could be. The full manuscript contains the requested elements: Section 3.2 defines the observer and its lightweight statistics; Section 4.2 reports correlation coefficients between observer risk scores and per-step LPIPS deltas (cached vs. recomputed); and Section 4.3 includes an ablation that forces cache decisions at varying risk thresholds and measures the resulting ImageReward and LPIPS-Full degradation. We will revise the abstract to add one sentence summarizing this validation (e.g., “Observer risk estimates correlate with per-step quality deltas, enabling the controller to avoid systematic degradation”). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical control method with direct performance measurements

full rationale

The paper describes a training-free control layer (Trajectory Drift Observer + Soft-Budget PI Controller) whose outputs are runtime thresholds and cache decisions. Reported gains (ImageReward 0.967→0.981, LPIPS-Full 0.518→0.498 at matched FLOPs) are presented as direct empirical results on FLUX.1-dev versus SpeCa, not as quantities derived from or fitted to the method's own statistics. No equations, self-citations, or uniqueness claims are supplied that would reduce any central result to its inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, parameter lists, or modeling assumptions are provided, so the ledger cannot be populated with concrete entries.

pith-pipeline@v0.9.1-grok · 5743 in / 1156 out tokens · 44249 ms · 2026-06-29T18:19:15.152100+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 6 canonical work pages

[1]

Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, and Weijia Jia

Forecast the principal, stabi- lize the residual: Subspace-aware feature caching for efficient diffusion transformers.arXiv preprint arXiv:2601.07396. Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, and Weijia Jia

work page arXiv
[2]

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixi- ang Ye, and Fang Wan

Predict to skip: Linear multistep feature forecasting for efficient diffusion transform- ers.arXiv preprint arXiv:2602.18093. Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixi- ang Ye, and Fang Wan. 2025a. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Visi...

work page arXiv
[3]

InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205

Scalable dif- fusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205. Junxiang Qiu, Lin Liu, Shuo Wang, Jinda Lu, Kezhou Chen, and Yanbin Hao. 2025a. Accelerating diffu- sion transformer via gradient-optimized cache. In Proceedings of the IEEE/CVF International Confer- ence on Computer V...

2025
[4]

arXiv preprint arXiv:2407.01425 (2024)

Fora: Fast- forward caching in diffusion transformer acceleration. arXiv preprint arXiv:2407.01425. Byunggwan Son, Jeimin Jeon, Jeongwoo Choi, and Bumsub Ham

work page arXiv
[5]

Shikang Zheng, Liang Feng, Xinyu Wang, Qinming Zhou, Peiliang Cai, Chang Zou, Jiacheng Liu, Yuqi Lin, Junjie Chen, Yue Ma, and 1 others

Relational feature caching for accelerating diffusion transformers.arXiv preprint arXiv:2602.19506. Shikang Zheng, Liang Feng, Xinyu Wang, Qinming Zhou, Peiliang Cai, Chang Zou, Jiacheng Liu, Yuqi Lin, Junjie Chen, Yue Ma, and 1 others

work page arXiv
[6]

Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Linfeng Zhang

Tap: A token-adaptive predictor framework for training-free diffusion acceleration.arXiv preprint arXiv:2603.03792. Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Linfeng Zhang

work page arXiv
[7]

Accelerating diffusion trans- formers with token-wise feature caching.arXiv preprint arXiv:2410.05317. Appendix A Additional Ablation Studies This appendix provides additional ablations for the three design choices in SoftCap: the soft-budget controller, the Trajectory Drift Observer (TDO), and the monitored layer used to compute drift. All image-quality ...

work page arXiv

[1] [1]

Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, and Weijia Jia

Forecast the principal, stabi- lize the residual: Subspace-aware feature caching for efficient diffusion transformers.arXiv preprint arXiv:2601.07396. Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, and Weijia Jia

work page arXiv

[2] [2]

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixi- ang Ye, and Fang Wan

Predict to skip: Linear multistep feature forecasting for efficient diffusion transform- ers.arXiv preprint arXiv:2602.18093. Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixi- ang Ye, and Fang Wan. 2025a. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Visi...

work page arXiv

[3] [3]

InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205

Scalable dif- fusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205. Junxiang Qiu, Lin Liu, Shuo Wang, Jinda Lu, Kezhou Chen, and Yanbin Hao. 2025a. Accelerating diffu- sion transformer via gradient-optimized cache. In Proceedings of the IEEE/CVF International Confer- ence on Computer V...

2025

[4] [4]

arXiv preprint arXiv:2407.01425 (2024)

Fora: Fast- forward caching in diffusion transformer acceleration. arXiv preprint arXiv:2407.01425. Byunggwan Son, Jeimin Jeon, Jeongwoo Choi, and Bumsub Ham

work page arXiv

[5] [5]

Shikang Zheng, Liang Feng, Xinyu Wang, Qinming Zhou, Peiliang Cai, Chang Zou, Jiacheng Liu, Yuqi Lin, Junjie Chen, Yue Ma, and 1 others

Relational feature caching for accelerating diffusion transformers.arXiv preprint arXiv:2602.19506. Shikang Zheng, Liang Feng, Xinyu Wang, Qinming Zhou, Peiliang Cai, Chang Zou, Jiacheng Liu, Yuqi Lin, Junjie Chen, Yue Ma, and 1 others

work page arXiv

[6] [6]

Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Linfeng Zhang

Tap: A token-adaptive predictor framework for training-free diffusion acceleration.arXiv preprint arXiv:2603.03792. Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Linfeng Zhang

work page arXiv

[7] [7]

Accelerating diffusion trans- formers with token-wise feature caching.arXiv preprint arXiv:2410.05317. Appendix A Additional Ablation Studies This appendix provides additional ablations for the three design choices in SoftCap: the soft-budget controller, the Trajectory Drift Observer (TDO), and the monitored layer used to compute drift. All image-quality ...

work page arXiv