pith. sign in

arxiv: 2605.15684 · v1 · pith:TEJITKKDnew · submitted 2026-05-15 · 💻 cs.CV

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

Pith reviewed 2026-05-20 19:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords Diffusion TransformerElastic architectureSparse attentionMobile image generationHigh-resolution generationDynamic trade-offSSBAT-DVAE
0
0 comments X

The pith

A single ElasticDiT model reconfigures its compression ratio and block depth on the fly to outperform specialized baselines across fidelity and latency on mobile devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that Diffusion Transformers can be made practical for high-resolution image generation on phones and tablets by building one model that changes its internal compression and depth at runtime. It adds Shift Sparse Block Attention to reduce computation at high sparsity levels and a compact distilled VAE that keeps reconstruction quality close to larger models while using far less compute. Experiments show the same trained weights can be switched between modes to hit different speed-quality points without retraining each time. This removes the need to ship multiple fixed models for different hardware budgets. The flex-lite setting reaches an HPS score of 32.87, above the Flux baseline, at 84 percent average sparsity.

Core claim

ElasticDiT achieves dynamic fidelity-latency trade-offs within a single set of parameters by jointly varying spatial compression ratios and DiT block depths, while Shift Sparse Block Attention (SSBA) maintains competitive image quality at 84.16 percent average sparsity and the Tiny DWT-Distilled VAE (T-DVAE) delivers SD3-level reconstruction at one-eighth the cost of standard VAEs.

What carries the argument

Elastic architecture that jointly adjusts spatial compression ratios and DiT block depths at inference time, supported by Shift Sparse Block Attention for sparsity and a Tiny DWT-Distilled VAE for efficient encoding.

If this is right

  • One trained ElasticDiT checkpoint can be reconfigured on the fly to serve many different mobile hardware budgets.
  • The flex-lite variant surpasses the Flux model on HPS while operating at 84.16 percent average sparsity through SSBA.
  • T-DVAE supplies SD3-level reconstruction quality using only one-eighth the compute of a standard VAE.
  • Flow-GRPO raises GenEval alignment from 66.93 to 73.62 without changing the core architecture.
  • Deployment no longer requires maintaining separate task-specific models for each latency target.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same elastic reconfiguration idea could be applied to other diffusion backbones to reduce the number of models needed across edge devices.
  • Real-time mobile apps could switch between high-quality and low-latency modes based on battery level or user preference without reloading weights.
  • Future hardware with variable tensor cores might exploit the sparsity patterns in SSBA to gain additional speedups beyond what software alone achieves.
  • The approach opens a path for on-device fine-tuning loops where the model adapts its depth to the current task without cloud round-trips.

Load-bearing premise

Quality improvements from sparse attention and the distilled VAE stay consistent no matter which compression ratio or block depth is chosen at runtime, without any extra retraining or tuning for each setting.

What would settle it

Run the flex-lite configuration at multiple different compression ratios and depths on a mobile device and measure whether HPS stays above 32.87 and visual quality remains competitive with Flux; a drop below that threshold at any valid runtime setting would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.15684 by Binglei Bao, Chuntao Liu, Haizhen Xie, Hao Wu, Heyuan Gao, Huaao Tang, Jie Hu, Kunpeng Du, Lei Yu, Sen Lu, Xinghao Chen, Yang Zhao, Zhicai Huang, Zhijun Tu.

Figure 1
Figure 1. Figure 1: Generation results from our ElasticDiT-flex-lite models. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison with other models on GenEval and HPSV2.1. Our models achieve the best [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: spatio-depth elastic architecture overview. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Unified Weight Co-Optimization. To achieve full weight-sharing between the DiT-flex-max and DiT-flex-lite while ensuring high￾quality generation capability, we design a rigorous Unified Weight Co-Optimization (UWCO) strat￾4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: the sparse-block attention module that enables linear-complexity long-range interaction. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: T-DVAE pipeline. Left: encoder compatible with ElasticDiT, enhanced by multi-level Haar wavelets. Right: two-stage distillation — latent alignment and reconstruction training. In order to reduce the training cost and inference latency of text-to-image generation, we design an efficient lightweight VAE, which reduces the computational load by using multi-level wavelet transform like WF-VAE Chen et al. (2024… view at source ↗
Figure 7
Figure 7. Figure 7: T2I qualitative comparison. 5 CONCLUSION This paper presents ElasticDiT, a hardware-aware framework that significantly improves the quality￾efficiency trade-off for mobile image generation. The proposed Spatio-Depth Elastic Architecture provides the flexibility to reconfigure model depth and resolution on-the-fly. This ensures optimal performance across varying hardware limitations with a unified set of pa… view at source ↗
read the original abstract

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes ElasticDiT, a Diffusion Transformer architecture for high-resolution image generation on mobile devices. It enables dynamic trade-offs by jointly adjusting spatial compression ratios and DiT block depths within a single trained model. The approach integrates Shift Sparse Block Attention (SSBA) to achieve high sparsity and a Tiny DWT-Distilled VAE (T-DVAE) for efficient encoding, with additional use of Flow-GRPO. Reported results include a flex lite variant achieving HPS of 32.87 (surpassing Flux) at 84.16% average sparsity, SD3-level reconstruction at 1/8x VAE cost, and GenEval improvement from 66.93 to 73.62.

Significance. If the central claims are substantiated, the work would offer a practical advance for deploying high-fidelity generative models under mobile constraints. A single-parameter-set model supporting on-the-fly reconfiguration across fidelity-latency points could reduce the engineering overhead of maintaining multiple specialized models. The quantitative gains in human preference and semantic alignment metrics, combined with the sparsity and efficiency techniques, indicate potential impact in resource-constrained deployment scenarios.

major comments (1)
  1. Abstract: The load-bearing claim that 'a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines' while preserving quality (e.g., HPS 32.87 at 84.16% sparsity) across arbitrary choices of spatial compression ratio and DiT block depth is not supported by any description of the training objective, regularization, or schedule that would enforce invariance to these runtime choices. If the elastic paths were optimized only for a subset of configurations, the reported superiority would not generalize.
minor comments (3)
  1. Abstract: The quantitative results (HPS 32.87, GenEval 73.62) are presented without reference to specific tables, figures, or sections containing the full experimental setup, controls, or number of runs.
  2. Abstract: The 'flex lite variant' is mentioned without clarifying its exact relation to the elastic parameters (compression ratio and block depth) or how it differs from other configurations.
  3. The manuscript would benefit from explicit discussion of whether SSBA and T-DVAE require any per-configuration fine-tuning or if they are trained once to support all elastic settings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential practical impact of ElasticDiT. We address the single major comment below and will incorporate clarifications to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: The load-bearing claim that 'a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines' while preserving quality (e.g., HPS 32.87 at 84.16% sparsity) across arbitrary choices of spatial compression ratio and DiT block depth is not supported by any description of the training objective, regularization, or schedule that would enforce invariance to these runtime choices. If the elastic paths were optimized only for a subset of configurations, the reported superiority would not generalize.

    Authors: We appreciate this observation and agree that the abstract claim requires explicit grounding in the training procedure. The full manuscript (Section 3.2) describes a multi-configuration training strategy in which spatial compression ratios and DiT block depths are randomly sampled per batch during optimization; the diffusion loss is computed on the sampled path, and a path-consistency regularization term is added to penalize output variance across different elastic settings. The training schedule progressively widens the sampled configuration space over epochs. This design is intended to promote invariance rather than specialization to a narrow subset. We will revise the abstract to briefly reference this training approach and expand the methods section with additional equations and pseudocode for the objective and sampling schedule to make the support for the claim fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering results with no self-referential derivation

full rationale

The paper describes an empirical architecture (ElasticDiT) that supports runtime reconfiguration of compression ratios and block depths, with quality metrics (HPS 32.87, 84.16% sparsity) presented as measured experimental outcomes rather than predictions derived from fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce by construction to the inputs; the central claim rests on reported performance across configurations, which is externally falsifiable via replication on the stated benchmarks. This is a standard non-circular engineering contribution.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claims rest on the effectiveness of newly introduced components whose performance is shown only through the reported experiments; no external benchmarks or formal proofs are mentioned.

free parameters (2)
  • spatial compression ratio
    Chosen at runtime to trade fidelity for latency; specific values are not listed but are central to the elastic mechanism.
  • DiT block depth
    Adjusted dynamically; the paper treats different depths as configurable without retraining.
axioms (1)
  • domain assumption Diffusion process and transformer attention mechanisms behave predictably under the proposed sparsity and compression changes.
    Invoked implicitly when claiming that quality is maintained across configurations.
invented entities (2)
  • Shift Sparse Block Attention (SSBA) no independent evidence
    purpose: Reduce attention computation while preserving quality at high sparsity levels.
    New attention variant introduced to achieve 84.16% average sparsity.
  • Tiny DWT-Distilled VAE (T-DVAE) no independent evidence
    purpose: Provide SD3-level reconstruction at 1/8 the compute of standard VAEs.
    New distilled VAE component presented as plug-and-play.

pith-pipeline@v0.9.0 · 5888 in / 1541 out tokens · 33566 ms · 2026-05-20T19:56:06.702466+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 16 internal anchors

  1. [1]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Progressive distillation for fast sampling of diffusion models , author=. arXiv preprint arXiv:2202.00512 , year=

  2. [2]

    International Conference on Learning Representations , year=

    Denoising diffusion implicit models , author=. International Conference on Learning Representations , year=

  3. [3]

    Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

    Knowledge distillation in iterative generative models for improved sampling speed , author=. arXiv preprint arXiv:2101.02388 , year=

  4. [4]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month=

    Li, Xiuyu and Liu, Yijiang and Lian, Long and Yang, Huanrui and Dong, Zhen and Kang, Daniel and Zhang, Shanghang and Keutzer, Kurt , title=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month=. 2023 , pages=

  5. [5]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  6. [6]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  7. [7]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  8. [8]

    2022 , journal=

    Scalable Diffusion Models with Transformers , author=. 2022 , journal=

  9. [9]

    arXiv preprint arXiv:2306.05178 , year=

    SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions , author=. arXiv preprint arXiv:2306.05178 , year=

  10. [10]

    The Twelfth International Conference on Learning Representations , year=

    ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=

  11. [11]

    2023 , note=

    Gemma: Open Models Based on Gemini Technology and Research , author=. 2023 , note=

  12. [12]

    arXiv preprint arXiv:2406.16747 (2024)

    Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , author=. arXiv preprint arXiv:2406.16747 , year=

  13. [13]

    ArXiv , year=

    EasyQuant: Post-training Quantization via Scale Optimization , author=. ArXiv , year=

  14. [14]

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    ZeroQ: A Novel Zero Shot Quantization Framework , author=. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  15. [15]

    ArXiv , year=

    SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation , author=. ArXiv , year=

  16. [16]

    2021 , url=

    Yuhang Li and Ruihao Gong and Xu Tan and Yang Yang and Peng Hu and Qi Zhang and Fengwei Yu and Wei Wang and Shi Gu , booktitle=. 2021 , url=

  17. [17]

    arXiv preprint arXiv:2001.08248 , year=

    How much position information do convolutional neural networks encode? , author=. arXiv preprint arXiv:2001.08248 , year=

  18. [18]

    Advances in neural information processing systems , volume=

    SegFormer: Simple and efficient design for semantic segmentation with transformers , author=. Advances in neural information processing systems , volume=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    arXiv preprint arXiv:2203.16634 , year=

    Transformer language models without positional encodings still learn positional information , author=. arXiv preprint arXiv:2203.16634 , year=

  21. [21]

    International conference on machine learning , pages=

    Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

  22. [22]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  23. [23]

    Linformer: Self-Attention with Linear Complexity

    Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

  24. [24]

    Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

    Efficient attention: Attention with linear complexities , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

  25. [25]

    European Conference on Computer Vision , pages=

    Hydra attention: Efficient attention with many heads , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  26. [26]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  27. [27]

    NeurIPS 2022 Workshop on Score-Based Methods , year=

    All are worth words: a vit backbone for score-based diffusion models , author=. NeurIPS 2022 Workshop on Score-Based Methods , year=

  28. [28]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  29. [29]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Condition-Aware Neural Network for Controlled Image Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  30. [30]

    Forty-first International Conference on Machine Learning , year=

    Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first International Conference on Machine Learning , year=

  31. [31]

    arXiv:2309.15807

    Emu: Enhancing image generation models using photogenic needles in a haystack , author=. arXiv preprint arXiv:2309.15807 , year=

  32. [32]

    International Conference on Learning Representations , year=

    PixArt- : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , author=. International Conference on Learning Representations , year=

  33. [33]

    Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023a

    Pixart- : Weak-to-strong training of diffusion transformer for 4k text-to-image generation , author=. arXiv preprint arXiv:2403.04692 , year=

  34. [34]

    International conference on machine learning , pages=

    Efficientnetv2: Smaller models and faster training , author=. International conference on machine learning , pages=. 2021 , organization=

  35. [35]

    International conference on machine learning , pages=

    Language modeling with gated convolutional networks , author=. International conference on machine learning , pages=. 2017 , organization=

  36. [36]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit

    Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT , author=. arXiv preprint arXiv:2406.18583 , year=

  37. [37]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

  38. [38]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation , author=. arXiv preprint arXiv:2402.17245 , year=

  39. [39]

    Playground v3: Improving text-to-image alignment with deep-fusion large language models

    Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models , author=. arXiv preprint arXiv:2409.10695 , year=

  40. [40]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding , author=. arXiv preprint arXiv:2405.08748 , year=

  41. [41]

    OpenAI. Dalle-3. 2023

  42. [42]

    Black Forest Labs. FLUX. 2024

  43. [43]

    Cheng Lu , title =. 2023

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=

  45. [45]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

  46. [46]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

  47. [47]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  48. [48]

    Language Models are Few-Shot Learners

    Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

  49. [49]

    Advances in neural information processing systems , volume=

    Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in neural information processing systems , volume=

  50. [50]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=

  51. [51]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Scaling up gans for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  52. [52]

    Advances in Neural Information Processing Systems , volume=

    Snapfusion: Text-to-image diffusion model on mobile devices within two seconds , author=. Advances in Neural Information Processing Systems , volume=

  53. [53]

    arXiv preprint arXiv:2311.16567 , year=

    Mobilediffusion: Subsecond text-to-image generation on mobile devices , author=. arXiv preprint arXiv:2311.16567 , year=

  54. [54]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Ella: Equip diffusion models with llm for enhanced semantic alignment , author=. arXiv preprint arXiv:2403.05135 , year=

  55. [55]

    Advances in Neural Information Processing Systems , volume=

    Geneval: An object-focused framework for evaluating text-to-image alignment , author=. Advances in Neural Information Processing Systems , volume=

  56. [56]

    Exploring the role of large language models in prompt encoding for diffusion models

    Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models , author=. arXiv preprint arXiv:2406.11831 , year=

  57. [57]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models , author=. arXiv preprint arXiv:2211.01095 , year=

  58. [58]

    Advances in Neural Information Processing Systems , volume=

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps , author=. Advances in Neural Information Processing Systems , volume=

  59. [59]

    Flow Matching for Generative Modeling

    Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=

  60. [60]

    Advances in neural information processing systems , volume=

    Elucidating the design space of diffusion-based generative models , author=. Advances in neural information processing systems , volume=

  61. [61]

    Advances in neural information processing systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

  62. [62]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  63. [63]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Vila: On pre-training for visual language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  64. [64]

    Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

    Triton: an intermediate language and compiler for tiled neural network computations , author=. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

  65. [65]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Diffusion models without attention , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  66. [66]

    arXiv preprint arXiv:2405.18428 , year=

    DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention , author=. arXiv preprint arXiv:2405.18428 , year=

  67. [68]

    2023 , eprint=

    Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=. 2023 , eprint=

  68. [69]

    arXiv preprint arXiv:2405.14224 , year=

    DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2405.14224 , year=

  69. [70]

    arXiv preprint arXiv:2405.02730 , year=

    U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers , author=. arXiv preprint arXiv:2405.02730 , year=

  70. [71]

    Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

    Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models , author=. arXiv preprint arXiv:2410.10733 , year=

  71. [72]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  72. [73]

    IEEE Transactions on Image Processing , year=

    WaveVAE: Wavelet-Enhanced Variational Autoencoder for High-Fidelity Image Compression , author=. IEEE Transactions on Image Processing , year=

  73. [74]

    Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2407.02158 , year=

  74. [75]

    Xu, Yuzhang and Li, Jialu and Guo, Qiulin and Zhou, Yuxiang and Zhang, Ziyu and Zhou, Ji and Chen, Shuai , journal=

  75. [76]

    Li, Xudong and Wang, Shuai and Zhang, Ziqi and Liu, Xiaoli and Wu, Tianyi and Wu, Ying and Li, Xing and Li, Jie , journal=

  76. [77]

    Li, Yutong and Wang, Yanan and Liu, Zizheng and Zhu, Hongjun and Chen, Bin and Chen, Zhiqiang , journal=

  77. [78]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Flow-grpo: Training flow matching models via online rl , author=. arXiv preprint arXiv:2505.05470 , year=

  78. [79]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

  79. [80]

    MoBA: Mixture of Block Attention for Long-Context LLMs

    Enzhe Lu and Zhejun Jiang and Jingyuan Liu and Yulun Du and Tao Jiang and Chao Hong and Shaowei Liu and Weiran He and Enming Yuan and Yuzhi Wang and Zhiqi Huang and Huan Yuan and Suting Xu and Xinran Xu and Guokun Lai and Yanru Chen and Huabin Zheng and Junjie Yan and Jianlin Su and Yuxin Wu and Yutao Zhang and Zhilin Yang and Xinyu Zhou and Mingxing Zhan...

  80. [81]

    2024 , eprint=

    SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training , author=. 2024 , eprint=

Showing first 80 references.