pith. machine review for the scientific record. sign in

arxiv: 2410.10629 · v3 · submitted 2024-10-14 · 💻 cs.CV

Recognition: 1 theorem link

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image synthesisdiffusion transformerslinear attentionhigh-resolution image generationdeep compression autoencoderefficient samplingflow matching
0
0 comments X

The pith

Sana-0.6B generates high-resolution images competitively with 12B-parameter models while running over 100 times faster on consumer GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sana, a text-to-image framework that scales to 4096 by 4096 resolution by combining a 32-times compression autoencoder with a diffusion transformer that uses linear attention in place of standard attention. This design shrinks the number of latent tokens enough for a 0.6 billion parameter model to match the output quality and text alignment of much larger systems such as Flux-12B. Additional changes replace the text encoder with a small decoder-only language model and introduce Flow-DPM-Solver sampling to cut the number of steps required. The result is a system that produces 1024 by 1024 images in under one second on a 16 GB laptop GPU. A reader would care because the approach lowers the hardware barrier for high-quality image synthesis from specialized servers to everyday devices.

Core claim

Sana shows that a diffusion transformer using linear attention throughout, paired with a deep-compression autoencoder that reduces images by 32 times rather than the conventional 8 times, plus a decoder-only text encoder and Flow-DPM-Solver sampling, produces a 0.6B model whose image quality and prompt adherence compete with 12B-scale diffusion models while delivering measured throughput more than 100 times higher and fitting on consumer laptop hardware.

What carries the argument

Linear DiT, the diffusion transformer in which every attention layer is replaced by linear attention, working together with the 32-times deep-compression autoencoder to keep token counts low enough for efficient high-resolution synthesis.

If this is right

  • High-resolution text-to-image generation becomes practical on laptops and other devices with 16 GB of GPU memory.
  • Training and inference costs for competitive image models drop by roughly two orders of magnitude in compute and time.
  • Fewer sampling steps become viable for production use while preserving quality through the Flow-DPM-Solver schedule.
  • Content pipelines can shift from cloud-only generation to on-device or low-latency local generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linear-attention and deep-compression pattern could be tested on video or 3D synthesis tasks where token counts grow even faster.
  • Modest further increases in model size while retaining the linear architecture might close any remaining quality gap without regaining the original speed penalty.
  • The decoder-only text encoder with in-context examples hints that prompt design alone could improve alignment for complex instructions without changing the image model.

Load-bearing premise

The 32-times autoencoder compression keeps enough perceptual detail and text-relevant information that linear attention can still produce clean, prompt-aligned images without new artifacts at high resolutions.

What would settle it

Side-by-side human preference tests and FID or CLIP-score measurements at 1024 by 1024 and 2048 by 2048 resolutions comparing Sana outputs directly against Flux-12B outputs for the same prompts, checking whether systematic blur, texture loss, or alignment errors appear only in the Sana images.

read the original abstract

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096$\times$4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8$\times$, we trained an AE that can compress images 32$\times$, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024$\times$1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Sana, a text-to-image framework for efficient synthesis up to 4096×4096 resolution. It relies on a 32× deep-compression autoencoder to reduce latent tokens from ~16k to ~256 at 1024², a linear-attention DiT backbone, a decoder-only LLM text encoder with in-context learning, and Flow-DPM-Solver sampling. The central claim is that the 0.6B-parameter Sana model matches the quality of much larger models such as Flux-12B while being ~20× smaller and >100× faster in measured throughput, and runnable on a 16 GB laptop GPU in <1 s for 1024×1024 images.

Significance. If the quality claims hold, the result would be significant for high-resolution image synthesis: it would demonstrate that aggressive latent compression plus linear attention can close the gap to giant diffusion models at far lower parameter count and latency, directly enabling consumer-grade deployment and lowering the barrier to high-res content creation.

major comments (2)
  1. [§3.1] §3.1 and Figure 3: the 32× deep-compression autoencoder is asserted to preserve perceptual quality and text alignment at target resolutions, yet only qualitative examples are shown; no quantitative reconstruction metrics (LPIPS, PSNR, SSIM, or CLIP-text alignment) are reported at 1024² or 4096² against an 8× baseline. Because the entire efficiency argument rests on the AE not discarding details that linear attention cannot recover, this absence leaves the core quality-through-compression claim unverified.
  2. [Experiments] Experimental section: the headline claim that Sana-0.6B is competitive with Flux-12B supplies no supporting tables of FID, CLIP-score, or human-preference metrics on standard benchmarks, nor ablations isolating the contribution of the 32× AE versus linear attention. Without these numbers the speed/size advantage cannot be assessed against an actual quality floor.
minor comments (2)
  1. [Abstract] Abstract: the statement '100+ times faster in measured throughput' should specify the exact hardware, batch size, and resolution at which the comparison was performed.
  2. [§2.2] Notation: the paper uses 'linear attention' without a brief equation or reference to the specific formulation (e.g., Performer-style or ReLU-based) in the main text; a short definition would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that quantitative evaluations are necessary to support the core claims. We have revised the manuscript to include the requested metrics, tables, and ablations. Below we respond point by point.

read point-by-point responses
  1. Referee: [§3.1] §3.1 and Figure 3: the 32× deep-compression autoencoder is asserted to preserve perceptual quality and text alignment at target resolutions, yet only qualitative examples are shown; no quantitative reconstruction metrics (LPIPS, PSNR, SSIM, or CLIP-text alignment) are reported at 1024² or 4096² against an 8× baseline. Because the entire efficiency argument rests on the AE not discarding details that linear attention cannot recover, this absence leaves the core quality-through-compression claim unverified.

    Authors: We agree that quantitative reconstruction metrics are required to verify the 32× AE. In the revised manuscript we add a table in §3.1 reporting LPIPS, PSNR, SSIM and CLIP-text alignment for the 32× AE versus an 8× baseline on 1024×1024 images. For 4096×4096 we report the same metrics on a representative subset (due to memory limits for full-resolution reconstruction) and include a note on the practical constraints. These numbers confirm that perceptual quality and text alignment are largely preserved, directly supporting the efficiency argument. revision: yes

  2. Referee: [Experiments] Experimental section: the headline claim that Sana-0.6B is competitive with Flux-12B supplies no supporting tables of FID, CLIP-score, or human-preference metrics on standard benchmarks, nor ablations isolating the contribution of the 32× AE versus linear attention. Without these numbers the speed/size advantage cannot be assessed against an actual quality floor.

    Authors: We acknowledge the absence of quantitative benchmark tables and ablations in the original submission. The revised Experiments section now contains FID and CLIP-score tables on MS-COCO and other standard benchmarks, plus results from a human preference study comparing Sana-0.6B against Flux-12B and other models. We also add ablations that separately measure the impact of the 32× AE versus linear attention on both quality and throughput. These additions allow readers to evaluate the quality-speed trade-off directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; efficiency claims rest on empirical measurements

full rationale

The paper presents Sana as an empirical framework whose speed and quality advantages are demonstrated via direct training runs, inference throughput measurements, and deployment tests on hardware. Core components (32x AE, linear attention replacement, decoder-only text encoder, Flow-DPM-Solver) are introduced as architectural choices whose performance is validated by experiment rather than derived from prior fitted quantities or self-citations. No equations reduce reported gains to inputs by construction, and no load-bearing uniqueness theorems or ansatzes are smuggled via self-reference. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the trained behavior of the deep-compression autoencoder and the approximation quality of linear attention at high latent resolutions; no new physical entities are postulated.

free parameters (1)
  • compression_ratio
    32x spatial compression factor chosen and trained to reduce token count while preserving image fidelity.
axioms (1)
  • domain assumption Linear attention provides sufficient modeling capacity for high-resolution image synthesis without quality degradation relative to full attention.
    Invoked when replacing all vanilla attention layers in the DiT backbone.

pith-pipeline@v0.9.0 · 5599 in / 1205 out tokens · 39769 ms · 2026-05-15T00:52:14.386983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Attention Sinks in Diffusion Transformers: A Causal Analysis

    cs.CV 2026-05 unverdicted novelty 7.0

    Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.

  2. Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

  3. Training-Free Refinement of Flow Matching with Divergence-based Sampling

    cs.CV 2026-04 unverdicted novelty 7.0

    Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.

  4. PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

    cs.LG 2026-04 unverdicted novelty 7.0

    PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

  5. Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.

  6. Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

    cs.CV 2026-05 conditional novelty 6.0

    VLMs exhibit size, center, and saliency biases in scene understanding, relying less on people than humans do, with size bias as a key driver of divergence.

  7. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  8. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  9. Attention Sinks in Diffusion Transformers: A Causal Analysis

    cs.CV 2026-05 unverdicted novelty 6.0

    Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.

  10. What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  11. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  12. ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

    cs.CV 2026-05 unverdicted novelty 6.0

    ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier i...

  13. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  14. MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    MeshLAM reconstructs high-fidelity animatable textured mesh head avatars from a single image via a feed-forward dual shape-texture architecture with iterative GRU decoding and reprojection-based guidance.

  15. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  16. Self-Adversarial One Step Generation via Condition Shifting

    cs.CV 2026-04 unverdicted novelty 6.0

    APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.

  17. FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

    cs.LG 2026-03 unverdicted novelty 6.0

    FluidFlow uses conditional flow-matching with U-Net and DiT architectures to predict pressure and friction coefficients on airfoils and 3D aircraft meshes, outperforming MLP baselines with better generalization.

  18. Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

    cs.LG 2026-04 unverdicted novelty 5.0

    Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.

  19. Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...

  20. Not all tokens contribute equally to diffusion learning

    cs.CV 2026-04 unverdicted novelty 5.0

    DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.

  21. LTX-2: Efficient Joint Audio-Visual Foundation Model

    cs.CV 2026-01 conditional novelty 5.0

    LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

  22. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 21 Pith papers · 8 internal anchors

  1. [1]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324,

  2. [2]

    All are worth words: a vit backbone for score-based diffusion models

    Fan Bao, Chongxuan Li, Yue Cao, and Jun Zhu. All are worth words: a vit backbone for score-based diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods,

  3. [3]

    Language Models are Few-Shot Learners

    Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

  4. [4]

    Mahoney, and Kurt Keutzer

    Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13166–13175,

  5. [5]

    Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024a. Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Hu...

  6. [6]

    Squant: On-the-fly data-free quantization via diagonal hessian approxima- tion

    11 Technical Report Cong Guo, Yuxian Qiu, Jingwen Leng, Xiaotian Gao, Chen Zhang, Yunxin Liu, Fan Yang, Yuhao Zhu, and Minyi Guo. Squant: On-the-fly data-free quantization via diagonal hessian approxima- tion. ArXiv, abs/2202.07471,

  7. [7]

    Transformer language models without positional encod- ings still learn positional information.arXiv preprint arXiv:2203.16634,

    Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. arXiv preprint arXiv:2203.16634,

  8. [8]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135,

  9. [9]

    How much position information do convolutional neural networks encode? arXiv preprint arXiv:2001.08248,

    Md Amirul Islam, Sen Jia, and Neil DB Bruce. How much position information do convolutional neural networks encode? arXiv preprint arXiv:2001.08248,

  10. [10]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

    URL https://blackforestlabs.ai/. Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a. Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer...

  11. [11]

    net/forum?id=POWv6hDd9XH

    URL https://openreview. net/forum?id=POWv6hDd9XH. Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:2405.08748, 2024c. Ji Lin, Hongxu Yin, W...

  12. [12]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,

  13. [13]

    Playground v3: Improving text-to-image alignment with deep-fusion large language models

    12 Technical Report Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695,

  14. [14]

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

    URL https://luchengthu.github.io/files/chenglu_dissertation.pdf. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan ...

  15. [15]

    Scalable Diffusion Models with Transformers

    URL https://openai.com/dall-e-3. William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748,

  16. [16]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

  17. [17]

    Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks

    Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks. arXiv preprint arXiv:2407.02158,

  18. [18]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ´e, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,

  19. [19]

    arXiv preprint arXiv:2405.14224 , year=

    Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high-resolution image synthesis.arXiv preprint arXiv:2405.14224,

  20. [20]

    U-dits: Downsample tokens in u-shaped diffusion transformers

    Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, and Yunhe Wang. U-dits: Downsample tokens in u-shaped diffusion transformers. arXiv preprint arXiv:2405.02730,

  21. [21]

    Linformer: Self-Attention with Linear Complexity

    13 Technical Report Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768,

  22. [22]

    Mobilediffusion: Subsecond text-to-image generation on mobile devices

    Yang Zhao, Yanwu Xu, Zhisheng Xiao, and Tingbo Hou. Mobilediffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567,

  23. [23]

    Dig: Scalable and efficient diffusion models with gated linear attention

    Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, and Xing- gang Wang. Dig: Scalable and efficient diffusion models with gated linear attention. arXiv preprint arXiv:2405.18428,

  24. [24]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. arXiv preprint arXiv:2406.18583,

  25. [25]

    Several works are also related to modifying the model configuration, e.g., diffusion without attention (Yan et al., 2024; Teng et al.,

    explored linear attention mechanisms in class-condition image genera- tion. Several works are also related to modifying the model configuration, e.g., diffusion without attention (Yan et al., 2024; Teng et al.,

  26. [26]

    Text Encoders in Image Generation

    and cascade model structures (Pernias et al., 2023; Ren et al., 2024; Tian et al., 2024). Text Encoders in Image Generation. The evolution of text encoders in image generation models has significantly impacted the field’s progress. Initially, Latent Diffusion Models (LDM) (Rombach et al.,

  27. [27]

    adopted OpenAI’s CLIP as the text encoder, leveraging its pre-trained visual-linguistic representations. A paradigm shift occurred with the introduction of Imagen (Saharia et al., 2022), which employed the T5-XXL language model as its text encoder, demonstrating superior text under- standing and generation capabilities. Subsequently, eDiff-I (Balaji et al.,

  28. [28]

    In our work, we choose W8A8 to reduce peak memory usage

    has achieved high-quality generation using only 4-bit weights. In our work, we choose W8A8 to reduce peak memory usage. B M ORE IMPLEMENTATION DETAILS Rectified-Flow vs. DDPM. In our theoretical analysis, we investigate the reasons behind the fast convergence of flow-matching methods, demonstrating that both 1st flow-matching and EDM mod- els rely on simi...

  29. [29]

    HAPPY VALENTINE

    12.0 84.8 91.2 91.3 89.7 86.5 87.0 0.91 Sana-0.6B (Ours) 0.6 83.6 83.0 89.5 89.3 90.1 90.2 0.97Sana-1.6B (Ours) 1.6 84.8 86.0 91.5 88.9 91.9 90.7 0.99 18 Technical Report Caption_original(ClipScore:25.67)top view the written " HAPPY VALENTINE " on a tart chocolate cake, black wood background Caption_VILA-13B(ClipScore: 26.33)The image captures a delightfu...