ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
Pith reviewed 2026-05-20 19:56 UTC · model grok-4.3
The pith
A single ElasticDiT model reconfigures its compression ratio and block depth on the fly to outperform specialized baselines across fidelity and latency on mobile devices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ElasticDiT achieves dynamic fidelity-latency trade-offs within a single set of parameters by jointly varying spatial compression ratios and DiT block depths, while Shift Sparse Block Attention (SSBA) maintains competitive image quality at 84.16 percent average sparsity and the Tiny DWT-Distilled VAE (T-DVAE) delivers SD3-level reconstruction at one-eighth the cost of standard VAEs.
What carries the argument
Elastic architecture that jointly adjusts spatial compression ratios and DiT block depths at inference time, supported by Shift Sparse Block Attention for sparsity and a Tiny DWT-Distilled VAE for efficient encoding.
If this is right
- One trained ElasticDiT checkpoint can be reconfigured on the fly to serve many different mobile hardware budgets.
- The flex-lite variant surpasses the Flux model on HPS while operating at 84.16 percent average sparsity through SSBA.
- T-DVAE supplies SD3-level reconstruction quality using only one-eighth the compute of a standard VAE.
- Flow-GRPO raises GenEval alignment from 66.93 to 73.62 without changing the core architecture.
- Deployment no longer requires maintaining separate task-specific models for each latency target.
Where Pith is reading between the lines
- The same elastic reconfiguration idea could be applied to other diffusion backbones to reduce the number of models needed across edge devices.
- Real-time mobile apps could switch between high-quality and low-latency modes based on battery level or user preference without reloading weights.
- Future hardware with variable tensor cores might exploit the sparsity patterns in SSBA to gain additional speedups beyond what software alone achieves.
- The approach opens a path for on-device fine-tuning loops where the model adapts its depth to the current task without cloud round-trips.
Load-bearing premise
Quality improvements from sparse attention and the distilled VAE stay consistent no matter which compression ratio or block depth is chosen at runtime, without any extra retraining or tuning for each setting.
What would settle it
Run the flex-lite configuration at multiple different compression ratios and depths on a mobile device and measure whether HPS stays above 32.87 and visual quality remains competitive with Flux; a drop below that threshold at any valid runtime setting would falsify the claim.
Figures
read the original abstract
The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ElasticDiT, a Diffusion Transformer architecture for high-resolution image generation on mobile devices. It enables dynamic trade-offs by jointly adjusting spatial compression ratios and DiT block depths within a single trained model. The approach integrates Shift Sparse Block Attention (SSBA) to achieve high sparsity and a Tiny DWT-Distilled VAE (T-DVAE) for efficient encoding, with additional use of Flow-GRPO. Reported results include a flex lite variant achieving HPS of 32.87 (surpassing Flux) at 84.16% average sparsity, SD3-level reconstruction at 1/8x VAE cost, and GenEval improvement from 66.93 to 73.62.
Significance. If the central claims are substantiated, the work would offer a practical advance for deploying high-fidelity generative models under mobile constraints. A single-parameter-set model supporting on-the-fly reconfiguration across fidelity-latency points could reduce the engineering overhead of maintaining multiple specialized models. The quantitative gains in human preference and semantic alignment metrics, combined with the sparsity and efficiency techniques, indicate potential impact in resource-constrained deployment scenarios.
major comments (1)
- Abstract: The load-bearing claim that 'a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines' while preserving quality (e.g., HPS 32.87 at 84.16% sparsity) across arbitrary choices of spatial compression ratio and DiT block depth is not supported by any description of the training objective, regularization, or schedule that would enforce invariance to these runtime choices. If the elastic paths were optimized only for a subset of configurations, the reported superiority would not generalize.
minor comments (3)
- Abstract: The quantitative results (HPS 32.87, GenEval 73.62) are presented without reference to specific tables, figures, or sections containing the full experimental setup, controls, or number of runs.
- Abstract: The 'flex lite variant' is mentioned without clarifying its exact relation to the elastic parameters (compression ratio and block depth) or how it differs from other configurations.
- The manuscript would benefit from explicit discussion of whether SSBA and T-DVAE require any per-configuration fine-tuning or if they are trained once to support all elastic settings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential practical impact of ElasticDiT. We address the single major comment below and will incorporate clarifications to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The load-bearing claim that 'a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines' while preserving quality (e.g., HPS 32.87 at 84.16% sparsity) across arbitrary choices of spatial compression ratio and DiT block depth is not supported by any description of the training objective, regularization, or schedule that would enforce invariance to these runtime choices. If the elastic paths were optimized only for a subset of configurations, the reported superiority would not generalize.
Authors: We appreciate this observation and agree that the abstract claim requires explicit grounding in the training procedure. The full manuscript (Section 3.2) describes a multi-configuration training strategy in which spatial compression ratios and DiT block depths are randomly sampled per batch during optimization; the diffusion loss is computed on the sampled path, and a path-consistency regularization term is added to penalize output variance across different elastic settings. The training schedule progressively widens the sampled configuration space over epochs. This design is intended to promote invariance rather than specialization to a narrow subset. We will revise the abstract to briefly reference this training approach and expand the methods section with additional equations and pseudocode for the objective and sampling schedule to make the support for the claim fully transparent. revision: yes
Circularity Check
No circularity: empirical engineering results with no self-referential derivation
full rationale
The paper describes an empirical architecture (ElasticDiT) that supports runtime reconfiguration of compression ratios and block depths, with quality metrics (HPS 32.87, 84.16% sparsity) presented as measured experimental outcomes rather than predictions derived from fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce by construction to the inputs; the central claim rests on reported performance across configurations, which is externally falsifiable via replication on the stated benchmarks. This is a standard non-circular engineering contribution.
Axiom & Free-Parameter Ledger
free parameters (2)
- spatial compression ratio
- DiT block depth
axioms (1)
- domain assumption Diffusion process and transformer attention mechanisms behave predictably under the proposed sparsity and compression changes.
invented entities (2)
-
Shift Sparse Block Attention (SSBA)
no independent evidence
-
Tiny DWT-Distilled VAE (T-DVAE)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly... Shift Sparse Block Attention (SSBA)... Tiny DWT-Distilled VAE (T-DVAE)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Spatio-Depth Elastic Architecture... Sparse-Depth Pruning... Unified Weight Co-Optimization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Progressive Distillation for Fast Sampling of Diffusion Models
Progressive distillation for fast sampling of diffusion models , author=. arXiv preprint arXiv:2202.00512 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
International Conference on Learning Representations , year=
Denoising diffusion implicit models , author=. International Conference on Learning Representations , year=
-
[3]
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
Knowledge distillation in iterative generative models for improved sampling speed , author=. arXiv preprint arXiv:2101.02388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month=
Li, Xiuyu and Liu, Yijiang and Lian, Long and Yang, Huanrui and Dong, Zhen and Kang, Daniel and Zhang, Shanghang and Keutzer, Kurt , title=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month=. 2023 , pages=
work page 2023
-
[5]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[6]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [7]
-
[8]
Scalable Diffusion Models with Transformers , author=. 2022 , journal=
work page 2022
-
[9]
arXiv preprint arXiv:2306.05178 , year=
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions , author=. arXiv preprint arXiv:2306.05178 , year=
-
[10]
The Twelfth International Conference on Learning Representations , year=
ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[11]
Gemma: Open Models Based on Gemini Technology and Research , author=. 2023 , note=
work page 2023
-
[12]
arXiv preprint arXiv:2406.16747 , year=
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , author=. arXiv preprint arXiv:2406.16747 , year=
-
[13]
EasyQuant: Post-training Quantization via Scale Optimization , author=. ArXiv , year=
-
[14]
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
ZeroQ: A Novel Zero Shot Quantization Framework , author=. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2020
-
[15]
SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation , author=. ArXiv , year=
-
[16]
Yuhang Li and Ruihao Gong and Xu Tan and Yang Yang and Peng Hu and Qi Zhang and Fengwei Yu and Wei Wang and Shi Gu , booktitle=. 2021 , url=
work page 2021
-
[17]
arXiv preprint arXiv:2001.08248 , year=
How much position information do convolutional neural networks encode? , author=. arXiv preprint arXiv:2001.08248 , year=
-
[18]
Advances in neural information processing systems , volume=
SegFormer: Simple and efficient design for semantic segmentation with transformers , author=. Advances in neural information processing systems , volume=
-
[19]
Advances in Neural Information Processing Systems , volume=
The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
arXiv preprint arXiv:2203.16634 , year=
Transformer language models without positional encodings still learn positional information , author=. arXiv preprint arXiv:2203.16634 , year=
-
[21]
International conference on machine learning , pages=
Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[22]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[23]
Linformer: Self-Attention with Linear Complexity
Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[24]
Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
Efficient attention: Attention with linear complexities , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
-
[25]
European Conference on Computer Vision , pages=
Hydra attention: Efficient attention with many heads , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
-
[26]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[27]
NeurIPS 2022 Workshop on Score-Based Methods , year=
All are worth words: a vit backbone for score-based diffusion models , author=. NeurIPS 2022 Workshop on Score-Based Methods , year=
work page 2022
-
[28]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[29]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Condition-Aware Neural Network for Controlled Image Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[30]
Forty-first International Conference on Machine Learning , year=
Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first International Conference on Machine Learning , year=
-
[31]
Emu: Enhancing image generation models using photogenic needles in a haystack , author=. arXiv preprint arXiv:2309.15807 , year=
-
[32]
International Conference on Learning Representations , year=
PixArt- : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , author=. International Conference on Learning Representations , year=
-
[33]
Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023a
Pixart- : Weak-to-strong training of diffusion transformer for 4k text-to-image generation , author=. arXiv preprint arXiv:2403.04692 , year=
-
[34]
International conference on machine learning , pages=
Efficientnetv2: Smaller models and faster training , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[35]
International conference on machine learning , pages=
Language modeling with gated convolutional networks , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[36]
Lumina-next: Making lumina-t2x stronger and faster with next-dit
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT , author=. arXiv preprint arXiv:2406.18583 , year=
-
[37]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation , author=. arXiv preprint arXiv:2402.17245 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Playground v3: Improving text-to-image alignment with deep-fusion large language models
Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models , author=. arXiv preprint arXiv:2409.10695 , year=
-
[40]
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding , author=. arXiv preprint arXiv:2405.08748 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
OpenAI. Dalle-3. 2023
work page 2023
-
[42]
Black Forest Labs. FLUX. 2024
work page 2024
-
[43]
Cheng Lu , title =. 2023
work page 2023
-
[44]
Advances in Neural Information Processing Systems , volume=
Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[48]
Language Models are Few-Shot Learners
Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[49]
Advances in neural information processing systems , volume=
Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in neural information processing systems , volume=
-
[50]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Scaling up gans for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[52]
Advances in Neural Information Processing Systems , volume=
Snapfusion: Text-to-image diffusion model on mobile devices within two seconds , author=. Advances in Neural Information Processing Systems , volume=
-
[53]
arXiv preprint arXiv:2311.16567 , year=
Mobilediffusion: Subsecond text-to-image generation on mobile devices , author=. arXiv preprint arXiv:2311.16567 , year=
-
[54]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Ella: Equip diffusion models with llm for enhanced semantic alignment , author=. arXiv preprint arXiv:2403.05135 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Advances in Neural Information Processing Systems , volume=
Geneval: An object-focused framework for evaluating text-to-image alignment , author=. Advances in Neural Information Processing Systems , volume=
-
[56]
Exploring the role of large language models in prompt encoding for diffusion models
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models , author=. arXiv preprint arXiv:2406.11831 , year=
-
[57]
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models , author=. arXiv preprint arXiv:2211.01095 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Advances in Neural Information Processing Systems , volume=
Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps , author=. Advances in Neural Information Processing Systems , volume=
-
[59]
Flow Matching for Generative Modeling
Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
Advances in neural information processing systems , volume=
Elucidating the design space of diffusion-based generative models , author=. Advances in neural information processing systems , volume=
-
[61]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
-
[62]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[63]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Vila: On pre-training for visual language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[64]
Triton: an intermediate language and compiler for tiled neural network computations , author=. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=
-
[65]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Diffusion models without attention , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[66]
arXiv preprint arXiv:2405.18428 , year=
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention , author=. arXiv preprint arXiv:2405.18428 , year=
-
[68]
Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=. 2023 , eprint=
work page 2023
-
[69]
arXiv preprint arXiv:2405.14224 , year=
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2405.14224 , year=
-
[70]
arXiv preprint arXiv:2405.02730 , year=
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers , author=. arXiv preprint arXiv:2405.02730 , year=
-
[71]
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models , author=. arXiv preprint arXiv:2410.10733 , year=
-
[72]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[73]
IEEE Transactions on Image Processing , year=
WaveVAE: Wavelet-Enhanced Variational Autoencoder for High-Fidelity Image Compression , author=. IEEE Transactions on Image Processing , year=
-
[74]
Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2407.02158 , year=
-
[75]
Xu, Yuzhang and Li, Jialu and Guo, Qiulin and Zhou, Yuxiang and Zhang, Ziyu and Zhou, Ji and Chen, Shuai , journal=
-
[76]
Li, Xudong and Wang, Shuai and Zhang, Ziqi and Liu, Xiaoli and Wu, Tianyi and Wu, Ying and Li, Xing and Li, Jie , journal=
-
[77]
Li, Yutong and Wang, Yanan and Liu, Zizheng and Zhu, Hongjun and Chen, Bin and Chen, Zhiqiang , journal=
-
[78]
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-grpo: Training flow matching models via online rl , author=. arXiv preprint arXiv:2505.05470 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author=. arXiv preprint arXiv:2306.09341 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
MoBA: Mixture of Block Attention for Long-Context LLMs
Enzhe Lu and Zhejun Jiang and Jingyuan Liu and Yulun Du and Tao Jiang and Chao Hong and Shaowei Liu and Weiran He and Enming Yuan and Yuzhi Wang and Zhiqi Huang and Huan Yuan and Suting Xu and Xinran Xu and Guokun Lai and Yanru Chen and Huabin Zheng and Junjie Yan and Jianlin Su and Yuxin Wu and Yutao Zhang and Zhilin Yang and Xinyu Zhou and Mingxing Zhan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.