ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

Binglei Bao; Chuntao Liu; Haizhen Xie; Hao Wu; Heyuan Gao; Huaao Tang; Jie Hu; Kunpeng Du; Lei Yu; Sen Lu

arxiv: 2605.15684 · v1 · pith:TEJITKKDnew · submitted 2026-05-15 · 💻 cs.CV

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

Kunpeng Du , Haizhen Xie , Sen Lu , Lei Yu , Binglei Bao , Huaao Tang , Chuntao Liu , Hao Wu

show 6 more authors

Yang Zhao Zhicai Huang Heyuan Gao Zhijun Tu Jie Hu Xinghao Chen

This is my paper

Pith reviewed 2026-05-20 19:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords Diffusion TransformerElastic architectureSparse attentionMobile image generationHigh-resolution generationDynamic trade-offSSBAT-DVAE

0 comments

The pith

A single ElasticDiT model reconfigures its compression ratio and block depth on the fly to outperform specialized baselines across fidelity and latency on mobile devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that Diffusion Transformers can be made practical for high-resolution image generation on phones and tablets by building one model that changes its internal compression and depth at runtime. It adds Shift Sparse Block Attention to reduce computation at high sparsity levels and a compact distilled VAE that keeps reconstruction quality close to larger models while using far less compute. Experiments show the same trained weights can be switched between modes to hit different speed-quality points without retraining each time. This removes the need to ship multiple fixed models for different hardware budgets. The flex-lite setting reaches an HPS score of 32.87, above the Flux baseline, at 84 percent average sparsity.

Core claim

ElasticDiT achieves dynamic fidelity-latency trade-offs within a single set of parameters by jointly varying spatial compression ratios and DiT block depths, while Shift Sparse Block Attention (SSBA) maintains competitive image quality at 84.16 percent average sparsity and the Tiny DWT-Distilled VAE (T-DVAE) delivers SD3-level reconstruction at one-eighth the cost of standard VAEs.

What carries the argument

Elastic architecture that jointly adjusts spatial compression ratios and DiT block depths at inference time, supported by Shift Sparse Block Attention for sparsity and a Tiny DWT-Distilled VAE for efficient encoding.

If this is right

One trained ElasticDiT checkpoint can be reconfigured on the fly to serve many different mobile hardware budgets.
The flex-lite variant surpasses the Flux model on HPS while operating at 84.16 percent average sparsity through SSBA.
T-DVAE supplies SD3-level reconstruction quality using only one-eighth the compute of a standard VAE.
Flow-GRPO raises GenEval alignment from 66.93 to 73.62 without changing the core architecture.
Deployment no longer requires maintaining separate task-specific models for each latency target.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same elastic reconfiguration idea could be applied to other diffusion backbones to reduce the number of models needed across edge devices.
Real-time mobile apps could switch between high-quality and low-latency modes based on battery level or user preference without reloading weights.
Future hardware with variable tensor cores might exploit the sparsity patterns in SSBA to gain additional speedups beyond what software alone achieves.
The approach opens a path for on-device fine-tuning loops where the model adapts its depth to the current task without cloud round-trips.

Load-bearing premise

Quality improvements from sparse attention and the distilled VAE stay consistent no matter which compression ratio or block depth is chosen at runtime, without any extra retraining or tuning for each setting.

What would settle it

Run the flex-lite configuration at multiple different compression ratios and depths on a mobile device and measure whether HPS stays above 32.87 and visual quality remains competitive with Flux; a drop below that threshold at any valid runtime setting would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.15684 by Binglei Bao, Chuntao Liu, Haizhen Xie, Hao Wu, Heyuan Gao, Huaao Tang, Jie Hu, Kunpeng Du, Lei Yu, Sen Lu, Xinghao Chen, Yang Zhao, Zhicai Huang, Zhijun Tu.

**Figure 2.** Figure 2: Comparison with other models on GenEval and HPSV2.1. Our models achieve the best [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: spatio-depth elastic architecture overview. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Unified Weight Co-Optimization. To achieve full weight-sharing between the DiT-flex-max and DiT-flex-lite while ensuring highquality generation capability, we design a rigorous Unified Weight Co-Optimization (UWCO) strat4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: the sparse-block attention module that enables linear-complexity long-range interaction. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: T-DVAE pipeline. Left: encoder compatible with ElasticDiT, enhanced by multi-level Haar wavelets. Right: two-stage distillation — latent alignment and reconstruction training. In order to reduce the training cost and inference latency of text-to-image generation, we design an efficient lightweight VAE, which reduces the computational load by using multi-level wavelet transform like WF-VAE Chen et al. (2024… view at source ↗

**Figure 7.** Figure 7: T2I qualitative comparison. 5 CONCLUSION This paper presents ElasticDiT, a hardware-aware framework that significantly improves the qualityefficiency trade-off for mobile image generation. The proposed Spatio-Depth Elastic Architecture provides the flexibility to reconfigure model depth and resolution on-the-fly. This ensures optimal performance across varying hardware limitations with a unified set of pa… view at source ↗

read the original abstract

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ElasticDiT shows a single DiT that can adjust compression and depth at runtime for mobile use with SSBA and T-DVAE, but the abstract gives no training details to back the claim that one weight set stays stable across configs.

read the letter

The main thing here is that ElasticDiT tries to deliver one set of weights that can be reconfigured on the fly by changing spatial compression ratio and DiT block depth, using their Shift Sparse Block Attention and a Tiny DWT-Distilled VAE to keep quality while cutting latency and memory on mobile hardware. This targets the real gap between cloud-scale DiTs like Flux or SD3 and what phones can run without shipping multiple specialized models. The reported flex lite point hitting HPS 32.87 and the GenEval jump from 66.93 to 73.62 are concrete numbers that show they are aiming at competitive output rather than just speed hacks. The T-DVAE claim of SD3-level reconstruction at 1/8 the cost is also a straightforward practical win if it holds. What stands out as new is the specific pairing of runtime-adjustable depth and compression with the SSBA mechanism; earlier pruning or linear-attention work tends to be fixed after training. The paper does a reasonable job laying out why a single-model elastic approach matters for deployment and energy use. The soft spot is exactly the one the stress-test flags. The central claim requires that the same trained weights deliver the quoted quality numbers no matter which compression ratio or block depth is chosen at inference time, yet the abstract supplies no description of the training objective, any capacity-regularization term, or schedule that would enforce that invariance. If the model was primarily optimized around a few operating points, the on-the-fly superiority could be narrower than presented. There is also no mention of run counts, variance, or controls, which leaves the quantitative improvements only partially supported. This is aimed at people working on efficient diffusion models and on-device generative AI. A reader who needs ideas for mobile DiT variants or hardware-adaptive generation would find the architecture choices and operating points worth examining. I would send it for peer review; the problem is timely and the engineering direction is clear enough that referees could usefully press on the training details and elasticity ablations.

Referee Report

1 major / 3 minor

Summary. The paper proposes ElasticDiT, a Diffusion Transformer architecture for high-resolution image generation on mobile devices. It enables dynamic trade-offs by jointly adjusting spatial compression ratios and DiT block depths within a single trained model. The approach integrates Shift Sparse Block Attention (SSBA) to achieve high sparsity and a Tiny DWT-Distilled VAE (T-DVAE) for efficient encoding, with additional use of Flow-GRPO. Reported results include a flex lite variant achieving HPS of 32.87 (surpassing Flux) at 84.16% average sparsity, SD3-level reconstruction at 1/8x VAE cost, and GenEval improvement from 66.93 to 73.62.

Significance. If the central claims are substantiated, the work would offer a practical advance for deploying high-fidelity generative models under mobile constraints. A single-parameter-set model supporting on-the-fly reconfiguration across fidelity-latency points could reduce the engineering overhead of maintaining multiple specialized models. The quantitative gains in human preference and semantic alignment metrics, combined with the sparsity and efficiency techniques, indicate potential impact in resource-constrained deployment scenarios.

major comments (1)

Abstract: The load-bearing claim that 'a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines' while preserving quality (e.g., HPS 32.87 at 84.16% sparsity) across arbitrary choices of spatial compression ratio and DiT block depth is not supported by any description of the training objective, regularization, or schedule that would enforce invariance to these runtime choices. If the elastic paths were optimized only for a subset of configurations, the reported superiority would not generalize.

minor comments (3)

Abstract: The quantitative results (HPS 32.87, GenEval 73.62) are presented without reference to specific tables, figures, or sections containing the full experimental setup, controls, or number of runs.
Abstract: The 'flex lite variant' is mentioned without clarifying its exact relation to the elastic parameters (compression ratio and block depth) or how it differs from other configurations.
The manuscript would benefit from explicit discussion of whether SSBA and T-DVAE require any per-configuration fine-tuning or if they are trained once to support all elastic settings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential practical impact of ElasticDiT. We address the single major comment below and will incorporate clarifications to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The load-bearing claim that 'a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines' while preserving quality (e.g., HPS 32.87 at 84.16% sparsity) across arbitrary choices of spatial compression ratio and DiT block depth is not supported by any description of the training objective, regularization, or schedule that would enforce invariance to these runtime choices. If the elastic paths were optimized only for a subset of configurations, the reported superiority would not generalize.

Authors: We appreciate this observation and agree that the abstract claim requires explicit grounding in the training procedure. The full manuscript (Section 3.2) describes a multi-configuration training strategy in which spatial compression ratios and DiT block depths are randomly sampled per batch during optimization; the diffusion loss is computed on the sampled path, and a path-consistency regularization term is added to penalize output variance across different elastic settings. The training schedule progressively widens the sampled configuration space over epochs. This design is intended to promote invariance rather than specialization to a narrow subset. We will revise the abstract to briefly reference this training approach and expand the methods section with additional equations and pseudocode for the objective and sampling schedule to make the support for the claim fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering results with no self-referential derivation

full rationale

The paper describes an empirical architecture (ElasticDiT) that supports runtime reconfiguration of compression ratios and block depths, with quality metrics (HPS 32.87, 84.16% sparsity) presented as measured experimental outcomes rather than predictions derived from fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce by construction to the inputs; the central claim rests on reported performance across configurations, which is externally falsifiable via replication on the stated benchmarks. This is a standard non-circular engineering contribution.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claims rest on the effectiveness of newly introduced components whose performance is shown only through the reported experiments; no external benchmarks or formal proofs are mentioned.

free parameters (2)

spatial compression ratio
Chosen at runtime to trade fidelity for latency; specific values are not listed but are central to the elastic mechanism.
DiT block depth
Adjusted dynamically; the paper treats different depths as configurable without retraining.

axioms (1)

domain assumption Diffusion process and transformer attention mechanisms behave predictably under the proposed sparsity and compression changes.
Invoked implicitly when claiming that quality is maintained across configurations.

invented entities (2)

Shift Sparse Block Attention (SSBA) no independent evidence
purpose: Reduce attention computation while preserving quality at high sparsity levels.
New attention variant introduced to achieve 84.16% average sparsity.
Tiny DWT-Distilled VAE (T-DVAE) no independent evidence
purpose: Provide SD3-level reconstruction at 1/8 the compute of standard VAEs.
New distilled VAE component presented as plug-and-play.

pith-pipeline@v0.9.0 · 5888 in / 1541 out tokens · 33566 ms · 2026-05-20T19:56:06.702466+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly... Shift Sparse Block Attention (SSBA)... Tiny DWT-Distilled VAE (T-DVAE)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Spatio-Depth Elastic Architecture... Sparse-Depth Pruning... Unified Weight Co-Optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 16 internal anchors

[1]

Progressive Distillation for Fast Sampling of Diffusion Models

Progressive distillation for fast sampling of diffusion models , author=. arXiv preprint arXiv:2202.00512 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

International Conference on Learning Representations , year=

Denoising diffusion implicit models , author=. International Conference on Learning Representations , year=

work page
[3]

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Knowledge distillation in iterative generative models for improved sampling speed , author=. arXiv preprint arXiv:2101.02388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month=

Li, Xiuyu and Liu, Yijiang and Lian, Long and Yang, Huanrui and Dong, Zhen and Kang, Daniel and Zhang, Shanghang and Keutzer, Kurt , title=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month=. 2023 , pages=

work page 2023
[5]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[6]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[7]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[8]

2022 , journal=

Scalable Diffusion Models with Transformers , author=. 2022 , journal=

work page 2022
[9]

arXiv preprint arXiv:2306.05178 , year=

SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions , author=. arXiv preprint arXiv:2306.05178 , year=

work page arXiv
[10]

The Twelfth International Conference on Learning Representations , year=

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[11]

2023 , note=

Gemma: Open Models Based on Gemini Technology and Research , author=. 2023 , note=

work page 2023
[12]

arXiv preprint arXiv:2406.16747 , year=

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , author=. arXiv preprint arXiv:2406.16747 , year=

work page arXiv
[13]

ArXiv , year=

EasyQuant: Post-training Quantization via Scale Optimization , author=. ArXiv , year=

work page
[14]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ZeroQ: A Novel Zero Shot Quantization Framework , author=. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2020
[15]

ArXiv , year=

SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation , author=. ArXiv , year=

work page
[16]

2021 , url=

Yuhang Li and Ruihao Gong and Xu Tan and Yang Yang and Peng Hu and Qi Zhang and Fengwei Yu and Wei Wang and Shi Gu , booktitle=. 2021 , url=

work page 2021
[17]

arXiv preprint arXiv:2001.08248 , year=

How much position information do convolutional neural networks encode? , author=. arXiv preprint arXiv:2001.08248 , year=

work page arXiv 2001
[18]

Advances in neural information processing systems , volume=

SegFormer: Simple and efficient design for semantic segmentation with transformers , author=. Advances in neural information processing systems , volume=

work page
[19]

Advances in Neural Information Processing Systems , volume=

The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=

work page
[20]

arXiv preprint arXiv:2203.16634 , year=

Transformer language models without positional encodings still learn positional information , author=. arXiv preprint arXiv:2203.16634 , year=

work page arXiv
[21]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[22]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[23]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[24]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Efficient attention: Attention with linear complexities , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page
[25]

European Conference on Computer Vision , pages=

Hydra attention: Efficient attention with many heads , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[26]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[27]

NeurIPS 2022 Workshop on Score-Based Methods , year=

All are worth words: a vit backbone for score-based diffusion models , author=. NeurIPS 2022 Workshop on Score-Based Methods , year=

work page 2022
[28]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Condition-Aware Neural Network for Controlled Image Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[30]

Forty-first International Conference on Machine Learning , year=

Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first International Conference on Machine Learning , year=

work page
[31]

arXiv:2309.15807

Emu: Enhancing image generation models using photogenic needles in a haystack , author=. arXiv preprint arXiv:2309.15807 , year=

work page arXiv
[32]

International Conference on Learning Representations , year=

PixArt- : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , author=. International Conference on Learning Representations , year=

work page
[33]

Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023a

Pixart- : Weak-to-strong training of diffusion transformer for 4k text-to-image generation , author=. arXiv preprint arXiv:2403.04692 , year=

work page arXiv
[34]

International conference on machine learning , pages=

Efficientnetv2: Smaller models and faster training , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[35]

International conference on machine learning , pages=

Language modeling with gated convolutional networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[36]

Lumina-next: Making lumina-t2x stronger and faster with next-dit

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT , author=. arXiv preprint arXiv:2406.18583 , year=

work page arXiv
[37]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation , author=. arXiv preprint arXiv:2402.17245 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Playground v3: Improving text-to-image alignment with deep-fusion large language models

Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models , author=. arXiv preprint arXiv:2409.10695 , year=

work page arXiv
[40]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding , author=. arXiv preprint arXiv:2405.08748 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

OpenAI. Dalle-3. 2023

work page 2023
[42]

Black Forest Labs. FLUX. 2024

work page 2024
[43]

Cheng Lu , title =. 2023

work page 2023
[44]

Advances in Neural Information Processing Systems , volume=

Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[48]

Language Models are Few-Shot Learners

Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[49]

Advances in neural information processing systems , volume=

Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in neural information processing systems , volume=

work page
[50]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Scaling up gans for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[52]

Advances in Neural Information Processing Systems , volume=

Snapfusion: Text-to-image diffusion model on mobile devices within two seconds , author=. Advances in Neural Information Processing Systems , volume=

work page
[53]

arXiv preprint arXiv:2311.16567 , year=

Mobilediffusion: Subsecond text-to-image generation on mobile devices , author=. arXiv preprint arXiv:2311.16567 , year=

work page arXiv
[54]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Ella: Equip diffusion models with llm for enhanced semantic alignment , author=. arXiv preprint arXiv:2403.05135 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Advances in Neural Information Processing Systems , volume=

Geneval: An object-focused framework for evaluating text-to-image alignment , author=. Advances in Neural Information Processing Systems , volume=

work page
[56]

Exploring the role of large language models in prompt encoding for diffusion models

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models , author=. arXiv preprint arXiv:2406.11831 , year=

work page arXiv
[57]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models , author=. arXiv preprint arXiv:2211.01095 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Advances in Neural Information Processing Systems , volume=

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

Flow Matching for Generative Modeling

Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Advances in neural information processing systems , volume=

Elucidating the design space of diffusion-based generative models , author=. Advances in neural information processing systems , volume=

work page
[61]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[62]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[63]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vila: On pre-training for visual language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[64]

Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

Triton: an intermediate language and compiler for tiled neural network computations , author=. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

work page
[65]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Diffusion models without attention , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[66]

arXiv preprint arXiv:2405.18428 , year=

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention , author=. arXiv preprint arXiv:2405.18428 , year=

work page arXiv
[68]

2023 , eprint=

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=. 2023 , eprint=

work page 2023
[69]

arXiv preprint arXiv:2405.14224 , year=

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2405.14224 , year=

work page arXiv
[70]

arXiv preprint arXiv:2405.02730 , year=

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers , author=. arXiv preprint arXiv:2405.02730 , year=

work page arXiv
[71]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models , author=. arXiv preprint arXiv:2410.10733 , year=

work page arXiv
[72]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[73]

IEEE Transactions on Image Processing , year=

WaveVAE: Wavelet-Enhanced Variational Autoencoder for High-Fidelity Image Compression , author=. IEEE Transactions on Image Processing , year=

work page
[74]

Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2407.02158 , year=

work page arXiv
[75]

Xu, Yuzhang and Li, Jialu and Guo, Qiulin and Zhou, Yuxiang and Zhang, Ziyu and Zhou, Ji and Chen, Shuai , journal=

work page
[76]

Li, Xudong and Wang, Shuai and Zhang, Ziqi and Liu, Xiaoli and Wu, Tianyi and Wu, Ying and Li, Xing and Li, Jie , journal=

work page
[77]

Li, Yutong and Wang, Yanan and Liu, Zizheng and Zhu, Hongjun and Chen, Bin and Chen, Zhiqiang , journal=

work page
[78]

Flow-GRPO: Training Flow Matching Models via Online RL

Flow-grpo: Training flow matching models via online rl , author=. arXiv preprint arXiv:2505.05470 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu and Zhejun Jiang and Jingyuan Liu and Yulun Du and Tao Jiang and Chao Hong and Shaowei Liu and Weiran He and Enming Yuan and Yuzhi Wang and Zhiqi Huang and Huan Yuan and Suting Xu and Xinran Xu and Guokun Lai and Yanru Chen and Huabin Zheng and Junjie Yan and Jianlin Su and Yuxin Wu and Yutao Zhang and Zhilin Yang and Xinyu Zhou and Mingxing Zhan...

work page internal anchor Pith review Pith/arXiv arXiv
[81]

2024 , eprint=

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training , author=. 2024 , eprint=

work page 2024

Showing first 80 references.

[1] [1]

Progressive Distillation for Fast Sampling of Diffusion Models

Progressive distillation for fast sampling of diffusion models , author=. arXiv preprint arXiv:2202.00512 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

International Conference on Learning Representations , year=

Denoising diffusion implicit models , author=. International Conference on Learning Representations , year=

work page

[3] [3]

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Knowledge distillation in iterative generative models for improved sampling speed , author=. arXiv preprint arXiv:2101.02388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month=

Li, Xiuyu and Liu, Yijiang and Lian, Long and Yang, Huanrui and Dong, Zhen and Kang, Daniel and Zhang, Shanghang and Keutzer, Kurt , title=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month=. 2023 , pages=

work page 2023

[5] [5]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[6] [6]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[7] [7]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[8] [8]

2022 , journal=

Scalable Diffusion Models with Transformers , author=. 2022 , journal=

work page 2022

[9] [9]

arXiv preprint arXiv:2306.05178 , year=

SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions , author=. arXiv preprint arXiv:2306.05178 , year=

work page arXiv

[10] [10]

The Twelfth International Conference on Learning Representations , year=

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=

work page

[11] [11]

2023 , note=

Gemma: Open Models Based on Gemini Technology and Research , author=. 2023 , note=

work page 2023

[12] [12]

arXiv preprint arXiv:2406.16747 , year=

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , author=. arXiv preprint arXiv:2406.16747 , year=

work page arXiv

[13] [13]

ArXiv , year=

EasyQuant: Post-training Quantization via Scale Optimization , author=. ArXiv , year=

work page

[14] [14]

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ZeroQ: A Novel Zero Shot Quantization Framework , author=. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2020

[15] [15]

ArXiv , year=

SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation , author=. ArXiv , year=

work page

[16] [16]

2021 , url=

Yuhang Li and Ruihao Gong and Xu Tan and Yang Yang and Peng Hu and Qi Zhang and Fengwei Yu and Wei Wang and Shi Gu , booktitle=. 2021 , url=

work page 2021

[17] [17]

arXiv preprint arXiv:2001.08248 , year=

How much position information do convolutional neural networks encode? , author=. arXiv preprint arXiv:2001.08248 , year=

work page arXiv 2001

[18] [18]

Advances in neural information processing systems , volume=

SegFormer: Simple and efficient design for semantic segmentation with transformers , author=. Advances in neural information processing systems , volume=

work page

[19] [19]

Advances in Neural Information Processing Systems , volume=

The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=

work page

[20] [20]

arXiv preprint arXiv:2203.16634 , year=

Transformer language models without positional encodings still learn positional information , author=. arXiv preprint arXiv:2203.16634 , year=

work page arXiv

[21] [21]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[22] [22]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[23] [23]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[24] [24]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Efficient attention: Attention with linear complexities , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page

[25] [25]

European Conference on Computer Vision , pages=

Hydra attention: Efficient attention with many heads , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022

[26] [26]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[27] [27]

NeurIPS 2022 Workshop on Score-Based Methods , year=

All are worth words: a vit backbone for score-based diffusion models , author=. NeurIPS 2022 Workshop on Score-Based Methods , year=

work page 2022

[28] [28]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[29] [29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Condition-Aware Neural Network for Controlled Image Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[30] [30]

Forty-first International Conference on Machine Learning , year=

Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first International Conference on Machine Learning , year=

work page

[31] [31]

arXiv:2309.15807

Emu: Enhancing image generation models using photogenic needles in a haystack , author=. arXiv preprint arXiv:2309.15807 , year=

work page arXiv

[32] [32]

International Conference on Learning Representations , year=

PixArt- : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , author=. International Conference on Learning Representations , year=

work page

[33] [33]

Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023a

Pixart- : Weak-to-strong training of diffusion transformer for 4k text-to-image generation , author=. arXiv preprint arXiv:2403.04692 , year=

work page arXiv

[34] [34]

International conference on machine learning , pages=

Efficientnetv2: Smaller models and faster training , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[35] [35]

International conference on machine learning , pages=

Language modeling with gated convolutional networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017

[36] [36]

Lumina-next: Making lumina-t2x stronger and faster with next-dit

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT , author=. arXiv preprint arXiv:2406.18583 , year=

work page arXiv

[37] [37]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation , author=. arXiv preprint arXiv:2402.17245 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Playground v3: Improving text-to-image alignment with deep-fusion large language models

Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models , author=. arXiv preprint arXiv:2409.10695 , year=

work page arXiv

[40] [40]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding , author=. arXiv preprint arXiv:2405.08748 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

OpenAI. Dalle-3. 2023

work page 2023

[42] [42]

Black Forest Labs. FLUX. 2024

work page 2024

[43] [43]

Cheng Lu , title =. 2023

work page 2023

[44] [44]

Advances in Neural Information Processing Systems , volume=

Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=

work page

[45] [45]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[48] [48]

Language Models are Few-Shot Learners

Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005

[49] [49]

Advances in neural information processing systems , volume=

Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in neural information processing systems , volume=

work page

[50] [50]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Scaling up gans for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[52] [52]

Advances in Neural Information Processing Systems , volume=

Snapfusion: Text-to-image diffusion model on mobile devices within two seconds , author=. Advances in Neural Information Processing Systems , volume=

work page

[53] [53]

arXiv preprint arXiv:2311.16567 , year=

Mobilediffusion: Subsecond text-to-image generation on mobile devices , author=. arXiv preprint arXiv:2311.16567 , year=

work page arXiv

[54] [54]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Ella: Equip diffusion models with llm for enhanced semantic alignment , author=. arXiv preprint arXiv:2403.05135 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

Advances in Neural Information Processing Systems , volume=

Geneval: An object-focused framework for evaluating text-to-image alignment , author=. Advances in Neural Information Processing Systems , volume=

work page

[56] [56]

Exploring the role of large language models in prompt encoding for diffusion models

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models , author=. arXiv preprint arXiv:2406.11831 , year=

work page arXiv

[57] [57]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models , author=. arXiv preprint arXiv:2211.01095 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Advances in Neural Information Processing Systems , volume=

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps , author=. Advances in Neural Information Processing Systems , volume=

work page

[59] [59]

Flow Matching for Generative Modeling

Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

Advances in neural information processing systems , volume=

Elucidating the design space of diffusion-based generative models , author=. Advances in neural information processing systems , volume=

work page

[61] [61]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page

[62] [62]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[63] [63]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vila: On pre-training for visual language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[64] [64]

Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

Triton: an intermediate language and compiler for tiled neural network computations , author=. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages=

work page

[65] [65]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Diffusion models without attention , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[66] [66]

arXiv preprint arXiv:2405.18428 , year=

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention , author=. arXiv preprint arXiv:2405.18428 , year=

work page arXiv

[67] [68]

2023 , eprint=

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=. 2023 , eprint=

work page 2023

[68] [69]

arXiv preprint arXiv:2405.14224 , year=

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2405.14224 , year=

work page arXiv

[69] [70]

arXiv preprint arXiv:2405.02730 , year=

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers , author=. arXiv preprint arXiv:2405.02730 , year=

work page arXiv

[70] [71]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models , author=. arXiv preprint arXiv:2410.10733 , year=

work page arXiv

[71] [72]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[72] [73]

IEEE Transactions on Image Processing , year=

WaveVAE: Wavelet-Enhanced Variational Autoencoder for High-Fidelity Image Compression , author=. IEEE Transactions on Image Processing , year=

work page

[73] [74]

Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2407.02158 , year=

work page arXiv

[74] [75]

Xu, Yuzhang and Li, Jialu and Guo, Qiulin and Zhou, Yuxiang and Zhang, Ziyu and Zhou, Ji and Chen, Shuai , journal=

work page

[75] [76]

Li, Xudong and Wang, Shuai and Zhang, Ziqi and Liu, Xiaoli and Wu, Tianyi and Wu, Ying and Li, Xing and Li, Jie , journal=

work page

[76] [77]

Li, Yutong and Wang, Yanan and Liu, Zizheng and Zhu, Hongjun and Chen, Bin and Chen, Zhiqiang , journal=

work page

[77] [78]

Flow-GRPO: Training Flow Matching Models via Online RL

Flow-grpo: Training flow matching models via online rl , author=. arXiv preprint arXiv:2505.05470 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[78] [79]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[79] [80]

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu and Zhejun Jiang and Jingyuan Liu and Yulun Du and Tao Jiang and Chao Hong and Shaowei Liu and Weiran He and Enming Yuan and Yuzhi Wang and Zhiqi Huang and Huan Yuan and Suting Xu and Xinran Xu and Guokun Lai and Yanru Chen and Huabin Zheng and Junjie Yan and Jianlin Su and Yuxin Wu and Yutao Zhang and Zhilin Yang and Xinyu Zhou and Mingxing Zhan...

work page internal anchor Pith review Pith/arXiv arXiv

[80] [81]

2024 , eprint=

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training , author=. 2024 , eprint=

work page 2024