pith. machine review for the scientific record. sign in

arxiv: 2310.00426 · v3 · submitted 2023-09-30 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Chongjian Ge, Enze Xie, Huchuan Lu, James Kwok, Jincheng Yu, Junsong Chen, Lewei Yao, Ping Luo, Yue Wu, Zhenguo Li, Zhongdao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 20:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image synthesisdiffusion transformertraining efficiencydense captionsPIXART-αvision-language modelimage generationhigh-resolution synthesis
0
0 comments X

The pith

PIXART-α trains a high-quality text-to-image diffusion transformer in 10.8 percent the time of Stable Diffusion v1.5.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a diffusion transformer for text-to-image synthesis can reach competitive quality with models like Imagen and SDXL while using dramatically lower training resources. It achieves this through three specific designs that separate the learning of pixel relationships, text alignment, and visual appeal, replace class conditioning with cross-attention for text, and supply dense captions generated by a vision-language model. A sympathetic reader would care because the resulting model supports 1024-pixel outputs and claims to cut training cost to roughly one percent of larger alternatives like RAPHAEL. If the designs work as described, they would make it feasible for more groups to build capable generative systems from scratch without million-dollar compute budgets.

Core claim

PIXART-α is a Transformer-based diffusion model for text-to-image synthesis whose generation quality matches state-of-the-art systems. The model is obtained by decomposing training into three successive stages that optimize pixel dependency, text-image alignment, and aesthetic quality in turn; by embedding cross-attention modules inside a Diffusion Transformer to handle text conditions efficiently; and by automatically labeling training pairs with dense pseudo-captions from a large vision-language model. These choices produce a system that trains in 675 A100 GPU days, 10.8 percent of the time reported for Stable Diffusion v1.5, while supporting up to 1024-pixel resolution and competitive or,

What carries the argument

Three-stage training decomposition together with a Diffusion Transformer that uses cross-attention for text conditioning and dense VLM-generated captions for alignment.

If this is right

  • High-resolution text-to-image synthesis up to 1024 pixels becomes practical at far lower compute budgets.
  • Training cost drops to 1 percent of larger state-of-the-art models such as RAPHAEL while preserving semantic control and artistic quality.
  • Carbon emissions associated with model development fall by roughly 90 percent compared with prior large-scale diffusion models.
  • Startups and research groups can iterate on new generative models without requiring thousands of GPU days.
  • The resulting models excel at image quality, artistry, and fine-grained semantic control according to the reported experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged-optimization pattern could be tested on video or 3D generation tasks to check whether similar efficiency gains appear.
  • If dense captions prove decisive, then future work might focus on improving the captioning model itself rather than scaling raw image data.
  • The cost reduction could enable more frequent public releases of updated models, shortening the iteration cycle in the field.
  • Open questions remain about whether the efficiency advantage persists when the same decomposition is applied to other transformer variants or non-diffusion backbones.

Load-bearing premise

The three-stage training split and the use of dense pseudo-captions are the main reasons for both the quality and the large reduction in training cost, rather than differences in total data volume or other implementation choices.

What would settle it

A side-by-side training run of the identical architecture and data scale, once with the three-stage schedule and dense captions and once without them, that shows whether the reported quality and speed gains disappear.

read the original abstract

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PixArt-α, a Transformer-based diffusion model for text-to-image synthesis. It claims competitive photorealistic quality with SOTA models (SDXL, Imagen, RAPHAEL, Midjourney) at up to 1024px resolution, enabled by three designs: (1) three-stage training decomposition optimizing pixel dependency, text-image alignment, and aesthetics separately; (2) efficient DiT with cross-attention for text conditioning instead of class labels; (3) VLM-generated dense pseudo-captions for high-informative data. This yields training in 675 A100 GPU days (10.8% of SD v1.5's 6,250 days), saving ~$300k and 90% CO2, with extensive visual/quantitative experiments.

Significance. If the efficiency and quality claims hold under controlled conditions, the work could substantially lower barriers for training high-quality T2I models, offering cost and environmental benefits plus practical insights on staged training and caption density for the AIGC community. The empirical comparisons to multiple baselines and support for high-res synthesis are notable strengths.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central efficiency claim (675 vs. 6,250 A100 GPU days, 10.8% of SD v1.5) attributes the speedup to the three core designs, but provides no controls or reporting for total data volume processed, number of optimization steps, model parameter count, or auxiliary VLM captioning compute. Without matched ablations or full hyperparameter tables isolating these factors, differences in data scale or implementation details could drive the reported savings rather than the proposed decomposition.
  2. [§3] §3 (Training Strategy Decomposition): The three-stage approach is presented as separately optimizing distinct objectives, yet the manuscript lacks quantitative ablations demonstrating the incremental gains of each stage (or the full decomposition) over a single-stage baseline trained with equivalent total compute and data.
minor comments (2)
  1. [Abstract] Abstract: Limited detail on exact quantitative metrics (e.g., specific FID, CLIP, or human preference scores) and evaluation protocols for comparisons to SDXL, Imagen, and RAPHAEL.
  2. [Figures 1-2 and §4] Figures 1-2 and §4: Visual results support the quality claims, but additional details on conditioning, resolution, and baseline implementation (e.g., whether all models used identical data regimes) would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of our efficiency claims and training strategy that merit clarification. We address each major comment below and indicate planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central efficiency claim (675 vs. 6,250 A100 GPU days, 10.8% of SD v1.5) attributes the speedup to the three core designs, but provides no controls or reporting for total data volume processed, number of optimization steps, model parameter count, or auxiliary VLM captioning compute. Without matched ablations or full hyperparameter tables isolating these factors, differences in data scale or implementation details could drive the reported savings rather than the proposed decomposition.

    Authors: We appreciate this observation. The manuscript reports model parameter counts (around 600M for PixArt-α) and total training steps, but we agree that a consolidated hyperparameter table and explicit data volume per stage would strengthen transparency. In the revision we will add such a table, including estimated VLM captioning cost (which is a one-time preprocessing step amortized over training). While direct matched ablations isolating every variable were not feasible within our compute budget, the staged approach demonstrably accelerates convergence on alignment and aesthetics objectives compared to joint training, as evidenced by our internal monitoring of loss curves and downstream metrics. Comparisons to SD v1.5 use the publicly stated training cost for that model. revision: partial

  2. Referee: [§3] §3 (Training Strategy Decomposition): The three-stage approach is presented as separately optimizing distinct objectives, yet the manuscript lacks quantitative ablations demonstrating the incremental gains of each stage (or the full decomposition) over a single-stage baseline trained with equivalent total compute and data.

    Authors: The referee correctly notes the absence of a full single-stage baseline trained for the same total compute. Such an experiment would require substantial additional resources and was not performed. Instead, we show progressive improvements across stages via FID, CLIP score, and human preference metrics in §4, supporting that each stage contributes distinct gains (pixel-level fidelity in stage 1, semantic alignment in stage 2, aesthetic quality in stage 3). We will expand §3 with a clearer rationale for the decomposition and include any available partial ablations (e.g., stage-wise metric deltas). We maintain that the decomposition enables more efficient use of data and objectives, but acknowledge a direct head-to-head comparison would be ideal. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or efficiency claims

full rationale

The paper reports empirical training costs (675 A100 GPU days) and quality metrics from direct model training runs, compared against external baselines such as Stable Diffusion v1.5. No equations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The three core designs are methodological choices whose effects are measured experimentally rather than derived by construction from the inputs themselves. The efficiency attribution rests on reported wall-clock measurements, not tautological redefinitions or unverified self-references.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard diffusion model assumptions and empirical scaling behavior rather than new theoretical derivations.

free parameters (1)
  • stage-specific training hyperparameters
    Learning rates, batch sizes, and iteration counts per training stage are chosen to optimize each phase separately.
axioms (2)
  • domain assumption Diffusion transformers can achieve high image quality when text conditions are properly injected via cross-attention.
    Inherited from prior DiT and diffusion literature.
  • domain assumption Dense pseudo-captions generated by a VLM improve text-image alignment learning.
    Core justification for the high-informative data component.

pith-pipeline@v0.9.0 · 5716 in / 1192 out tokens · 40916 ms · 2026-05-12T20:33:40.515106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ImageAttributionBench: How Far Are We from Generalizable Attribution?

    cs.CV 2026-05 unverdicted novelty 7.0

    ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

  2. What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 7.0

    A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

  3. Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

  4. SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

    cs.CV 2026-04 unverdicted novelty 7.0

    Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.

  5. DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference

    cs.AR 2026-04 unverdicted novelty 7.0

    DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.

  6. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  7. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  8. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  9. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.

  10. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.

  11. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  12. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  13. SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

  14. The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

    cs.CV 2026-04 unverdicted novelty 6.0

    A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.

  15. EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.

  16. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  17. BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models

    cs.CY 2026-04 conditional novelty 6.0

    BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.

  18. Evolutionary Token-Level Prompt Optimization for Diffusion Models

    cs.AI 2026-04 unverdicted novelty 6.0

    A genetic algorithm evolves CLIP token vectors to optimize aesthetic quality and prompt alignment in diffusion models, outperforming Promptist and random search by up to 23.93% on a combined fitness score.

  19. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  20. LTX-Video: Realtime Video Latent Diffusion

    cs.CV 2024-12 conditional novelty 6.0

    LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.

  21. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  22. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    cs.CV 2024-08 unverdicted novelty 6.0

    CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

  23. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    cs.CV 2023-10 unverdicted novelty 6.0

    Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.

  24. On the Limits of Latent Reuse in Diffusion Models

    stat.ML 2026-05 unverdicted novelty 5.0

    Reusing source latent spaces in diffusion models under distribution shift produces target score error set by principal-angle misalignment and diffusion-time-amplified ambient noise.

  25. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  26. Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

    cs.LG 2026-04 unverdicted novelty 5.0

    Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.

  27. Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Target-based prompting lets users define fairness distributions for skin tones in generative AI, shifting outputs closer to chosen targets across 36 tested prompts for occupations and contexts.

  28. UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

  29. Open-Sora: Democratizing Efficient Video Production for All

    cs.CV 2024-12 unverdicted novelty 5.0

    Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...

  30. AHS: Adaptive Head Synthesis via Synthetic Data Augmentations

    cs.CV 2026-04 unverdicted novelty 4.0

    Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.

  31. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 30 Pith papers · 1 internal anchor

  1. [2]

    ediffi: Text-to-image diffusion models with an ensemble of expert denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. In arXiv, 2022

  2. [3]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In CVPR, 2023

  3. [4]

    A study on the evaluation of generative models

    Eyal Betzalel, Coby Penso, Aviv Navon, and Ethan Fetaya. A study on the evaluation of generative models. In arXiv, 2022

  4. [5]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020

  5. [6]

    Deepfloyd, 2023

    DeepFloyd. Deepfloyd, 2023. URL https://www.deepfloyd.ai/

  6. [7]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

  7. [8]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 0 8780--8794, 2021

  8. [9]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020 a

  9. [10]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In arXiv, 2020 b

  10. [11]

    Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts

    Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong Yin, Shikun Feng, et al. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In CVPR, 2023

  11. [12]

    Metabev: Solving sensor failures for 3d detection and map segmentation

    Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, and Ping Luo. Metabev: Solving sensor failures for 3d detection and map segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8721--8731, 2023

  12. [13]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15180--15190, 2023

  13. [14]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014

  14. [15]

    Transformer in transformer

    Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. NeurIPS, 2021

  15. [16]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022

  16. [17]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017

  17. [19]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

  18. [20]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2021

  19. [21]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. In ICCV, 2023

  20. [22]

    Scaling up gans for text-to-image synthesis

    Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In CVPR, 2023

  21. [23]

    Diffusionclip: Text-guided diffusion models for robust image manipulation

    Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2426--2435, June 2022

  22. [24]

    Auto-encoding variational bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In arXiv, 2013

  23. [25]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023

  24. [26]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In arXiv, 2023

  25. [27]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022 a

  26. [28]

    Panoptic segformer: Delving deeper into panoptic segmentation with transformers

    Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, Ping Luo, and Tong Lu. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, 2022 b

  27. [29]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

  28. [30]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In arXiv, 2023

  29. [31]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021

  30. [32]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In CVPR, 2022

  31. [33]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In arXiv, 2017

  32. [34]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35: 0 5775--5787, 2022

  33. [35]

    Gpu selling, 2023

    Microsoft. Gpu selling, 2023. URL https://www.leadergpu.com/

  34. [36]

    Midjourney, 2023

    Midjourney. Midjourney, 2023. URL https://www.midjourney.com

  35. [37]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In arXiv, 2023

  36. [38]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.\ 8162--8171. PMLR, 2021

  37. [39]

    Nltk, 2023

    NLTK. Nltk, 2023. URL https://www.nltk.org/

  38. [40]

    Getting immediate speedups with a100 and tf32, 2023

    NVIDIA. Getting immediate speedups with a100 and tf32, 2023. URL https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32

  39. [41]

    Dalle-2, 2023

    OpenAI. Dalle-2, 2023. URL https://openai.com/dall-e-2

  40. [42]

    Journeydb: A benchmark for generative image understanding

    Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. In arXiv, 2023

  41. [43]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023

  42. [44]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  43. [45]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In arXiv, 2023

  44. [46]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022

  45. [47]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI blog, 2018

  46. [48]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019

  47. [49]

    Variational inference with normalizing flows

    Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015

  48. [50]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

  49. [51]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015

  50. [52]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In arXiv, 2022

  51. [53]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022

  52. [54]

    Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In arXiv, 2021

  53. [55]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015

  54. [56]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019

  55. [57]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021

  56. [58]

    Segmenter: Transformer for semantic segmentation

    Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, 2021

  57. [59]

    Transtrack: Multiple object tracking with transformer

    Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple object tracking with transformer. In arXiv, 2020

  58. [60]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv \'e J \'e gou. Training data-efficient image transformers & distillation through attention. In ICML, 2021

  59. [61]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

  60. [62]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021

  61. [63]

    Pvt v2: Improved baselines with pyramid vision transformer

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 2022

  62. [65]

    Segformer: Simple and efficient design for semantic segmentation with transformers

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34: 0 12077--12090, 2021

  63. [66]

    Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning

    Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In ICCV, 2023

  64. [67]

    Holistically-nested edge detection

    Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In ICCV, 2015

  65. [69]

    Raphael: Text-to-image generation via large mixture of diffusion paths

    Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. In arXiv, 2023 b

  66. [70]

    Tokens-to-token vit: Training vision transformers from scratch on imagenet

    Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 2021

  67. [71]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023

  68. [72]

    Point transformer

    Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In ICCV, 2021

  69. [73]

    Fast training of diffusion models with masked transformers

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. In arXiv, 2023

  70. [74]

    Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers

    Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021

  71. [75]

    Deepvit: Towards deeper vision transformer

    Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. In arXiv, 2021

  72. [76]

    Understanding the robustness in vision transformers

    Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. In International Conference on Machine Learning, pp.\ 27378--27394. PMLR, 2022

  73. [77]

    Getting Immediate Speedups with A100 and TF32

    NVIDIA. Getting Immediate Speedups with A100 and TF32. 2023

  74. [78]

    GPU selling

    Microsoft. GPU selling. 2023

  75. [79]

    OpenAI. Dalle-2. 2023

  76. [80]

    Midjourney

    Midjourney. Midjourney. 2023

  77. [81]

    DeepFloyd

    DeepFloyd. DeepFloyd. 2023

  78. [82]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Kim, Gwanghyun and Kwon, Taesung and Ye, Jong Chul , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

  79. [83]

    arXiv preprint arXiv:2212.11565 , year=

    Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , author=. arXiv preprint arXiv:2212.11565 , year=

  80. [84]

    and Mildenhall, Ben , title =

    Poole, Ben and Jain, Ajay and Barron, Jonathan T. and Mildenhall, Ben , title =. arXiv , year =

Showing first 80 references.