{"work":{"id":"77157568-e4be-4041-bb20-388177fc59d0","openalex_id":null,"doi":null,"arxiv_id":"2310.00426","raw_key":null,"title":"PixArt-$\\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis","authors":null,"authors_text":"Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu","year":2023,"venue":"cs.CV","abstract":"The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \\$300,000 (\\$26,000 vs. \\$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.","external_url":"https://arxiv.org/abs/2310.00426","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T05:40:24.050601+00:00","pith_arxiv_id":"2310.00426","created_at":"2026-05-10T07:16:55.105888+00:00","updated_at":"2026-05-25T05:40:24.050601+00:00","title_quality_ok":true,"display_title":"PixArt-$\\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis","render_title":"PixArt-$\\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis"},"hub":{"state":{"work_id":"77157568-e4be-4041-bb20-388177fc59d0","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":69,"external_cited_by_count":null,"distinct_field_count":12,"first_pith_cited_at":"2023-10-30T13:12:40+00:00","last_pith_cited_at":"2026-05-22T08:50:10+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-27T21:07:55.446918+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":7},{"context_role":"baseline","n":5},{"context_role":"method","n":3},{"context_role":"extension","n":1}],"polarity_counts":[{"context_polarity":"background","n":8},{"context_polarity":"baseline","n":5},{"context_polarity":"use_method","n":2},{"context_polarity":"extend","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}