hub Baseline reference

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, Suhail Doshi · 2024 · cs.CV · arXiv 2402.17245

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

36 Pith papers citing it

Baseline 60% of classified citations

open full Pith review browse 36 citing papers arXiv PDF

abstract

In this work, we share three insights for achieving state-of-the-art aesthetic quality in text-to-image generative models. We focus on three critical aspects for model improvement: enhancing color and contrast, improving generation across multiple aspect ratios, and improving human-centric fine details. First, we delve into the significance of the noise schedule in training a diffusion model, demonstrating its profound impact on realism and visual fidelity. Second, we address the challenge of accommodating various aspect ratios in image generation, emphasizing the importance of preparing a balanced bucketed dataset. Lastly, we investigate the crucial role of aligning model outputs with human preferences, ensuring that generated images resonate with human perceptual expectations. Through extensive analysis and experiments, Playground v2.5 demonstrates state-of-the-art performance in terms of aesthetic quality under various conditions and aspect ratios, outperforming both widely-used open-source models like SDXL and Playground v2, and closed-source commercial systems such as DALLE 3 and Midjourney v5.2. Our model is open-source, and we hope the development of Playground v2.5 provides valuable guidelines for researchers aiming to elevate the aesthetic quality of diffusion-based image generation models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 5 background 3 dataset 1 method 1

citation-polarity summary

baseline 5 background 2 unclear 1 use dataset 1 use method 1

representative citing papers

Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

cs.CV · 2026-03-30 · unverdicted · novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

cs.CV · 2024-12-05 · unverdicted · novelty 7.0

T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

cs.CV · 2024-10-17 · unverdicted · novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.

Latent Action Control for Reasoning-Guided Unified Image Generation

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency points for high-resolution image generation on mobile devices.

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

IncreFA: Breaking the Static Wall of Generative Model Attribution

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

IncreFA uses hierarchical constraints with learnable orthogonal priors and a latent memory bank to enable continual adaptation for attributing images to new generative models, reporting SOTA accuracy and 98.93% unseen detection on a 28-model benchmark.

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

cs.CR · 2026-04-17 · unverdicted · novelty 6.0

TwoHamsters benchmark shows T2I models like FLUX generate unsafe multi-concept images at 99.52% rate while defenses like LLaVA-Guard achieve only 41.06% recall.

Self-Adversarial One Step Generation via Condition Shifting

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.

Nucleus-Image: Sparse MoE for Image Generation

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models

cs.CY · 2026-04-13 · conditional · novelty 6.0

BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

cs.CV · 2025-11-30 · unverdicted · novelty 6.0

Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.

PixelDiT: Pixel Diffusion Transformers for Image Generation

cs.CV · 2025-11-25 · conditional · novelty 6.0

PixelDiT generates images in pixel space with a dual-level transformer and reaches 1.61 FID on ImageNet 256, outperforming prior pixel-space models.

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

cs.CV · 2025-10-21 · unverdicted · novelty 6.0

VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

cs.CV · 2025-05-08 · unverdicted · novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

Autoregressive Video Generation without Vector Quantization

cs.CV · 2024-12-18 · unverdicted · novelty 6.0

NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

cs.CV · 2024-10-14 · unverdicted · novelty 6.0

Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.

Emu3: Next-Token Prediction is All You Need

cs.CV · 2024-09-27 · unverdicted · novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

cs.CV · 2024-05-14 · conditional · novelty 6.0

Hunyuan-DiT is a new multi-resolution diffusion transformer that achieves state-of-the-art Chinese text-to-image generation through custom architecture, data pipelines, and multimodal caption refinement.

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.

citing papers explorer

Showing 36 of 36 citing papers.

Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting cs.CV · 2026-03-30 · unverdicted · none · ref 12 · internal anchor
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation cs.CV · 2025-03-10 · unverdicted · none · ref 23 · internal anchor
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts cs.CV · 2024-12-05 · unverdicted · none · ref 26 · internal anchor
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 44 · internal anchor
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 17 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing cs.CV · 2026-05-17 · unverdicted · none · ref 25 · internal anchor
HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.
Latent Action Control for Reasoning-Guided Unified Image Generation cs.CV · 2026-05-16 · unverdicted · none · ref 17 · internal anchor
Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 36 · internal anchor
DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices cs.CV · 2026-05-15 · unverdicted · none · ref 38 · internal anchor
ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency points for high-resolution image generation on mobile devices.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation cs.CV · 2026-04-20 · unverdicted · none · ref 77 · internal anchor
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
IncreFA: Breaking the Static Wall of Generative Model Attribution cs.CV · 2026-04-20 · unverdicted · none · ref 27 · internal anchor
IncreFA uses hierarchical constraints with learnable orthogonal priors and a latent memory bank to enable continual adaptation for attributing images to new generative models, reporting SOTA accuracy and 98.93% unseen detection on a 28-model benchmark.
TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models cs.CR · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
TwoHamsters benchmark shows T2I models like FLUX generate unsafe multi-concept images at 99.52% rate while defenses like LLaVA-Guard achieve only 41.06% recall.
Self-Adversarial One Step Generation via Condition Shifting cs.CV · 2026-04-14 · unverdicted · none · ref 11 · internal anchor
APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 54 · internal anchor
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models cs.CY · 2026-04-13 · conditional · none · ref 46 · internal anchor
BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model cs.CV · 2025-11-30 · unverdicted · none · ref 55 · internal anchor
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
PixelDiT: Pixel Diffusion Transformers for Image Generation cs.CV · 2025-11-25 · conditional · none · ref 48 · internal anchor
PixelDiT generates images in pixel space with a dual-level transformer and reaches 1.61 FID on ImageNet 256, outperforming prior pixel-space models.
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models cs.CV · 2025-10-21 · unverdicted · none · ref 12 · internal anchor
VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 38 · internal anchor
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
Autoregressive Video Generation without Vector Quantization cs.CV · 2024-12-18 · unverdicted · none · ref 14 · internal anchor
NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers cs.CV · 2024-10-14 · unverdicted · none · ref 10 · internal anchor
Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 46 · internal anchor
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding cs.CV · 2024-05-14 · conditional · none · ref 16 · internal anchor
Hunyuan-DiT is a new multi-resolution diffusion transformer that achieves state-of-the-art Chinese text-to-image generation through custom architecture, data pipelines, and multimodal caption refinement.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 54 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
LongCat-Image Technical Report cs.CV · 2025-12-08 · unverdicted · none · ref 22 · internal anchor
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer cs.CV · 2025-11-27 · unverdicted · none · ref 38 · internal anchor
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing cs.CV · 2025-09-02 · unverdicted · none · ref 11 · internal anchor
Rebalancing designer-painter roles by assigning design to the understanding module via the new DIM dataset yields SOTA image editing performance with a 4.6B model.
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning cs.CV · 2025-08-28 · unverdicted · none · ref 10 · internal anchor
Pref-GRPO stabilizes T2I RL training by using pairwise win rates from preference models as rewards instead of normalized pointwise scores, while UniGenBench enables finer-grained model evaluation across themes and criteria.
Qwen-Image Technical Report cs.CV · 2025-08-04 · unverdicted · none · ref 16 · internal anchor
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.
OmniGen2: Towards Instruction-Aligned Multimodal Generation cs.CV · 2025-06-23 · unverdicted · none · ref 36 · internal anchor
OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consistency among open-source models on the new OmniContext benchmark.
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 38 · internal anchor
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset cs.CV · 2025-05-14 · conditional · none · ref 15 · internal anchor
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception cs.CV · 2026-04-18 · unverdicted · none · ref 27 · internal anchor
I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harming generalization.
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning cs.CV · 2026-04-10 · unverdicted · none · ref 26 · internal anchor
A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 57 · internal anchor
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling cs.AI · 2025-01-29 · conditional · none · ref 22 · internal anchor
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer