hub Mixed citations

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu · 2023 · cs.CV · arXiv 2310.00426

Mixed citation behavior. Most common role is background (50%).

87 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 87 citing papers arXiv PDF

abstract

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 5 method 3 extension 1

citation-polarity summary

background 8 baseline 5 use method 2 extend 1

representative citing papers

Diffusion Model Attribution via Spectral Coupling of Denoiser Responses

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

SDS extracts stable spectral signatures from diffusion model denoisers via frequency-controlled perturbations, achieving 99.9% attribution accuracy across eight models and 96.2% under prompt shift.

Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

SharpMoE is a plug-and-play post-training method that uses clean latent features and a trajectory routing loss to enable accurate saliency-based routing in diffusion MoE models for improved visual generation.

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.

GeoDiff-SAR II: 3D-Driven Foundation Diffusion Models for SAR Generation via Decoupled Control

eess.IV · 2026-05-20 · unverdicted · novelty 7.0

GeoDiff-SAR II proposes a 3D-driven decoupled diffusion framework using GECM and ControlNet on a FLUX backbone for controllable SAR image generation across large viewpoint gaps.

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

CoReDiT reduces self-attention FLOPs in DiTs by up to 55% via linear-time spatial coherence pruning and neighbor-based reconstruction, delivering 1.33x-1.72x speedups with maintained quality.

ImageAttributionBench: How Far Are We from Generalizable Attribution?

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.

DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference

cs.AR · 2026-04-10 · unverdicted · novelty 7.0

DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

cs.CV · 2026-03-01 · unverdicted · novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

ASTRA: Let Arbitrary Subjects Transform in Video Editing

cs.CV · 2025-10-01 · unverdicted · novelty 7.0

ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.

GenHSI: Controllable Generation of Human-Scene Interaction Videos

cs.CV · 2025-06-24 · unverdicted · novelty 7.0

GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

cs.CV · 2025-04-29 · unverdicted · novelty 7.0

ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.

An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

cs.CV · 2025-03-28 · unverdicted · novelty 7.0

Empirical study of a fully synthetic data generation pipeline for text-based person retrieval that tests its use as a replacement or augmentation for real data across scenarios.

VACE: All-in-One Video Creation and Editing

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

cs.CV · 2024-10-17 · unverdicted · novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

cs.CV · 2024-03-08 · unverdicted · novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

UniTac is the first unified multimodal model for cross-sensor tactile understanding and generation, using dual-level representations, two new understanding tasks, and a two-stage training paradigm with sensor-prior sampling to achieve SOTA understanding and realistic cross-sensor generation.

TerraDiT-$\Omega$: Unified Spatial Control for Satellite Image Synthesis with Any Geospatial Primitive

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

TerraDiT-Ω generates satellite imagery from native geospatial primitives via Geometry-Aware Local Attention and outperforms dense and sparse control baselines while boosting downstream GeoAI tasks.

OmniGen-AR: AutoRegressive Any-to-Image Generation

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB

ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

ReCache learns recomputation schedules via policy gradients to maximize quality under a target compute budget for any caching mechanism in diffusion models.

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.

citing papers explorer

Showing 50 of 83 citing papers after filters.

Diffusion Model Attribution via Spectral Coupling of Denoiser Responses cs.CV · 2026-06-26 · unverdicted · none · ref 3 · internal anchor
SDS extracts stable spectral signatures from diffusion model denoisers via frequency-controlled perturbations, achieving 99.9% attribution accuracy across eight models and 96.2% under prompt shift.
Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE cs.CV · 2026-06-25 · unverdicted · none · ref 3 · internal anchor
SharpMoE is a plug-and-play post-training method that uses clean latent features and a trajectory routing loss to enable accurate saliency-based routing in diffusion MoE models for improved visual generation.
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation cs.CV · 2026-05-22 · unverdicted · none · ref 4 · internal anchor
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
GeoDiff-SAR II: 3D-Driven Foundation Diffusion Models for SAR Generation via Decoupled Control eess.IV · 2026-05-20 · unverdicted · none · ref 39 · internal anchor
GeoDiff-SAR II proposes a 3D-driven decoupled diffusion framework using GECM and ControlNet on a FLUX backbone for controllable SAR image generation across large viewpoint gaps.
CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers cs.CV · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
CoReDiT reduces self-attention FLOPs in DiTs by up to 55% via linear-time spatial coherence pruning and neighbor-based reconstruction, delivering 1.33x-1.72x speedups with maintained quality.
ImageAttributionBench: How Far Are We from Generalizable Attribution? cs.CV · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm cs.CV · 2026-05-12 · unverdicted · none · ref 55 · 2 links · internal anchor
Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers cs.CV · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models cs.CV · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters cs.CV · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference cs.AR · 2026-04-10 · unverdicted · none · ref 30 · internal anchor
DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards cs.CV · 2026-03-01 · unverdicted · none · ref 9 · internal anchor
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
ASTRA: Let Arbitrary Subjects Transform in Video Editing cs.CV · 2025-10-01 · unverdicted · none · ref 3 · internal anchor
ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
GenHSI: Controllable Generation of Human-Scene Interaction Videos cs.CV · 2025-06-24 · unverdicted · none · ref 12 · internal anchor
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 21 · internal anchor
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval cs.CV · 2025-03-28 · unverdicted · none · ref 7 · internal anchor
Empirical study of a fully synthetic data generation pipeline for text-based person retrieval that tests its use as a replacement or augmentation for real data across scenarios.
VACE: All-in-One Video Creation and Editing cs.CV · 2025-03-10 · unverdicted · none · ref 7 · internal anchor
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 9 · internal anchor
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment cs.CV · 2024-03-08 · unverdicted · none · ref 12 · internal anchor
ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation cs.RO · 2026-06-30 · unverdicted · none · ref 4 · internal anchor
UniTac is the first unified multimodal model for cross-sensor tactile understanding and generation, using dual-level representations, two new understanding tasks, and a two-stage training paradigm with sensor-prior sampling to achieve SOTA understanding and realistic cross-sensor generation.
TerraDiT-$\Omega$: Unified Spatial Control for Satellite Image Synthesis with Any Geospatial Primitive cs.CV · 2026-06-30 · unverdicted · none · ref 7 · internal anchor
TerraDiT-Ω generates satellite imagery from native geospatial primitives via Geometry-Aware Local Attention and outperforms dense and sparse control baselines while boosting downstream GeoAI tasks.
OmniGen-AR: AutoRegressive Any-to-Image Generation cs.CV · 2026-06-08 · unverdicted · none · ref 9 · internal anchor
OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB
ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE cs.CV · 2026-06-04 · unverdicted · none · ref 9 · internal anchor
ReCache learns recomputation schedules via policy gradients to maximize quality under a target compute budget for any caching mechanism in diffusion models.
Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders cs.CV · 2026-05-30 · unverdicted · none · ref 122 · internal anchor
C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.
Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization cs.CV · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.
Channel-wise Vector Quantization cs.CV · 2026-05-25 · unverdicted · none · ref 7 · internal anchor
CVQ replaces patch-wise vector quantization with channel-wise quantization of feature maps, enabling a next-channel autoregressive model that reports 100% codebook utilization and text-to-image scores of DPG 86.7 and GenEval 0.79.
SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models cs.CV · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.
Rethinking Cross-Layer Information Routing in Diffusion Transformers cs.CV · 2026-05-20 · unverdicted · none · ref 6 · 2 links · internal anchor
DAR replaces residual addition in DiTs with learnable, timestep-adaptive aggregation of sublayer outputs, yielding 2.11 FID improvement on SiT-XL/2 and 8.75x faster convergence on ImageNet 256x256.
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 9 · internal anchor
DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
Controlla: Learning Controllability via Graph-Constrained Latent Geometry cs.CV · 2026-05-15 · unverdicted · none · ref 6 · internal anchor
Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport to enforce consistent attribute trajectories while preserving reference identity.
RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations cs.CV · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
RaPD enables resolution-agnostic image generation by diffusing in a semantics-enriched continuous Neural Image Field latent space using semantic guidance and a coordinate-queried attention renderer.
DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer cs.CV · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
DreamSR uses a dual-branch MM-ControlNet with patch-level and global prompts plus a receptive-field enhancement training strategy in a diffusion transformer to reduce over-generation and improve local texture details in ultra-high-resolution super-resolution.
L2P: Unlocking Latent Potential for Pixel Generation cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition cs.CV · 2026-05-11 · unverdicted · none · ref 4 · 2 links · internal anchor
Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
The two clocks and the innovation window: When and how generative models learn rules cs.LG · 2026-05-11 · unverdicted · none · ref 37 · internal anchor
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
When Do Diffusion Models learn to Generate Multiple Objects? cs.CV · 2026-04-30 · unverdicted · none · ref 4 · internal anchor
Using the mosaic controlled dataset framework, experiments show scene complexity dominates over concept imbalance in diffusion model failures for multi-object generation, with counting especially hard in low-data regimes and compositional generalization collapsing under held-out combinations.
Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 11 · 2 links · internal anchor
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness cs.CV · 2026-04-29 · unverdicted · none · ref 5 · internal anchor
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents cs.CV · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents cs.CV · 2026-04-19 · unverdicted · none · ref 4 · internal anchor
EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 9 · internal anchor
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Evolutionary Token-Level Prompt Optimization for Diffusion Models cs.AI · 2026-04-10 · unverdicted · none · ref 14 · internal anchor
A genetic algorithm evolves CLIP token vectors to optimize aesthetic quality and prompt alignment in diffusion models, outperforming Promptist and random search by up to 23.93% on a combined fitness score.
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation cs.LG · 2026-03-12 · unverdicted · none · ref 2 · internal anchor
EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding state-of-the-art results on standard benchmarks.
TIQA: Human-Aligned Perceptual Text Quality Assessment in Generated Images cs.CV · 2026-03-07 · unverdicted · none · ref 15 · internal anchor
TIQA introduces datasets and a model that predict human perceptual quality of rendered text in AI images, achieving PLCC 0.942 on crops and improving selected image text quality by 0.36 MOS.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World cs.CV · 2025-12-29 · unverdicted · none · ref 11 · internal anchor
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
EmoCtrl: Controllable Emotional Image Content Generation cs.CV · 2025-12-27 · unverdicted · none · ref 5 · internal anchor
EmoCtrl generates images faithful to content prompts while expressing target emotions via textual/visual enhancement modules and emotion-driven preference optimization.
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model cs.CV · 2025-11-30 · unverdicted · none · ref 56 · internal anchor
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
Algebraic Language Models for Inverse Design of Metamaterials via Diffusion Transformers cs.CE · 2025-07-21 · unverdicted · none · ref 60 · internal anchor
DiffuMeta uses diffusion transformers and algebraic language representations to generate diverse 3D shell metamaterials with targeted stress-strain responses under large deformations including buckling and contact.
MAGI-1: Autoregressive Video Generation at Scale cs.CV · 2025-05-19 · unverdicted · none · ref 6 · internal anchor
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 9 · internal anchor
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer