Pixelponder: Dynamic patch adaptation for enhanced multi-conditional text-to-image generation

Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang · 2025 · cs.CV · arXiv 2503.06684

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

TrioPose proposes a Triple-Stream Pose-Aware DiT with relational bias masks and spatial loss weighting to achieve SOTA pose-guided text-to-image results on multi-person benchmarks like Human-Art.

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

cs.CV · 2026-05-10 · unverdicted · novelty 4.0

AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.

citing papers explorer

Showing 1 of 1 citing paper after filters.

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation cs.CV · 2026-05-10 · unverdicted · none · ref 58 · internal anchor
AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.

Pixelponder: Dynamic patch adaptation for enhanced multi-conditional text-to-image generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer