OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
hub
Qwen-image technical report
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 14roles
method 2polarities
use method 2representative citing papers
VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.
ChArtist generates pictorial charts via a Diffusion Transformer using skeleton-based spatial control and reference-image subject control, supported by a new 30,000-triplet dataset and data accuracy metric.
DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
MICo-150K is a new 150K-image dataset with 7 tasks, a De&Re real-image subset, MICo-Bench, and Weighted-Ref-VIEScore metric that improves AI models for generating consistent composites from arbitrary numbers of reference images.
PanoWorld autoregressively generates consistent multi-room 360-degree panoramas for whole-house VR using a floorplan-derived 3D shell as geometric proxy and a dynamic 3DGS cache for spatial memory.
BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
HorizonWeaver enables photorealistic, instruction-driven multi-level editing of complex driving scenes with improved generalization via a new paired dataset, language-guided masks, and joint training losses.
GAPL learns a compact set of canonical forgery prototypes and applies two-stage LoRA training to build a low-variance feature space that improves generalization across GAN and diffusion generators.
SkyReels-Text enables simultaneous fine-grained editing of multiple text regions in posters using arbitrary glyph patches for font control without labels or test-time fine-tuning.
Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.
citing papers explorer
No citing papers match the current filters.