VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.
Qwen-image technical report
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 8roles
method 1polarities
use method 1representative citing papers
ChArtist generates pictorial charts via a Diffusion Transformer using skeleton-based spatial control and reference-image subject control, supported by a new 30,000-triplet dataset and data accuracy metric.
DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
MICo-150K is a new 150K-image dataset with 7 tasks, a De&Re real-image subset, MICo-Bench, and Weighted-Ref-VIEScore metric that improves AI models for generating consistent composites from arbitrary numbers of reference images.
HorizonWeaver enables photorealistic, instruction-driven multi-level editing of complex driving scenes with improved generalization via a new paired dataset, language-guided masks, and joint training losses.
GAPL learns a compact set of canonical forgery prototypes and applies two-stage LoRA training to build a low-variance feature space that improves generalization across GAN and diffusion generators.
Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.
citing papers explorer
-
VOSR: A Vision-Only Generative Model for Image Super-Resolution
VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.
-
ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
ChArtist generates pictorial charts via a Diffusion Transformer using skeleton-based spatial control and reference-image subject control, supported by a new 30,000-triplet dataset and data accuracy metric.
-
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
-
Do-Undo Bench: Reversibility for Action Understanding in Image Generation
Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
-
MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
MICo-150K is a new 150K-image dataset with 7 tasks, a De&Re real-image subset, MICo-Bench, and Weighted-Ref-VIEScore metric that improves AI models for generating consistent composites from arbitrary numbers of reference images.
-
HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes
HorizonWeaver enables photorealistic, instruction-driven multi-level editing of complex driving scenes with improved generalization via a new paired dataset, language-guided masks, and joint training losses.
-
Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes
GAPL learns a compact set of canonical forgery prototypes and applies two-stage LoRA training to build a low-variance feature space that improves generalization across GAN and diffusion generators.
-
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.