{"total":14,"items":[{"citing_arxiv_id":"2605.17916","ref_index":41,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis","primary_cat":"cs.CV","submitted_at":"2026-05-18T06:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PanoWorld autoregressively generates consistent multi-room 360-degree panoramas for whole-house VR using a floorplan-derived 3D shell as geometric proxy and a dynamic 3DGS cache for spatial memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15857","ref_index":55,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"AHS: Adaptive Head Synthesis via Synthetic Data Augmentations","primary_cat":"cs.CV","submitted_at":"2026-04-17T09:05:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14062","ref_index":38,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"OneHOI: Unifying Human-Object Interaction Generation and Editing","primary_cat":"cs.CV","submitted_at":"2026-04-15T16:37:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13688","ref_index":75,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data","primary_cat":"cs.CV","submitted_at":"2026-04-15T10:10:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07958","ref_index":38,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:22:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"such supervision during pretraining. By augmenting the edited prompts with comprehensive visual descriptions of the target scene, we establish more robust and informative supervisory signals for learning complex editing behaviors. Paired Image Synthesis.Based on the constructed prompts, we synthesize paired images using text-to-image and image editing models. We adopt Qwen-Image [38] and Qwen-Image-Edit to ensure high visual fidelity and consistency between source and edited images. This pro- cess results in a collection of paired samples of the form {(xsrc i , xedit i , psrc i , pedit i )}. Data Filtering.To further improve data quality, we uti- lize a combination of automated and human-in-the-loop fil- tering. Gemini 3.1 Pro is used to filter samples based on"},{"citing_arxiv_id":"2604.04887","ref_index":60,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes","primary_cat":"cs.CV","submitted_at":"2026-04-06T17:36:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HorizonWeaver enables photorealistic, instruction-driven multi-level editing of complex driving scenes with improved generalization via a new paired dataset, language-guided masks, and joint training losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03225","ref_index":43,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"VOSR: A Vision-Only Generative Model for Image Super-Resolution","primary_cat":"cs.CV","submitted_at":"2026-04-03T17:50:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"For each size, we report both a multi-step model and its distilled one-step counterpart, denoted by the suffixes \"- ms\" and \"-os\" (e.g., VOSR-0.5B-ms and VOSR-0.5B-os). VOSR-0.5B uses the SD2.1 V AE and, following AdcSR [4], a retrained lightweight decoder for SR to reduce peak in- ference memory with minimal impact on decoding quality. VOSR-1.4B instead adopts a 16-channel latent V AE [43] to reduce the irreversible information loss of the standard 4-channel compression. For semantic encoding, VOSR- 0.5B uses DINOv2-Base and VOSR-1.4B uses DINOv2- Large [25]. At inference, the multi-step model uses 25 sam- pling steps. Detailed training hyperparameters and archi- tecture configurations are provided in theAppendix. 4.2. Experimental Results"},{"citing_arxiv_id":"2603.14209","ref_index":47,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control","primary_cat":"cs.CV","submitted_at":"2026-03-15T03:55:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ChArtist generates pictorial charts via a Diffusion Transformer using skeleton-based spatial control and reference-image subject control, supported by a new 30,000-triplet dataset and data accuracy metric.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.05449","ref_index":67,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching","primary_cat":"cs.CV","submitted_at":"2026-02-05T08:45:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13609","ref_index":33,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Do-Undo Bench: Reversibility for Action Understanding in Image Generation","primary_cat":"cs.CV","submitted_at":"2025-12-15T18:03:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12982","ref_index":69,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes","primary_cat":"cs.CV","submitted_at":"2025-12-15T04:58:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GAPL learns a compact set of canonical forgery prototypes and applies two-stage LoRA training to build a low-variance feature space that improves generalization across GAN and diffusion generators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12675","ref_index":36,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling","primary_cat":"cs.CV","submitted_at":"2025-12-14T12:58:19+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.07348","ref_index":85,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition","primary_cat":"cs.CV","submitted_at":"2025-12-08T09:40:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MICo-150K is a new 150K-image dataset with 7 tasks, a De&Re real-image subset, MICo-Bench, and Weighted-Ref-VIEScore metric that improves AI models for generating consistent composites from arbitrary numbers of reference images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.13285","ref_index":37,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"SkyReels-Text: Fine-Grained Font-Controllable Text Editing for Poster Design","primary_cat":"cs.CV","submitted_at":"2025-11-17T12:02:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkyReels-Text enables simultaneous fine-grained editing of multiple text regions in posters using arbitrary glyph patches for font control without labels or test-time fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}