{"total":12,"items":[{"citing_arxiv_id":"2606.05071","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space","primary_cat":"cs.CV","submitted_at":"2026-06-03T16:30:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InstantRetouch performs efficient high-fidelity language-guided retouching via bilateral grid prediction of affine transforms combined with variational score distillation from diffusion models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17294","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing","primary_cat":"cs.CV","submitted_at":"2026-05-17T07:14:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10127","ref_index":31,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07457","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:05:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., et al.: Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. arXiv preprint arXiv:2509.23909 (2025) [34] Lupascu, M., Stupariu, M.S.: Optimal transport for rectified flow image editing: Unifying inversion-based and direct methods. arXiv preprint arXiv:2508.02363 (2025) 12 [35] Mao, C., Zhang, J., Pan, Y ., Jiang, Z., Han, Z., Liu, Y ., Zhou, J.: Ace++: Instruction-based image creation and editing via context-aware content filling. arXiv preprint arXiv:2501.02487 (2025) [36] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) [37] Nam, H."},{"citing_arxiv_id":"2604.19406","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HP-Edit: A Human-Preference Post-Training Framework for Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-21T12:29:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583, 2025. 3 [31] Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, and Haonan Lu. X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning.ICCV, 2025. 2, 6, 7, 8, 1 [32] Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025. 2 [33] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162-8171."},{"citing_arxiv_id":"2604.17195","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior","primary_cat":"cs.CV","submitted_at":"2026-04-19T01:51:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2019. 6 [23] Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xi- aoniu Song, Xing Chen, et al. Step-video-t2v technical re- port: The practice, challenges, and future of video founda- tion model.arXiv preprint arXiv:2502.10248, 2025. 3 [24] Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025. 3 [25] Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, and Yuyin Zhou. Story-adapter: A training-free iterative framework for long story visualization."},{"citing_arxiv_id":"2603.02210","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images","primary_cat":"cs.CV","submitted_at":"2026-03-02T18:59:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.18871","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","primary_cat":"cs.CV","submitted_at":"2025-06-23T17:38:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consistency among open-source models on the new OmniContext benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.20690","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2025-04-29T12:14:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.17761","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step1X-Edit: A Practical Framework for General Image Editing","primary_cat":"cs.CV","submitted_at":"2025-04-24T17:25:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Flash [15], and SeedEdit/Doubao [50], have pushed the frontier of instruction-based image editing. These systems leverage large-scale vision-language modeling capabilities to perform high-fidelity edits across diverse scenarios. However, their closed nature limits reproducibility and transparency. In parallel, open-source efforts like OmniGen [61] and ACE++ [34] aim to replicate similar capabilities but still fall short in terms of overall generalization, edit accuracy, and the quality of generated images. In this work, we aim to narrow the performance gap between open-source and closed-source editing systems, while also pushing the boundary of practical and user-grounded editing evaluation. Although researchers have open-sourced editing datasets like AnyEdit [64] and OmniEdit [59], we argue that"},{"citing_arxiv_id":"2503.20314","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Wan: Open and Advanced Large-Scale Video Generative Models","primary_cat":"cs.CV","submitted_at":"2025-03-26T08:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.07598","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VACE: All-in-One Video Creation and Editing","primary_cat":"cs.CV","submitted_at":"2025-03-10T17:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}