{"total":20,"items":[{"citing_arxiv_id":"2605.27235","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale","primary_cat":"cs.CV","submitted_at":"2026-05-26T16:16:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Presents MRT, a 20B-parameter masked region diffusion model unifying text-to-layers, image-to-layers, and layers-to-layers tasks with an overflow-aware canvas layer for complete editable outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24114","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"COSY: Compositional 3DGS Synthesis for Disentangled Human Head Editing","primary_cat":"cs.CV","submitted_at":"2026-05-22T18:22:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"COSY uses independent per-component 3DGS generators plus context tokens to achieve disentangled semantic editing of human heads without masks or classifiers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10730","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen-Image-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:34:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27505","ref_index":18,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"delivers substantial gains to SOTA editors like FLUX.1-kontext and Qwen-Image-Edit, demonstrating the real-world effectiveness of our verifier-based RL framework. 2 Related Works 2.1 Reward model for generative models Driven by advances in Large Language Models (LLMs), many Reward Models (RMs) are now constructed directly upon them as shown in Tab. 1 [18, 39, 43, 61]. In terms of modeling architecture, two dominant approaches have emerged, including regression-based [34, 55] and generative-based [19, 26, 61]. The regression- based methods add a regression head for scoring, while the generative methods leverage the model's own generative abilities for assessment and are generally considered more effective at harnessing the base model's"},{"citing_arxiv_id":"2604.25427","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"• Autoregressive distillation enables efficient deployment:AD transfers the capability of the post-trained generator into a causal architecture, improving inference efficiency while preserving key generation abilities. 2 Related Work 2.1 Prompt Enhancement for Visual Generation PE for image generation Prompt enhancement (PE) has become essential for improving text-to-image (T2I) generation quality and alignment [ 6, 14]. While early approaches relied on manual refine- ment, recent methods leverage LMs for automated prompt optimization. Promptist [15] combines supervised fine-tuning with RL to optimize prompts for aesthetic appeal while preserving user intent. NeuroPrompts [16] introduces constrained text decoding for automatic prompt enhancement with user-controllable styles."},{"citing_arxiv_id":"2604.19902","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings","primary_cat":"cs.CV","submitted_at":"2026-04-21T18:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"In our experiments, we find this strategy is more efficient, achieving performance parity with native unified architectures-such as Transfusion [48] and BAGEL [5]-while requiring only∼30% of the computational budget typically associated with training these models from scratch. Finally, keeping the same dropout settings and following the established alignment pipelines [29], we refine the model via Supervised Fine-Tuning (SFT) on a specialized high-quality dataset, followed by Reinforcement Learning with Human Feedback (RLHF). This final stage aligns the model closer to its theoretical upper bound, yielding a network that significantly outperforms the baselines. Discussion: Necessity of VAEs.Recent advances, such as RAE [47], have demonstrated that discriminative"},{"citing_arxiv_id":"2604.14148","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedance 2.0: Advancing Video Generation for World Complexity","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00918","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards","primary_cat":"cs.CV","submitted_at":"2026-03-01T04:39:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.02493","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PixelGen: Improving Pixel Diffusion with Perceptual Supervision","primary_cat":"cs.CV","submitted_at":"2026-02-02T18:59:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13507","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model","primary_cat":"cs.CV","submitted_at":"2025-12-15T16:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Seedance 1.5 pro is a joint audio-visual generation model achieving high synchronization via dual-branch diffusion transformer and post-training optimizations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.07584","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LongCat-Image Technical Report","primary_cat":"cs.CV","submitted_at":"2025-12-08T14:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.19365","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation","primary_cat":"cs.CV","submitted_at":"2025-11-24T17:59:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20427","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedream 4.0: Toward Next-generation Multimodal Image Generation","primary_cat":"cs.CV","submitted_at":"2025-09-24T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"artificialanalysis. https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard, 2025. [2] dreamina. dreamina. https://dreamina.capcut.com/, 2025. [3] Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. [4] Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025. [5] Google. gemini2.5. https://deepmind.google/models/gemini/image/, 2025. [6] Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang,"},{"citing_arxiv_id":"2509.01986","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing","primary_cat":"cs.CV","submitted_at":"2025-09-02T06:06:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Rebalancing designer-painter roles by assigning design to the understanding module via the new DIM dataset yields SOTA image editing performance with a 4.6B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.02324","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen-Image Technical Report","primary_cat":"cs.CV","submitted_at":"2025-08-04T11:49:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346, 2025. Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231-1237, 2013. Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again. CoRR, abs/2507.22058, 2025. Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for"},{"citing_arxiv_id":"2506.09113","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedance 1.0: Exploring the Boundaries of Video Generation Models","primary_cat":"cs.CV","submitted_at":"2025-06-10T17:56:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"gives precise and high-quality rephrased results in video caption format, consistent with DiT training. 3 Data The performance of video generation models is inextricably linked to the scale, diversity, and quality of the training data. While our broader training corpus incorporates both video and image datasets, with image data preparation following methodologies similar to Seedream [8], this section specifically details our rigorous approach to curating video data. We develop a systematic data processing workflow, illustrated in figure 3, to transform vast, heterogeneous raw video collections into a refined, high-quality, diverse, and safe dataset for training robust video generation models. This workflow is deployed as a robust, automated system optimized"},{"citing_arxiv_id":"2505.07818","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DanceGRPO: Unleashing GRPO on Visual Generation","primary_cat":"cs.CV","submitted_at":"2025-05-12T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recent advances in generative models-particularly diffusion models [1-4] and rectified flows [5-7]-have transformed visual content creation by improving output quality and versatility in image and video generation. While pretraining establishes foundational data distributions, integrating human feedback during training proves critical for aligning outputs with human preferences and aesthetic criteria [8]. Existing methods face notable limitations: ReFL [9-11] relies on differentiable reward models, which introduce VRAM inefficiency in video generation and require several extensive engineering efforts, while DPO variants (Diffusion-DPO [12, 13], Flow-DPO [14], OnlineVPO [15]) achieve only marginal visual quality improvements. Reinforcement learning"},{"citing_arxiv_id":"2505.05472","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","primary_cat":"cs.CV","submitted_at":"2025-05-08T17:58:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.05470","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Flow-GRPO: Training Flow Matching Models via Online RL","primary_cat":"cs.CV","submitted_at":"2025-05-08T17:58:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the ability to place accurate and coherent text inside the generated images is crucial for T2I models. In our settings, we define an text rendering task, where each prompt follows the template\"A sign that says \"text\". Specifically, the placeholder\"text\"is the exact string that should appear in the image. We use GPT4o to produce 20K training prompts and 1K test prompts. Following [58], we measure text fidelity with the reward r= max(1−N e/Nref,0), where Ne is the minimum edit distance between the rendered text and the target text and Nref is the number of characters inside the quotation marks in the prompt. This reward also serves as our metric of text accuracy. Human Preference Alignment [19].This task aims to align T2I models with human preferences."},{"citing_arxiv_id":"2504.11346","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedream 3.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2025-04-15T16:19:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware timestep sampling for 4-8x faster inference at up to 2K resolution.","context_count":1,"top_context_role":"extension","top_context_polarity":"extend","context_text":"The retrieval-enhanced framework dynamically optimizes the dataset through the following methods: (1) injecting expert knowledge via targeted concept retrieval; (2) performing distribution calibration through similarity-weighted sampling; (3) utilizing retrieved neighboring pairs for cross-modal enhancement. 2.2 Model Pre-training 2.2.1 Model Architectures Our core architecture design inherits from Seedream 2.0 [4], which adopts an MMDiT [3] to process the image and text tokens and capture the relationship between the two modalities. We have increased the total parameters in our base model, and introduced several improvements in Seedream 3.0, leading to enhanced scalability, generalizability, and visual-language alignment. Mixed-resolution Training. Transformers [23] natively supports variable lengths of tokens as input, which also"}],"limit":50,"offset":0}