{"total":40,"items":[{"citing_arxiv_id":"2605.21272","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","primary_cat":"cs.CV","submitted_at":"2026-05-20T15:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092-27112, 2023. [23] Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. [24] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86-92, 2021. [25] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132-"},{"citing_arxiv_id":"2605.20807","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-20T06:58:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A two-stage method predicts an intermediate Canny map for structure then renders the image conditioned on appearance and structure, paired with a 100k text-aware dataset, to improve detail preservation in subject-driven generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17294","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing","primary_cat":"cs.CV","submitted_at":"2026-05-17T07:14:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16080","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation","primary_cat":"cs.CV","submitted_at":"2026-05-15T15:43:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14876","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-14T14:22:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13565","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Qwen-Image-VAE-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-13T14:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12967","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ImageAttributionBench: How Far Are We from Generalizable Attribution?","primary_cat":"cs.CV","submitted_at":"2026-05-13T04:01:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"hybrid architecture with an AR generator and a DiT decoder. For the commercial models, we focus on representative models with state-of-the-art performance and broad industry adoption. These can be divided into two categories: pure text-to-image generation models, including DALL·E 3 [4], Midjourney [49] (V5.2 and V6.0), Kling-image-V1 [40], Ideogram- generate-V-1-TURBO [34], and Doubao-seedream-3.0-t2i [22]; and models integrated into widely deployed large multimodal frameworks, such as GPT-4o [1], GPT-Image-1 [52], GPT-Image-1.5 [53], Gemini-2.0-Flash [23], Gemini-2.5-Flash-Image [24], Gemini-3-Pro-Image [25], Doubao-seedream- 5.0-Lite [8], and Grok3-image [26]. Image GenerationFinally, we use captions generated in Section 3.2 to guide these generative"},{"citing_arxiv_id":"2605.12500","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Nevertheless, these results highlight the effectiveness of our native multimodal framework in sustaining trustworthy and capable agent behavior across diverse real-world scenarios. 17 Model # Params Single Object Two Object Counting Colors Position Attribute BindingOverall↑ Closed-source Models GPT-Image-1 [100] - 0.99 0.92 0.85 0.92 0.75 0.61 0.84 Seedream 4.0 [111] - 0.99 0.92 0.72 0.91 0.76 0.74 0.84 Seedream 3.0 [38] - 0.99 0.96 0.91 0.93 0.47 0.80 0.84 Open-source Models SenseNova-U1 8BA3B 1.00 0.96 0.89 0.91 0.92 0.77 0.91 SenseNova-U1 8B 1.00 0.96 0.92 0.92 0.91 0.76 0.91 Tuna [84] 7B 1.00 0.97 0.81 0.91 0.88 0.83 0.90 OneCAT [66] 9BA3B 1.00 0.96 0.84 0.94 0.84 0.80 0.90 NEO-unify [112] 8B 1.00 0.96 0.90 0.91 0.91 0.77 0.90 Mogao [74] 7B 1.00 0.97 0.83 0.93 0."},{"citing_arxiv_id":"2605.12271","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:35:34+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"All image models are evaluated with the same Qwen3-VL-32B judge, dual-dimension scoring ( min(Q,A)×10 ), and mean-of-4-samples aggregation. HunyuanVideo generates full videos; the reported score applies the same VLM judge directly to the generated video. Best per-category in bold. Model Vis.Text Inl.Color Inl.VisRef Counting Style Pose Sketch Overall GPT Image 2 [63] 78.392.475.891.8 60.3 20.0 34.0 64.7 Seedream 5.0 Lite [64] 79.068.7 74.7 88.8 48.7 16.8 32.4 58.4 Nano Banana 2 [65] 59.2 69.778.067.1 44.7 19.1 22.3 51.4 V2V-Zero (ours) 34.8 76.9 42.8 24.0 20.3 13.3 16.6 32.7 HunyuanVideo-1.5 (video) [4] 17.7 32.5 25.7 19.2 17.3 12.4 16.3 20.2 Qwen-Image-Edit-2511 [1] 15.7 16.9 34.2 23.2 17.1 13.4 17.2 19.7 BAGEL-7B-MoT [48] 43."},{"citing_arxiv_id":"2605.12013","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"L2P: Unlocking Latent Potential for Pixel Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T12:01:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10730","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen-Image-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:34:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10723","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:31:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AllocMV uses a global planner to build a structured persistent state then solves a Multiple-Choice Knapsack Problem to allocate High-Gen, Mid-Gen, and Reuse compute branches, achieving an optimal Cost-Quality Ratio under budget and rhythmic constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08962","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production","primary_cat":"cs.DC","submitted_at":"2026-05-09T13:59:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ShardingImages Texts Audios Reduce buﬀer (b) Balance for LLM CP Encoder rank 1 Encoder rank 2 Sharding & GPU 1 GPU 2 LLM SP rank 1 LLM SP rank 2 All-to-All Figure 12Balancing encoder-LLM resharding for triple- modality. Encoder and LLM ranks are colocated on 2 GPUs. across LLM CP ranks, while intra-sample sharding with fixed CP degree causes redundant communica- tion for short samples [15, 48]. We thus only shard long samples (e.g.,ImageandTextin the figure) and integrate hybrid data parallelism [15] to process sharded samples with CP and unsharded ones with DP. Symmetric Dispatching.After sharding, we dis- patch the sharded embeddings to LLM ranks under the principle ofsymmetry. As shown in Figure 12(a), we use a symmetric all-to-all operation for Ulysses"},{"citing_arxiv_id":"2605.04128","ref_index":32,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation","primary_cat":"cs.GR","submitted_at":"2026-05-05T15:49:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 7Quantitative evaluation results on LongText-Bench [33]. Model LongText-Bench-EN↑LongText-Bench-ZH↑ Janus-Pro [21] 0.019 0.006 BLIP3-o [17] 0.021 0.018 HiDream-I1-Full [10] 0.543 0.024 Kolors 2.0 [72] 0.258 0.329 FLUX.1 [Dev] [6] 0.607 0.005 OmniGen2 [86] 0.561 0.059 BAGEL [28] 0.373 0.310 GPT Image 1 [High] [61] 0.956 0.619 X-Omni [33] 0.900 0.814 Seedream 3.0 [32] 0.896 0.878 Z-Image-Turbo [76] 0.917 0.926 Z-Image [76] 0.935 0.936 Qwen-Image [85] 0.943 0.946 JoyAI-Image 0.963 0.963 CVTG-2k.On the CVTG-2K benchmark shown in Table 8, our model demonstrates strong text rendering capability, particularly in terms of word accuracy and structural consistency. As shown in Table 8, JoyAI-Image achieves the highest Word Accuracy of 0."},{"citing_arxiv_id":"2605.02641","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:26:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"SD3-Medium [70] 2B 0.99 0.94 0.72 0.89 0.33 0.60 0.74 HiDream-11-Full [71] 17B 1.00 0.98 0.79 0.91 0.60 0.72 0.83 Janus-Pro-7B [72] 7B 0.99 0.89 0.59 0.90 0.79 0.66 0.80 Mogao-7B [20] 7B 1.00 0.97 0.83 0.93 0.84 0.80 0.89 Mamoda2 [24] 8B + 3B + 2B 1.00 0.97 0.63 0.89 0.90 0.82 0.87 Qwen-Image [73] 7B + 20B 0.99 0.92 0.89 0.88 0.76 0.77 0.87 Seedream 3.0 [74] - 0.99 0.96 0.91 0.93 0.47 0.80 0.84 Video Models Wan2.1-14B [4] 14B 0.88 0.55 0.51 0.71 0.16 0.25 0.51 HunyuanVideo [3] 13B 0.95 0.77 0.34 0.77 0.43 0.44 0.61 HunyuanVideo† [3] 13B 0.96 0.88 0.43 0.84 0.66 0.47 0.71 VInO [17] 13B 0.95 0.72 0.33 0.80 0.21 0.49 0.59 VInO† [17] 13B 0.97 0.88 0.52 0.88 0.65 0.62 0.75 Mamoda2.5 25B-A3B 0.99 0.95 0.81 0."},{"citing_arxiv_id":"2604.27505","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"based generative models, text-to-image (T2I) generation [2,6,10-13,15,16,18,25,28,32,40,42,44,46,60,70], image editing [5, 8, 35, 43, 50, 51, 56, 63], and video generation [3, 4, 7, 9, 17, 24, 29, 38, 47-49, 52, 54, 68] have advanced dramatically. In T2I generation, Reinforcement Learning from Human Feedback (RLHF) has become a core post-training step [16, 18, 60], driven by powerful reward models (RMs) [39, 57, 64] and optimization algorithms [53, 64, 66]. By contrast, the application of RLHF to image editing has remained limited, with research still centered on pretraining and supervised fine-tuning (SFT) [5, 14, 56]. A primary obstacle is the lack of a sufficiently robust reward model in editing."},{"citing_arxiv_id":"2604.26341","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness","primary_cat":"cs.CV","submitted_at":"2026-04-29T06:46:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"92 36.03 65.74 17.45 16.63 33.27 30.3241.9125.69 32.64 34.03 SD-3.5-L [9] 42.85 31.48 5.90 26.74 73.03 11.15 23.55 35.91 31.03 33.05 24.83 29.64 30.76 FLUX.1-dev [22] 40.42 31.11 12.28 27.94 63.39 13.17 19.40 31.99 29.16 30.72 31.98 30.62 30.18 Qwen-Image [49] 54.59 49.96 19.89 41.48 63.83 10.04 20.17 31.34 25.84 33.24 20.22 26.43 33.09 Seedream-3.0 [10] 53.75 61.62 13.70 43.02 84.84 18.56 17.02 40.14 26.24 30.89 26.13 27.75 36.97 Unified Generative Model UniWorld-V1 [26] 23.72 24.59 15.78 21.36 59.62 17.09 13.59 30.10 31.74 18.22 30.85 26.94 26.13 BAGEL [7] 43.34 46.65 13.47 34.49 72.10 22.53 19.12 37.92 30.77 36.86 29.01 32.21 34.87 Gemini-2.0-Flash [12] 54.77 52.93 10.92 39.54 81.85 17.50 14.07 37."},{"citing_arxiv_id":"2604.25427","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139-22149, 2024. [13] Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025. [14] Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xi- aochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. [15] WeiJie Li, Jin Wang, and Xuejie Zhang. Promptist: Automated prompt optimization for text-to- image synthesis. InCCF international conference on natural language processing and Chinese"},{"citing_arxiv_id":"2604.20796","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model","primary_cat":"cs.CV","submitted_at":"2026-04-22T17:20:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19858","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Wan-Image: Pushing the Boundaries of Generative Visual Intelligence","primary_cat":"cs.CV","submitted_at":"2026-04-21T17:58:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise editing, outperforming several prior models in human tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19730","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FASTER: Value-Guided Sampling for Fast RL","primary_cat":"cs.LG","submitted_at":"2026-04-21T17:52:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LIBERO, we start from the pretrained pi05_libero checkpoint from OpenPI [30], which is trained on libero_goal, libero_object, libero_spatial, and libero_10 but not libero_90; we then select all held-out libero_90 tasks on which the original policy achieves 40-60% success and run RL on those tasks without any offline data. Baselines.We compare our method to prior state-of-the-art methods for online and batch-online [35] reinforcement learning. EXPO [5].EXPO is an online RL method that jointly learns an expressive diffusion policy alongside a lightweight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. IDQL [6].IDQL trains an expressive diffusion policy via imitation learning and uses implicit policy extraction by performing best-of-Nsampling, selecting the action that maximizes theQ-value."},{"citing_arxiv_id":"2604.18168","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation","primary_cat":"cs.CV","submitted_at":"2026-04-20T12:28:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"1-dev [45] 12B 50 0.98 0.81 0.74 0.79 0.22 0.45 0.66 SANA-1.5 [31] 4.8B 20 0.99 0.93 0.86 0.84 0.59 0.65 0.81 Cosmos-Predict2 [64] 0.6B 35 1.00 0.97 0.74 0.86 0.59 0.70 0.81 PixArt-𝛼[42] 0.6B 20 0.98 0.50 0.44 0.80 0.08 0.07 0.48 Lumina-Image 2.0 [65] 2.6B 50 - 0.87 0.67 - - 0.62 0.73 HiDream-I1-Full [66] 3B 50 1.00 0.98 0.79 0.91 0.60 0.72 0.83 Seedream 3.0 [67] / / 0.99 0.96 0.91 0.93 0.47 0.80 0.84 GPT Image 1 [High] [68] / / 0.99 0.92 0.85 0.92 0.75 0.61 0.84 BLIP3o-NEXT [51] 3B 30 0.99 0.95 0.88 0.90 0.92 0.79 0.91 Unified Models MetaQuery-L [69] 3B 30 - - - - - - 0.78 BLIP3-o-8B [59] 8B 30 - - - - - - 0.83 OpenUni-B-512 [70] 1.6B 30 0.99 0.91 0.74 0.90 0.77 0.73 0.84 Tar-7B [71] 9.6B 50 - 0.92 0.83 0."},{"citing_arxiv_id":"2604.14148","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedance 2.0: Advancing Video Generation for World Complexity","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13030","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Generative Refinement Networks for Visual Synthesis","primary_cat":"cs.CV","submitted_at":"2026-04-14T17:59:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"74, we are confident that the performance gap can be bridged by scaling up the size of GRN. We present qualitative results in Fig. 6. Please refer to Fig. 15 and Fig. 16 in the Appendix for 8 Table 4:Evaluation of Text-to-Image generation on GenEval [21]. Model #Param #Data Single Obj. Two Obj. Count Colors Pos. Color Attri. Overall↑ Proprietary Models GPT Image 1 [40] N/A N/A 0.99 0.92 0.85 0.92 0.75 0.61 0.84 Seedream 3.0 [20] N/A N/A 0.99 0.96 0.91 0.93 0.47 0.80 0.84 Diffusion Models PixArt-α[9] 0.6B N/A 0.98 0.50 0.44 0.80 0.08 0.07 0.48 SD3 Medium [18] 2B N/A 0.98 0.74 0.63 0.67 0.34 0.36 0.62 JanusFlow [38] 1.3B N/A 0.97 0.59 0.45 0.83 0.53 0.42 0.63 FLUX.1-Dev [31] 12B N/A 0.98 0.81 0.74 0.79 0.22 0.45 0.66 SD3.5-Large [18] 8B N/A 0.98 0.89 0.73 0.83 0.34 0.47 0.71"},{"citing_arxiv_id":"2604.12322","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-Adversarial One Step Generation via Condition Shifting","primary_cat":"cs.CV","submitted_at":"2026-04-14T05:54:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"First, it trains the model on fake trajectories via: LTF =E xt,z, t \u0002 ∥Fθ(xfake t , t)−(z−x fake)∥2\u0003 .(6) 3 Preprint Then minimizes the velocity discrepancy between the real score +t and the fake score −t via a rectification loss, steering generation toward higher fidelity without an external discriminator: LTF-rect =E xt,z, t \u0002 ∥Fθ(xt, t)−sg(F θ(xt,−t) + ∆v)∥ 2\u0003 ,(7) where ∆v accounts for the gap between real and fake velocity targets. The two branches are separated by thesignof the time input t vs. −t; APEX achieves the same structure via a simpler separation in condition spacecvs.c fake, as developed in Section 3 . GAN Dynamics and Score Difference Gradients.GAN generator updates take the form of a score difference signal (sθ(x)−s data(x)) modulated by a sample dependent weight from the"},{"citing_arxiv_id":"2604.12163","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Nucleus-Image: Sparse MoE for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-04-14T00:43:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"98 0.74 0.63 0.67 0.34 0.36 0.62 FLUX.1 Dev [40] 0.98 0.81 0.74 0.79 0.22 0.45 0.66 SD3.5 Large [27] 0.98 0.89 0.73 0.83 0.34 0.47 0.71 JanusFlow [47] 0.97 0.59 0.45 0.83 0.53 0.42 0.63 Janus-Pro-1B [48] - 0.87 0.67 - - 0.62 0.73 Janus-Pro-7B [48] 0.99 0.89 0.59 0.90 0.79 0.66 0.80 HiDream-I1-Full [49]1.000.98 0.79 0.91 0.60 0.72 0.83 Seedream 3.0 [50] 0.99 0.96 0.91 0.93 0.47 0.80 0.84 Qwen-Image [13] 0.99 0.92 0.89 0.88 0.76 0.77 0.87 GPT Image 1 High [51] 0.99 0.92 0.85 0.92 0.75 0.61 0.84 Nucleus-Image0.99 0.95 0.78 0.920.850.710.87 As shown in Table 9, Nucleus-Image achieves an overall score of 0.865 (reported as 0.87), matching Qwen-Image and surpassing all other reported models including GPT Image 1 High (0."},{"citing_arxiv_id":"2603.00607","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"IdGlow: Dynamic Identity Modulation for Multi-Subject Generation","primary_cat":"cs.CV","submitted_at":"2026-02-28T11:56:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IdGlow is a progressive two-stage diffusion framework that uses task-adaptive timestep scheduling, temporal gating, VLM prompt synthesis, and group-level DPO to balance identity preservation and scene coherence in multi-subject image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.02493","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PixelGen: Improving Pixel Diffusion with Perceptual Supervision","primary_cat":"cs.CV","submitted_at":"2026-02-02T18:59:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.00122","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents","primary_cat":"cs.CV","submitted_at":"2026-01-27T16:51:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VDE Bench is a new human-annotated dataset and OCR-based evaluation framework for measuring image editing model performance on bilingual dense visual documents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13507","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model","primary_cat":"cs.CV","submitted_at":"2025-12-15T16:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Seedance 1.5 pro is a joint audio-visual generation model achieving high synchronization via dual-branch diffusion transformer and post-training optimizations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.07584","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LongCat-Image Technical Report","primary_cat":"cs.CV","submitted_at":"2025-12-08T14:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22699","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2025-11-27T18:52:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"• Text-Image Correlation:We use CN-CLIP [86] to compute the alignment score between an image and its associated alt caption. Pairs with low correlation scores are discarded to ensure the relevance of textual supervision. • Multi-Level Captioning:For all images selected for pre-training, we generate a structured set of captions, including concise tags, short phrases, and detailed long-form descriptions. Notably, diverging from prior works [21, 64, 76] that use separate modules for Optical Character Recognition (OCR) and watermark detection, our approach leverages the powerful inherent capabilities of our VLM. We explicitly prompt the VLM to describe any visible text or watermarks within the image, seamlessly integrating this information into the final caption. This unified strategy not only"},{"citing_arxiv_id":"2511.19365","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation","primary_cat":"cs.CV","submitted_at":"2025-11-24T17:59:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.26583","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Emu3.5: Native Multimodal Models are World Learners","primary_cat":"cs.CV","submitted_at":"2025-10-30T15:11:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"66 86.67 66.83 54.8372.93 60.99 SD 3 [27] 67.46 66.0978.32 77.75 83.33 79.83 82.07 78.82 71.07 74.0761.46 59.56 61.07 64.07 68.84 70.34 50.96 57.84 66.67 76.67 59.83 20.8363.23 67.34 MidJourney v7 [62]68.74 65.6977.41 76.00 77.58 81.83 82.07 76.82 72.57 69.3264.66 60.53 67.20 62.70 81.22 71.59 60.72 64.59 83.33 80.00 24.83 20.8368.83 63.61 Seedream 3.0 [30]86.02 84.3187.0784.93 90.5090.0089.8585.94 80.8678.8679.16 80.60 79.76 81.82 77.23 78.85 75.64 78.64100.0093.3397.1787.7883.21 83.58 GPT Image 1 [64]89.1588.2990.75 89.66 91.3387.08 84.57 84.5796.32 97.3288.55 88.35 87.07 89.44 87.2283.9685.59 83.2190.0093.3389.83 86.8389.7393.46 Qwen-Image [106]86.14 86.8386.18 87.22 90.5091.5088.2290.7879.81 79.3879."},{"citing_arxiv_id":"2509.20427","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedream 4.0: Toward Next-generation Multimodal Image Generation","primary_cat":"cs.CV","submitted_at":"2025-09-24T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"19 3 1 Introduction Diffusion models have ushered in a new era in generative AI, enabling the synthesis of images with remarkable fidelity and diversity. Building on recent advances in diffusion transformers (DiTs), state-of-the-art open-source and commercial systems have emerged, such as Stable Diffusion [18], FLUX series [7, 8], Seedream models [3, 4, 21], GPT-4o image generation [15] and Gemini 2.5 flash [5]. However, as the demand for higher image quality, greater controllability, and strong multimodal capabilities (e.g., text-to-image (T2I) synthesis and image editing) increases, current models often have a critical scalability bottleneck. In this paper, we introduce Seedream 4.0, a powerful multimodal generative model engineered for scalability"},{"citing_arxiv_id":"2508.20751","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-08-28T13:11:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pref-GRPO stabilizes T2I RL training by using pairwise win rates from preference models as rewards instead of normalized pointwise scores, while UniGenBench enables finer-grained model evaluation across themes and criteria.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.02324","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen-Image Technical Report","primary_cat":"cs.CV","submitted_at":"2025-08-04T11:49:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.09113","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seedance 1.0: Exploring the Boundaries of Video Generation Models","primary_cat":"cs.CV","submitted_at":"2025-06-10T17:56:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"alignment and structural stability. We employ a Vision-Language Model as the architecture of this reward model. Motion reward model helps to mitigate video artifacts while enhancing motion amplitude and vividness. Given that video aesthetics primarily derive from keyframes, we design the aesthetic reward model from image-space input inspired by Seedream [6, 8], with the data source modified to use keyframes from videos. 11 4.4.3 Base Model Feedback Learning Reward feedback learning [17, 18, 28, 33] have been widely used in currnet diffusion models. In Seedance 1.0, we simulate the video inference pipeline during training, directly predictx0 (generated clean video) when the Reward Model (RM) adequately assesses video quality."},{"citing_arxiv_id":"2505.15659","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FLARE: Robot Learning with Implicit World Modeling","primary_cat":"cs.RO","submitted_at":"2025-05-21T15:33:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.05472","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","primary_cat":"cs.CV","submitted_at":"2025-05-08T17:58:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}