{"total":75,"items":[{"citing_arxiv_id":"2607.00402","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-07-01T04:00:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Safety-aligned T2I diffusion models exhibit semantic collapse in text embeddings causing TIFA drops; SAGE regularization restores structured utility while retaining safety.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.32039","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GEAR: Guided End-to-End AutoRegression for Image Synthesis","primary_cat":"cs.CV","submitted_at":"2026-06-30T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.32020","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers","primary_cat":"cs.CV","submitted_at":"2026-06-30T17:51:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces a Bridge latent interface that maps mismatched student latents into teacher space, enabling distillation from modern diffusion teachers to compact one-step students and raising SD 1.5 HPSv3 from 5.4 to 9.4 while keeping one-step speed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31711","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist","primary_cat":"cs.AI","submitted_at":"2026-06-30T14:17:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Arena-T2I Hard benchmark with ~30 decomposed constraints per prompt and a dependency-aware checklist reward yields better faithfulness-aesthetics trade-off than single-reward or weighted-sum baselines on SD3.5-Medium and FLUX.1-dev.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30262","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Intermediate Text Representation Guided Text-to-Image Generation for Enhancing One-and-Only Alignment","primary_cat":"cs.CV","submitted_at":"2026-06-29T13:09:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IR-guided diffusion injects intermediate text representations into early denoising steps to improve alignment for one-and-only objects, reporting up to 19.1pp VQAScore gains on OAO-AttackBench and other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29814","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis","primary_cat":"cs.CV","submitted_at":"2026-06-29T05:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A masked discrete diffusion model adds token editing at inference and grouped cross-entropy training to reach 0.90 GenEval, 86.9 DPG, and 10.76 HPSv3 scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29013","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers","primary_cat":"cs.CV","submitted_at":"2026-06-27T17:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mural transfers knowledge from a frozen LLM to text-to-image synthesis via MoT shared attention, achieving 0.85 GenEval, 86.75 DPG-Bench, and 0.66 WISE while exhibiting emergent behaviors without multimodal or reasoning supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28406","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models","primary_cat":"cs.LG","submitted_at":"2026-06-24T14:21:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SciDraw-Bench provides 32 structured tasks and a four-dimensional protocol to evaluate text-to-image models on scientific figure generation, with a domain-specific system outperforming general baselines in a pilot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00351","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization","primary_cat":"cs.CV","submitted_at":"2026-05-29T20:45:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"UniVerse proposes a unified modulation framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers, claiming superior localization and fidelity over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31604","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Representation Forcing for Bottleneck-Free Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-29T17:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Representation Forcing enables end-to-end pixel-space unified multimodal models by making visual representation prediction a native autoregressive generation target that guides subsequent pixel diffusion in the same backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30317","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:55:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VPG is a training-free inference-time guidance technique that improves autoregressive image and video generation by contrasting model outputs under generated versus corrupted prefixes to strengthen next-step support for the prefix.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":274,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Document & OCR DocVQA [88] ANLS Question answering on document images. ChartQA [90] Acc. Visual and logical reasoning over charts and plots. InfoVQA [89] ANLS Multi-hop reasoning over infographic layouts. OCRBench [272] Acc. Comprehensive OCR perception across 29 sub-tasks. Generation GenEval [273] Comp. Score T2I: attribute binding, counting, relations. DPG-Bench [274] Alignment Dense, long-prompt following with structured grading. T2I-CompBench [275] Multi Attribute binding, relations, complex composition. FID [276] Distrib. Fréchet Inception Distance to real-image distribution. CLIPScore [277] Alignment CLIP-embedded Reference-free text-image alignment. Audio Speech Recognition LibriSpeech [98] WER Read English speech, clean and other splits."},{"citing_arxiv_id":"2605.25328","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:17:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DIVA factorizes visual representations in unified multimodal models into shared and unique components via complementary information flows and mutual information estimation to convert representation divergence into mutual reinforcement between understanding and generation branches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23902","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21272","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","primary_cat":"cs.CV","submitted_at":"2026-05-20T15:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MONET is built from existing open-source datasets selected via source-governance criteria, chosen to maximize diversity in content, visual style, and resolution while supporting reproducibility. As summarized in Table 1, the bulk of the pool ( >2.8B images, from LAION, COYO and CC12M) comes with noisyalt-textcaptions, while 14.6M images are pre-captioned with VLMs such as BLIP2 [54] (Common-Catalog) and 14k with GPT-4o [38] (Diffusion-Aesthetic-4K). Finally, 9.6M images have no captions (Megalith-10M). We deliberately exclude several popular alternatives relying on Common Crawl, such as DataComp-1B [22], since they heavily overlap with LAION and COYO, as well as the non-English part of LAION-5B [82], since multilingual coverage is more reliably obtained via translation than from noisyalt-text."},{"citing_arxiv_id":"2605.20600","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-20T01:30:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19839","ref_index":4,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Preference Labels Fall Short: Aligning Diffusion Models from Real Data","primary_cat":"cs.CV","submitted_at":"2026-05-19T13:35:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Real images contrasted with generated samples can supply effective preference signals for aligning diffusion models at performance levels comparable to standard preference-pair methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":41,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"During inference, the CFG scale for text conditions in generation tasks is set to4. Unless otherwise specified, the image input resolution is set to768× 768, while videos are sampled at480p resolution with a frame rate of12fps. 5.2 Main Results 5.2.1 Image Generation Quantitative Results.We evaluate the image generation capability of Lance on GenEval [34] and DPG- Bench [42]. As shown in Table 5, Lance achieves top-tier performance among unified models on GenEval, matching the best overall score (0.90) while showing strong compositional ability on counting, colors, and spatial position. On DPG-Bench, Lance obtains competitive overall performance and performs particularly well on relation modeling, indicating its ability to preserve fine-grained semantic consistency under complex"},{"citing_arxiv_id":"2605.18324","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Improved Baselines with Representation Autoencoders","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:42:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18115","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens","primary_cat":"cs.CV","submitted_at":"2026-05-18T09:24:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17766","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LatentUMM: Dual Latent Alignment for Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T02:35:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16842","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-16T06:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15684","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices","primary_cat":"cs.CV","submitted_at":"2026-05-15T07:13:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency points for high-resolution image generation on mobile devices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14333","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T03:57:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12964","ref_index":26,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Asymmetric Flow Models","primary_cat":"cs.CV","submitted_at":"2026-05-13T03:58:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AsymFlow uses rank-asymmetric velocity prediction to reach 1.57 FID on ImageNet 256x256 and enables finetuning of latent flow models into superior pixel-space text-to-image generators.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 2:ImageNet 256×256 pixel diffusion comparison. FLOP estimation follows the convention in [71]. * denotes JiT evaluation protocol, which may have up to 0.08 better FID than ADM according to our tests. Method Pred (±) Params GFLOPs FID↓ Hierarchical CNNs (skip connections / U-Net-like) ADM-G [14]ϵ554M 2240 4.59 Hierarchical transformers (skip connections / U-ViT-like) RIN [26]ϵ320M 668 3.42 SiD, UViT/2 [22]ϵ2B 1110 2.44 VDM++, UViT/2 [31]ϵ2B 1110 2.12 SiD2, UViT/2 [23]ϵ- 274 1.73 EPG-G/16 [34]x 0 1.4B 642 1.58 SiD2, UViT/1 [23]ϵ- 13061.38 Hierarchical transformers (decoder head / DDT-like) PixNerd-XL/16 [64]ϵ−x0 700M 268 2.15 DiP-XL/16 [10]ϵ−x 0 631M - 1.79 DeCo-XL/16 [45]ϵ−x0 682M 245 1.62 PixelDiT-XL/16 [71]ϵ−x0 797M 3111."},{"citing_arxiv_id":"2605.12500","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"object-level composition, prompt following, long-text rendering, knowledge-informed generation, structured professional visual content creation, and more tightly coupled understanding-generation behaviors. Given our 32×32 downsampling ratios, we generate 2K images and downsample them to 1K for evaluation under comparable computational budgets. General Generation.For general text-to-image generation, we adopt GenEval [ 43], DPG-Bench [ 53], OneIG- Bench [12], and TIIF-Bench [138]. These benchmarks examine object-level compositional generation, dense prompt following, and fine-grained overall capability from complementary perspectives. Across them, SenseNova-U1 remains highly competitive, showing that the native unified modeling paradigm does not sacrifice fundamental generation quality."},{"citing_arxiv_id":"2605.12013","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"L2P: Unlocking Latent Potential for Pixel Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T12:01:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11061","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"https: //blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ (April 2026) [17] Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic models. In: NeurIPS (2020) [18] Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., Salimans, T.: Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion. In: CVPR (2025) [19] Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P ., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024) [20] Kingma, D.P ., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014) [21] Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metrics for conditional image synthesis evaluation."},{"citing_arxiv_id":"2605.10045","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T06:14:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"our hardware budget, and additionally produce qualitative results at 2048×2048 . We compare against representative positional extrapolation methods, including PE, PI [5], NTK-aware scaling, YaRN [26], RiFlex [51], and DyPE [18]. To reduce memory usage and push evaluation to the highest feasible resolution, we perform inference with FastV AR [9]. We evaluate generation quality using GenEval [8], DPG-Bench [15], and HPSv2.1 [39]. Implementation Details.For the Stage-Aware RoPE Remapping, we use a total of K= 13 generation scale steps and set kl = 6 and kh = 9 for stage-aware interpolation between PI and YaRN, and the size of the High band is set to m= 3 . For the Attention Calibration, the reference entropy is measured at the training resolution and reused during extrapolated inference."},{"citing_arxiv_id":"2605.08354","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria","primary_cat":"cs.AI","submitted_at":"2026-05-08T18:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"RewardBench2 [16], which provides fine-grained diagnostic splits across multimodal reward scenar- ios; HPDv3 (test set) [28], a large-scale text-to-image preference corpus comprising 14,400 pairwise human judgments; and EditReward-Bench [43], specifically curated to probe instruction adherence in image editing. For generative quality assessment, we adopt GenEval [ 11], DPG-Bench[ 15], TIIF(test-mini-short)[40], and UniGenBench++[37] for text-to-image synthesis, complemented by GEdit-Bench[24] and ImgEdit[49] for editing tasks. Baselines and Implementation.For human preference evaluation, we compare against a suite of state-of-the-art trained reward models, including HPSv3 [28], PickScore [19], ImageReward [47], UnifiedReward[39] and UnifiedReward-Thinking [38], and EditReward [43], alongside representative"},{"citing_arxiv_id":"2605.08078","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Normalizing Trajectory Models","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:57:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08029","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:14:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07253","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling","primary_cat":"cs.CV","submitted_at":"2026-05-08T05:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840-6851, 2020. [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. [13] Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. [14] Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural"},{"citing_arxiv_id":"2605.06376","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Continuous-Time Distribution Matching for Few-Step Diffusion Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T14:56:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06170","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:53:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05781","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Steering Visual Generation in Unified Multimodal Models with Understanding Supervision","primary_cat":"cs.CV","submitted_at":"2026-05-07T07:20:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"re-caption images using different caption models to form text-image-text triplets. For image editing, training is conducted on CrispEdit-2M [8], a diverse set of high quality editing pairs. We additionally caption the target images to provide supervision text, resulting in text-image-image-text quadruplets. Evaluation ProtocolWe evaluate text-to-image generation performance using GenEval2 [20], DPG- Bench [18] and UniGenBench++ [55]. We primarily focus on DPG-Bench for its diverse prompts to evaluate semantic related instruction following. Additionally, as DPG-Bench exhibit rapid per- formance saturation [ 45], we also evaluate on UniGenBench++, a more recent and fine-grained evaluation set. We also report results on GenEval2 [20], an improved version of GenEval [14] that"},{"citing_arxiv_id":"2605.05206","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Taming Outlier Tokens in Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-06T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05204","ref_index":34,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-06T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by having the model act as both teacher (with multimodal context) and student (with text-only context) on its own roll-outs.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"from the target images, VLM's judgment of subject or style consistency (VLM-J), CLIP Score (CLIP-S) [29] for testing whether the model can generalize with the learned new knowledge, the Quality Score (Quality-S) and Aesthetic Score (Aesthetic-S) from the re- ward model for testing whether the model maintain its few-step sampling capacity, as well as Geneval [21] and DPG [34] score to test whether the model retain its previous knowl- edge. Detailed explanation of how the evaluation set is constructed and how each metric is obtained are in Appendix D. Methods for comparison.We compare with several representative baseline methods: (a). directly training with vanilla flow-matching loss [48] (Vanilla SFT). (b). training on the"},{"citing_arxiv_id":"2605.04128","ref_index":38,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation","primary_cat":"cs.GR","submitted_at":"2026-05-05T15:49:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"typography, and (3) stylized high-fidelity visual synthesis in editorial contexts. Compared to existing T2I models, JoyAI-Image consistently produces more accurate and complete text rendering, better preserves multilingual consistency, and maintains stronger visual coherence and layout 18 Table 8Quantitative evaluation results on representative general T2I benchmarks. For OneIG [15] and DPG [38], we report overall scores. Model OneIG CVTG-2K DPG EN ZH NED CLIPScore Word Acc. Overall Seedream 3.0 [32] 0.530 0.528 0.8537 0.7821 0.5924 88.27 GPT Image 1 [High] [61] 0.533 0.4740.94780.7982 0.8569 85.15 Z-Image [76]0.5460.535 0.9367 0.7969 0.8671 88.14 Qwen-Image [85] 0.5390.5480.91160.80170.828888.32 JoyAI-Image 0.542 0.521 0.9369 0.7990 0.8739 88."},{"citing_arxiv_id":"2605.02772","ref_index":5,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Linearizing Vision Transformer with Test-Time Training","primary_cat":"cs.CV","submitted_at":"2026-05-04T16:16:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Converts pretrained Vision Transformers to linear-complexity TTT models via architectural and representational alignment, demonstrated by linearizing Stable Diffusion 3.5 with 1-hour fine-tuning to match quality at 1.32-1.47x faster inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02641","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:26:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"and its strong benchmark results (Section 5) further validate the effectiveness of this extrapolation. 6.1.3 Upcycling Ablations The preceding ablation experiments validated individual MoE design choices under controlled settings. We now evaluate the effect of the random neuron sampling upcycling strategy (Section 2.3) on downstream instruction-following benchmarks (GenEval [66] and DPGBench [82]), using the actual Mamoda2.5 25B-A3B model (i.e., the E128A8 configuration) for the experiments. The DiT backbone weights of Mamoda2.5 are constructed from the pre-trained Wan2.2 5B [4] dense model. Note that the two models differ fundamentally in architecture: Wan2.2 adopts umT5 as the text encoder with cross-attention-based condition injection, whereas"},{"citing_arxiv_id":"2604.28185","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-30T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26341","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness","primary_cat":"cs.CV","submitted_at":"2026-04-29T06:46:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"50 SD3 [9] 0.74 0.99 0.94 0.72 0.89 0.33 0.60 81.3 58.9 73.3 32.0 40.8 61.7 84.08 FLUX.1-dev [22] 0.66 0.98 0.79 0.73 0.77 0.22 0.45 74.1 57.2 69.2 28.6 38.7 61.9 83.79 Unified Generative Model Show-o [53] 0.98 0.80 0.66 0.84 0.31 0.50 0.68 - - - - - - 67.27 UniWorld-V1 [26] 0.80 0.99 0.930.810.89 0.74 0.71 61.8 33.5 47.4 27.5 40.5 55.3 81.38 BAGEL [15] 0.82 0.99 0.940.810.88 0.64 0.63 81.0 56.2 70.8 35.4 41.9 64.7 - Janus-Pro [6] 0.80 0.99 0.89 0.59 0.90 0.790.66 - - - - - - 84.17 OmniGen2 [50] 0.801 0.950.64 0.89 0.550.7678.8 56.4 72.3 38.8 41.4 64.4 83.57 SpatialFusion 0.84 1 0.95 0.78 0.92 0.76 0.71 81.5 59.0 74.0 40.7 43.6 66.4 84.28 approximately 23% in allocentric tasks and 31% in intrinsic relation-"},{"citing_arxiv_id":"2604.25636","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-28T13:36:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than editing-based methods.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"1Tsinghua University 2Tencent HY ‡Corresponding Authors https://github.com/LeapLabTHU/RvR Fig.1: Refinement via Regeneration (RvR) largely improves text-to-image generation.Compared with the base unified multimodal model (UMM) BAGEL [12] and existing refinement-via-editing (RvE) methods, RvR achieves consistently better performance across Geneval [17], DPGBench [24], and UniGenBench++ [52]. Abstract.Unified multimodal models (UMMs) integrate visual under- standing and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs af- ter their initial generation, potentially extending the performance up- per bound. Current UMM-based refinement methods primarily follow a"},{"citing_arxiv_id":"2604.25299","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents","primary_cat":"cs.CV","submitted_at":"2026-04-28T07:09:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24953","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ViPO: Visual Preference Optimization at Scale","primary_cat":"cs.CV","submitted_at":"2026-04-27T19:49:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"|−(1−p)(1+αp)| of Poly-DPO adapts to different data characteristics through theα parameter, where p=σ(z) represents the model's confidence in preferring the chosen response. The visualization reveals three distinct optimization regimes that directly correspond to our experimental findings. Whenα >0 (blue and purple curves), the gradient is amplified in the regionp∈[0.5,0.8] , maintaining substantial parameter updates even for moderately confident predictions. This enhancement proves crucial for noisy datasets like Pick-a-Pic V2, where only 20.79% of samples show consistent preferences across evaluation dimensions-the sustained gradient (approximately 2-3× stronger than standard DPO atp≈0.6 when α= 8) prevents premature convergence on spurious patterns and encourages continued exploration to identify"},{"citing_arxiv_id":"2604.24763","ref_index":19,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-27T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21921","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Context Unrolling in Omni Models","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"As a unified model,O mni can naturally perform image generation and editing tasks, depending on the modality combination of input contexts. We thus report the performances on various benchmarks, comparing our proposed method with the expertise models. Table 6 presents the main results on both text-to-image and image editing tasks, including GenEval2 [19], DPG [16], LongText-EN [12], Inhouse evaluation, and GEdit [29]. Although prior approaches usually derive expertise models that focus on different tasks respectively,O mni benefits from the MoE architecture and task unification, achieving comparable performances with only 3B activations. 3.3 Video Generation Beyond image generation,O mni can also synthesize videos with various combinations of multimodal instruc-"},{"citing_arxiv_id":"2604.20796","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model","primary_cat":"cs.CV","submitted_at":"2026-04-22T17:20:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18258","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Long-Text-to-Image Generation via Compositional Prompt Decomposition","primary_cat":"cs.CV","submitted_at":"2026-04-20T13:31:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}