pith. machine review for the scientific record. sign in

arxiv: 2508.02324 · v1 · submitted 2025-08-04 · 💻 cs.CV

Recognition: 3 theorem links

Qwen-Image Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords imageeditingqwen-imagerenderingachievescomplextextcapabilities
0
0 comments X

The pith

Qwen-Image uses progressive curriculum training and dual-encoding to advance complex text rendering and consistent image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors present Qwen-Image as a model built to render detailed text accurately inside generated images and to edit existing images without unwanted changes. They collect and prepare large amounts of text-rich data, then train the model in stages that start with simple text and move up to full paragraphs. For editing they add a reconstruction task and process each input image in two ways—one path captures semantic meaning while the other preserves visual structure—so the edits stay faithful to both the prompt and the original picture. These choices produce stronger results on benchmarks for both generation and editing, including better handling of Chinese text. A reader would care because spelling words correctly in pictures and making precise edits remain hard problems for most image models.

Core claim

Through a data pipeline for text-rich examples, progressive curriculum training that scales from simple to paragraph-level text, and a dual-encoding approach that separately extracts semantic and reconstructive features from the input image, Qwen-Image improves text rendering in generation and maintains higher consistency during image editing.

What carries the argument

Progressive curriculum training combined with a dual-encoding mechanism that processes the original image once for semantic content and once for reconstructive detail.

Load-bearing premise

The performance gains in text rendering and editing consistency arise chiefly from the progressive curriculum and dual-encoding steps rather than from model scale or base model improvements alone.

What would settle it

An ablation study that trains the model without the progressive stages or without the dual-encoding and then measures the drop on text rendering and editing benchmarks.

read the original abstract

We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents Qwen-Image, an image generation and editing foundation model in the Qwen series. It describes a data collection/filtering/synthesis pipeline combined with progressive curriculum training (non-text-to-text to paragraph-level) to improve complex text rendering, particularly for logographic languages such as Chinese. For editing consistency it introduces a multi-task setup (T2I, TI2I, and I2I reconstruction) plus a dual-encoding mechanism that routes the input image separately through Qwen2.5-VL (semantic) and a VAE encoder (reconstructive). The abstract asserts that these components yield state-of-the-art performance on multiple benchmarks for both generation and editing.

Significance. If the performance claims and causal attributions hold, the work would supply a concrete, reproducible recipe for curriculum-based text rendering and dual-path latent alignment in diffusion-style models, with particular value for multilingual and text-heavy image tasks. The explicit separation of semantic and reconstructive pathways is a clear design choice that could be adopted more broadly.

major comments (3)
  1. [Abstract] Abstract and methods description: the central claim that progressive curriculum training and dual-encoding drive SOTA text-rendering and editing gains is unsupported because the manuscript supplies no quantitative benchmark scores, comparison tables, or error metrics. Without these data the attribution cannot be evaluated against simpler scaling baselines.
  2. [Training strategy] Training strategy section: no ablation experiments are reported that isolate the progressive curriculum (non-text-to-text to paragraph-level) or the dual-encoding (Qwen2.5-VL + VAE) from the effects of data volume or the base Qwen2.5-VL model. This omission leaves open the possibility that observed improvements are due to scale rather than the described mechanisms.
  3. [Multi-task training] Multi-task training description: the claim that adding I2I reconstruction aligns latent representations between Qwen2.5-VL and MMDiT is stated without any supporting alignment metrics, reconstruction error curves, or consistency scores on editing benchmarks.
minor comments (1)
  1. The abstract and methods text would benefit from explicit forward references to any tables or figures that contain the benchmark numbers once they are added.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript accordingly to strengthen the presentation of results and supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods description: the central claim that progressive curriculum training and dual-encoding drive SOTA text-rendering and editing gains is unsupported because the manuscript supplies no quantitative benchmark scores, comparison tables, or error metrics. Without these data the attribution cannot be evaluated against simpler scaling baselines.

    Authors: We agree that the abstract would benefit from more explicit quantitative grounding. The experiments section of the manuscript reports results on standard benchmarks including GenEval, DPG-Bench for generation and MagicBrush, Emu Edit for editing tasks. To directly address the concern, we have revised the abstract to reference these evaluations and inserted a new summary comparison table early in the experiments section that lists key metrics against baselines such as SD3 and Flux. This makes the performance claims and their attribution to the proposed methods immediately verifiable. revision: yes

  2. Referee: [Training strategy] Training strategy section: no ablation experiments are reported that isolate the progressive curriculum (non-text-to-text to paragraph-level) or the dual-encoding (Qwen2.5-VL + VAE) from the effects of data volume or the base Qwen2.5-VL model. This omission leaves open the possibility that observed improvements are due to scale rather than the described mechanisms.

    Authors: We acknowledge that dedicated ablations would provide stronger causal evidence. The manuscript already includes comparisons against the base Qwen2.5-VL model and qualitative discussion of curriculum stages, but full-scale ablations were omitted due to computational cost. In the revision we have added a new subsection presenting smaller-scale ablation studies on the curriculum progression (non-text-to-text through paragraph-level) and dual-encoding variants, together with additional analysis arguing that the gains exceed what would be expected from data volume or base-model scaling alone. revision: partial

  3. Referee: [Multi-task training] Multi-task training description: the claim that adding I2I reconstruction aligns latent representations between Qwen2.5-VL and MMDiT is stated without any supporting alignment metrics, reconstruction error curves, or consistency scores on editing benchmarks.

    Authors: We thank the referee for highlighting this gap. The multi-task setup is described in the methods, but explicit supporting measurements were not reported. We have added reconstruction error curves, latent alignment metrics (cosine similarity between Qwen2.5-VL and MMDiT representations), and editing consistency scores on the relevant benchmarks to the revised multi-task training subsection, directly substantiating the alignment claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training pipeline with no derivations or self-referential reductions

full rationale

The paper describes concrete engineering choices—large-scale data collection/filtering/synthesis, progressive curriculum (non-text-to-text to paragraph-level), multi-task training (T2I + TI2I + I2I reconstruction), and dual-encoding (separate Qwen2.5-VL semantic and VAE paths)—then reports benchmark outcomes. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear. Claims rest on external data and observed results rather than tautological reduction to inputs. This is a standard empirical technical report with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, mathematical axioms, or newly postulated entities; the work relies on standard components (Qwen2.5-VL, MMDiT, VAE) and conventional ML training practices whose details are not enumerated.

pith-pipeline@v0.9.0 · 5695 in / 1134 out tokens · 35391 ms · 2026-05-10T14:24:43.916786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

    cs.CV 2026-04 unverdicted novelty 8.0

    The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

  2. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  3. From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

  4. MiVE: Multiscale Vision-language features for reference-guided video Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.

  5. LiWi: Layering in the Wild

    cs.CV 2026-05 unverdicted novelty 7.0

    LiWi uses an agent-driven data synthesis pipeline to build the LiWi-100k dataset and a model with shadow-guided and degradation-restoration objectives that achieves SoTA performance on RGB L1 and Alpha IoU for natural...

  6. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  7. Asymmetric Flow Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...

  8. Inline Critic Steers Image Editing

    cs.CV 2026-05 conditional novelty 7.0

    Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

  9. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  10. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  11. Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

    cs.CR 2026-05 unverdicted novelty 7.0

    Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.

  12. What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 7.0

    A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.

  13. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.

  14. EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

  15. Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

  16. Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.

  17. Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles

    cs.CR 2026-05 unverdicted novelty 7.0

    A theoretical framework decouples diffusion model generation from watermark decisions, enabling SSB to reach any security-robustness-fidelity regime without model-specific empirical tests.

  18. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    CrossCult-KIBench provides 9,800 test cases for cross-cultural knowledge insertion in MLLMs and shows that existing methods cannot reliably adapt to one culture while preserving behavior in others.

  19. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.

  20. MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

  21. DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability w...

  22. SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking

    cs.CV 2026-05 unverdicted novelty 7.0

    SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.

  23. StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

    cs.GR 2026-04 unverdicted novelty 7.0

    StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.

  24. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  25. ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.

  26. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  27. Long-Text-to-Image Generation via Compositional Prompt Decomposition

    cs.CV 2026-04 unverdicted novelty 7.0

    PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...

  28. View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

    cs.CV 2026-04 unverdicted novelty 7.0

    A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.

  29. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  30. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

  31. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  32. Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

    cs.CV 2026-04 unverdicted novelty 7.0

    STFER uses LVLM-generated identity-consistent semantic text to drive visual token filtering and expert routing for improved any-time person re-identification under clothing changes and modality shifts.

  33. LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.

  34. HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement

    cs.CV 2026-04 unverdicted novelty 7.0

    A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.

  35. RewardFlow: Generate Images by Optimizing What You Reward

    cs.CV 2026-04 unverdicted novelty 7.0

    RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.

  36. FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    FlowGuard detects unsafe content during diffusion image generation via linear latent decoding and curriculum learning, outperforming prior methods by over 30% F1 while reducing GPU memory by 97% and projection time to...

  37. SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

    cs.CV 2026-04 conditional novelty 7.0

    SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.

  38. RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

    cs.CV 2026-04 unverdicted novelty 7.0

    RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

  39. 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

    cs.CV 2026-04 conditional novelty 7.0

    1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.

  40. Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro

    cs.CV 2026-04 unverdicted novelty 7.0

    Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.

  41. Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

    cs.CV 2026-03 unverdicted novelty 7.0

    Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.

  42. Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.

  43. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  44. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  45. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  46. AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

    cs.SD 2026-05 unverdicted novelty 6.0

    AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice ...

  47. GeoR-Bench: Evaluating Geoscience Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.

  48. FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry

    cs.LG 2026-05 unverdicted novelty 6.0

    Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.

  49. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  50. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  51. Pixal3D: Pixel-Aligned 3D Generation from Images

    cs.CV 2026-05 unverdicted novelty 6.0

    Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

  52. How Mobile World Model Guides GUI Agents?

    cs.AI 2026-05 unverdicted novelty 6.0

    Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...

  53. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.

  54. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.

  55. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

    cs.AI 2026-05 unverdicted novelty 6.0

    Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...

  56. SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

  57. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  58. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.

  59. ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.

  60. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 130 Pith papers · 21 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575,

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  3. [3]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073,

  4. [4]

    Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer. arXiv preprint arXiv:2505.22705,

  5. [5]

    Oneig-bench: Omni-dimensional nuanced evaluation for image generation

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977,

  6. [6]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pp. 386–402. Springer, 2024a. Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. B...

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683,

  8. [8]

    Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T

    URL https://arxiv.org/abs/2204.11918. Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461,

  9. [9]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346,

  10. [10]

    arXiv preprint arXiv:2507.22058 (2025)

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again. CoRR, abs/2507.22058,

  11. [11]

    Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model. arXiv preprint arXiv:2503.07703,

  12. [12]

    Depthfm: Fast monocular depth estimation with flow matching

    Ming Gui, Johannes Schusterbauer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, and Björn Ommer. Depthfm: Fast monocular depth estimation with flow matching. arXiv preprint arXiv:2403.13788,

  13. [13]

    Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv preprint arXiv:2404.15506, 2024a. Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip ...

  14. [14]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    42 Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603,

  15. [15]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742,

  16. [16]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a. Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Day...

  17. [17]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147,

  18. [18]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025a. Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Pr...

  19. [19]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025b. Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. a...

  20. [20]

    1145/219717.219748

    ISSN 0001-0782. doi: 10.1145/219717.219748. URL https://doi.org/10.1145/219717.219748. Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging {AI} applications. In 13th USENIX symposium on operating systems des...

  21. [21]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    URL https://openai.com/index/introducing-4o-image-generation/ . Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

  22. [22]

    Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework. arXiv preprint arXiv:2503.21758,

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  24. [24]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

  25. [25]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786,

  26. [26]

    Dai, Andrea F

    Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z Dai, Andrea F Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R Walter, et al. Diode: A dense indoor and outdoor depth dataset. arXiv preprint arXiv:1908.00463,

  27. [27]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  28. [28]

    arXiv preprint arXiv:2312.02201 , year=

    Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201,

  29. [29]

    Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

    44 Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jian- chao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083,

  30. [30]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024a. Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xiang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su, and Jun Zhu. Crm: Single image to 3d te...

  31. [31]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 12966–12977, 2025a. Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shit...

  32. [32]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427, 2025a. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qingho...

  33. [33]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025b. An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastive vision-language pretraining in chinese. arXiv:2211.01335,

  34. [34]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,

  35. [35]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10371–10381, 2024a. Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao...

  36. [36]

    In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. arXiv preprint arXiv:2504.20690,

  37. [37]

    arXiv preprint arXiv:2502.17157 (2025)

    Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, and Chun- hua Shen. Diception: A generalist diffusion model for visual perceptual tasks. arXiv preprint arXiv:2502.17157,

  38. [38]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,

  39. [39]

    arXiv preprint arXiv:2410.12669 (2024)

    Dewei Zhou, Ji Xie, Zongxin Yang, and Yi Yang. 3dis: Depth-driven decoupled instance synthesis for text-to-image generation. arXiv preprint arXiv:2410.12669,