UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

Jiayun Wang; Weijie Gan; Wei Wei; Yu Wang; Zhenting Wang

arxiv: 2605.21611 · v1 · pith:BNVQK7IDnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

Jiayun Wang , Yu Wang , Weijie Gan , Zhenting Wang , Wei Wei This is my paper

Pith reviewed 2026-05-22 09:21 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords spatially grounded image generationvision-language embeddingOCR pretrainingdiffusion conditioningunified conditioningcontextual image generationmask-annotated imagestext-rendered masks

0 comments

The pith

Rendering text onto spatial masks lets one OCR-based encoder replace separate text and image prompts for better controlled image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a single unified visual input, created by rendering textual instructions directly onto a spatial mask, can bind semantics to locations more effectively than separate vision and language encoders. An OCR-pretrained backbone extracts a fused embedding that conditions a diffusion model after a two-stage alignment process. This produces higher-fidelity images on a benchmark of 477K mask-annotated examples while removing the text encoder at inference time. The method cuts computational load and supports precise instructions about what should appear where. Readers would care because it simplifies architecture and lowers the cost of spatially controllable synthesis.

Core claim

UniVL reframes conditioning for image generation by rendering textual instructions onto spatial masks to form a single visual input. An OCR-pretrained encoder produces the fVIL embedding that fuses semantic intent with spatial locations in one token sequence. A two-stage pipeline first aligns this embedding with the VAE space and then uses it to condition a pretrained diffusion backbone, eliminating any standalone text encoder such as T5.

What carries the argument

The UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, that reads the unified text-rendered mask condition and outputs a fused fVIL embedding for diffusion conditioning.

If this is right

Image quality rises, with FID dropping from 14 to 11 and PSNR rising from 16 to 20 on the UniVL-ImgGen benchmark.
Inference cost falls because the text encoder is removed, cutting TFLOPs by up to 52 percent and runtime by up to 44 percent.
The model follows user instructions specifying both content and location within the generated image.
Ablation studies confirm that the OCR backbone adaptation and two-stage alignment each contribute to the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rendering trick could be applied to other control signals such as depth maps or edge sketches to create richer single-stream conditioning.
Interactive design tools might adopt this pattern to lower latency when users edit spatial instructions in real time.
Further adaptation of the backbone on diverse scripts and layouts could extend reliable performance to multilingual or stylized text inputs.
The approach suggests that vision backbones pretrained on reading tasks may serve as drop-in replacements for language encoders in many multimodal generation settings.

Load-bearing premise

The OCR-pretrained backbone can reliably extract and bind semantic intent from text rendered onto the spatial mask without significant loss of information or spatial precision.

What would settle it

Evaluating generated images on a new test set of masks with unusual fonts, dense text, or partial occlusions and checking whether semantic accuracy and spatial match remain superior to text-prompted baselines would directly test the central claim.

Figures

Figures reproduced from arXiv: 2605.21611 by Jiayun Wang, Weijie Gan, Wei Wei, Yu Wang, Zhenting Wang.

**Figure 2.** Figure 2: UNIVL training pipeline. The UNIVL encoder (left, pretrained on OCR tasks) consists of a frozen VAE encoder, a trainable CLIP backbone and a trainable linear adapter; the VAE features bypass CLIP via a skip connection and are mask-aware-fused with the CLIP features (Eq. 1). Stage 1 (Feature alignment): the UNIVL encoder takes the contextual condition CI and is trained so its output fVL reconstructs the VAE… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison to OminiControl baseline [ [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: (4a) Efficiency gains of UNIVL over the baseline [26] with text encoder across image resolutions: TFLOPs and runtime savings grow at lower resolutions, from 8% at 1024×1024 to 52% TFLOPs / 44% runtime at 256×256. (4b)–(4c) Multi-region performance: UNIVL (red) maintains a substantial lead over OminiControl w/o text (blue) at every mask count, both in image quality (FID, lower is better) and in region-level… view at source ↗

**Figure 5.** Figure 5: Zero-shot test on irregular mask shape. Row 1: UNIVL generates an image whose foreground content fills the irregular mask shape, matching the text label rendered inside the non-rectangular region. Row 4: UNIVL respects the round-shape cookie mask and outperforms the OminiControl baseline, which defaults to a bounding-rectangle fill and ignores the irregular boundary. Although UNIVL is trained only on recta… view at source ↗

**Figure 6.** Figure 6: Overlapping masks. Row 1: UNIVL works when two masks do not overlap too much—each labeled region is generated as instructed. Row 2 (failure case): when the “food” and “plate” boxes totally overlay, UNIVL cannot disambiguate which label belongs to which region and produces a single fused object. Row 3: compared to existing methods, UNIVL follows the “pickup truck” instruction at the desired location, while … view at source ↗

**Figure 7.** Figure 7: Zero-shot UNIVL on COCO val2017. UNIVL generalizes well to a held-out natural-image distribution despite never having seen COCO during training, producing region-faithful generations across diverse object categories. E.4 Multi-Step and Single-Step Multibox Edits UNIVL handles two complementary multi-region workflows in a single architecture: (i) sequential multi-step edits, where the user applies one mask … view at source ↗

**Figure 8.** Figure 8: Multi-region edit modes. The source image is shown at the bottom-left. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Failure cases of UNIVL. (a) Rare vocabulary: for uncommon class phrases such as “stone fountain,” the rendered text is read correctly but the diffusion model has limited training signal for that category, producing generic stone-textured fills rather than the intended object. We expect this to improve with broader training-data coverage. (b) Very small mask region: when the mask area is tiny, the rendered … view at source ↗

**Figure 10.** Figure 10: Class-name vocabulary in UNIVL-ImgGen. (10a) Word cloud where font size is proportional to frequency: the most prominent labels are everyday objects and people (person, car, people, flowers, bench, trees), with a long tail of attribute-modified phrases (a wooden tray, a black bicycle, a red sports car). (10b) Frequency chart of the top-30 most frequent phrases. The vocabulary is dominated by natural-image… view at source ↗

read the original abstract

We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL, that fuses visual and semantic intent with spatial locations in a single token sequence. A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder, such as T5. Although this reframing uses a deliberately minimal text interface, it yields strong empirical gains. On UniVL-ImgGen, a benchmark of 477K mask-annotated images that we construct for training and evaluation, UniVL improves image quality over text-prompted baselines, reducing FID from 14 to 11 and increasing PSNR from 16 to 20. It also eliminates the text encoder entirely, reducing inference TFLOPs by up to 52% and runtime by up to 44%. Additional ablation studies validate the contributions of the proposed components, paving the way for efficient, spatially grounded image generation with a unified conditioning paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniVL renders text on masks to drop the separate text encoder and reports efficiency plus quality gains on a self-built benchmark, but the OCR binding step looks like the untested load-bearing piece.

read the letter

The paper's main idea is to render the textual instruction directly onto the spatial mask and run the composite through an OCR-adapted encoder to produce one fVIL embedding that conditions a diffusion model. This removes the standalone text encoder at inference and they show lower FID, higher PSNR, and up to 52% fewer TFLOPs on the 477K-image UniVL-ImgGen benchmark they assembled for the task. The two-stage alignment to VAE space then diffusion conditioning is a practical way to reuse existing backbones without retraining everything from scratch. Building the mask-annotated dataset is also a concrete contribution that gives others a place to test similar unified conditioning setups. Efficiency numbers like the 44% runtime cut matter for real deployment, and the abstract is clear about what changes in the pipeline. The soft spot is whether the OCR backbone actually fuses semantics with precise spatial locations. OCR pretraining focuses on reading characters, not on interpreting instructions like object placement or color while preserving mask alignment. If that step drops information, the reported gains could trace back to benchmark construction or the diffusion backbone rather than the unified paradigm itself. The abstract mentions ablations but gives no error bars or detailed breakdowns, so it's hard to tell how sensitive the results are to the rendering or the specific OCR model. I'd want to see head-to-head runs where the same masks get both the rendered-text input and a conventional text encoder to isolate the effect. This work is aimed at people building efficient conditional generators who want simpler inference pipelines and spatial control without extra encoders. A reader focused on diffusion conditioning or vision-language unification would find the benchmark and the embedding approach useful to examine. It deserves a serious referee because the efficiency claims are specific and the task reframing is distinct enough to check against stronger baselines and independent verification of the OCR assumption.

Referee Report

3 major / 2 minor

Summary. The paper introduces spatially grounded contextual image generation, a task that renders textual instructions directly onto spatial masks to create a single unified visual input. It proposes UniVL, an encoder adapted from an OCR-pretrained backbone that produces a fused fVIL embedding capturing both semantics and spatial locations. A two-stage pipeline first aligns this embedding with VAE space and then conditions a pretrained diffusion model on it, eliminating the standalone text encoder (e.g., T5). On the authors' newly constructed UniVL-ImgGen benchmark of 477K mask-annotated images, the method reports improved image quality (FID reduced from 14 to 11, PSNR increased from 16 to 20) and efficiency gains (up to 52% fewer inference TFLOPs and 44% faster runtime) over text-prompted baselines.

Significance. If the central claims hold after verification, this work could meaningfully advance efficient diffusion-based generation by unifying vision-language conditioning into a single visual stream, removing the text encoder at inference and delivering substantial compute savings. The efficiency numbers are a notable strength if reproducible across backbones. However, the significance is limited by reliance on a self-constructed benchmark and the untested assumption that an OCR-pretrained model can reliably extract and spatially bind instructional semantics without information loss.

major comments (3)

[Abstract and §4] Abstract and §4 (Benchmark): The performance claims rest on the newly constructed UniVL-ImgGen benchmark of 477K images, yet the manuscript provides no details on data sourcing, mask generation, annotation protocol, or controls for distribution shift relative to standard datasets. This is load-bearing because the reported FID/PSNR gains and efficiency improvements could be artifacts of benchmark construction rather than evidence for the unified paradigm.
[§3.1] §3.1 (UniVL encoder): The core assumption that an OCR-pretrained backbone can extract instructional semantics (e.g., “place a red cat here”) from text rendered on spatial masks and bind them to precise locations without loss of semantic content or spatial precision is unverified. Standard OCR pretraining optimizes for character detection, not semantic interpretation or sub-pixel alignment; without an ablation replacing the OCR backbone with a general vision encoder, it is unclear whether the two-stage alignment and diffusion conditioning actually succeed due to the unified fVIL representation.
[§5] §5 (Experiments): The abstract states that “additional ablation studies validate the contributions,” yet no quantitative ablation tables, error bars, or statistical tests on the FID/PSNR improvements or TFLOP reductions are described. The two-stage pipeline’s alignment loss and conditioning mechanism are load-bearing for the efficiency claims; missing these details prevents assessment of whether the 52% TFLOP reduction is robust or specific to the chosen diffusion backbone.

minor comments (2)

[Abstract] The acronym fVIL is introduced in the abstract without an explicit expansion or dimensionality specification; define it clearly on first use.
[Figures] Figure captions and method diagrams should explicitly label the rendered text-on-mask input and the two-stage alignment blocks for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below, providing clarifications and indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Benchmark): The performance claims rest on the newly constructed UniVL-ImgGen benchmark of 477K images, yet the manuscript provides no details on data sourcing, mask generation, annotation protocol, or controls for distribution shift relative to standard datasets. This is load-bearing because the reported FID/PSNR gains and efficiency improvements could be artifacts of benchmark construction rather than evidence for the unified paradigm.

Authors: We agree that the manuscript would benefit from more comprehensive details on the UniVL-ImgGen benchmark construction. Although §4 introduces the benchmark, we acknowledge that additional information on data sourcing, the process for generating spatial masks, the annotation protocol, and measures to mitigate distribution shift is necessary for full reproducibility and to address concerns about potential artifacts. In the revised manuscript, we will expand §4 with a dedicated subsection providing these details, including the sources of the images, how masks were created and annotated, and comparisons to standard datasets like COCO or LAION to control for shifts. revision: yes
Referee: [§3.1] §3.1 (UniVL encoder): The core assumption that an OCR-pretrained backbone can extract instructional semantics (e.g., “place a red cat here”) from text rendered on spatial masks and bind them to precise locations without loss of semantic content or spatial precision is unverified. Standard OCR pretraining optimizes for character detection, not semantic interpretation or sub-pixel alignment; without an ablation replacing the OCR backbone with a general vision encoder, it is unclear whether the two-stage alignment and diffusion conditioning actually succeed due to the unified fVIL representation.

Authors: We selected an OCR-pretrained backbone precisely because it is optimized for reading and localizing text within images, which directly supports our approach of rendering textual instructions onto spatial masks. This allows the model to process the unified visual input optically. However, we recognize the value of verifying this choice through ablation. We will add an ablation study in the revised §5, comparing the OCR backbone to a general-purpose vision encoder (e.g., a standard ViT) to demonstrate the specific benefits of OCR pretraining for semantic extraction and spatial binding in this task. This will help confirm that the fVIL embedding's effectiveness stems from the unified representation. revision: yes
Referee: [§5] §5 (Experiments): The abstract states that “additional ablation studies validate the contributions,” yet no quantitative ablation tables, error bars, or statistical tests on the FID/PSNR improvements or TFLOP reductions are described. The two-stage pipeline’s alignment loss and conditioning mechanism are load-bearing for the efficiency claims; missing these details prevents assessment of whether the 52% TFLOP reduction is robust or specific to the chosen diffusion backbone.

Authors: We apologize for the omission of detailed quantitative results for the ablations in the current version. The manuscript mentions additional ablation studies, but we agree that including full tables with error bars, statistical significance tests, and breakdowns of the alignment loss and conditioning mechanism is essential. In the revised manuscript, we will include comprehensive ablation tables in §5, reporting quantitative results with standard deviations across multiple runs, and provide more details on how the two-stage pipeline contributes to the observed efficiency gains. This will allow better assessment of the robustness of the 52% TFLOP reduction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmark, independent of model equations

full rationale

The paper reframes conditioning by rendering text onto spatial masks and feeding the composite to an OCR-adapted encoder to produce fVIL embeddings, then aligns and conditions a diffusion model. All reported gains (FID 14→11, PSNR 16→20, TFLOP reductions) are presented as measured outcomes on the 477K-image UniVL-ImgGen benchmark the authors construct, not as quantities algebraically forced by the model definition or by self-citation chains. No equations equate a fitted parameter to a claimed prediction, no uniqueness theorem is imported from prior self-work, and the OCR backbone is treated as an external pretrained component rather than defined circularly. The derivation therefore remains self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the ability of an OCR-pretrained model to read rendered text on masks and on the success of a two-stage alignment process whose details are not provided in the abstract.

axioms (2)

domain assumption An OCR-pretrained backbone can accurately read and semantically interpret text rendered onto spatial masks in the context of image generation conditioning.
The framework adapts an optical-character-recognition-pretrained backbone to read the unified condition optically.
domain assumption Aligning the UniVL embedding to the VAE latent space followed by diffusion conditioning preserves sufficient spatial and semantic information.
A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings.

invented entities (1)

UniVL embedding (fVIL) no independent evidence
purpose: Single token sequence that fuses visual and semantic intent with spatial locations.
The UniVL encoder produces fVIL that fuses visual and semantic intent with spatial locations in a single token sequence.

pith-pipeline@v0.9.0 · 5847 in / 1595 out tokens · 40122 ms · 2026-05-22T09:21:25.305684+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 15 internal anchors

[1]

Ming-omni: A unified multimodal model for perception and generation, 2025

Inclusion AI. Ming-omni: A unified multimodal model for perception and generation, 2025. URL https://arxiv.org/abs/2506.09344

work page arXiv 2025
[2]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions.arXiv preprint arXiv:2211.09800, 2023

work page arXiv 2023
[4]

Textdiffuser: Diffusion models as text painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. InNeurIPS, 2023

work page 2023
[5]

Anydoor: Zero- shot object-level image customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero- shot object-level image customization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[6]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.arXiv preprint arXiv:2105.05233, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

work page arXiv 2025
[9]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt- to-prompt image editing with cross-attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.arXiv preprint arXiv:2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[11]

Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion

Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[12]

A style-based generator architecture for generative adversar- ial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversar- ial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019
[13]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

work page 2021
[14]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. pages 22511–22521, 2023. 12

work page 2023
[16]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Repaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. pages 11461–11471, 2022

work page 2022
[18]

Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

Jian Ma, Mingjun Zhao, Chen Chen, Ruichen Wang, Di Niu, Haonan Lu, and Xiaodong Lin. Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

work page arXiv 2023
[19]

Semantic image synthesis with spatially-adaptive normalization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019

work page 2019
[20]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[21]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

URLhttps://arxiv.org/abs/2203.17189

work page arXiv
[25]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models.arXiv preprint arXiv:2112.10752, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022
[27]

Ominicontrol: Min- imal and universal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Min- imal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[28]

Omni-video: Democra- tizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democra- tizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

work page arXiv 2025
[29]

Anytext: Multilingual visual text generation and editing

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023. 13

work page arXiv 2023
[30]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

work page arXiv 2026
[33]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. pages 13294–13304, 2025

work page 2025
[35]

Paint by example: Exemplar-based image editing with diffusion models

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[36]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models.arXiv preprint arXiv:2302.05543, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Unpaired image-to-image translation using cycle-consistent adversarial networks

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017

work page 2017
[39]

A task is worth one word: Learning with task prompts for high-quality versatile image inpainting

Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. InEuropean Conference on Computer Vision (ECCV), 2024. 14 Appendix We provide details omitted from the main text in appendix: • Section A(Implementation Details):full training hyperparameters, ...

work page 2024
[40]

Feature alignment( Lalign, λ=1.0): per-token spatial fidelity between the UNIVL embedding and the target V AE latent,∥fVL −z 0∥2 2 wherez 0 =E(X)

work page
[41]

CLIP image loss( Lclip-img, λ=1.0): enforces that UNIVL features match CLIP vision patch features within the mask,∥f VL ⊙m−f CLIP ⊙m∥ 2

work page
[42]

with CLIP semantic instruction

CLIP text loss( Lclip-txt, λ=0.8): cosine similarity between the projected pooled masked condition and the CLIP text embedding of the class name, 1−cos(g(¯cm),CLIP text(ℓ)), providing category-level semantic grounding. The L2 loss provides instance-level alignment (reconstructthis specific object), while the CLIP text loss provides category-level groundin...

work page arXiv

[1] [1]

Ming-omni: A unified multimodal model for perception and generation, 2025

Inclusion AI. Ming-omni: A unified multimodal model for perception and generation, 2025. URL https://arxiv.org/abs/2506.09344

work page arXiv 2025

[2] [2]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions.arXiv preprint arXiv:2211.09800, 2023

work page arXiv 2023

[4] [4]

Textdiffuser: Diffusion models as text painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. InNeurIPS, 2023

work page 2023

[5] [5]

Anydoor: Zero- shot object-level image customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero- shot object-level image customization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[6] [6]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.arXiv preprint arXiv:2105.05233, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

work page arXiv 2025

[9] [9]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt- to-prompt image editing with cross-attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.arXiv preprint arXiv:2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[11] [11]

Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion

Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024

[12] [12]

A style-based generator architecture for generative adversar- ial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversar- ial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019

[13] [13]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

work page 2021

[14] [14]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. pages 22511–22521, 2023. 12

work page 2023

[16] [16]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Repaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. pages 11461–11471, 2022

work page 2022

[18] [18]

Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

Jian Ma, Mingjun Zhao, Chen Chen, Ruichen Wang, Di Niu, Haonan Lu, and Xiaodong Lin. Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

work page arXiv 2023

[19] [19]

Semantic image synthesis with spatially-adaptive normalization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019

work page 2019

[20] [20]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[21] [21]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [24]

URLhttps://arxiv.org/abs/2203.17189

work page arXiv

[24] [25]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models.arXiv preprint arXiv:2112.10752, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [26]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022

[26] [27]

Ominicontrol: Min- imal and universal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Min- imal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025

[27] [28]

Omni-video: Democra- tizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democra- tizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

work page arXiv 2025

[28] [29]

Anytext: Multilingual visual text generation and editing

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023. 13

work page arXiv 2023

[29] [30]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [31]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [32]

Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

work page arXiv 2026

[32] [33]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [34]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. pages 13294–13304, 2025

work page 2025

[34] [35]

Paint by example: Exemplar-based image editing with diffusion models

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[35] [36]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [37]

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models.arXiv preprint arXiv:2302.05543, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [38]

Unpaired image-to-image translation using cycle-consistent adversarial networks

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017

work page 2017

[38] [39]

A task is worth one word: Learning with task prompts for high-quality versatile image inpainting

Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. InEuropean Conference on Computer Vision (ECCV), 2024. 14 Appendix We provide details omitted from the main text in appendix: • Section A(Implementation Details):full training hyperparameters, ...

work page 2024

[39] [40]

Feature alignment( Lalign, λ=1.0): per-token spatial fidelity between the UNIVL embedding and the target V AE latent,∥fVL −z 0∥2 2 wherez 0 =E(X)

work page

[40] [41]

CLIP image loss( Lclip-img, λ=1.0): enforces that UNIVL features match CLIP vision patch features within the mask,∥f VL ⊙m−f CLIP ⊙m∥ 2

work page

[41] [42]

with CLIP semantic instruction

CLIP text loss( Lclip-txt, λ=0.8): cosine similarity between the projected pooled masked condition and the CLIP text embedding of the class name, 1−cos(g(¯cm),CLIP text(ℓ)), providing category-level semantic grounding. The L2 loss provides instance-level alignment (reconstructthis specific object), while the CLIP text loss provides category-level groundin...

work page arXiv