pith. sign in

arxiv: 2605.20316 · v1 · pith:ADTN25QFnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

Pith reviewed 2026-05-21 07:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords bidirectional vision-languageflow matchingLoRA adapterstext-to-imageparameter-efficient trainingrectified flowmultimodal generation
0
0 comments X

The pith

FullFlow upgrades a pretrained text-to-image flow model to bidirectional vision-language generation by training only LoRA adapters and lightweight text heads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern text-to-image flow models already encode strong visual priors but can only generate images from text. FullFlow adds a discrete text insertion process and separate timesteps for images and text so that one backbone supports text-to-image, image-to-text, joint sampling, and partial-text prediction. The method trains roughly five percent of the parameters via LoRA, keeps the original continuous flow for images, and reports large gains in quality and efficiency over prior bidirectional approaches that require more retraining.

Core claim

FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text-to-image, image-to-text, joint sampling, and partial-text prediction with a single backbone.

What carries the argument

LoRA adapters plus lightweight text heads on a pretrained rectified-flow backbone, using separate timesteps for the image and text modalities.

If this is right

  • Text-to-image FID drops from 62.7 to 31.6 while image-to-text CIDEr rises from 2.0 to 99.4 under identical trainable-parameter count and LoRA rank.
  • Peak VRAM falls from roughly 84 GB to 38 GB and throughput increases by a factor of eight on two RTX A5000 GPUs.
  • Training finishes in under 24 hours while updating only about five percent of backbone parameters.
  • The same recipe transfers directly to FLUX.1-dev and enables downstream VQA via partial-text generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Strong unimodal priors may reduce the need for full joint pretraining when building bidirectional models.
  • Similar lightweight adapter strategies could extend other single-direction generative models to new modalities.
  • The two-dimensional timestep space may support additional interactive tasks such as guided editing or progressive completion.

Load-bearing premise

The rich visual priors already present in a pretrained text-to-image backbone remain usable for bidirectional tasks when only LoRA adapters and lightweight text heads are added.

What would settle it

A controlled experiment that fully retrains the text pathway on the same data volume and measures whether text-to-image FID rises above 31.6 or peak memory exceeds 38 GB.

Figures

Figures reproduced from arXiv: 2605.20316 by Alessio Tonioni, Enis Simsar, Eric Tillmann Bill, Thomas Hofmann.

Figure 1
Figure 1. Figure 1: Overview of the multimodal generation capabilities after finetuning Stable Diffusion 3. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dual-timestep space with im￾age time t (x-axis) and text time τ (y-axis). All arrows, including the black ones, corre￾spond to valid trajectories in the joint gen￾erative space. Colored trajectories high￾light the canonical modes: text→image, image→text, and joint generation. Caption y CLIP-L/14 CLIP-G/14 T5 XXL Image x MLP Linear Sinusoidal Encoding MLP Timestep τ Timestep t − + + Linear × tanh Scalar τ P… view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of the adaptive text-loss weight λtxt. Thin: raw ratio estimate; thick: EMA used in training. A1: Trainable parameters and text-conditioning path￾ways. SD3 uses three frozen text encoders with distinct tokenizers, whereas our discrete text process is defined in a single T5 token space. We therefore diffuse T5 tokens, decode the partially corrupted sequence, and re-tokenize it for the remaining en… view at source ↗
Figure 5
Figure 5. Figure 5: CIDEr for alternating-clean (πac), mixed￾corruption (πind), and five independent runs that switch from πind to πac at 75k, 100k, 125k, 150k, and 175k steps. Mean over a 5k held-out split; error bars show 95% CIs. A4: Which corruption schedule best supports both correspondence and deployment? We com￾pare two corruption schedules: alternating-clean (πac), which keeps one modality clean and matches the endpoi… view at source ↗
Figure 6
Figure 6. Figure 6: Joint generation over τ = t 2 p trajecto￾ries. Color shows mean CLIP over 1k samples. The dual-timestep formulation also enables image and text to be sampled jointly rather than condition￾ing on one modality. We sweep trajectories by fixing image time to a linear schedule and setting text time to τ = t 2 p , with p controlling which modality de￾noises earlier [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Continuous rectified flow (left) and discrete Edit Flows (right) instantiate the same path-based principle: learn local transport from a base distribution to data. In continuous space the supervised target is a velocity; in discrete space it is an insertion jump. The red arrow/token marks the quantity the model is trained to predict. B Training Details B.1 Dataset Preparation and Pre-Encoding We train on a… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of caption lengths in the training set, measured in T5 tokens after the LAION-COCO￾Aesthetic filtering described in Section B. For readability the histogram drops the lowest and highest 1% of caption lengths and uses 60 bins; this distribution motivates our 77-token sequence cap. contain 18 or more tokens; captions with fewer than 4 tokens are discarded. We sample 300 captions from each length… view at source ↗
Figure 9
Figure 9. Figure 9: Log-ratio between the expected MSE gradient norm (image branch, Theorem E.1) and the lower/upper bounds on the cross-entropy gradient norm (text branch, Theorem E.2), as a function of the agreement levels σimg and σtxt for latent dimension N = 4096. Negative regions, which dominate early training when text predictions are poor and the image prior is already accurate, indicate that the text-branch gradient … view at source ↗
Figure 10
Figure 10. Figure 10: MM-DiT block adapted from Stable Diffusion 3 for joint image–text generation. We add text-time conditioning τ alongside the original image-time conditioning t, with both timesteps embedded separately and injected through the existing AdaLN-style modulation interface. Yellow blocks denote LoRA-augmented linear layers; green blocks denote frozen auxiliary operations. F Architecture Stable Diffusion 3. Our S… view at source ↗
Figure 11
Figure 11. Figure 11: Multimodal DiT architecture adapted from FLUX.1 [dev]. Noisy text tokens and image patches are processed jointly together with their modality-specific timesteps (τ, t) to predict cross-modal flows. Blue: frozen pretrained components; yellow: LoRA-augmented linear layers; red: newly added trainable modules (text head and timestep processors); green: auxiliary operations. provide compatible inputs to each c… view at source ↗
Figure 12
Figure 12. Figure 12: Effect of image-to-text classifier-free guidance scale γ on caption quality and length. Mild guidance (γ ≈ 1.2) improves CIDEr and brings caption length closer to the reference; stronger guidance over-amplifies high-probability insertions in the reverse CTMC, producing overly verbose captions and degrading quality. We evaluate the effect of image-to-text CFG on caption generation in [PITH_FULL_IMAGE:figu… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative effect of the image-supervision target on text-to-image generation across training checkpoints (50k–200k steps). RF: standard rectified flow; Teacher-CT: clean-text teacher matching; Teacher￾SN: same-noise teacher matching. RF accumulates distortion, Teacher-CT introduces transient artifacts and is unstable, while Teacher-SN preserves the visual prior throughout training and is therefore our d… view at source ↗
Figure 14
Figure 14. Figure 14: Extended corruption-schedule ablation from Section 5.2, showing four additional captioning metrics beyond CIDEr. yellow: pure alternating-clean (πac); blue: pure mixed-corruption (πind); red: five independent runs that switch from πind to πac at 75k, 100k, 125k, 150k, and 175k steps. Solid lines: train; dashed lines: held-out validation. Shaded bands: 95% CI over 5k samples. Across all four metrics, the m… view at source ↗
Figure 15
Figure 15. Figure 15: Image-to-text captions from the SD3 FullFlow model on unseen LAION-Aesthetic images (part 1/2). Each tile shows the input image and the greedy caption sampled by the model. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Image-to-text captions from the SD3 FullFlow model on unseen LAION-Aesthetic images (part 2/2). Continuation of [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Image-to-text captions from the FLUX.1 [dev] FullFlow model on unseen LAION-Aesthetic images, demonstrating that the same recipe transfers to a different rectified-flow backbone. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Text-to-image samples from the SD3 base model (top row) and our FullFlow SD3 finetune (bottom row) under identical prompts, seeds, and schedulers. Differences between rows isolate the effect of the multimodal uplift on the visual prior. A black cat on grass A brown dog in the snow A horse running in a field A beautiful landscape with mountains A futuristic city skyline at sunset noise clean [PITH_FULL_IM… view at source ↗
Figure 19
Figure 19. Figure 19: Text-to-image samples from the FLUX.1 [dev] base model (top row) and our FullFlow FLUX.1 finetune (bottom row) under identical prompts, seeds, and schedulers, mirroring the SD3 comparison in [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: VQAv2 predictions from the SD3 FullFlow VQA finetune on unseen validation images. Each tile shows the input image, the question, and the model’s predicted answer. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Caption-length distribution under joint image–text sampling as a function of the trajectory parameter p in τ = t 2 p . Boxes show the interquartile range (Q1–Q3) with the median; whiskers extend to the min/max sample. Larger p (image-first) yields longer captions, consistent with the qualitative examples below. The image shows a wooden cutting board with a bowl of coffee, surrounded by various food ingred… view at source ↗
Figure 22
Figure 22. Figure 22: Joint image–text samples from FullFlow (seed 1/5) for p ∈ {−5, −2.5, 0, 2.5, 5}, shown left to right. Negative p denoises text first, p = 0 denoises both modalities together, and positive p denoises the image first. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Joint image–text samples from FullFlow (seed 2/5) for the same trajectory sweep as [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Joint image–text samples from FullFlow (seed 3/5) for the same trajectory sweep as [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Joint image–text samples from FullFlow (seed 4/5) for the same trajectory sweep as [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Joint image–text samples from FullFlow (seed 5/5) for the same trajectory sweep as [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗
read the original abstract

Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces FullFlow, a parameter-efficient method to convert a pretrained rectified-flow text-to-image model (e.g., Stable Diffusion 3) into a bidirectional vision-language generator. It adds LoRA adapters and lightweight text heads while keeping images in continuous flow and introducing a discrete insertion process for text. Separate image and text timesteps enable text-to-image, image-to-text, joint sampling, and partial-text prediction with a single backbone. On SD3 with matched trainable parameters and LoRA rank, it reports improving text-to-image FID from 62.7 to 31.6 and image-to-text CIDEr from 2.0 to 99.4 versus a LoRA-adapted Dual Diffusion baseline at matched training time, alongside VRAM reduction from ~84 GB to ~38 GB and ~8x throughput gains, while training only ~5% of backbone parameters. The recipe is also shown to transfer to FLUX.1-dev and support downstream VQA.

Significance. If the reported gains are shown to stem from the architectural choices (separate timesteps and discrete text insertion) rather than baseline mismatches, the work would be significant for demonstrating that rich visual priors in existing flow-based T2I models can be extended to bidirectional capability without large-scale joint pretraining or full text-pathway retraining. The efficiency improvements in VRAM and throughput would further support practical adoption of unified vision-language models.

major comments (1)
  1. [Abstract and results] Abstract and results section: The central empirical claim rests on outperforming a matched LoRA-rank adaptation of Dual Diffusion, yet the baseline text-to-image FID of 62.7 on SD3 is markedly worse than typical literature values for this backbone (often below 30 even for lightly tuned models). This discrepancy indicates the baseline may not have received equivalent hyperparameter search, optimization effort, or correct adaptation to rectified flow and LoRA constraints, which risks attributing gains to FullFlow's separate timesteps and text insertion rather than differences in training efficacy.
minor comments (2)
  1. [Method] The manuscript should clarify the precise definition and implementation of the discrete text insertion process and how it interacts with the continuous image flow during joint sampling.
  2. [Experiments] Additional ablations on the contribution of separate timesteps versus the text heads alone would strengthen the attribution of the bidirectional performance gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern regarding baseline performance below and clarify the experimental controls used to ensure fair comparison.

read point-by-point responses
  1. Referee: [Abstract and results] Abstract and results section: The central empirical claim rests on outperforming a matched LoRA-rank adaptation of Dual Diffusion, yet the baseline text-to-image FID of 62.7 on SD3 is markedly worse than typical literature values for this backbone (often below 30 even for lightly tuned models). This discrepancy indicates the baseline may not have received equivalent hyperparameter search, optimization effort, or correct adaptation to rectified flow and LoRA constraints, which risks attributing gains to FullFlow's separate timesteps and text insertion rather than differences in training efficacy.

    Authors: We appreciate this observation and the opportunity to clarify our controls. The 62.7 FID value is obtained from our re-implementation of the Dual Diffusion formulation (adapted to LoRA on the SD3 rectified-flow backbone) trained for the same wall-clock time and with the same total trainable parameter count and LoRA rank as FullFlow. Our goal was a head-to-head comparison under identical resource constraints rather than an absolute state-of-the-art T2I benchmark. While we acknowledge that more extensive hyperparameter sweeps or longer training can yield lower FID numbers in the broader literature, such additional tuning would violate the matched-training-time protocol we adopted to isolate the effect of separate timesteps and discrete text insertion. We will expand the experimental section with further details on the baseline adaptation procedure, including the precise LoRA placement and optimization settings used for both methods, to make the equivalence explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical recipe with independent baseline comparisons

full rationale

The paper advances an empirical method (LoRA adapters plus discrete text insertion on a pretrained rectified-flow backbone) and validates it via direct performance measurements against a matched-parameter LoRA baseline derived from prior Dual Diffusion work. No equations, predictions, or uniqueness claims are presented that reduce by construction to quantities defined inside the method itself. All reported gains (FID, CIDEr, VRAM, throughput) are external measurements on held-out benchmarks, not re-expressions of fitted parameters or self-citations. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5864 in / 1192 out tokens · 34607 ms · 2026-05-21T07:37:37.514352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce FullFlow, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text→image FID from 62.7 to 31.6 and image→text CIDEr from 2.0 to 99.4

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 10 internal anchors

  1. [1]

    Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, et al. Llada2. 0-uni: Unifying multimodal under- standing and generation with diffusion large language model.arXiv preprint arXiv:2604.20796, 2026

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

  3. [3]

    Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    One transformer fits all distributions in multi-modal diffusion at scale

    Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023

  6. [6]

    Optimal control meets flow matching: A principled route to multi-subject fidelity, 2025

    Eric Tillmann Bill, Enis Simsar, and Thomas Hofmann. Optimal control meets flow matching: A principled route to multi-subject fidelity, 2025

  7. [7]

    Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024

    Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024

  8. [8]

    Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023

  9. [9]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  10. [10]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

  11. [11]

    Fung, and Steven Hoi

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N. Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.Advances in Neural Information Processing Systems, 36:49250– 49267, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/9a6...

  12. [12]

    Scaling rectified flow transform- ers for high-resolution image synthesis, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis, 2024

  13. [13]

    Discrete flow matching, 2024

    Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching, 2024

  14. [14]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 10

  15. [15]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

  16. [16]

    Edit flows: Flow matching with edit operations, 2025

    Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations, 2025

  17. [17]

    Flowtok: Flowing seamlessly across text and image tokens, 2025

    Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flowtok: Flowing seamlessly across text and image tokens, 2025

  18. [18]

    Prompt-to-prompt image editing with cross-attention control, 2023

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control, 2023

  19. [19]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017

  20. [20]

    Denoising diffusion probabilistic models, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

  21. [21]

    Lora: Low-rank adaptation of large language models., 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models., 2022

  22. [22]

    T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation, 2023

  23. [23]

    Rethinking fid: Towards a better evaluation metric for image generation, 2024

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation, 2024

  24. [24]

    Diffusion instruction tuning

    Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, and Philip Alexander Teare. Diffusion instruction tuning. InInternational Conference on Machine Learning, pages 28097– 28137. PMLR, 2025

  25. [25]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  26. [26]

    Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

  27. [27]

    Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

  28. [28]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

  29. [29]

    Lavida: A large diffusion language model for multimodal understanding, 2025

    Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding, 2025

  30. [30]

    Omniflow: Any-to-any generation with multi-modal rectified flows, 2025

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Omniflow: Any-to-any generation with multi-modal rectified flows, 2025

  31. [31]

    Dual diffusion for unified image generation and understanding, 2025

    Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding, 2025

  32. [32]

    Flow matching for generative modeling, 2023

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling, 2023

  33. [33]

    World model on million-length video and language with blockwise ringattention, 2025

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention, 2025

  34. [34]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 11

  35. [35]

    Flow straight and fast: Learning to generate and transfer data with rectified flow, 2023

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2023

  36. [36]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  37. [37]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, June 2024. URLhttp://arxiv.org/abs/2310.16834. arXiv:2310.16834 [stat]

  38. [38]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7739–7751, 2025

  39. [39]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

  40. [40]

    Conform: Contrast is all you need for high-fidelity text-to-image diffusion models, 2024

    Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. Conform: Contrast is all you need for high-fidelity text-to-image diffusion models, 2024

  41. [41]

    John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer, and Ricky T. Q. Chen. OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows, October 2025. URL http://arxiv.org/abs/2510.03506. arXiv:2510.03506 [cs]

  42. [42]

    On aliased resizing and surprising subtleties in gan evaluation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

  43. [43]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023

  44. [44]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision, 2021

  45. [45]

    High-resolution image synthesis with latent diffusion models, June 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, June 2022

  46. [46]

    Simple and effective masked diffusion language models,

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and Effective Masked Diffusion Language Models, November 2024. URL http://arxiv.org/abs/2406.07524. arXiv:2406.07524 [cs]

  47. [47]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models,...

  48. [48]

    Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025

    Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025

  49. [49]

    Unified multimodal discrete diffusion, 2025

    Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multimodal discrete diffusion, 2025

  50. [50]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models, March 2025. URLhttp://arxiv.org/abs/2405.09818. arXiv:2405.09818 [cs]

  51. [51]

    Cider: Consensus-based image description evaluation, 2015

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation, 2015

  52. [52]

    Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities, 2025

    Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, and Ping Luo. Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities, 2025. 12

  53. [53]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, October 2024. URL http://arxiv.org/ abs/2410.13848. arXiv:2410.13848 [cs]

  54. [54]

    Show-o: One sin- gle transformer to unify multimodal understanding and generation, 2025

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One sin- gle transformer to unify multimodal understanding and generation, 2025. URL https: //openreview.net/forum?id=o6Ynz6OIQ6

  55. [55]

    Versatile diffusion: Text, images and variations all in one diffusion model,

    Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, January 2024. URL http://arxiv. org/abs/2211.08332. arXiv:2211.08332 [cs]

  56. [56]

    MMaDA: Multimodal large diffusion language models, 2026

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MMaDA: Multimodal large diffusion language models, 2026. URL https://openreview. net/forum?id=wczmXLuLGd

  57. [57]

    Llada-v: Large language diffusion models with visual instruction tuning, 2025

    Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning, 2025

  58. [58]

    Scaling autoregressive multi- modal models: Pretraining and instruction tuning

    Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz...

  59. [59]

    Dimple: Discrete diffusion multimodal large language model with parallel decoding.arXiv preprint arXiv:2505.16990,

    Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding, May 2025. URL http://arxiv.org/abs/2505. 16990. arXiv:2505.16990 [cs]

  60. [60]

    Weinberger, and Yoav Artzi

    Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URL https://openreview.net/forum?id= SkeHuCVFDr

  61. [61]

    Transfusion: Predict the next token and diffuse images with one multi-modal model, 2025

    Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model, 2025. URL https://openreview. net/forum?id=SI2hI0frk6

  62. [62]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 13 Appendix A Flow Matching: Similarity Between Continuous and Discrete Despite their apparent differences, continuous rectified flow and discrete Edit Flows in...

  63. [63]

    T5 branch (diffusion alphabet):feed ˜yτ directly to the frozen T5 encoder to obtain token embeddings for the shared transformer

  64. [64]

    s unfolds with a retrozoned from the pageinspired, Cleveland-colored brain

    CLIP branches (auxiliary conditioning):decode ˜yτ to a string ˜sτ = decodeT5(˜yτ) (dropping special tokens), then re-tokenize ˜sτ with each CLIP tokenizer and feed the resulting IDs to the corresponding frozen CLIP encoders (including pooled embeddings). This yields a simple, deterministic, and cheap mapping between encoder stacks while keeping all text e...