FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

Alessio Tonioni; Enis Simsar; Eric Tillmann Bill; Thomas Hofmann

arxiv: 2605.20316 · v1 · pith:ADTN25QFnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

Eric Tillmann Bill , Enis Simsar , Alessio Tonioni , Thomas Hofmann This is my paper

Pith reviewed 2026-05-21 07:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords bidirectional vision-languageflow matchingLoRA adapterstext-to-imageparameter-efficient trainingrectified flowmultimodal generation

0 comments

The pith

FullFlow upgrades a pretrained text-to-image flow model to bidirectional vision-language generation by training only LoRA adapters and lightweight text heads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern text-to-image flow models already encode strong visual priors but can only generate images from text. FullFlow adds a discrete text insertion process and separate timesteps for images and text so that one backbone supports text-to-image, image-to-text, joint sampling, and partial-text prediction. The method trains roughly five percent of the parameters via LoRA, keeps the original continuous flow for images, and reports large gains in quality and efficiency over prior bidirectional approaches that require more retraining.

Core claim

FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text-to-image, image-to-text, joint sampling, and partial-text prediction with a single backbone.

What carries the argument

LoRA adapters plus lightweight text heads on a pretrained rectified-flow backbone, using separate timesteps for the image and text modalities.

If this is right

Text-to-image FID drops from 62.7 to 31.6 while image-to-text CIDEr rises from 2.0 to 99.4 under identical trainable-parameter count and LoRA rank.
Peak VRAM falls from roughly 84 GB to 38 GB and throughput increases by a factor of eight on two RTX A5000 GPUs.
Training finishes in under 24 hours while updating only about five percent of backbone parameters.
The same recipe transfers directly to FLUX.1-dev and enables downstream VQA via partial-text generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Strong unimodal priors may reduce the need for full joint pretraining when building bidirectional models.
Similar lightweight adapter strategies could extend other single-direction generative models to new modalities.
The two-dimensional timestep space may support additional interactive tasks such as guided editing or progressive completion.

Load-bearing premise

The rich visual priors already present in a pretrained text-to-image backbone remain usable for bidirectional tasks when only LoRA adapters and lightweight text heads are added.

What would settle it

A controlled experiment that fully retrains the text pathway on the same data volume and measures whether text-to-image FID rises above 31.6 or peak memory exceeds 38 GB.

Figures

Figures reproduced from arXiv: 2605.20316 by Alessio Tonioni, Enis Simsar, Eric Tillmann Bill, Thomas Hofmann.

**Figure 2.** Figure 2: Dual-timestep space with image time t (x-axis) and text time τ (y-axis). All arrows, including the black ones, correspond to valid trajectories in the joint generative space. Colored trajectories highlight the canonical modes: text→image, image→text, and joint generation. Caption y CLIP-L/14 CLIP-G/14 T5 XXL Image x MLP Linear Sinusoidal Encoding MLP Timestep τ Timestep t − + + Linear × tanh Scalar τ P… view at source ↗

**Figure 4.** Figure 4: Evolution of the adaptive text-loss weight λtxt. Thin: raw ratio estimate; thick: EMA used in training. A1: Trainable parameters and text-conditioning pathways. SD3 uses three frozen text encoders with distinct tokenizers, whereas our discrete text process is defined in a single T5 token space. We therefore diffuse T5 tokens, decode the partially corrupted sequence, and re-tokenize it for the remaining en… view at source ↗

**Figure 5.** Figure 5: CIDEr for alternating-clean (πac), mixedcorruption (πind), and five independent runs that switch from πind to πac at 75k, 100k, 125k, 150k, and 175k steps. Mean over a 5k held-out split; error bars show 95% CIs. A4: Which corruption schedule best supports both correspondence and deployment? We compare two corruption schedules: alternating-clean (πac), which keeps one modality clean and matches the endpoi… view at source ↗

**Figure 6.** Figure 6: Joint generation over τ = t 2 p trajectories. Color shows mean CLIP over 1k samples. The dual-timestep formulation also enables image and text to be sampled jointly rather than conditioning on one modality. We sweep trajectories by fixing image time to a linear schedule and setting text time to τ = t 2 p , with p controlling which modality denoises earlier [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Continuous rectified flow (left) and discrete Edit Flows (right) instantiate the same path-based principle: learn local transport from a base distribution to data. In continuous space the supervised target is a velocity; in discrete space it is an insertion jump. The red arrow/token marks the quantity the model is trained to predict. B Training Details B.1 Dataset Preparation and Pre-Encoding We train on a… view at source ↗

**Figure 8.** Figure 8: Distribution of caption lengths in the training set, measured in T5 tokens after the LAION-COCOAesthetic filtering described in Section B. For readability the histogram drops the lowest and highest 1% of caption lengths and uses 60 bins; this distribution motivates our 77-token sequence cap. contain 18 or more tokens; captions with fewer than 4 tokens are discarded. We sample 300 captions from each length… view at source ↗

**Figure 9.** Figure 9: Log-ratio between the expected MSE gradient norm (image branch, Theorem E.1) and the lower/upper bounds on the cross-entropy gradient norm (text branch, Theorem E.2), as a function of the agreement levels σimg and σtxt for latent dimension N = 4096. Negative regions, which dominate early training when text predictions are poor and the image prior is already accurate, indicate that the text-branch gradient … view at source ↗

**Figure 10.** Figure 10: MM-DiT block adapted from Stable Diffusion 3 for joint image–text generation. We add text-time conditioning τ alongside the original image-time conditioning t, with both timesteps embedded separately and injected through the existing AdaLN-style modulation interface. Yellow blocks denote LoRA-augmented linear layers; green blocks denote frozen auxiliary operations. F Architecture Stable Diffusion 3. Our S… view at source ↗

**Figure 11.** Figure 11: Multimodal DiT architecture adapted from FLUX.1 [dev]. Noisy text tokens and image patches are processed jointly together with their modality-specific timesteps (τ, t) to predict cross-modal flows. Blue: frozen pretrained components; yellow: LoRA-augmented linear layers; red: newly added trainable modules (text head and timestep processors); green: auxiliary operations. provide compatible inputs to each c… view at source ↗

**Figure 12.** Figure 12: Effect of image-to-text classifier-free guidance scale γ on caption quality and length. Mild guidance (γ ≈ 1.2) improves CIDEr and brings caption length closer to the reference; stronger guidance over-amplifies high-probability insertions in the reverse CTMC, producing overly verbose captions and degrading quality. We evaluate the effect of image-to-text CFG on caption generation in [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 13.** Figure 13: Qualitative effect of the image-supervision target on text-to-image generation across training checkpoints (50k–200k steps). RF: standard rectified flow; Teacher-CT: clean-text teacher matching; TeacherSN: same-noise teacher matching. RF accumulates distortion, Teacher-CT introduces transient artifacts and is unstable, while Teacher-SN preserves the visual prior throughout training and is therefore our d… view at source ↗

**Figure 14.** Figure 14: Extended corruption-schedule ablation from Section 5.2, showing four additional captioning metrics beyond CIDEr. yellow: pure alternating-clean (πac); blue: pure mixed-corruption (πind); red: five independent runs that switch from πind to πac at 75k, 100k, 125k, 150k, and 175k steps. Solid lines: train; dashed lines: held-out validation. Shaded bands: 95% CI over 5k samples. Across all four metrics, the m… view at source ↗

**Figure 15.** Figure 15: Image-to-text captions from the SD3 FullFlow model on unseen LAION-Aesthetic images (part 1/2). Each tile shows the input image and the greedy caption sampled by the model. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Image-to-text captions from the SD3 FullFlow model on unseen LAION-Aesthetic images (part 2/2). Continuation of [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Image-to-text captions from the FLUX.1 [dev] FullFlow model on unseen LAION-Aesthetic images, demonstrating that the same recipe transfers to a different rectified-flow backbone. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: Text-to-image samples from the SD3 base model (top row) and our FullFlow SD3 finetune (bottom row) under identical prompts, seeds, and schedulers. Differences between rows isolate the effect of the multimodal uplift on the visual prior. A black cat on grass A brown dog in the snow A horse running in a field A beautiful landscape with mountains A futuristic city skyline at sunset noise clean [PITH_FULL_IM… view at source ↗

**Figure 19.** Figure 19: Text-to-image samples from the FLUX.1 [dev] base model (top row) and our FullFlow FLUX.1 finetune (bottom row) under identical prompts, seeds, and schedulers, mirroring the SD3 comparison in [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: VQAv2 predictions from the SD3 FullFlow VQA finetune on unseen validation images. Each tile shows the input image, the question, and the model’s predicted answer. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Caption-length distribution under joint image–text sampling as a function of the trajectory parameter p in τ = t 2 p . Boxes show the interquartile range (Q1–Q3) with the median; whiskers extend to the min/max sample. Larger p (image-first) yields longer captions, consistent with the qualitative examples below. The image shows a wooden cutting board with a bowl of coffee, surrounded by various food ingred… view at source ↗

**Figure 22.** Figure 22: Joint image–text samples from FullFlow (seed 1/5) for p ∈ {−5, −2.5, 0, 2.5, 5}, shown left to right. Negative p denoises text first, p = 0 denoises both modalities together, and positive p denoises the image first. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗

**Figure 23.** Figure 23: Joint image–text samples from FullFlow (seed 2/5) for the same trajectory sweep as [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

**Figure 24.** Figure 24: Joint image–text samples from FullFlow (seed 3/5) for the same trajectory sweep as [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗

**Figure 25.** Figure 25: Joint image–text samples from FullFlow (seed 4/5) for the same trajectory sweep as [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗

**Figure 26.** Figure 26: Joint image–text samples from FullFlow (seed 5/5) for the same trajectory sweep as [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗

read the original abstract

Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FullFlow adds separate timesteps and discrete text insertion to turn pretrained flow T2I models bidirectional with LoRA, but the big reported gains rest on a suspiciously weak baseline.

read the letter

The main point is that this paper shows how to bolt bidirectional vision-language generation onto an existing rectified-flow text-to-image model by training only LoRA adapters plus lightweight text heads, keeping the strong image prior intact instead of retraining everything jointly. Separate image and text timesteps plus a discrete insertion step for text let the same backbone handle text-to-image, image-to-text, joint sampling, and partial-text tasks. That recipe looks new relative to prior unified models that do full retraining or heavy text-pathway updates, and the efficiency numbers—dropping peak VRAM from 84 GB to 38 GB and getting 8x throughput on two A5000s while training 5% of parameters—are practical and worth noting for anyone trying to reuse SD3 or FLUX backbones on limited hardware. The transfer to FLUX and the VQA downstream use case also show the approach is not tied to one model family. The central results claim clear wins over a matched LoRA version of Dual Diffusion, with FID improving from 62.7 to 31.6 and CIDEr from 2.0 to 99.4 at similar training time. Those deltas are large enough to matter if they hold up. The soft spot is the baseline itself. An FID of 62.7 on SD3 under LoRA adaptation is markedly worse than what most people see even with light tuning on that backbone, which suggests the Dual Diffusion comparator may not have received equivalent hyperparameter effort or correct adaptation to the flow setting. If the baseline is under-optimized, part of the reported improvement could come from better training practice rather than the separate-timestep and insertion design alone. Without seeing the full methods, ablations, and exact baseline code, it is hard to separate those effects. This work is aimed at people building efficient multimodal systems who already have strong pretrained flow models and want to avoid large-scale joint pretraining. Readers focused on parameter-efficient adaptation or unified VL generation will find the concrete recipe and efficiency measurements useful. The paper deserves a serious referee because the core idea is distinct and the efficiency claims are testable, even though the current comparisons need closer scrutiny on the baseline side. I would send it to review with instructions to check whether the Dual Diffusion LoRA was given the same optimization budget and whether the gains survive stronger controls.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces FullFlow, a parameter-efficient method to convert a pretrained rectified-flow text-to-image model (e.g., Stable Diffusion 3) into a bidirectional vision-language generator. It adds LoRA adapters and lightweight text heads while keeping images in continuous flow and introducing a discrete insertion process for text. Separate image and text timesteps enable text-to-image, image-to-text, joint sampling, and partial-text prediction with a single backbone. On SD3 with matched trainable parameters and LoRA rank, it reports improving text-to-image FID from 62.7 to 31.6 and image-to-text CIDEr from 2.0 to 99.4 versus a LoRA-adapted Dual Diffusion baseline at matched training time, alongside VRAM reduction from ~84 GB to ~38 GB and ~8x throughput gains, while training only ~5% of backbone parameters. The recipe is also shown to transfer to FLUX.1-dev and support downstream VQA.

Significance. If the reported gains are shown to stem from the architectural choices (separate timesteps and discrete text insertion) rather than baseline mismatches, the work would be significant for demonstrating that rich visual priors in existing flow-based T2I models can be extended to bidirectional capability without large-scale joint pretraining or full text-pathway retraining. The efficiency improvements in VRAM and throughput would further support practical adoption of unified vision-language models.

major comments (1)

[Abstract and results] Abstract and results section: The central empirical claim rests on outperforming a matched LoRA-rank adaptation of Dual Diffusion, yet the baseline text-to-image FID of 62.7 on SD3 is markedly worse than typical literature values for this backbone (often below 30 even for lightly tuned models). This discrepancy indicates the baseline may not have received equivalent hyperparameter search, optimization effort, or correct adaptation to rectified flow and LoRA constraints, which risks attributing gains to FullFlow's separate timesteps and text insertion rather than differences in training efficacy.

minor comments (2)

[Method] The manuscript should clarify the precise definition and implementation of the discrete text insertion process and how it interacts with the continuous image flow during joint sampling.
[Experiments] Additional ablations on the contribution of separate timesteps versus the text heads alone would strengthen the attribution of the bidirectional performance gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern regarding baseline performance below and clarify the experimental controls used to ensure fair comparison.

read point-by-point responses

Referee: [Abstract and results] Abstract and results section: The central empirical claim rests on outperforming a matched LoRA-rank adaptation of Dual Diffusion, yet the baseline text-to-image FID of 62.7 on SD3 is markedly worse than typical literature values for this backbone (often below 30 even for lightly tuned models). This discrepancy indicates the baseline may not have received equivalent hyperparameter search, optimization effort, or correct adaptation to rectified flow and LoRA constraints, which risks attributing gains to FullFlow's separate timesteps and text insertion rather than differences in training efficacy.

Authors: We appreciate this observation and the opportunity to clarify our controls. The 62.7 FID value is obtained from our re-implementation of the Dual Diffusion formulation (adapted to LoRA on the SD3 rectified-flow backbone) trained for the same wall-clock time and with the same total trainable parameter count and LoRA rank as FullFlow. Our goal was a head-to-head comparison under identical resource constraints rather than an absolute state-of-the-art T2I benchmark. While we acknowledge that more extensive hyperparameter sweeps or longer training can yield lower FID numbers in the broader literature, such additional tuning would violate the matched-training-time protocol we adopted to isolate the effect of separate timesteps and discrete text insertion. We will expand the experimental section with further details on the baseline adaptation procedure, including the precise LoRA placement and optimization settings used for both methods, to make the equivalence explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical recipe with independent baseline comparisons

full rationale

The paper advances an empirical method (LoRA adapters plus discrete text insertion on a pretrained rectified-flow backbone) and validates it via direct performance measurements against a matched-parameter LoRA baseline derived from prior Dual Diffusion work. No equations, predictions, or uniqueness claims are presented that reduce by construction to quantities defined inside the method itself. All reported gains (FID, CIDEr, VRAM, throughput) are external measurements on held-out benchmarks, not re-expressions of fitted parameters or self-citations. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5864 in / 1192 out tokens · 34607 ms · 2026-05-21T07:37:37.514352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce FullFlow, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text→image FID from 62.7 to 31.6 and image→text CIDEr from 2.0 to 99.4

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 10 internal anchors

[1]

Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, et al. Llada2. 0-uni: Unifying multimodal under- standing and generation with diffusion large language model.arXiv preprint arXiv:2604.20796, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022
[3]

Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

work page 2023
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

One transformer fits all distributions in multi-modal diffusion at scale

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023

work page 2023
[6]

Optimal control meets flow matching: A principled route to multi-subject fidelity, 2025

Eric Tillmann Bill, Enis Simsar, and Thomas Hofmann. Optimal control meets flow matching: A principled route to multi-subject fidelity, 2025

work page 2025
[7]

Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024

work page 2024
[8]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023

work page 2023
[9]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Fung, and Steven Hoi

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N. Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.Advances in Neural Information Processing Systems, 36:49250– 49267, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/9a6...

work page 2023
[12]

Scaling rectified flow transform- ers for high-resolution image synthesis, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis, 2024

work page 2024
[13]

Discrete flow matching, 2024

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching, 2024

work page 2024
[14]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 10

work page 2017
[15]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

work page 2018
[16]

Edit flows: Flow matching with edit operations, 2025

Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations, 2025

work page 2025
[17]

Flowtok: Flowing seamlessly across text and image tokens, 2025

Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flowtok: Flowing seamlessly across text and image tokens, 2025

work page 2025
[18]

Prompt-to-prompt image editing with cross-attention control, 2023

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control, 2023

work page 2023
[19]

Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017

work page 2017
[20]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

work page 2020
[21]

Lora: Low-rank adaptation of large language models., 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models., 2022

work page 2022
[22]

T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation, 2023

work page 2023
[23]

Rethinking fid: Towards a better evaluation metric for image generation, 2024

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation, 2024

work page 2024
[24]

Diffusion instruction tuning

Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, and Philip Alexander Teare. Diffusion instruction tuning. InInternational Conference on Machine Learning, pages 28097– 28137. PMLR, 2025

work page 2025
[25]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[26]

Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

work page 2023
[27]

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

work page 2024
[28]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

work page 2022
[29]

Lavida: A large diffusion language model for multimodal understanding, 2025

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding, 2025

work page 2025
[30]

Omniflow: Any-to-any generation with multi-modal rectified flows, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Omniflow: Any-to-any generation with multi-modal rectified flows, 2025

work page 2025
[31]

Dual diffusion for unified image generation and understanding, 2025

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding, 2025

work page 2025
[32]

Flow matching for generative modeling, 2023

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling, 2023

work page 2023
[33]

World model on million-length video and language with blockwise ringattention, 2025

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention, 2025

work page 2025
[34]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 11

work page 2023
[35]

Flow straight and fast: Learning to generate and transfer data with rectified flow, 2023

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2023

work page 2023
[36]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, June 2024. URLhttp://arxiv.org/abs/2310.16834. arXiv:2310.16834 [stat]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7739–7751, 2025

work page 2025
[39]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

work page 2019
[40]

Conform: Contrast is all you need for high-fidelity text-to-image diffusion models, 2024

Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. Conform: Contrast is all you need for high-fidelity text-to-image diffusion models, 2024

work page 2024
[41]

John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer, and Ricky T. Q. Chen. OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows, October 2025. URL http://arxiv.org/abs/2510.03506. arXiv:2510.03506 [cs]

work page arXiv 2025
[42]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

work page 2022
[43]

Scalable diffusion models with transformers, 2023

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023

work page 2023
[44]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision, 2021

work page 2021
[45]

High-resolution image synthesis with latent diffusion models, June 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, June 2022

work page 2022
[46]

Simple and effective masked diffusion language models,

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and Effective Masked Diffusion Language Models, November 2024. URL http://arxiv.org/abs/2406.07524. arXiv:2406.07524 [cs]

work page arXiv 2024
[47]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025

work page 2025
[49]

Unified multimodal discrete diffusion, 2025

Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multimodal discrete diffusion, 2025

work page 2025
[50]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models, March 2025. URLhttp://arxiv.org/abs/2405.09818. arXiv:2405.09818 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Cider: Consensus-based image description evaluation, 2015

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation, 2015

work page 2015
[52]

Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities, 2025

Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, and Ping Luo. Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities, 2025. 12

work page 2025
[53]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, October 2024. URL http://arxiv.org/ abs/2410.13848. arXiv:2410.13848 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Show-o: One sin- gle transformer to unify multimodal understanding and generation, 2025

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One sin- gle transformer to unify multimodal understanding and generation, 2025. URL https: //openreview.net/forum?id=o6Ynz6OIQ6

work page 2025
[55]

Versatile diffusion: Text, images and variations all in one diffusion model,

Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, January 2024. URL http://arxiv. org/abs/2211.08332. arXiv:2211.08332 [cs]

work page arXiv 2024
[56]

MMaDA: Multimodal large diffusion language models, 2026

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MMaDA: Multimodal large diffusion language models, 2026. URL https://openreview. net/forum?id=wczmXLuLGd

work page 2026
[57]

Llada-v: Large language diffusion models with visual instruction tuning, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning, 2025

work page 2025
[58]

Scaling autoregressive multi- modal models: Pretraining and instruction tuning

Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz...

work page arXiv 2023
[59]

Dimple: Discrete diffusion multimodal large language model with parallel decoding.arXiv preprint arXiv:2505.16990,

Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding, May 2025. URL http://arxiv.org/abs/2505. 16990. arXiv:2505.16990 [cs]

work page arXiv 2025
[60]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URL https://openreview.net/forum?id= SkeHuCVFDr

work page 2020
[61]

Transfusion: Predict the next token and diffuse images with one multi-modal model, 2025

Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model, 2025. URL https://openreview. net/forum?id=SI2hI0frk6

work page 2025
[62]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 13 Appendix A Flow Matching: Similarity Between Continuous and Discrete Despite their apparent differences, continuous rectified flow and discrete Edit Flows in...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

T5 branch (diffusion alphabet):feed ˜yτ directly to the frozen T5 encoder to obtain token embeddings for the shared transformer

work page
[64]

s unfolds with a retrozoned from the pageinspired, Cleveland-colored brain

CLIP branches (auxiliary conditioning):decode ˜yτ to a string ˜sτ = decodeT5(˜yτ) (dropping special tokens), then re-tokenize ˜sτ with each CLIP tokenizer and feed the resulting IDs to the corresponding frozen CLIP encoders (including pooled embeddings). This yields a simple, deterministic, and cheap mapping between encoder stacks while keeping all text e...

work page

[1] [1]

Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, et al. Llada2. 0-uni: Unifying multimodal under- standing and generation with diffusion large language model.arXiv preprint arXiv:2604.20796, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022

[3] [3]

Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

work page 2023

[4] [4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

One transformer fits all distributions in multi-modal diffusion at scale

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023

work page 2023

[6] [6]

Optimal control meets flow matching: A principled route to multi-subject fidelity, 2025

Eric Tillmann Bill, Enis Simsar, and Thomas Hofmann. Optimal control meets flow matching: A principled route to multi-subject fidelity, 2025

work page 2025

[7] [7]

Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024

work page 2024

[8] [8]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023

work page 2023

[9] [9]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Fung, and Steven Hoi

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N. Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.Advances in Neural Information Processing Systems, 36:49250– 49267, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/9a6...

work page 2023

[12] [12]

Scaling rectified flow transform- ers for high-resolution image synthesis, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis, 2024

work page 2024

[13] [13]

Discrete flow matching, 2024

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching, 2024

work page 2024

[14] [14]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 10

work page 2017

[15] [15]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

work page 2018

[16] [16]

Edit flows: Flow matching with edit operations, 2025

Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations, 2025

work page 2025

[17] [17]

Flowtok: Flowing seamlessly across text and image tokens, 2025

Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flowtok: Flowing seamlessly across text and image tokens, 2025

work page 2025

[18] [18]

Prompt-to-prompt image editing with cross-attention control, 2023

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control, 2023

work page 2023

[19] [19]

Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017

work page 2017

[20] [20]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

work page 2020

[21] [21]

Lora: Low-rank adaptation of large language models., 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models., 2022

work page 2022

[22] [22]

T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation, 2023

work page 2023

[23] [23]

Rethinking fid: Towards a better evaluation metric for image generation, 2024

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation, 2024

work page 2024

[24] [24]

Diffusion instruction tuning

Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, and Philip Alexander Teare. Diffusion instruction tuning. InInternational Conference on Machine Learning, pages 28097– 28137. PMLR, 2025

work page 2025

[25] [25]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[26] [26]

Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023

work page 2023

[27] [27]

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

work page 2024

[28] [28]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

work page 2022

[29] [29]

Lavida: A large diffusion language model for multimodal understanding, 2025

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding, 2025

work page 2025

[30] [30]

Omniflow: Any-to-any generation with multi-modal rectified flows, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Omniflow: Any-to-any generation with multi-modal rectified flows, 2025

work page 2025

[31] [31]

Dual diffusion for unified image generation and understanding, 2025

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding, 2025

work page 2025

[32] [32]

Flow matching for generative modeling, 2023

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling, 2023

work page 2023

[33] [33]

World model on million-length video and language with blockwise ringattention, 2025

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention, 2025

work page 2025

[34] [34]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 11

work page 2023

[35] [35]

Flow straight and fast: Learning to generate and transfer data with rectified flow, 2023

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2023

work page 2023

[36] [36]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, June 2024. URLhttp://arxiv.org/abs/2310.16834. arXiv:2310.16834 [stat]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7739–7751, 2025

work page 2025

[39] [39]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

work page 2019

[40] [40]

Conform: Contrast is all you need for high-fidelity text-to-image diffusion models, 2024

Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. Conform: Contrast is all you need for high-fidelity text-to-image diffusion models, 2024

work page 2024

[41] [41]

John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer, and Ricky T. Q. Chen. OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows, October 2025. URL http://arxiv.org/abs/2510.03506. arXiv:2510.03506 [cs]

work page arXiv 2025

[42] [42]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

work page 2022

[43] [43]

Scalable diffusion models with transformers, 2023

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023

work page 2023

[44] [44]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision, 2021

work page 2021

[45] [45]

High-resolution image synthesis with latent diffusion models, June 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, June 2022

work page 2022

[46] [46]

Simple and effective masked diffusion language models,

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and Effective Masked Diffusion Language Models, November 2024. URL http://arxiv.org/abs/2406.07524. arXiv:2406.07524 [cs]

work page arXiv 2024

[47] [47]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025

work page 2025

[49] [49]

Unified multimodal discrete diffusion, 2025

Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multimodal discrete diffusion, 2025

work page 2025

[50] [50]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models, March 2025. URLhttp://arxiv.org/abs/2405.09818. arXiv:2405.09818 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Cider: Consensus-based image description evaluation, 2015

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation, 2015

work page 2015

[52] [52]

Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities, 2025

Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, and Ping Luo. Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities, 2025. 12

work page 2025

[53] [53]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, October 2024. URL http://arxiv.org/ abs/2410.13848. arXiv:2410.13848 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Show-o: One sin- gle transformer to unify multimodal understanding and generation, 2025

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One sin- gle transformer to unify multimodal understanding and generation, 2025. URL https: //openreview.net/forum?id=o6Ynz6OIQ6

work page 2025

[55] [55]

Versatile diffusion: Text, images and variations all in one diffusion model,

Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, January 2024. URL http://arxiv. org/abs/2211.08332. arXiv:2211.08332 [cs]

work page arXiv 2024

[56] [56]

MMaDA: Multimodal large diffusion language models, 2026

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MMaDA: Multimodal large diffusion language models, 2026. URL https://openreview. net/forum?id=wczmXLuLGd

work page 2026

[57] [57]

Llada-v: Large language diffusion models with visual instruction tuning, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning, 2025

work page 2025

[58] [58]

Scaling autoregressive multi- modal models: Pretraining and instruction tuning

Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz...

work page arXiv 2023

[59] [59]

Dimple: Discrete diffusion multimodal large language model with parallel decoding.arXiv preprint arXiv:2505.16990,

Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding, May 2025. URL http://arxiv.org/abs/2505. 16990. arXiv:2505.16990 [cs]

work page arXiv 2025

[60] [60]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URL https://openreview.net/forum?id= SkeHuCVFDr

work page 2020

[61] [61]

Transfusion: Predict the next token and diffuse images with one multi-modal model, 2025

Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model, 2025. URL https://openreview. net/forum?id=SI2hI0frk6

work page 2025

[62] [62]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 13 Appendix A Flow Matching: Similarity Between Continuous and Discrete Despite their apparent differences, continuous rectified flow and discrete Edit Flows in...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

T5 branch (diffusion alphabet):feed ˜yτ directly to the frozen T5 encoder to obtain token embeddings for the shared transformer

work page

[64] [64]

s unfolds with a retrozoned from the pageinspired, Cleveland-colored brain

CLIP branches (auxiliary conditioning):decode ˜yτ to a string ˜sτ = decodeT5(˜yτ) (dropping special tokens), then re-tokenize ˜sτ with each CLIP tokenizer and feed the resulting IDs to the corresponding frozen CLIP encoders (including pooled embeddings). This yields a simple, deterministic, and cheap mapping between encoder stacks while keeping all text e...

work page