FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
Pith reviewed 2026-05-21 07:37 UTC · model grok-4.3
The pith
FullFlow upgrades a pretrained text-to-image flow model to bidirectional vision-language generation by training only LoRA adapters and lightweight text heads.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text-to-image, image-to-text, joint sampling, and partial-text prediction with a single backbone.
What carries the argument
LoRA adapters plus lightweight text heads on a pretrained rectified-flow backbone, using separate timesteps for the image and text modalities.
If this is right
- Text-to-image FID drops from 62.7 to 31.6 while image-to-text CIDEr rises from 2.0 to 99.4 under identical trainable-parameter count and LoRA rank.
- Peak VRAM falls from roughly 84 GB to 38 GB and throughput increases by a factor of eight on two RTX A5000 GPUs.
- Training finishes in under 24 hours while updating only about five percent of backbone parameters.
- The same recipe transfers directly to FLUX.1-dev and enables downstream VQA via partial-text generation.
Where Pith is reading between the lines
- Strong unimodal priors may reduce the need for full joint pretraining when building bidirectional models.
- Similar lightweight adapter strategies could extend other single-direction generative models to new modalities.
- The two-dimensional timestep space may support additional interactive tasks such as guided editing or progressive completion.
Load-bearing premise
The rich visual priors already present in a pretrained text-to-image backbone remain usable for bidirectional tasks when only LoRA adapters and lightweight text heads are added.
What would settle it
A controlled experiment that fully retrains the text pathway on the same data volume and measures whether text-to-image FID rises above 31.6 or peak memory exceeds 38 GB.
Figures
read the original abstract
Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FullFlow, a parameter-efficient method to convert a pretrained rectified-flow text-to-image model (e.g., Stable Diffusion 3) into a bidirectional vision-language generator. It adds LoRA adapters and lightweight text heads while keeping images in continuous flow and introducing a discrete insertion process for text. Separate image and text timesteps enable text-to-image, image-to-text, joint sampling, and partial-text prediction with a single backbone. On SD3 with matched trainable parameters and LoRA rank, it reports improving text-to-image FID from 62.7 to 31.6 and image-to-text CIDEr from 2.0 to 99.4 versus a LoRA-adapted Dual Diffusion baseline at matched training time, alongside VRAM reduction from ~84 GB to ~38 GB and ~8x throughput gains, while training only ~5% of backbone parameters. The recipe is also shown to transfer to FLUX.1-dev and support downstream VQA.
Significance. If the reported gains are shown to stem from the architectural choices (separate timesteps and discrete text insertion) rather than baseline mismatches, the work would be significant for demonstrating that rich visual priors in existing flow-based T2I models can be extended to bidirectional capability without large-scale joint pretraining or full text-pathway retraining. The efficiency improvements in VRAM and throughput would further support practical adoption of unified vision-language models.
major comments (1)
- [Abstract and results] Abstract and results section: The central empirical claim rests on outperforming a matched LoRA-rank adaptation of Dual Diffusion, yet the baseline text-to-image FID of 62.7 on SD3 is markedly worse than typical literature values for this backbone (often below 30 even for lightly tuned models). This discrepancy indicates the baseline may not have received equivalent hyperparameter search, optimization effort, or correct adaptation to rectified flow and LoRA constraints, which risks attributing gains to FullFlow's separate timesteps and text insertion rather than differences in training efficacy.
minor comments (2)
- [Method] The manuscript should clarify the precise definition and implementation of the discrete text insertion process and how it interacts with the continuous image flow during joint sampling.
- [Experiments] Additional ablations on the contribution of separate timesteps versus the text heads alone would strengthen the attribution of the bidirectional performance gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the concern regarding baseline performance below and clarify the experimental controls used to ensure fair comparison.
read point-by-point responses
-
Referee: [Abstract and results] Abstract and results section: The central empirical claim rests on outperforming a matched LoRA-rank adaptation of Dual Diffusion, yet the baseline text-to-image FID of 62.7 on SD3 is markedly worse than typical literature values for this backbone (often below 30 even for lightly tuned models). This discrepancy indicates the baseline may not have received equivalent hyperparameter search, optimization effort, or correct adaptation to rectified flow and LoRA constraints, which risks attributing gains to FullFlow's separate timesteps and text insertion rather than differences in training efficacy.
Authors: We appreciate this observation and the opportunity to clarify our controls. The 62.7 FID value is obtained from our re-implementation of the Dual Diffusion formulation (adapted to LoRA on the SD3 rectified-flow backbone) trained for the same wall-clock time and with the same total trainable parameter count and LoRA rank as FullFlow. Our goal was a head-to-head comparison under identical resource constraints rather than an absolute state-of-the-art T2I benchmark. While we acknowledge that more extensive hyperparameter sweeps or longer training can yield lower FID numbers in the broader literature, such additional tuning would violate the matched-training-time protocol we adopted to isolate the effect of separate timesteps and discrete text insertion. We will expand the experimental section with further details on the baseline adaptation procedure, including the precise LoRA placement and optimization settings used for both methods, to make the equivalence explicit. revision: partial
Circularity Check
No circularity: empirical recipe with independent baseline comparisons
full rationale
The paper advances an empirical method (LoRA adapters plus discrete text insertion on a pretrained rectified-flow backbone) and validates it via direct performance measurements against a matched-parameter LoRA baseline derived from prior Dual Diffusion work. No equations, predictions, or uniqueness claims are presented that reduce by construction to quantities defined inside the method itself. All reported gains (FID, CIDEr, VRAM, throughput) are external measurements on held-out benchmarks, not re-expressions of fitted parameters or self-citations. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce FullFlow, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text→image FID from 62.7 to 31.6 and image→text CIDEr from 2.0 to 99.4
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, et al. Llada2. 0-uni: Unifying multimodal under- standing and generation with diffusion large language model.arXiv preprint arXiv:2604.20796, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022
work page 2022
-
[3]
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023
work page 2023
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
One transformer fits all distributions in multi-modal diffusion at scale
Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023
work page 2023
-
[6]
Optimal control meets flow matching: A principled route to multi-subject fidelity, 2025
Eric Tillmann Bill, Enis Simsar, and Thomas Hofmann. Optimal control meets flow matching: A principled route to multi-subject fidelity, 2025
work page 2025
-
[7]
Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024
work page 2024
-
[8]
Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023
work page 2023
-
[9]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N. Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.Advances in Neural Information Processing Systems, 36:49250– 49267, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/9a6...
work page 2023
-
[12]
Scaling rectified flow transform- ers for high-resolution image synthesis, 2024
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis, 2024
work page 2024
-
[13]
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching, 2024
work page 2024
-
[14]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 10
work page 2017
-
[15]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018
work page 2018
-
[16]
Edit flows: Flow matching with edit operations, 2025
Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations, 2025
work page 2025
-
[17]
Flowtok: Flowing seamlessly across text and image tokens, 2025
Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flowtok: Flowing seamlessly across text and image tokens, 2025
work page 2025
-
[18]
Prompt-to-prompt image editing with cross-attention control, 2023
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control, 2023
work page 2023
-
[19]
Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017
work page 2017
-
[20]
Denoising diffusion probabilistic models, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020
work page 2020
-
[21]
Lora: Low-rank adaptation of large language models., 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models., 2022
work page 2022
-
[22]
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation, 2023
work page 2023
-
[23]
Rethinking fid: Towards a better evaluation metric for image generation, 2024
Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation, 2024
work page 2024
-
[24]
Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, and Philip Alexander Teare. Diffusion instruction tuning. InInternational Conference on Machine Learning, pages 28097– 28137. PMLR, 2025
work page 2025
-
[25]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[26]
Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023
work page 2023
-
[27]
Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024
work page 2024
-
[28]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022
work page 2022
-
[29]
Lavida: A large diffusion language model for multimodal understanding, 2025
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding, 2025
work page 2025
-
[30]
Omniflow: Any-to-any generation with multi-modal rectified flows, 2025
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Omniflow: Any-to-any generation with multi-modal rectified flows, 2025
work page 2025
-
[31]
Dual diffusion for unified image generation and understanding, 2025
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding, 2025
work page 2025
-
[32]
Flow matching for generative modeling, 2023
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling, 2023
work page 2023
-
[33]
World model on million-length video and language with blockwise ringattention, 2025
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention, 2025
work page 2025
-
[34]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 11
work page 2023
-
[35]
Flow straight and fast: Learning to generate and transfer data with rectified flow, 2023
Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2023
work page 2023
-
[36]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, June 2024. URLhttp://arxiv.org/abs/2310.16834. arXiv:2310.16834 [stat]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7739–7751, 2025
work page 2025
-
[39]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019
work page 2019
-
[40]
Conform: Contrast is all you need for high-fidelity text-to-image diffusion models, 2024
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. Conform: Contrast is all you need for high-fidelity text-to-image diffusion models, 2024
work page 2024
- [41]
-
[42]
On aliased resizing and surprising subtleties in gan evaluation
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022
work page 2022
-
[43]
Scalable diffusion models with transformers, 2023
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023
work page 2023
-
[44]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision, 2021
work page 2021
-
[45]
High-resolution image synthesis with latent diffusion models, June 2022
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, June 2022
work page 2022
-
[46]
Simple and effective masked diffusion language models,
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and Effective Masked Diffusion Language Models, November 2024. URL http://arxiv.org/abs/2406.07524. arXiv:2406.07524 [cs]
-
[47]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025
Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, et al. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025
work page 2025
-
[49]
Unified multimodal discrete diffusion, 2025
Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multimodal discrete diffusion, 2025
work page 2025
-
[50]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models, March 2025. URLhttp://arxiv.org/abs/2405.09818. arXiv:2405.09818 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Cider: Consensus-based image description evaluation, 2015
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation, 2015
work page 2015
-
[52]
Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, and Ping Luo. Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities, 2025. 12
work page 2025
-
[53]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, October 2024. URL http://arxiv.org/ abs/2410.13848. arXiv:2410.13848 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Show-o: One sin- gle transformer to unify multimodal understanding and generation, 2025
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One sin- gle transformer to unify multimodal understanding and generation, 2025. URL https: //openreview.net/forum?id=o6Ynz6OIQ6
work page 2025
-
[55]
Versatile diffusion: Text, images and variations all in one diffusion model,
Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, January 2024. URL http://arxiv. org/abs/2211.08332. arXiv:2211.08332 [cs]
-
[56]
MMaDA: Multimodal large diffusion language models, 2026
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MMaDA: Multimodal large diffusion language models, 2026. URL https://openreview. net/forum?id=wczmXLuLGd
work page 2026
-
[57]
Llada-v: Large language diffusion models with visual instruction tuning, 2025
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning, 2025
work page 2025
-
[58]
Scaling autoregressive multi- modal models: Pretraining and instruction tuning
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz...
-
[59]
Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding, May 2025. URL http://arxiv.org/abs/2505. 16990. arXiv:2505.16990 [cs]
-
[60]
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URL https://openreview.net/forum?id= SkeHuCVFDr
work page 2020
-
[61]
Transfusion: Predict the next token and diffuse images with one multi-modal model, 2025
Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model, 2025. URL https://openreview. net/forum?id=SI2hI0frk6
work page 2025
-
[62]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 13 Appendix A Flow Matching: Similarity Between Continuous and Discrete Despite their apparent differences, continuous rectified flow and discrete Edit Flows in...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
T5 branch (diffusion alphabet):feed ˜yτ directly to the frozen T5 encoder to obtain token embeddings for the shared transformer
-
[64]
s unfolds with a retrozoned from the pageinspired, Cleveland-colored brain
CLIP branches (auxiliary conditioning):decode ˜yτ to a string ˜sτ = decodeT5(˜yτ) (dropping special tokens), then re-tokenize ˜sτ with each CLIP tokenizer and feed the resulting IDs to the corresponding frozen CLIP encoders (including pooled embeddings). This yields a simple, deterministic, and cheap mapping between encoder stacks while keeping all text e...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.