Recognition: 2 theorem links
· Lean TheoremImages in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
Pith reviewed 2026-05-13 05:39 UTC · model grok-4.3
The pith
Embedding images directly into text instructions at semantic positions improves consistency in multi-image generation
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
INSET is a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, it leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. A scalable data engine synthesizes 15M high-quality interleaved samples from standard image and video datasets.
What carries the argument
Positioning visual features directly at semantic slots within the textual token sequence to leverage transformer contextual locality for precise object binding.
If this is right
- Multi-image consistency and text alignment outperform state-of-the-art methods on InterleaveBench.
- Performance gaps widen as input complexity and number of images increase.
- The model extends to multimodal image editing by integrating visual content directly into instructions.
- A data engine can scalably produce 15M high-quality interleaved samples from standard datasets.
Where Pith is reading between the lines
- The locality approach may support longer interleaved sequences without extra alignment modules.
- Similar token placement could extend to other modalities such as audio segments or video frames.
- Data synthesis techniques might generate training sets for related tasks like video generation.
Load-bearing premise
Positioning visual features directly at their corresponding semantic slots in the transformer leverages contextual locality for precise object binding without creating new alignment or training instabilities.
What would settle it
A test showing that multi-image consistency or text alignment fails to improve, or object binding errors increase, when visual features are placed at semantic slots in complex interleaved prompts compared to separated image-text processing.
read the original abstract
While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes INSET (Images in Sentences), a unified visual generation model that integrates images directly into textual instructions by embedding visual features at corresponding semantic positions within the transformer sequence. This approach aims to leverage contextual locality for improved multi-image consistency and text alignment. The authors also present a data synthesis pipeline that generates 15 million high-quality interleaved instruction samples using VLMs and LLMs from existing image and video datasets. They evaluate the model on InterleaveBench, claiming significant outperformance over state-of-the-art methods, particularly as instruction complexity increases, and note its extension to multimodal image editing.
Significance. If the empirical claims hold, this work could advance multimodal generation by offering a more integrated way to handle complex interleaved instructions, reducing reliance on long-range dependencies through direct visual token placement. The scalable synthetic data engine for creating 15M interleaved samples is a clear strength that could support future research in the area. The architectural choice to treat images as dense language tokens is conceptually interesting and could unify generation and editing tasks if the integration proves stable.
major comments (2)
- [§4 (Experiments)] §4 (Experiments/Evaluation): The central claim that INSET significantly outperforms SOTA methods on InterleaveBench with widening gaps at higher complexity is presented only qualitatively in the abstract and evaluation section. No specific quantitative metrics (e.g., exact scores, tables of results), baseline details, ablation studies, or error analysis are provided, which is load-bearing for assessing whether the gains are due to the proposed insertion method versus data scale or training.
- [§3 (Method)] §3 (Method, visual feature insertion): The core assumption that positioning visual features directly at semantic slots leverages contextual locality for precise object binding without creating alignment or training instabilities is not supported by any ablations on projection layers, dimension handling, attention entropy, or loss variance across modalities. This is critical because the largest claimed gains occur at high complexity, where any projection-induced misalignment would be expected to degrade multi-image consistency.
minor comments (2)
- [Abstract / §1] The acronym expansion for INSET is given in the abstract but could be restated more explicitly in the introduction for readers who skip the abstract.
- [§4 or §5] The extension to multimodal image editing is asserted but lacks any qualitative examples, failure cases, or quantitative comparison to dedicated editing baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will incorporate the suggested additions into the revised version to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments/Evaluation): The central claim that INSET significantly outperforms SOTA methods on InterleaveBench with widening gaps at higher complexity is presented only qualitatively in the abstract and evaluation section. No specific quantitative metrics (e.g., exact scores, tables of results), baseline details, ablation studies, or error analysis are provided, which is load-bearing for assessing whether the gains are due to the proposed insertion method versus data scale or training.
Authors: We agree that the current presentation relies too heavily on qualitative descriptions and figures. In the revised manuscript, we will add a dedicated results table in Section 4 reporting exact quantitative metrics (e.g., multi-image consistency and text alignment scores) for INSET versus all baselines on InterleaveBench, with explicit breakdowns by instruction complexity to demonstrate the widening gaps. We will also specify baseline details, include an ablation comparing our 15M interleaved data synthesis pipeline against training on standard datasets, and add an error analysis subsection. These changes will directly address whether gains arise from the insertion method. revision: yes
-
Referee: [§3 (Method)] §3 (Method, visual feature insertion): The core assumption that positioning visual features directly at semantic slots leverages contextual locality for precise object binding without creating alignment or training instabilities is not supported by any ablations on projection layers, dimension handling, attention entropy, or loss variance across modalities. This is critical because the largest claimed gains occur at high complexity, where any projection-induced misalignment would be expected to degrade multi-image consistency.
Authors: The insertion strategy is intended to exploit the transformer's local attention patterns by placing visual tokens at semantically corresponding positions. To substantiate this, the revised Section 3 will include new ablations: comparisons of alternative projection layers and dimension alignment techniques, attention entropy visualizations demonstrating localized focus, and training loss variance metrics across text and image modalities. These will be reported for both low- and high-complexity instructions to confirm stability and rule out projection-induced degradation. revision: yes
Circularity Check
No circularity: architectural proposal with external data synthesis and empirical evaluation
full rationale
The paper describes an architectural design choice—embedding visual features directly at semantic slots in the transformer sequence to leverage contextual locality for object binding—without any mathematical derivations, equations, or parameter-fitting steps that could reduce to self-referential inputs. The data engine relies on external VLMs and LLMs for synthesizing 15M samples from standard datasets, and performance claims are grounded in evaluation on InterleaveBench rather than internal consistency or self-citations. No load-bearing steps invoke prior author work as uniqueness theorems or smuggle ansatzes; the approach is self-contained as a modeling innovation validated externally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformers leverage contextual locality for precise object binding when visual features are positioned at corresponding semantic slots.
invented entities (1)
-
INSET architecture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a scalable data engine that constructs 15M high-quality interleaved samples
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021
work page 2021
-
[3]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Google DeepMind. Gemini 2.5 flash image. https://developers.googleblog.com/en/ introducing-gemini-2-5-flash-image/, 2025. Accessed: 2025-10-30
work page 2025
-
[7]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024. 9
work page 2024
-
[9]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Chameleon: Hierarchical clustering using dynamic modeling
George Karypis, Eui-Hong Han, and Vipin Kumar. Chameleon: Hierarchical clustering using dynamic modeling. computer, 32(8):68–75, 1999
work page 1999
-
[11]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, 2023
work page 2023
-
[12]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[13]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text
Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024
-
[15]
Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025
-
[16]
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025
-
[17]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
work page 2024
-
[20]
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025
work page 2025
-
[21]
Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,
Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915, 2025
-
[22]
OpenAI. Gpt-4 technical report.arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Gpt-4v(ision) system card, 2023
OpenAI. Gpt-4v(ision) system card, 2023. URLhttps://openai.com/research/gpt-4v-system-card
work page 2023
-
[24]
Introducing 4o image generation.https://openai.com/index/introducing-4o-image-generation/,
OpenAI. Introducing 4o image generation.https://openai.com/index/introducing-4o-image-generation/,
-
[25]
Accessed: 2025-12-19
work page 2025
-
[26]
Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models.arXiv preprint arXiv:2310.02992, 2023
-
[27]
Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024
-
[28]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv:2306.14824, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[31]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[32]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024
work page 2024
-
[34]
Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Push- ing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXivpreprintarXiv:2504.11455, 2025
-
[35]
Skywork unipic: Unified autoregressive modeling for visual understanding and generation
Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, et al. Skywork unipic: Unified autoregressive modeling for visual understanding and generation. arXiv preprint arXiv:2508.03320, 2025
-
[36]
Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024
Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024
-
[37]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model
Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025
-
[39]
Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation
Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023
work page 2023
-
[40]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025
work page 2025
-
[42]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representations for unified multimodal understanding and generation.arXiv preprint arXiv:2503.21979, 2025. 11
-
[44]
arXiv preprint arXiv:2510.06679 (2025)
Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025
-
[45]
Omnigen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025
work page 2025
-
[46]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987, 2025
-
[50]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advancesin Neural Information Processing Systems, 36:8958–8974, 2023. 12 GPT-4o Seedream 4.0Nano BananaOurs A whimsical outdoor g...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.