Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

Dongliang Chang; Xintong Liu; Yuanchen Fang; Yujun Tong; Zhanyu Ma; Zijin Yin

arxiv: 2605.15792 · v1 · pith:MSXZA4MQnew · submitted 2026-05-15 · 💻 cs.CV

Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

Yujun Tong , Dongliang Chang , Zijin Yin , Xintong Liu , Yuanchen Fang , Zhanyu Ma This is my paper

Pith reviewed 2026-05-20 18:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal modelsvisual generationunderstanding synergyself-generated visual thoughtsreversed information flowedit promptsgenerative fidelitymultimodal benchmarks

0 comments

The pith

Feeding a multimodal model's self-generated visual thoughts back into itself improves understanding across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large multimodal models can treat visual generation as an explicit intermediate reasoning step rather than an endpoint. By performing controlled generative acts such as detail enhancement, context expansion, or structural visualisation, the model creates its own visual thoughts and feeds them back to refine perception. This reversed flow is shown to raise performance on twelve understanding benchmarks without any retraining or external tools. The work further demonstrates that the quality of the generated images limits how much perceptual gain occurs and that different prompt families control how efficiently the benefit transfers. It also finds that models can produce plausible edits but do not yet select generations that reliably align with the task at hand.

Core claim

By making visual generation an explicit intermediate reasoning step, a model produces self-generated visual thoughts through acts like detail enhancement, context expansion, or structural visualisation; these thoughts are then fed back into the model to refine perception, yielding consistent gains on multimodal understanding benchmarks while generative fidelity bounds the size of those gains and distinct edit-prompt families govern transfer efficiency.

What carries the argument

Generation-to-Understanding (G2U) synergy, the mechanism that converts controlled generative acts into self-generated visual thoughts for direct feedback into the same model.

If this is right

Generative fidelity directly limits the amount of perceptual improvement that can be obtained.
Different families of edit prompts determine the efficiency with which generation benefits transfer to understanding.
Models produce plausible visual edits yet lack stable task alignment when they must choose what to generate on their own.
Unified multimodal models can achieve bidirectional enhancement between generation and understanding without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models could run repeated internal cycles of generation and feedback to handle ambiguous or complex visual scenes.
The same reversed-flow pattern might apply to other output modalities such as audio or structured text for self-refinement.
Adding mechanisms to improve alignment between chosen generations and task goals could move models closer to genuine self-reflection.

Load-bearing premise

The observed gains in understanding come specifically from the controlled generative acts of detail enhancement, context expansion, or structural visualisation rather than from incidental effects of prompting or the evaluation setup.

What would settle it

Replace the self-generated visual thoughts with unrelated or randomly created images on the same twelve benchmarks and test whether the performance improvements disappear.

Figures

Figures reproduced from arXiv: 2605.15792 by Dongliang Chang, Xintong Liu, Yuanchen Fang, Yujun Tong, Zhanyu Ma, Zijin Yin.

**Figure 1.** Figure 1: The core ambition of unified models is a true synergy where understanding (U) and generation (G) mutually reinforce. However, current unification is predominantly onedirectional (U → G), as an LMM’s (e.g., QwenVL) strong reasoning is leveraged to high-fidelity generation. The reciprocal path (G → U), where generative faculties enhance understanding, remains critically overlooked. This work explores thi… view at source ↗

**Figure 2.** Figure 2: Overview of our Generation-to-Understanding (G [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the two functional regimes of visual editing that enable our G [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative evaluations on our VisThink-Bench. Left: We report the accuracy change of our G [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation between improvements and editing perfor [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of generated edit prompt behaviors [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve benchmarks, this reversed information flow consistently improves multimodal understanding. We show that generative fidelity bounds perceptual gain and that distinct families of edit prompts govern transfer efficiency. We further analyse whether models can decide what to imagine. While they can produce plausible edits, these self-generated visual thoughts lack stable task alignment, revealing that current large multimodal models fall short of true self-reflection. This work exposes a missing mechanism in unified cognition and suggests that imagination is not the end of understanding but its beginning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames generation as an explicit intermediate reasoning step to boost multimodal understanding and reports gains on twelve benchmarks, but the causal link to the visual feedback itself is not isolated.

read the letter

The main thing here is that they reverse the usual pipeline so a multimodal model generates its own visual thoughts through controlled edits and feeds those back to improve perception. They claim this works without retraining, produces consistent lifts across twelve benchmarks, and that better generative fidelity leads to bigger perceptual gains while different edit-prompt families change how much transfers. They also test self-chosen generations and find the outputs lack stable task alignment, so current models are not great at deciding what to imagine on their own.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a Generation-to-Understanding (G2U) framework for large multimodal models in which controlled generative acts (detail enhancement, context expansion, structural visualisation) produce self-generated visual thoughts that are fed back to refine perception without retraining. It reports consistent improvements across twelve benchmarks, claims that generative fidelity bounds perceptual gain and that families of edit prompts control transfer efficiency, and finds that current models produce plausible but task-misaligned edits when deciding what to imagine.

Significance. If the reported gains are shown to arise specifically from the visual feedback loop rather than prompt engineering, the work would be significant for demonstrating a concrete mechanism by which generation can support understanding in unified multimodal models, addressing the one-directional asymmetry noted in the introduction and providing empirical bounds on the relationship between generative fidelity and perceptual improvement.

major comments (2)

Evaluation section (the comprehensive evaluation on twelve benchmarks): the central claim that reversed information flow improves understanding requires an ablation that holds edit-prompt content, context length, and instruction-following fixed while removing the actual image-generation step. Without this control, improvements could reflect richer textual guidance rather than the generative visual feedback itself, directly undermining the premise that perceptual gains stem from the controlled generative acts.
Results and analysis sections: the statement that generative fidelity bounds perceptual gain is presented as a key finding, yet no quantitative correlation, regression, or controlled variation of fidelity (e.g., via noise injection or model scale) is described to establish the bounding relationship; this leaves the claimed bound as an observed trend rather than a demonstrated limit.

minor comments (2)

Abstract: the sentence beginning 'Despite recent works such as BAGEL, BLIP3o achieves remarkable progress;' is grammatically incomplete and should be rephrased for clarity.
The term 'visual thoughts' is introduced without a precise operational definition or example in the early sections; adding a short illustrative figure or pseudocode would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment below, acknowledging where revisions are needed to strengthen the claims, and outline the specific changes we will incorporate.

read point-by-point responses

Referee: Evaluation section (the comprehensive evaluation on twelve benchmarks): the central claim that reversed information flow improves understanding requires an ablation that holds edit-prompt content, context length, and instruction-following fixed while removing the actual image-generation step. Without this control, improvements could reflect richer textual guidance rather than the generative visual feedback itself, directly undermining the premise that perceptual gains stem from the controlled generative acts.

Authors: We agree that this ablation is essential to isolate the contribution of the visual feedback loop. In the revised manuscript we will add a controlled experiment that replaces the generated images with either textual renderings of the same edits or fixed placeholder images, while exactly preserving edit-prompt content, context length, and instruction-following behavior. Performance differences between the visual-feedback condition and these textual/placebo controls will be reported to demonstrate that gains arise specifically from the self-generated visual thoughts. revision: yes
Referee: Results and analysis sections: the statement that generative fidelity bounds perceptual gain is presented as a key finding, yet no quantitative correlation, regression, or controlled variation of fidelity (e.g., via noise injection or model scale) is described to establish the bounding relationship; this leaves the claimed bound as an observed trend rather than a demonstrated limit.

Authors: We acknowledge that the current presentation relies on comparative trends rather than formal quantitative evidence. In the revision we will add (i) Pearson and Spearman correlations between generative fidelity metrics (FID, LPIPS) and per-benchmark perceptual improvement deltas, and (ii) controlled fidelity-variation experiments that inject graduated Gaussian noise into the generated images while keeping all other factors fixed, thereby quantifying the bounding relationship. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical protocol with no derivations or self-referential reductions

full rationale

The paper describes an empirical intervention—using controlled generative acts (detail enhancement, context expansion, structural visualisation) to produce self-generated visual thoughts that are fed back into the model—followed by evaluation on twelve benchmarks. No equations, fitted parameters, or first-principles derivations appear in the abstract or described framework. Claims such as 'generative fidelity bounds perceptual gain' and 'distinct families of edit prompts govern transfer efficiency' are presented as observed outcomes of the evaluation rather than quantities derived from or defined in terms of themselves. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is therefore self-contained against external benchmarks and does not reduce any prediction to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the empirical effectiveness of self-generated visuals as reasoning aids. No explicit free parameters, new physical entities, or ad-hoc axioms are introduced in the abstract; the work relies on standard assumptions about multimodal model capabilities.

axioms (1)

domain assumption Multimodal models can execute controlled generative acts such as detail enhancement or context expansion without retraining or external tools.
Invoked when describing the framework that produces self-generated visual thoughts.

pith-pipeline@v0.9.0 · 5765 in / 1222 out tokens · 43989 ms · 2026-05-20T18:41:16.077053+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 19 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, 2023. 2

work page 2023
[3]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 4

work page 2020
[4]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team Chameleon. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024. 4, 5, 7, 8

work page 2024
[7]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Mme: A comprehensive evaluation bench- mark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InICLR, 2025. 4, 5, 7

work page 2025
[11]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation.arXiv preprint arXiv:2404.14396, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024. 4, 5, 7, 8

work page 2024
[13]

W., Li, L., Yang, Z., Wang, L., and Cheng, Y

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444,

work page arXiv
[14]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic mul- timodal model.arXiv preprint arXiv:2511.05271, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Osten- dorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, 2024. 3

work page 2024
[16]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InACL, 2024. 7

work page 2024
[17]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. InCVPR, 2024. 4

work page 2024
[18]

Llava-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research, 2024. 6, 7

work page 2024
[19]

R-bench: Are your large multimodal model robust to real-world corruptions?IEEE Journal of Selected Topics in Signal Processing, 2025

Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiongkuo Min, Xiaohong Liu, Weisi Lin, et al. R-bench: Are your large multimodal model robust to real-world corruptions?IEEE Journal of Selected Topics in Signal Processing, 2025. 4, 5, 7, 8

work page 2025
[20]

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InCVPR, 2025. 3

work page 2025
[21]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models

Weixin Liang, Lili Yu, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. InICLR Workshop, 2025. 2

work page 2025
[22]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 4, 5, 7

work page 2024
[23]

Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025

Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Ji- aqi Wang. Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025. 3

work page arXiv 2025
[24]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InCVPR,

work page
[25]

Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InCVPR, 2025. 2

work page 2025
[26]

Dall·e 3 system card, 2023

OpenAI. Dall·e 3 system card, 2023. 1

work page 2023
[27]

Gpt-5, 2025

OpenAI. Gpt-5, 2025. 1

work page 2025
[28]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Trans- fer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2023. 1

work page 2023
[30]

Cogcom: A visual language model with chain-of- manipulations reasoning

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: A visual language model with chain-of- manipulations reasoning. InICLR, 2024. 3

work page 2024
[31]

V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, et al. V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025. 3

work page arXiv 2025
[32]

Tokenflow: Unified image tokenizer for multi- modal understanding and generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InCVPR, 2025. 2, 6, 7

work page 2025
[33]

2025.doi:10.48550/arXiv.2412.15188

Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal gener- ation.arXiv preprint arXiv:2412.15188, 2024. 1, 2, 6, 7

work page arXiv 2024
[34]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024. 2

work page 2024
[37]

Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025. 4

work page arXiv 2025
[38]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Metamorph: Multimodal un- derstanding and generation via instruction tuning

Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal un- derstanding and generation via instruction tuning. InICCV,

work page
[40]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Spatial457: A di- agnostic benchmark for 6d spatial reasoning of large mul- timodal models

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A di- agnostic benchmark for 6d spatial reasoning of large mul- timodal models. InCVPR, 2025. 4

work page 2025
[42]

Janus: Decoupling visual encod- ing for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In CVPR, 2025. 2

work page 2025
[43]

Q-bench: A benchmark for general-purpose foundation models on low-level vision

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. In ICLR, 2023. 4

work page 2023
[44]

Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use. arXiv preprint arXiv:2505.19255, 2025. 3

work page arXiv 2025
[45]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024. 3

work page 2024
[46]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Log- icvista: Multimodal llm logical reasoning benchmark in vi- sual contexts.arXiv preprint arXiv:2407.04973, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2024. 2

work page 2024
[49]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Kiva: Kid-inspired visual analogies for testing large multimodal models

Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models. InICLR, 2024. 4, 5, 7

work page 2024
[51]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

arXiv preprint arXiv:2507.07998 , year=

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025. 3

work page arXiv 2025
[54]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Transfusion: Pre- dict the next token and diffuse images with one multi-modal model

Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Pre- dict the next token and diffuse images with one multi-modal model. InICLR, 2024. 2

work page 2024

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, 2023. 2

work page 2023

[3] [3]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 4

work page 2020

[4] [4]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team Chameleon. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024. 4, 5, 7, 8

work page 2024

[7] [7]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Mme: A comprehensive evaluation bench- mark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InICLR, 2025. 4, 5, 7

work page 2025

[11] [11]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation.arXiv preprint arXiv:2404.14396, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024. 4, 5, 7, 8

work page 2024

[13] [13]

W., Li, L., Yang, Z., Wang, L., and Cheng, Y

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444,

work page arXiv

[14] [14]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic mul- timodal model.arXiv preprint arXiv:2511.05271, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Osten- dorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, 2024. 3

work page 2024

[16] [16]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InACL, 2024. 7

work page 2024

[17] [17]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. InCVPR, 2024. 4

work page 2024

[18] [18]

Llava-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research, 2024. 6, 7

work page 2024

[19] [19]

R-bench: Are your large multimodal model robust to real-world corruptions?IEEE Journal of Selected Topics in Signal Processing, 2025

Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiongkuo Min, Xiaohong Liu, Weisi Lin, et al. R-bench: Are your large multimodal model robust to real-world corruptions?IEEE Journal of Selected Topics in Signal Processing, 2025. 4, 5, 7, 8

work page 2025

[20] [20]

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InCVPR, 2025. 3

work page 2025

[21] [21]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models

Weixin Liang, Lili Yu, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. InICLR Workshop, 2025. 2

work page 2025

[22] [22]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 4, 5, 7

work page 2024

[23] [23]

Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025

Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Ji- aqi Wang. Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025. 3

work page arXiv 2025

[24] [24]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InCVPR,

work page

[25] [25]

Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InCVPR, 2025. 2

work page 2025

[26] [26]

Dall·e 3 system card, 2023

OpenAI. Dall·e 3 system card, 2023. 1

work page 2023

[27] [27]

Gpt-5, 2025

OpenAI. Gpt-5, 2025. 1

work page 2025

[28] [28]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Trans- fer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2023. 1

work page 2023

[30] [30]

Cogcom: A visual language model with chain-of- manipulations reasoning

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: A visual language model with chain-of- manipulations reasoning. InICLR, 2024. 3

work page 2024

[31] [31]

V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, et al. V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025. 3

work page arXiv 2025

[32] [32]

Tokenflow: Unified image tokenizer for multi- modal understanding and generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InCVPR, 2025. 2, 6, 7

work page 2025

[33] [33]

2025.doi:10.48550/arXiv.2412.15188

Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal gener- ation.arXiv preprint arXiv:2412.15188, 2024. 1, 2, 6, 7

work page arXiv 2024

[34] [34]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024. 2

work page 2024

[37] [37]

Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025

Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025. 4

work page arXiv 2025

[38] [38]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Metamorph: Multimodal un- derstanding and generation via instruction tuning

Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal un- derstanding and generation via instruction tuning. InICCV,

work page

[40] [40]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Spatial457: A di- agnostic benchmark for 6d spatial reasoning of large mul- timodal models

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A di- agnostic benchmark for 6d spatial reasoning of large mul- timodal models. InCVPR, 2025. 4

work page 2025

[42] [42]

Janus: Decoupling visual encod- ing for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In CVPR, 2025. 2

work page 2025

[43] [43]

Q-bench: A benchmark for general-purpose foundation models on low-level vision

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. In ICLR, 2023. 4

work page 2023

[44] [44]

Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use. arXiv preprint arXiv:2505.19255, 2025. 3

work page arXiv 2025

[45] [45]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024. 3

work page 2024

[46] [46]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Log- icvista: Multimodal llm logical reasoning benchmark in vi- sual contexts.arXiv preprint arXiv:2407.04973, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2024. 2

work page 2024

[49] [49]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Kiva: Kid-inspired visual analogies for testing large multimodal models

Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models. InICLR, 2024. 4, 5, 7

work page 2024

[51] [51]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

arXiv preprint arXiv:2507.07998 , year=

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025. 3

work page arXiv 2025

[54] [54]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Transfusion: Pre- dict the next token and diffuse images with one multi-modal model

Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Pre- dict the next token and diffuse images with one multi-modal model. InICLR, 2024. 2

work page 2024