Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models
Pith reviewed 2026-05-20 18:41 UTC · model grok-4.3
The pith
Feeding a multimodal model's self-generated visual thoughts back into itself improves understanding across benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By making visual generation an explicit intermediate reasoning step, a model produces self-generated visual thoughts through acts like detail enhancement, context expansion, or structural visualisation; these thoughts are then fed back into the model to refine perception, yielding consistent gains on multimodal understanding benchmarks while generative fidelity bounds the size of those gains and distinct edit-prompt families govern transfer efficiency.
What carries the argument
Generation-to-Understanding (G2U) synergy, the mechanism that converts controlled generative acts into self-generated visual thoughts for direct feedback into the same model.
If this is right
- Generative fidelity directly limits the amount of perceptual improvement that can be obtained.
- Different families of edit prompts determine the efficiency with which generation benefits transfer to understanding.
- Models produce plausible visual edits yet lack stable task alignment when they must choose what to generate on their own.
- Unified multimodal models can achieve bidirectional enhancement between generation and understanding without retraining.
Where Pith is reading between the lines
- Models could run repeated internal cycles of generation and feedback to handle ambiguous or complex visual scenes.
- The same reversed-flow pattern might apply to other output modalities such as audio or structured text for self-refinement.
- Adding mechanisms to improve alignment between chosen generations and task goals could move models closer to genuine self-reflection.
Load-bearing premise
The observed gains in understanding come specifically from the controlled generative acts of detail enhancement, context expansion, or structural visualisation rather than from incidental effects of prompting or the evaluation setup.
What would settle it
Replace the self-generated visual thoughts with unrelated or randomly created images on the same twelve benchmarks and test whether the performance improvements disappear.
Figures
read the original abstract
The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve benchmarks, this reversed information flow consistently improves multimodal understanding. We show that generative fidelity bounds perceptual gain and that distinct families of edit prompts govern transfer efficiency. We further analyse whether models can decide what to imagine. While they can produce plausible edits, these self-generated visual thoughts lack stable task alignment, revealing that current large multimodal models fall short of true self-reflection. This work exposes a missing mechanism in unified cognition and suggests that imagination is not the end of understanding but its beginning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a Generation-to-Understanding (G2U) framework for large multimodal models in which controlled generative acts (detail enhancement, context expansion, structural visualisation) produce self-generated visual thoughts that are fed back to refine perception without retraining. It reports consistent improvements across twelve benchmarks, claims that generative fidelity bounds perceptual gain and that families of edit prompts control transfer efficiency, and finds that current models produce plausible but task-misaligned edits when deciding what to imagine.
Significance. If the reported gains are shown to arise specifically from the visual feedback loop rather than prompt engineering, the work would be significant for demonstrating a concrete mechanism by which generation can support understanding in unified multimodal models, addressing the one-directional asymmetry noted in the introduction and providing empirical bounds on the relationship between generative fidelity and perceptual improvement.
major comments (2)
- Evaluation section (the comprehensive evaluation on twelve benchmarks): the central claim that reversed information flow improves understanding requires an ablation that holds edit-prompt content, context length, and instruction-following fixed while removing the actual image-generation step. Without this control, improvements could reflect richer textual guidance rather than the generative visual feedback itself, directly undermining the premise that perceptual gains stem from the controlled generative acts.
- Results and analysis sections: the statement that generative fidelity bounds perceptual gain is presented as a key finding, yet no quantitative correlation, regression, or controlled variation of fidelity (e.g., via noise injection or model scale) is described to establish the bounding relationship; this leaves the claimed bound as an observed trend rather than a demonstrated limit.
minor comments (2)
- Abstract: the sentence beginning 'Despite recent works such as BAGEL, BLIP3o achieves remarkable progress;' is grammatically incomplete and should be rephrased for clarity.
- The term 'visual thoughts' is introduced without a precise operational definition or example in the early sections; adding a short illustrative figure or pseudocode would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment below, acknowledging where revisions are needed to strengthen the claims, and outline the specific changes we will incorporate.
read point-by-point responses
-
Referee: Evaluation section (the comprehensive evaluation on twelve benchmarks): the central claim that reversed information flow improves understanding requires an ablation that holds edit-prompt content, context length, and instruction-following fixed while removing the actual image-generation step. Without this control, improvements could reflect richer textual guidance rather than the generative visual feedback itself, directly undermining the premise that perceptual gains stem from the controlled generative acts.
Authors: We agree that this ablation is essential to isolate the contribution of the visual feedback loop. In the revised manuscript we will add a controlled experiment that replaces the generated images with either textual renderings of the same edits or fixed placeholder images, while exactly preserving edit-prompt content, context length, and instruction-following behavior. Performance differences between the visual-feedback condition and these textual/placebo controls will be reported to demonstrate that gains arise specifically from the self-generated visual thoughts. revision: yes
-
Referee: Results and analysis sections: the statement that generative fidelity bounds perceptual gain is presented as a key finding, yet no quantitative correlation, regression, or controlled variation of fidelity (e.g., via noise injection or model scale) is described to establish the bounding relationship; this leaves the claimed bound as an observed trend rather than a demonstrated limit.
Authors: We acknowledge that the current presentation relies on comparative trends rather than formal quantitative evidence. In the revision we will add (i) Pearson and Spearman correlations between generative fidelity metrics (FID, LPIPS) and per-benchmark perceptual improvement deltas, and (ii) controlled fidelity-variation experiments that inject graduated Gaussian noise into the generated images while keeping all other factors fixed, thereby quantifying the bounding relationship. revision: yes
Circularity Check
No circularity: purely empirical protocol with no derivations or self-referential reductions
full rationale
The paper describes an empirical intervention—using controlled generative acts (detail enhancement, context expansion, structural visualisation) to produce self-generated visual thoughts that are fed back into the model—followed by evaluation on twelve benchmarks. No equations, fitted parameters, or first-principles derivations appear in the abstract or described framework. Claims such as 'generative fidelity bounds perceptual gain' and 'distinct families of edit prompts govern transfer efficiency' are presented as observed outcomes of the evaluation rather than quantities derived from or defined in terms of themselves. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is therefore self-contained against external benchmarks and does not reduce any prediction to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal models can execute controlled generative acts such as detail enhancement or context expansion without retraining or external tools.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, 2023. 2
work page 2023
-
[3]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 4
work page 2020
-
[4]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Team Chameleon. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 1, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024. 4, 5, 7, 8
work page 2024
-
[7]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Mme: A comprehensive evaluation bench- mark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InICLR, 2025. 4, 5, 7
work page 2025
-
[11]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation.arXiv preprint arXiv:2404.14396, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024. 4, 5, 7, 8
work page 2024
-
[13]
W., Li, L., Yang, Z., Wang, L., and Cheng, Y
Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444,
-
[14]
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic mul- timodal model.arXiv preprint arXiv:2511.05271, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Osten- dorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, 2024. 3
work page 2024
-
[16]
Viescore: Towards explainable metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InACL, 2024. 7
work page 2024
-
[17]
Seed-bench: Bench- marking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. InCVPR, 2024. 4
work page 2024
-
[18]
Llava-onevision: Easy visual task transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research, 2024. 6, 7
work page 2024
-
[19]
Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiongkuo Min, Xiaohong Liu, Weisi Lin, et al. R-bench: Are your large multimodal model robust to real-world corruptions?IEEE Journal of Selected Topics in Signal Processing, 2025. 4, 5, 7, 8
work page 2025
-
[20]
Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InCVPR, 2025. 3
work page 2025
-
[21]
Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models
Weixin Liang, Lili Yu, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. InICLR Workshop, 2025. 2
work page 2025
-
[22]
Mmbench: Is your multi-modal model an all-around player? InECCV, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 4, 5, 7
work page 2024
-
[23]
Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025
Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Ji- aqi Wang. Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025. 3
-
[24]
Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InCVPR,
-
[25]
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InCVPR, 2025. 2
work page 2025
- [26]
- [27]
-
[28]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Trans- fer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2023. 1
work page 2023
-
[30]
Cogcom: A visual language model with chain-of- manipulations reasoning
Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: A visual language model with chain-of- manipulations reasoning. InICLR, 2024. 3
work page 2024
-
[31]
V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025
Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, et al. V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025. 3
-
[32]
Tokenflow: Unified image tokenizer for multi- modal understanding and generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xin- glong Wu. Tokenflow: Unified image tokenizer for multi- modal understanding and generation. InCVPR, 2025. 2, 6, 7
work page 2025
-
[33]
2025.doi:10.48550/arXiv.2412.15188
Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal gener- ation.arXiv preprint arXiv:2412.15188, 2024. 1, 2, 6, 7
-
[34]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024. 2
work page 2024
-
[37]
Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. Lego-puzzles: How good are mllms at multi-step spatial reasoning?arXiv preprint arXiv:2503.19990, 2025. 4
-
[38]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Metamorph: Multimodal un- derstanding and generation via instruction tuning
Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal un- derstanding and generation via instruction tuning. InICCV,
-
[40]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Spatial457: A di- agnostic benchmark for 6d spatial reasoning of large mul- timodal models
Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A di- agnostic benchmark for 6d spatial reasoning of large mul- timodal models. InCVPR, 2025. 4
work page 2025
-
[42]
Janus: Decoupling visual encod- ing for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In CVPR, 2025. 2
work page 2025
-
[43]
Q-bench: A benchmark for general-purpose foundation models on low-level vision
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. In ICLR, 2023. 4
work page 2023
-
[44]
Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use. arXiv preprint arXiv:2505.19255, 2025. 3
-
[45]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024. 3
work page 2024
-
[46]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Log- icvista: Multimodal llm logical reasoning benchmark in vi- sual contexts.arXiv preprint arXiv:2407.04973, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Show-o: One single transformer to unify multimodal understanding and generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2024. 2
work page 2024
-
[49]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Kiva: Kid-inspired visual analogies for testing large multimodal models
Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models. InICLR, 2024. 4, 5, 7
work page 2024
-
[51]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
arXiv preprint arXiv:2507.07998 , year=
Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025. 3
-
[54]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Transfusion: Pre- dict the next token and diffuse images with one multi-modal model
Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Pre- dict the next token and diffuse images with one multi-modal model. InICLR, 2024. 2
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.