pith. sign in

arxiv: 2605.23897 · v1 · pith:2VO3PNEKnew · submitted 2026-05-22 · 💻 cs.CV · cs.AI· cs.CL

ETCHR: Editing To Clarify and Harness Reasoning

Pith reviewed 2026-05-25 04:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords image editingvisual reasoningmultimodal modelsreasoning trajectoriesdecoupled editoredit correctnesstask families
0
0 comments X

The pith

A dedicated image editor trained on reasoning trajectories improves visual reasoning accuracy in multimodal models by 4-5 percentage points without retraining the models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current multimodal models struggle with visual reasoning because text-only reasoning chains cannot handle fine details or needed view changes, and existing image-based methods either use rigid tools or produce unreliable edits. ETCHR addresses this by creating a separate image editor that learns to turn abstract questions into useful image transformations. The editor is trained first by copying good edit sequences and then by using feedback from the understanding model to ensure edits stay correct even as reasoning gets deeper. Because the editor stays independent, it can be attached to any existing multimodal model without further training and delivers consistent gains on tasks that need precise perception or spatial changes.

Core claim

ETCHR is a question-conditioned image editor that is trained separately from the downstream understanding model using a two-stage process of supervised imitation on edit trajectories followed by reward-based enhancement for both edit correctness and final reasoning accuracy, allowing it to clarify visual inputs for improved performance across fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding tasks.

What carries the argument

The decoupled, reasoning-aware image editor that maps abstract questions to targeted visual transformations via two-stage training of imitation followed by VLM reward optimization.

Load-bearing premise

A dedicated image editor can be trained to map abstract questions to appropriate visual transformations and maintain edit correctness as reasoning depth increases, without joint optimization with the downstream understanding model.

What would settle it

Measuring whether the edits produced by the trained editor raise or lower the downstream model's accuracy on questions that specifically require detail focus or viewpoint changes, compared to the unedited baseline.

read the original abstract

Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ETCHR, a question-conditioned image editor decoupled from downstream MLLMs. It addresses two gaps in off-the-shelf editors (language-side mapping from abstract questions to visual transformations; generation-side degradation with reasoning depth) via a two-stage training process: Reasoning Imitation (SFT on edit trajectories) followed by Reasoning Enhancement (VLM-derived rewards on edit correctness and downstream accuracy). The editor is claimed to plug in training-free to open- and closed-source MLLMs, yielding Pass@1 gains of +4.82 (Qwen3-VL-8B), +5.47 (Gemini-3.1-Flash-Lite), and +4.61 (Kimi K2.5) averaged across five task families: fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding.

Significance. If the decoupling and generalization claims hold under held-out reward models and rigorous controls, the work would offer a practical modular route to improve visual reasoning without retraining large MLLMs or relying on fixed toolkits. The explicit targeting of the two identified gaps and the two-stage recipe are conceptually clear; reproducible code or parameter-free derivations are not mentioned.

major comments (3)
  1. [Reasoning Enhancement stage (abstract and §3)] Reasoning Enhancement stage: the manuscript does not state whether the VLM supplying the downstream reasoning accuracy reward matches any of the three evaluation models (Qwen3-VL-8B, Gemini-3.1-Flash-Lite, Kimi K2.5) or is held out. This detail is load-bearing for the central claim that a single trained editor works training-free across arbitrary open- and closed-source MLLMs.
  2. [Evaluation / Results section] Results across five task families: the reported average Pass@1 gains lack any description of the exact baselines, number of evaluation runs, statistical significance tests, data splits, or variance. Without these, the numerical improvements cannot be assessed as robust support for the method's effectiveness.
  3. [Reasoning Enhancement and experiments] The claim that edit correctness is maintained as reasoning depth increases is central to closing the generation-side gap, yet no depth-stratified ablation or analysis is referenced to validate this assumption empirically.
minor comments (1)
  1. [Abstract] The abstract would benefit from a single sentence clarifying the reward VLM's relationship to the test models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Reasoning Enhancement stage (abstract and §3)] Reasoning Enhancement stage: the manuscript does not state whether the VLM supplying the downstream reasoning accuracy reward matches any of the three evaluation models (Qwen3-VL-8B, Gemini-3.1-Flash-Lite, Kimi K2.5) or is held out. This detail is load-bearing for the central claim that a single trained editor works training-free across arbitrary open- and closed-source MLLMs.

    Authors: The VLM providing the downstream reasoning accuracy reward is a held-out model distinct from Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. This choice directly supports the generalization claim. We will revise the abstract and §3 to state this explicitly. revision: yes

  2. Referee: [Evaluation / Results section] Results across five task families: the reported average Pass@1 gains lack any description of the exact baselines, number of evaluation runs, statistical significance tests, data splits, or variance. Without these, the numerical improvements cannot be assessed as robust support for the method's effectiveness.

    Authors: We agree additional methodological details are required. The baselines are the three MLLMs with direct inference (no editing). All results are means over three independent runs with reported standard deviation; data splits follow the public benchmark protocols; significance is assessed via paired t-tests. We will add a dedicated table in the Results section documenting these elements and the variance. revision: yes

  3. Referee: [Reasoning Enhancement and experiments] The claim that edit correctness is maintained as reasoning depth increases is central to closing the generation-side gap, yet no depth-stratified ablation or analysis is referenced to validate this assumption empirically.

    Authors: We will add a new depth-stratified ablation (edit success rate versus reasoning depth) to the experiments section or appendix to empirically support the claim that correctness holds as depth grows. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training stages and decoupling are independent of reported test gains.

full rationale

The paper presents an empirical method with two explicit training stages (SFT on trajectories, then reward-based enhancement) for a decoupled editor, followed by training-free application to multiple MLLMs and measurement on external task families. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The decoupling claim and cross-model gains are stated directly without reduction to the training VLM by construction. This is a standard non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The core premise that image edits can be learned to aid reasoning is treated as a domain assumption.

axioms (1)
  • domain assumption Image editors can be conditioned on questions to produce edits that improve downstream multimodal reasoning accuracy
    This premise underpins the entire two-stage training approach described in the abstract.

pith-pipeline@v0.9.0 · 5873 in / 1216 out tokens · 21779 ms · 2026-05-25T04:19:47.832384+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 22 canonical work pages · 15 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1, 2, 1, 4, 2, 3, 4.1, 4.2, 4, 5, 4.2

  2. [2]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023. 1, 2, 5

  3. [3]

    Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024

    Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024. 1

  4. [4]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 2, 1, 4, 2, 3, 4.1, 4.2, 4, 5

  5. [5]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1

  6. [6]

    Nano banana 2: Combining pro capabilities with lightning-fast speed, Feb 2026

    Google. Nano banana 2: Combining pro capabilities with lightning-fast speed, Feb 2026. 2, 4.1

  7. [7]

    Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025. 1, 1, 4.1, 5

  8. [8]

    DeepEyesV2: Toward Agentic Multimodal Model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025. 1, 1, 5

  9. [9]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 3.2

  10. [10]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 1, 5

  11. [11]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,

  12. [12]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276,

  13. [13]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 1, 5

  14. [14]

    FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025. 1, 2, 3.2, 4.2, 4, 5

  15. [15]

    Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025

    Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025. 1, 1, 4.1, 5

  16. [16]

    Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.ArXiv, abs/2505.21500, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.ArXiv, abs/2505.21500, 2025. 1, 4 11 ETCHR: Editing To Clarify and Harness Reasoning

  17. [17]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014. 1, 4

  18. [18]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 3.2, 4

  19. [19]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 3.2

  20. [20]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 1

  21. [21]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36:46534–46594, 2023. 1

  22. [22]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics. 1, 4

  23. [23]

    Thinking with images.https://openai.com/index/thinking-with-images/, 2025

    OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025. 1

  24. [24]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205,

  25. [25]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepseekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3.3, 4.2

  27. [27]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023. 1

  28. [28]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 5

  29. [29]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi K2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. 1, 4, 4.1

  30. [30]

    Refchartqa: Grounding visual answer on chart images through instruction tuning

    Alexander Vogel, Omar Moured, Yufan Chen, Jiaming Zhang, and Rainer Stiefelhagen. Refchartqa: Grounding visual answer on chart images through instruction tuning. InInternational Conference on Document Analysis and Recognition, pages 523–537. Springer, 2025. 3.2

  31. [31]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    WenbinWang,LiangDing,MinyanZeng,XiabinZhou,LiShen,YongLuo,WeiYu,andDachengTao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025. 1, 2, 4

  32. [32]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025. 3.3, 4 12 ETCHR: Editing To Clarify and Harness Reasoning

  33. [33]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024. 1, 4

  34. [34]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1

  35. [35]

    Unig2u-bench: Do unified models advance multimodal understanding?arXiv preprint arXiv:2603.03241, 2026

    Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, et al. Unig2u-bench: Do unified models advance multimodal understanding?arXiv preprint arXiv:2603.03241, 2026. 1, 5

  36. [36]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1, 5

  37. [37]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966– 12977, 2025. 1, 5

  38. [38]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094,

  39. [39]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 1, 5

  40. [40]

    MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InCVPR, pages 9556–9567, 2024. 1

  41. [41]

    MagicBrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449,

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449,

  42. [42]

    Thyme: Think Beyond Images

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 1, 1, 5

  43. [43]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 3.3, 4.2

  44. [44]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025. 1, 5 13 ETCHR: Editing To Clarify and Harness Reasoning A. Prompts Task-level Prompt: Fine-grained Perception:Draw a red box to mark the important ...