ETCHR: Editing To Clarify and Harness Reasoning
Pith reviewed 2026-05-25 04:19 UTC · model grok-4.3
The pith
A dedicated image editor trained on reasoning trajectories improves visual reasoning accuracy in multimodal models by 4-5 percentage points without retraining the models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ETCHR is a question-conditioned image editor that is trained separately from the downstream understanding model using a two-stage process of supervised imitation on edit trajectories followed by reward-based enhancement for both edit correctness and final reasoning accuracy, allowing it to clarify visual inputs for improved performance across fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding tasks.
What carries the argument
The decoupled, reasoning-aware image editor that maps abstract questions to targeted visual transformations via two-stage training of imitation followed by VLM reward optimization.
Load-bearing premise
A dedicated image editor can be trained to map abstract questions to appropriate visual transformations and maintain edit correctness as reasoning depth increases, without joint optimization with the downstream understanding model.
What would settle it
Measuring whether the edits produced by the trained editor raise or lower the downstream model's accuracy on questions that specifically require detail focus or viewpoint changes, compared to the unedited baseline.
read the original abstract
Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ETCHR, a question-conditioned image editor decoupled from downstream MLLMs. It addresses two gaps in off-the-shelf editors (language-side mapping from abstract questions to visual transformations; generation-side degradation with reasoning depth) via a two-stage training process: Reasoning Imitation (SFT on edit trajectories) followed by Reasoning Enhancement (VLM-derived rewards on edit correctness and downstream accuracy). The editor is claimed to plug in training-free to open- and closed-source MLLMs, yielding Pass@1 gains of +4.82 (Qwen3-VL-8B), +5.47 (Gemini-3.1-Flash-Lite), and +4.61 (Kimi K2.5) averaged across five task families: fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding.
Significance. If the decoupling and generalization claims hold under held-out reward models and rigorous controls, the work would offer a practical modular route to improve visual reasoning without retraining large MLLMs or relying on fixed toolkits. The explicit targeting of the two identified gaps and the two-stage recipe are conceptually clear; reproducible code or parameter-free derivations are not mentioned.
major comments (3)
- [Reasoning Enhancement stage (abstract and §3)] Reasoning Enhancement stage: the manuscript does not state whether the VLM supplying the downstream reasoning accuracy reward matches any of the three evaluation models (Qwen3-VL-8B, Gemini-3.1-Flash-Lite, Kimi K2.5) or is held out. This detail is load-bearing for the central claim that a single trained editor works training-free across arbitrary open- and closed-source MLLMs.
- [Evaluation / Results section] Results across five task families: the reported average Pass@1 gains lack any description of the exact baselines, number of evaluation runs, statistical significance tests, data splits, or variance. Without these, the numerical improvements cannot be assessed as robust support for the method's effectiveness.
- [Reasoning Enhancement and experiments] The claim that edit correctness is maintained as reasoning depth increases is central to closing the generation-side gap, yet no depth-stratified ablation or analysis is referenced to validate this assumption empirically.
minor comments (1)
- [Abstract] The abstract would benefit from a single sentence clarifying the reward VLM's relationship to the test models.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Reasoning Enhancement stage (abstract and §3)] Reasoning Enhancement stage: the manuscript does not state whether the VLM supplying the downstream reasoning accuracy reward matches any of the three evaluation models (Qwen3-VL-8B, Gemini-3.1-Flash-Lite, Kimi K2.5) or is held out. This detail is load-bearing for the central claim that a single trained editor works training-free across arbitrary open- and closed-source MLLMs.
Authors: The VLM providing the downstream reasoning accuracy reward is a held-out model distinct from Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. This choice directly supports the generalization claim. We will revise the abstract and §3 to state this explicitly. revision: yes
-
Referee: [Evaluation / Results section] Results across five task families: the reported average Pass@1 gains lack any description of the exact baselines, number of evaluation runs, statistical significance tests, data splits, or variance. Without these, the numerical improvements cannot be assessed as robust support for the method's effectiveness.
Authors: We agree additional methodological details are required. The baselines are the three MLLMs with direct inference (no editing). All results are means over three independent runs with reported standard deviation; data splits follow the public benchmark protocols; significance is assessed via paired t-tests. We will add a dedicated table in the Results section documenting these elements and the variance. revision: yes
-
Referee: [Reasoning Enhancement and experiments] The claim that edit correctness is maintained as reasoning depth increases is central to closing the generation-side gap, yet no depth-stratified ablation or analysis is referenced to validate this assumption empirically.
Authors: We will add a new depth-stratified ablation (edit success rate versus reasoning depth) to the experiments section or appendix to empirically support the claim that correctness holds as depth grows. revision: yes
Circularity Check
No significant circularity; training stages and decoupling are independent of reported test gains.
full rationale
The paper presents an empirical method with two explicit training stages (SFT on trajectories, then reward-based enhancement) for a decoupled editor, followed by training-free application to multiple MLLMs and measurement on external task families. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The decoupling claim and cross-model gains are stated directly without reduction to the training VLM by construction. This is a standard non-circular empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Image editors can be conditioned on questions to produce edits that improve downstream multimodal reasoning accuracy
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ETCHR trains a question-conditioned image editor in two stages... Reasoning Imitation via supervised fine-tuning... Reasoning Enhancement with VLM-derived rewards
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Reward Design... Editing Guidance Reward... Editing Correctness Reward... convex sum α=β=0.5
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1, 2, 1, 4, 2, 3, 4.1, 4.2, 4, 5, 4.2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023. 1, 2, 5
2023
-
[3]
Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024. 1
-
[4]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 2, 1, 4, 2, 3, 4.1, 4.2, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Nano banana 2: Combining pro capabilities with lightning-fast speed, Feb 2026
Google. Nano banana 2: Combining pro capabilities with lightning-fast speed, Feb 2026. 2, 4.1
2026
-
[7]
Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492, 2025. 1, 1, 4.1, 5
-
[8]
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025. 1, 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 3.2
2022
-
[10]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 1, 5
2024
-
[11]
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 1, 5
2017
-
[14]
FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025. 1, 2, 3.2, 4.2, 4, 5
2025
-
[15]
Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025
Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025. 1, 1, 4.1, 5
-
[16]
Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.ArXiv, abs/2505.21500, 2025. 1, 4 11 ETCHR: Editing To Clarify and Harness Reasoning
-
[17]
Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014. 1, 4
2014
-
[18]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 3.2, 4
2024
-
[19]
Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 3.2
-
[20]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36:46534–46594, 2023. 1
2023
-
[22]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics. 1, 4
2022
-
[23]
Thinking with images.https://openai.com/index/thinking-with-images/, 2025
OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025. 1
2025
-
[24]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205,
-
[25]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,
-
[26]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepseekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3.3, 4.2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023. 1
2023
-
[28]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi K2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. 1, 4, 4.1
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Refchartqa: Grounding visual answer on chart images through instruction tuning
Alexander Vogel, Omar Moured, Yufan Chen, Jiaming Zhang, and Rainer Stiefelhagen. Refchartqa: Grounding visual answer on chart images through instruction tuning. InInternational Conference on Document Analysis and Recognition, pages 523–537. Springer, 2025. 3.2
2025
-
[31]
Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models
WenbinWang,LiangDing,MinyanZeng,XiabinZhou,LiShen,YongLuo,WeiYu,andDachengTao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025. 1, 2, 4
2025
-
[32]
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025. 3.3, 4 12 ETCHR: Editing To Clarify and Harness Reasoning
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024. 1, 4
-
[34]
Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1
2022
-
[35]
Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, et al. Unig2u-bench: Do unified models advance multimodal understanding?arXiv preprint arXiv:2603.03241, 2026. 1, 5
-
[36]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966– 12977, 2025. 1, 5
2025
-
[38]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094,
-
[39]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InCVPR, pages 9556–9567, 2024. 1
2024
-
[41]
MagicBrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449,
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449,
-
[42]
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 1, 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 3.3, 4.2
2023
-
[44]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025. 1, 5 13 ETCHR: Editing To Clarify and Harness Reasoning A. Prompts Task-level Prompt: Fine-grained Perception:Draw a red box to mark the important ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.