MUSE: A Unified Agentic Harness for MLLMs
Pith reviewed 2026-06-28 11:12 UTC · model grok-4.3
The pith
A unified execution harness around frozen MLLMs produces consistent gains by fixing scaffold issues rather than retraining the model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MUSE is a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. When evaluated across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination using multiple state-of-the-art MLLMs, MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fun
What carries the argument
MUSE, the multimodal unified structured execution harness with its composable modules and verifier-guided repair mechanism that wraps a frozen MLLM to improve output without retraining.
If this is right
- Consistent gains appear across visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination benchmarks.
- The largest improvements occur on the most challenging instances.
- Verifier-guided repair corrects many failures without any change to the underlying model.
- The same harness produces gains when paired with multiple different state-of-the-art MLLMs.
Where Pith is reading between the lines
- If harness-level fixes continue to work, development priorities may shift toward scaffold design alongside model scaling.
- The same modular structure could be tried on language-only models or robotic control tasks to test generality.
- Expanding evaluation to real-world deployment settings would show whether the harness gains persist outside curated benchmarks.
Load-bearing premise
The reported gains are produced by the MUSE modules and verifier-guided repair rather than by differences in prompts, benchmark selection, or evaluation protocols.
What would settle it
An experiment that applies MUSE with identical prompts and evaluation protocols to the original benchmarks and finds no performance difference from the bare model, or that shows the gains disappear on a wider collection of multimodal tasks.
read the original abstract
Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MUSE, a multimodal unified structured execution harness that wraps any frozen off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair. It evaluates the harness on benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination using multiple MLLMs, claiming consistent gains over the bare model (largest on challenging instances) and that many failures stem from harness-level issues addressable via verifier-guided repair without model changes.
Significance. If the reported gains are shown to be robustly attributable to the specific MUSE modules rather than prompt or protocol variations, the work would establish agentic multimodal harness design as a meaningful orthogonal axis for MLLM improvement, complementing model-centric approaches and highlighting execution scaffolds as an underexplored lever.
major comments (2)
- [Abstract / Evaluation description] The central attribution claim—that performance deltas arise from the MUSE modules (task representation, perception tools, deterministic verification, verifier-guided repair) rather than incidental prompt or output-format differences—requires explicit isolation of the bare-model baseline. The abstract states MUSE 'wraps a frozen MLLM' and delivers gains 'over the bare model,' but provides no statement that the bare baseline uses identical initial prompts, visual preprocessing, or structured output format before any verification step.
- [Further analysis paragraph] The analysis that 'many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits' is load-bearing for the paper's framing, yet the provided description contains no quantitative breakdown (e.g., fraction of errors fixed by verifier-guided repair alone, or per-module ablation) that would allow readers to assess how much of the observed gain is due to repair versus earlier scaffolding steps.
minor comments (1)
- [Abstract] The abstract refers to 'multiple state-of-the-art MLLMs' and 'diverse benchmarks' without naming the models or listing the specific benchmark suites and metrics used.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on baseline isolation and quantitative analysis. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract / Evaluation description] The central attribution claim—that performance deltas arise from the MUSE modules (task representation, perception tools, deterministic verification, verifier-guided repair) rather than incidental prompt or output-format differences—requires explicit isolation of the bare-model baseline. The abstract states MUSE 'wraps a frozen MLLM' and delivers gains 'over the bare model,' but provides no statement that the bare baseline uses identical initial prompts, visual preprocessing, or structured output format before any verification step.
Authors: We agree that the abstract and evaluation description should explicitly isolate the bare-model baseline to support attribution to the MUSE modules. The bare baseline is the off-the-shelf MLLM prompted directly with the benchmark task query and image using its standard interface, without MUSE's task representation, visual processing, perception tools, structured parsing, verification, or repair. We will revise the abstract and evaluation section to state this configuration clearly, ensuring the comparison isolates the harness contributions. revision: yes
-
Referee: [Further analysis paragraph] The analysis that 'many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits' is load-bearing for the paper's framing, yet the provided description contains no quantitative breakdown (e.g., fraction of errors fixed by verifier-guided repair alone, or per-module ablation) that would allow readers to assess how much of the observed gain is due to repair versus earlier scaffolding steps.
Authors: We acknowledge that the current analysis relies on qualitative case studies of verifier-detected harness issues (e.g., parsing or grounding errors) that are repaired without model changes. A quantitative breakdown of repair-only fixes versus other modules would strengthen the claim. We will add this analysis to the further analysis section using our existing experimental logs, including fractions of errors addressed by repair and per-module contributions where measurable. revision: yes
Circularity Check
No significant circularity; empirical engineering contribution on external benchmarks
full rationale
The paper introduces MUSE as a composable harness around frozen MLLMs and reports empirical gains on diverse external benchmarks. No equations, fitted parameters, predictions derived from inputs, or self-citation chains are present in the provided text. The central claim rests on benchmark deltas rather than any derivation that reduces to its own inputs by construction. This is a standard self-contained empirical result with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
Pith/arXiv arXiv 2023
-
[2]
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025
Pith/arXiv arXiv 2025
-
[3]
The claude 3 model family: Opus, sonnet, haiku
Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. Technical Report
2024
-
[4]
Introducing claude haiku 4.5
Anthropic. Introducing claude haiku 4.5. 2025. https://www.anthropic.com/news/claude-haiku-4-5
2025
-
[5]
Introducing claude opus 4.7
Anthropic. Introducing claude opus 4.7. 2026. https://www.anthropic.com/news/claude-opus-4-7
2026
-
[6]
Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
Pith/arXiv arXiv 2023
-
[7]
Harnessengineering
BirgittaBöckeler. Harnessengineering. April2026. Accessed: 2026-04, https://martinfowler.com/articles/exploring- gen-ai/harness-engineering.html
2026
-
[8]
I improved 15 llms at coding in one afternoon
Can Bölük. I improved 15 llms at coding in one afternoon. only the harness changed. February 2026. Accessed: 2026-04, https://blog.can.ac/2026/02/12/the-harness-problem/
2026
-
[9]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
1901
-
[10]
Comt: A novel benchmark for chain of multi-modal thought on large vision-language models
Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. Comt: A novel benchmark for chain of multi-modal thought on large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025
2025
-
[11]
Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, and Yun Fu. Cot referring: Improving referring expression tasks with grounded reasoning.arXiv preprint arXiv:2510.06243, 2026
arXiv 2026
-
[12]
Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo. Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026
Pith/arXiv arXiv 2026
-
[13]
Ref-adv: Exploring MLLM visual reasoning in referring expression tasks
Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, and Yun Fu. Ref-adv: Exploring MLLM visual reasoning in referring expression tasks. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[14]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024
2024
-
[15]
The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
Pith/arXiv arXiv 2024
-
[16]
Visual programming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14953–14962, 2023
2023
-
[17]
Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
Pith/arXiv arXiv 2024
-
[18]
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023. 11
Pith/arXiv arXiv 2023
-
[19]
Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023
2023
-
[20]
Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025
arXiv 2025
-
[21]
Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026
Pith/arXiv arXiv 2026
-
[22]
Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025
Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025
Pith/arXiv arXiv 2025
-
[23]
Agent harness engineering: A survey
Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, and Chandan Reddy. Agent harness engineering: A survey. 2026. https://openreview.net/pdf?id=eONq7FdiHa
2026
-
[24]
Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Wei Chen, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025
arXiv 2025
-
[25]
Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, and Tieying Zhang. Muse-autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026
Pith/arXiv arXiv 2026
-
[26]
Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu-Gang Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026
Pith/arXiv arXiv 2026
-
[27]
Visual instruction tuning.Advances in neural information processing systems, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023
2023
-
[28]
Visualagentbench: Towards large multimodal models as visual foundation agents
Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, and Jie Tang. Visualagentbench: To...
2025
-
[29]
Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026
arXiv 2026
-
[30]
Representation potentials of foundation models for multimodal alignment: A survey
Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. Representation potentials of foundation models for multimodal alignment: A survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16669–16684, November 2025
2025
-
[31]
The indra representation hypothesis for multimodal alignment
Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, and Yun Fu. The indra representation hypothesis for multimodal alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
2026
-
[32]
Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, and Yun Fu. Restore-r1: Efficient image restoration agents via reinforcement learning with multimodal llm perceptual feedback.arXiv preprint arXiv:2512.18599, 2026
Pith/arXiv arXiv 2026
-
[33]
Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 2023
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 2023
2023
-
[34]
Agent harness for large language model agents: A survey
Qianyu Meng, Yanan Wang, Liyi Chen, Yihang Li, Wei Wu, Wenyuan Jiang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, and Yao Hu. Agent harness for large language model agents: A survey. 2026. https://www.preprints.org/manuscript/202604.0428/v3
arXiv 2026
-
[35]
Code as agent harness.arXiv preprint arXiv:2605.18747, 2026
Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pa...
Pith/arXiv arXiv 2026
-
[36]
Harness engineering: Leveraging codex in an agent-first world
OpenAI. Harness engineering: Leveraging codex in an agent-first world. February 2026. Accessed: 2026-04, OpenAI Blog, https://openai.com/index/harness-engineering/
2026
-
[37]
Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko
Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024
arXiv 2024
-
[38]
Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 2023
2023
-
[39]
Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023
2023
-
[40]
Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 2023
2023
-
[41]
Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026
Pith/arXiv arXiv 2026
-
[42]
Vipergpt: Visual inference via python execution for reasoning
Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023
2023
-
[43]
Agent systems with harness engineering
Xinyu Tang, Han Peng, Guoxin Chen, Yuze Shi, Zitao Su, Peiyu Liu, Wayne Xin Zhao, Yawen Li, and Zhe Xue. Agent systems with harness engineering. 2026. https://openreview.net/pdf?id=nM5tDHrQsx
2026
-
[44]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024
2024
-
[45]
Jianing Wang, Linsen Guo, Zhengyu Chen, Qi Guo, Hongyu Zang, Wenjie Shi, Haoxiang Ma, Xiangyu Xi, Xiaoyu Li, Wei Wang, and Xunliang Cai. Heavyskill: Heavy thinking as the inner skill in agentic harness.arXiv preprint arXiv:2605.02396, 2026
Pith/arXiv arXiv 2026
-
[46]
Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025
arXiv 2025
-
[47]
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023
Pith/arXiv arXiv 2023
-
[48]
Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, and Klara Nahrstedt. Aha moment revisited: Are vlms truly capable of self verification in inference-time scaling? arXiv preprint arXiv:2506.17417, 2025
arXiv 2025
-
[49]
Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms
Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, and Shiyu Chang. Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
2025
-
[50]
Tianshi Xu, Huifeng Wen, and Meng Li. Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026
Pith/arXiv arXiv 2026
-
[51]
Large language models as optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[52]
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023
Pith/arXiv arXiv 2023
-
[53]
Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025
Pith/arXiv arXiv 2025
-
[54]
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023. 13
Pith/arXiv arXiv 2023
-
[55]
Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 2023
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 2023
2023
-
[56]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[57]
Effective harnesses for long-running agents.Anthropic Engineering Blog, Nov, 2025
Justin Young. Effective harnesses for long-running agents.Anthropic Engineering Blog, Nov, 2025
2025
-
[58]
Effective harnesses for long-running agents
Justin Young. Effective harnesses for long-running agents. November 2025. Anthropic Engineering Blog, https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
2025
-
[59]
MMMU-pro: A more robust multi- discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-pro: A more robust multi- discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 15134–15186, 2025
2025
-
[60]
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024
Pith/arXiv arXiv 2024
-
[61]
Vqtoken: Neural discrete token representation learning for extreme token reduction in video large language models
Haichao Zhang and Yun Fu. Vqtoken: Neural discrete token representation learning for extreme token reduction in video large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[62]
HaichaoZhang, YijiangLi, ShwaiHe, TusharNagarajan, MingfeiChen, JianglinLu, AngLi, andYunFu. Thinkjepa: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026
Pith/arXiv arXiv 2026
-
[63]
Out-of-sight embodied agents: Multimodal tracking, sensor fusion, and trajectory forecasting.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2026
Haichao Zhang, Yi Xu, and Yun Fu. Out-of-sight embodied agents: Multimodal tracking, sensor fusion, and trajectory forecasting.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2026
2026
-
[64]
Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023
Pith/arXiv arXiv 2023
-
[65]
Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, and Jing Zhang. Action draft and verify: A self-verifying framework for vision-language-action model.arXiv preprint arXiv:2603.18091, 2026
arXiv 2026
-
[66]
answer":
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025. 14 Appendix A Experimental Details A.1 Benchmarks The used benchmarks...
2025
-
[67]
The color palette, atmospheric haze, and building density in A provide seamless edge continuation at both the right and bottom boundaries. Verdict:✓(rescued after repairs) Response: Opus 4.7 /BASE Answer:B Details:The main image shows dark smoke rising on the right side against a stormy sky, with buildings at lower left. Candidate B continues the smoke pl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.