pith. sign in

arxiv: 2606.03005 · v1 · pith:P7JA5REZnew · submitted 2026-06-02 · 💻 cs.CV · cs.AI

MUSE: A Unified Agentic Harness for MLLMs

Pith reviewed 2026-06-28 11:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords MUSEMLLMagentic harnessverifier-guided repairexecution scaffoldmultimodal reasoningfrozen modelsvisual benchmarks
0
0 comments X

The pith

A unified execution harness around frozen MLLMs produces consistent gains by fixing scaffold issues rather than retraining the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how much capability can be elicited from a frozen multimodal large language model by improving only the execution scaffold around it. It presents MUSE as a composable harness that adds modules for task representation, visual processing, tool use, structured parsing, deterministic verification, and verifier-guided repair. Tests across benchmarks for spatial planning, perception, reasoning, and discrimination show steady improvements over the bare model, with the biggest lifts on hard cases. Analysis indicates that many failures trace to harness shortcomings addressable by repair steps rather than to limits inside the model itself. If correct, this identifies scaffold design as a direct lever for better results without the expense of model changes.

Core claim

MUSE is a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. When evaluated across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination using multiple state-of-the-art MLLMs, MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fun

What carries the argument

MUSE, the multimodal unified structured execution harness with its composable modules and verifier-guided repair mechanism that wraps a frozen MLLM to improve output without retraining.

If this is right

  • Consistent gains appear across visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination benchmarks.
  • The largest improvements occur on the most challenging instances.
  • Verifier-guided repair corrects many failures without any change to the underlying model.
  • The same harness produces gains when paired with multiple different state-of-the-art MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If harness-level fixes continue to work, development priorities may shift toward scaffold design alongside model scaling.
  • The same modular structure could be tried on language-only models or robotic control tasks to test generality.
  • Expanding evaluation to real-world deployment settings would show whether the harness gains persist outside curated benchmarks.

Load-bearing premise

The reported gains are produced by the MUSE modules and verifier-guided repair rather than by differences in prompts, benchmark selection, or evaluation protocols.

What would settle it

An experiment that applies MUSE with identical prompts and evaluation protocols to the original benchmarks and finds no performance difference from the bare model, or that shows the gains disappear on a wider collection of multimodal tasks.

read the original abstract

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MUSE, a multimodal unified structured execution harness that wraps any frozen off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair. It evaluates the harness on benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination using multiple MLLMs, claiming consistent gains over the bare model (largest on challenging instances) and that many failures stem from harness-level issues addressable via verifier-guided repair without model changes.

Significance. If the reported gains are shown to be robustly attributable to the specific MUSE modules rather than prompt or protocol variations, the work would establish agentic multimodal harness design as a meaningful orthogonal axis for MLLM improvement, complementing model-centric approaches and highlighting execution scaffolds as an underexplored lever.

major comments (2)
  1. [Abstract / Evaluation description] The central attribution claim—that performance deltas arise from the MUSE modules (task representation, perception tools, deterministic verification, verifier-guided repair) rather than incidental prompt or output-format differences—requires explicit isolation of the bare-model baseline. The abstract states MUSE 'wraps a frozen MLLM' and delivers gains 'over the bare model,' but provides no statement that the bare baseline uses identical initial prompts, visual preprocessing, or structured output format before any verification step.
  2. [Further analysis paragraph] The analysis that 'many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits' is load-bearing for the paper's framing, yet the provided description contains no quantitative breakdown (e.g., fraction of errors fixed by verifier-guided repair alone, or per-module ablation) that would allow readers to assess how much of the observed gain is due to repair versus earlier scaffolding steps.
minor comments (1)
  1. [Abstract] The abstract refers to 'multiple state-of-the-art MLLMs' and 'diverse benchmarks' without naming the models or listing the specific benchmark suites and metrics used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on baseline isolation and quantitative analysis. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Evaluation description] The central attribution claim—that performance deltas arise from the MUSE modules (task representation, perception tools, deterministic verification, verifier-guided repair) rather than incidental prompt or output-format differences—requires explicit isolation of the bare-model baseline. The abstract states MUSE 'wraps a frozen MLLM' and delivers gains 'over the bare model,' but provides no statement that the bare baseline uses identical initial prompts, visual preprocessing, or structured output format before any verification step.

    Authors: We agree that the abstract and evaluation description should explicitly isolate the bare-model baseline to support attribution to the MUSE modules. The bare baseline is the off-the-shelf MLLM prompted directly with the benchmark task query and image using its standard interface, without MUSE's task representation, visual processing, perception tools, structured parsing, verification, or repair. We will revise the abstract and evaluation section to state this configuration clearly, ensuring the comparison isolates the harness contributions. revision: yes

  2. Referee: [Further analysis paragraph] The analysis that 'many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits' is load-bearing for the paper's framing, yet the provided description contains no quantitative breakdown (e.g., fraction of errors fixed by verifier-guided repair alone, or per-module ablation) that would allow readers to assess how much of the observed gain is due to repair versus earlier scaffolding steps.

    Authors: We acknowledge that the current analysis relies on qualitative case studies of verifier-detected harness issues (e.g., parsing or grounding errors) that are repaired without model changes. A quantitative breakdown of repair-only fixes versus other modules would strengthen the claim. We will add this analysis to the further analysis section using our existing experimental logs, including fractions of errors addressed by repair and per-module contributions where measurable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering contribution on external benchmarks

full rationale

The paper introduces MUSE as a composable harness around frozen MLLMs and reports empirical gains on diverse external benchmarks. No equations, fitted parameters, predictions derived from inputs, or self-citation chains are present in the provided text. The central claim rests on benchmark deltas rather than any derivation that reduces to its own inputs by construction. This is a standard self-contained empirical result with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5759 in / 1086 out tokens · 27922 ms · 2026-06-28T11:12:41.357637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 23 linked inside Pith

  1. [1]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. Technical Report

  4. [4]

    Introducing claude haiku 4.5

    Anthropic. Introducing claude haiku 4.5. 2025. https://www.anthropic.com/news/claude-haiku-4-5

  5. [5]

    Introducing claude opus 4.7

    Anthropic. Introducing claude opus 4.7. 2026. https://www.anthropic.com/news/claude-opus-4-7

  6. [6]

    Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  7. [7]

    Harnessengineering

    BirgittaBöckeler. Harnessengineering. April2026. Accessed: 2026-04, https://martinfowler.com/articles/exploring- gen-ai/harness-engineering.html

  8. [8]

    I improved 15 llms at coding in one afternoon

    Can Bölük. I improved 15 llms at coding in one afternoon. only the harness changed. February 2026. Accessed: 2026-04, https://blog.can.ac/2026/02/12/the-harness-problem/

  9. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  10. [10]

    Comt: A novel benchmark for chain of multi-modal thought on large vision-language models

    Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. Comt: A novel benchmark for chain of multi-modal thought on large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

  11. [11]

    Cot referring: Improving referring expression tasks with grounded reasoning.arXiv preprint arXiv:2510.06243, 2026

    Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, and Yun Fu. Cot referring: Improving referring expression tasks with grounded reasoning.arXiv preprint arXiv:2510.06243, 2026

  12. [12]

    Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026

    Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo. Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026

  13. [13]

    Ref-adv: Exploring MLLM visual reasoning in referring expression tasks

    Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, and Yun Fu. Ref-adv: Exploring MLLM visual reasoning in referring expression tasks. InThe Fourteenth International Conference on Learning Representations, 2026

  14. [14]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  15. [15]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  16. [16]

    Visual programming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14953–14962, 2023

  17. [17]

    Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  18. [18]

    Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023. 11

  19. [19]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023

  20. [20]

    Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

    Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

  21. [21]

    Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

  22. [22]

    Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

  23. [23]

    Agent harness engineering: A survey

    Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, and Chandan Reddy. Agent harness engineering: A survey. 2026. https://openreview.net/pdf?id=eONq7FdiHa

  24. [24]

    Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

    Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Wei Chen, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

  25. [25]

    Muse-autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026

    Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, and Tieying Zhang. Muse-autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026

  26. [26]

    Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

    Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu-Gang Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

  27. [27]

    Visual instruction tuning.Advances in neural information processing systems, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023

  28. [28]

    Visualagentbench: Towards large multimodal models as visual foundation agents

    Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, and Jie Tang. Visualagentbench: To...

  29. [29]

    Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

  30. [30]

    Representation potentials of foundation models for multimodal alignment: A survey

    Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. Representation potentials of foundation models for multimodal alignment: A survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16669–16684, November 2025

  31. [31]

    The indra representation hypothesis for multimodal alignment

    Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, and Yun Fu. The indra representation hypothesis for multimodal alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  32. [32]

    Restore-r1: Efficient image restoration agents via reinforcement learning with multimodal llm perceptual feedback.arXiv preprint arXiv:2512.18599, 2026

    Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, and Yun Fu. Restore-r1: Efficient image restoration agents via reinforcement learning with multimodal llm perceptual feedback.arXiv preprint arXiv:2512.18599, 2026

  33. [33]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 2023

  34. [34]

    Agent harness for large language model agents: A survey

    Qianyu Meng, Yanan Wang, Liyi Chen, Yihang Li, Wei Wu, Wenyuan Jiang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, and Yao Hu. Agent harness for large language model agents: A survey. 2026. https://www.preprints.org/manuscript/202604.0428/v3

  35. [35]

    Code as agent harness.arXiv preprint arXiv:2605.18747, 2026

    Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pa...

  36. [36]

    Harness engineering: Leveraging codex in an agent-first world

    OpenAI. Harness engineering: Leveraging codex in an agent-first world. February 2026. Accessed: 2026-04, OpenAI Blog, https://openai.com/index/harness-engineering/

  37. [37]

    Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

  38. [38]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 2023

  39. [39]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  40. [40]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 2023

  41. [41]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

  42. [42]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023

  43. [43]

    Agent systems with harness engineering

    Xinyu Tang, Han Peng, Guoxin Chen, Yuze Shi, Zitao Su, Peiyu Liu, Wayne Xin Zhao, Yawen Li, and Zhe Xue. Agent systems with harness engineering. 2026. https://openreview.net/pdf?id=nM5tDHrQsx

  44. [44]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

  45. [45]

    Heavyskill: Heavy thinking as the inner skill in agentic harness.arXiv preprint arXiv:2605.02396, 2026

    Jianing Wang, Linsen Guo, Zhengyu Chen, Qi Guo, Hongyu Zang, Wenjie Shi, Haoxiang Ma, Xiangyu Xi, Xiaoyu Li, Wei Wang, and Xunliang Cai. Heavyskill: Heavy thinking as the inner skill in agentic harness.arXiv preprint arXiv:2605.02396, 2026

  46. [46]

    Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

    Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

  47. [47]

    Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

  48. [48]

    Aha moment revisited: Are vlms truly capable of self verification in inference-time scaling? arXiv preprint arXiv:2506.17417, 2025

    Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, and Klara Nahrstedt. Aha moment revisited: Are vlms truly capable of self verification in inference-time scaling? arXiv preprint arXiv:2506.17417, 2025

  49. [49]

    Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms

    Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, and Shiyu Chang. Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  50. [50]

    Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026

    Tianshi Xu, Huifeng Wen, and Meng Li. Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026

  51. [51]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024

  52. [52]

    Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

  53. [53]

    Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

    Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

  54. [54]

    Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023. 13

  55. [55]

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 2023

  56. [56]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

  57. [57]

    Effective harnesses for long-running agents.Anthropic Engineering Blog, Nov, 2025

    Justin Young. Effective harnesses for long-running agents.Anthropic Engineering Blog, Nov, 2025

  58. [58]

    Effective harnesses for long-running agents

    Justin Young. Effective harnesses for long-running agents. November 2025. Anthropic Engineering Blog, https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

  59. [59]

    MMMU-pro: A more robust multi- discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-pro: A more robust multi- discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 15134–15186, 2025

  60. [60]

    differentiation

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

  61. [61]

    Vqtoken: Neural discrete token representation learning for extreme token reduction in video large language models

    Haichao Zhang and Yun Fu. Vqtoken: Neural discrete token representation learning for extreme token reduction in video large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  62. [62]

    Thinkjepa: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

    HaichaoZhang, YijiangLi, ShwaiHe, TusharNagarajan, MingfeiChen, JianglinLu, AngLi, andYunFu. Thinkjepa: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

  63. [63]

    Out-of-sight embodied agents: Multimodal tracking, sensor fusion, and trajectory forecasting.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2026

    Haichao Zhang, Yi Xu, and Yun Fu. Out-of-sight embodied agents: Multimodal tracking, sensor fusion, and trajectory forecasting.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2026

  64. [64]

    Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

  65. [65]

    Action draft and verify: A self-verifying framework for vision-language-action model.arXiv preprint arXiv:2603.18091, 2026

    Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, and Jing Zhang. Action draft and verify: A self-verifying framework for vision-language-action model.arXiv preprint arXiv:2603.18091, 2026

  66. [66]

    answer":

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025. 14 Appendix A Experimental Details A.1 Benchmarks The used benchmarks...

  67. [67]

    The color palette, atmospheric haze, and building density in A provide seamless edge continuation at both the right and bottom boundaries. Verdict:✓(rescued after repairs) Response: Opus 4.7 /BASE Answer:B Details:The main image shows dark smoke rising on the right side against a stormy sky, with buildings at lower left. Candidate B continues the smoke pl...