MUSE: A Unified Agentic Harness for MLLMs

Hailing Wang; Jianglin Lu; Mingyuan Zhang; Qihua Dong; Xu Ma; Yizhou Wang; Yun Fu

arxiv: 2606.03005 · v1 · pith:P7JA5REZnew · submitted 2026-06-02 · 💻 cs.CV · cs.AI

MUSE: A Unified Agentic Harness for MLLMs

Jianglin Lu , Hailing Wang , Xu Ma , Qihua Dong , Mingyuan Zhang , Yizhou Wang , Yun Fu This is my paper

Pith reviewed 2026-06-28 11:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords MUSEMLLMagentic harnessverifier-guided repairexecution scaffoldmultimodal reasoningfrozen modelsvisual benchmarks

0 comments

The pith

A unified execution harness around frozen MLLMs produces consistent gains by fixing scaffold issues rather than retraining the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how much capability can be elicited from a frozen multimodal large language model by improving only the execution scaffold around it. It presents MUSE as a composable harness that adds modules for task representation, visual processing, tool use, structured parsing, deterministic verification, and verifier-guided repair. Tests across benchmarks for spatial planning, perception, reasoning, and discrimination show steady improvements over the bare model, with the biggest lifts on hard cases. Analysis indicates that many failures trace to harness shortcomings addressable by repair steps rather than to limits inside the model itself. If correct, this identifies scaffold design as a direct lever for better results without the expense of model changes.

Core claim

MUSE is a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. When evaluated across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination using multiple state-of-the-art MLLMs, MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fun

What carries the argument

MUSE, the multimodal unified structured execution harness with its composable modules and verifier-guided repair mechanism that wraps a frozen MLLM to improve output without retraining.

If this is right

Consistent gains appear across visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination benchmarks.
The largest improvements occur on the most challenging instances.
Verifier-guided repair corrects many failures without any change to the underlying model.
The same harness produces gains when paired with multiple different state-of-the-art MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If harness-level fixes continue to work, development priorities may shift toward scaffold design alongside model scaling.
The same modular structure could be tried on language-only models or robotic control tasks to test generality.
Expanding evaluation to real-world deployment settings would show whether the harness gains persist outside curated benchmarks.

Load-bearing premise

The reported gains are produced by the MUSE modules and verifier-guided repair rather than by differences in prompts, benchmark selection, or evaluation protocols.

What would settle it

An experiment that applies MUSE with identical prompts and evaluation protocols to the original benchmarks and finds no performance difference from the bare model, or that shows the gains disappear on a wider collection of multimodal tasks.

read the original abstract

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MUSE wraps frozen MLLMs with a six-module harness including verification and repair, claiming gains on visual tasks, but the abstract gives no numbers or baseline details so attribution remains unclear.

read the letter

The main point on this paper is that MUSE adds a composable harness around off-the-shelf MLLMs with modules for task representation, visual processing, perception tools, structured parsing, deterministic verification, and verifier-guided repair. The authors test it on frozen models across visual spatial planning, perception, reasoning, and discrimination benchmarks and report consistent improvements over the bare model, especially on harder cases. They frame many failures as harness problems rather than model limits.

The work is straightforward engineering. Naming the six modules and showing they can be applied without retraining is a clear contribution, and the perspective that scaffolding can unlock capability is useful for anyone deploying these models in practice. It aligns with existing agentic ideas but packages them specifically for multimodal settings.

The soft spot is the complete absence of numbers, baselines, ablations, or protocol details in the abstract. Without evidence that the bare-model comparison uses the same initial prompts, preprocessing, or output format, it is hard to separate the effect of the verifier-guided repair from simple changes in prompting or parsing. The stress-test concern lands here: the central claim about harness-level fixes needs isolation that the provided description does not supply.

This is aimed at engineers and researchers working on reliable MLLM use in robotics or interactive systems. Readers who want quantitative controls or statistical backing will need the full paper. It is worth sending for peer review because the topic is practical and the framing is coherent; a referee can require the missing comparisons and check whether the gains hold up under tighter controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces MUSE, a multimodal unified structured execution harness that wraps any frozen off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair. It evaluates the harness on benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination using multiple MLLMs, claiming consistent gains over the bare model (largest on challenging instances) and that many failures stem from harness-level issues addressable via verifier-guided repair without model changes.

Significance. If the reported gains are shown to be robustly attributable to the specific MUSE modules rather than prompt or protocol variations, the work would establish agentic multimodal harness design as a meaningful orthogonal axis for MLLM improvement, complementing model-centric approaches and highlighting execution scaffolds as an underexplored lever.

major comments (2)

[Abstract / Evaluation description] The central attribution claim—that performance deltas arise from the MUSE modules (task representation, perception tools, deterministic verification, verifier-guided repair) rather than incidental prompt or output-format differences—requires explicit isolation of the bare-model baseline. The abstract states MUSE 'wraps a frozen MLLM' and delivers gains 'over the bare model,' but provides no statement that the bare baseline uses identical initial prompts, visual preprocessing, or structured output format before any verification step.
[Further analysis paragraph] The analysis that 'many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits' is load-bearing for the paper's framing, yet the provided description contains no quantitative breakdown (e.g., fraction of errors fixed by verifier-guided repair alone, or per-module ablation) that would allow readers to assess how much of the observed gain is due to repair versus earlier scaffolding steps.

minor comments (1)

[Abstract] The abstract refers to 'multiple state-of-the-art MLLMs' and 'diverse benchmarks' without naming the models or listing the specific benchmark suites and metrics used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on baseline isolation and quantitative analysis. We address each major comment below.

read point-by-point responses

Referee: [Abstract / Evaluation description] The central attribution claim—that performance deltas arise from the MUSE modules (task representation, perception tools, deterministic verification, verifier-guided repair) rather than incidental prompt or output-format differences—requires explicit isolation of the bare-model baseline. The abstract states MUSE 'wraps a frozen MLLM' and delivers gains 'over the bare model,' but provides no statement that the bare baseline uses identical initial prompts, visual preprocessing, or structured output format before any verification step.

Authors: We agree that the abstract and evaluation description should explicitly isolate the bare-model baseline to support attribution to the MUSE modules. The bare baseline is the off-the-shelf MLLM prompted directly with the benchmark task query and image using its standard interface, without MUSE's task representation, visual processing, perception tools, structured parsing, verification, or repair. We will revise the abstract and evaluation section to state this configuration clearly, ensuring the comparison isolates the harness contributions. revision: yes
Referee: [Further analysis paragraph] The analysis that 'many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits' is load-bearing for the paper's framing, yet the provided description contains no quantitative breakdown (e.g., fraction of errors fixed by verifier-guided repair alone, or per-module ablation) that would allow readers to assess how much of the observed gain is due to repair versus earlier scaffolding steps.

Authors: We acknowledge that the current analysis relies on qualitative case studies of verifier-detected harness issues (e.g., parsing or grounding errors) that are repaired without model changes. A quantitative breakdown of repair-only fixes versus other modules would strengthen the claim. We will add this analysis to the further analysis section using our existing experimental logs, including fractions of errors addressed by repair and per-module contributions where measurable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering contribution on external benchmarks

full rationale

The paper introduces MUSE as a composable harness around frozen MLLMs and reports empirical gains on diverse external benchmarks. No equations, fitted parameters, predictions derived from inputs, or self-citation chains are present in the provided text. The central claim rests on benchmark deltas rather than any derivation that reduces to its own inputs by construction. This is a standard self-contained empirical result with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5759 in / 1086 out tokens · 27922 ms · 2026-06-28T11:12:41.357637+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 23 linked inside Pith

[1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[2]

Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

Pith/arXiv arXiv 2025
[3]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. Technical Report

2024
[4]

Introducing claude haiku 4.5

Anthropic. Introducing claude haiku 4.5. 2025. https://www.anthropic.com/news/claude-haiku-4-5

2025
[5]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. 2026. https://www.anthropic.com/news/claude-opus-4-7

2026
[6]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023
[7]

Harnessengineering

BirgittaBöckeler. Harnessengineering. April2026. Accessed: 2026-04, https://martinfowler.com/articles/exploring- gen-ai/harness-engineering.html

2026
[8]

I improved 15 llms at coding in one afternoon

Can Bölük. I improved 15 llms at coding in one afternoon. only the harness changed. February 2026. Accessed: 2026-04, https://blog.can.ac/2026/02/12/the-harness-problem/

2026
[9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

1901
[10]

Comt: A novel benchmark for chain of multi-modal thought on large vision-language models

Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. Comt: A novel benchmark for chain of multi-modal thought on large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[11]

Cot referring: Improving referring expression tasks with grounded reasoning.arXiv preprint arXiv:2510.06243, 2026

Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, and Yun Fu. Cot referring: Improving referring expression tasks with grounded reasoning.arXiv preprint arXiv:2510.06243, 2026

arXiv 2026
[12]

Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026

Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo. Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026

Pith/arXiv arXiv 2026
[13]

Ref-adv: Exploring MLLM visual reasoning in referring expression tasks

Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, and Yun Fu. Ref-adv: Exploring MLLM visual reasoning in referring expression tasks. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[14]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

2024
[15]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[16]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14953–14962, 2023

2023
[17]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[18]

Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023. 11

Pith/arXiv arXiv 2023
[19]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023

2023
[20]

Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

arXiv 2025
[21]

Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

Pith/arXiv arXiv 2026
[22]

Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

Pith/arXiv arXiv 2025
[23]

Agent harness engineering: A survey

Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, and Chandan Reddy. Agent harness engineering: A survey. 2026. https://openreview.net/pdf?id=eONq7FdiHa

2026
[24]

Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Wei Chen, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

arXiv 2025
[25]

Muse-autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, and Tieying Zhang. Muse-autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026

Pith/arXiv arXiv 2026
[26]

Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu-Gang Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

Pith/arXiv arXiv 2026
[27]

Visual instruction tuning.Advances in neural information processing systems, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023

2023
[28]

Visualagentbench: Towards large multimodal models as visual foundation agents

Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, and Jie Tang. Visualagentbench: To...

2025
[29]

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

arXiv 2026
[30]

Representation potentials of foundation models for multimodal alignment: A survey

Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. Representation potentials of foundation models for multimodal alignment: A survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16669–16684, November 2025

2025
[31]

The indra representation hypothesis for multimodal alignment

Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, and Yun Fu. The indra representation hypothesis for multimodal alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[32]

Restore-r1: Efficient image restoration agents via reinforcement learning with multimodal llm perceptual feedback.arXiv preprint arXiv:2512.18599, 2026

Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, and Yun Fu. Restore-r1: Efficient image restoration agents via reinforcement learning with multimodal llm perceptual feedback.arXiv preprint arXiv:2512.18599, 2026

Pith/arXiv arXiv 2026
[33]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 2023

2023
[34]

Agent harness for large language model agents: A survey

Qianyu Meng, Yanan Wang, Liyi Chen, Yihang Li, Wei Wu, Wenyuan Jiang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, and Yao Hu. Agent harness for large language model agents: A survey. 2026. https://www.preprints.org/manuscript/202604.0428/v3

arXiv 2026
[35]

Code as agent harness.arXiv preprint arXiv:2605.18747, 2026

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pa...

Pith/arXiv arXiv 2026
[36]

Harness engineering: Leveraging codex in an agent-first world

OpenAI. Harness engineering: Leveraging codex in an agent-first world. February 2026. Accessed: 2026-04, OpenAI Blog, https://openai.com/index/harness-engineering/

2026
[37]

Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

arXiv 2024
[38]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 2023

2023
[39]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

2023
[40]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 2023

2023
[41]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Pith/arXiv arXiv 2026
[42]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023

2023
[43]

Agent systems with harness engineering

Xinyu Tang, Han Peng, Guoxin Chen, Yuze Shi, Zitao Su, Peiyu Liu, Wayne Xin Zhao, Yawen Li, and Zhe Xue. Agent systems with harness engineering. 2026. https://openreview.net/pdf?id=nM5tDHrQsx

2026
[44]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

2024
[45]

Heavyskill: Heavy thinking as the inner skill in agentic harness.arXiv preprint arXiv:2605.02396, 2026

Jianing Wang, Linsen Guo, Zhengyu Chen, Qi Guo, Hongyu Zang, Wenjie Shi, Haoxiang Ma, Xiangyu Xi, Xiaoyu Li, Wei Wang, and Xunliang Cai. Heavyskill: Heavy thinking as the inner skill in agentic harness.arXiv preprint arXiv:2605.02396, 2026

Pith/arXiv arXiv 2026
[46]

Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

arXiv 2025
[47]

Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

Pith/arXiv arXiv 2023
[48]

Aha moment revisited: Are vlms truly capable of self verification in inference-time scaling? arXiv preprint arXiv:2506.17417, 2025

Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, and Klara Nahrstedt. Aha moment revisited: Are vlms truly capable of self verification in inference-time scaling? arXiv preprint arXiv:2506.17417, 2025

arXiv 2025
[49]

Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms

Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, and Shiyu Chang. Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[50]

Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026

Tianshi Xu, Huifeng Wen, and Meng Li. Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026

Pith/arXiv arXiv 2026
[51]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024

2024
[52]

Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

Pith/arXiv arXiv 2023
[53]

Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Pith/arXiv arXiv 2025
[54]

Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023. 13

Pith/arXiv arXiv 2023
[55]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 2023

2023
[56]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[57]

Effective harnesses for long-running agents.Anthropic Engineering Blog, Nov, 2025

Justin Young. Effective harnesses for long-running agents.Anthropic Engineering Blog, Nov, 2025

2025
[58]

Effective harnesses for long-running agents

Justin Young. Effective harnesses for long-running agents. November 2025. Anthropic Engineering Blog, https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

2025
[59]

MMMU-pro: A more robust multi- discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-pro: A more robust multi- discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 15134–15186, 2025

2025
[60]

differentiation

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

Pith/arXiv arXiv 2024
[61]

Vqtoken: Neural discrete token representation learning for extreme token reduction in video large language models

Haichao Zhang and Yun Fu. Vqtoken: Neural discrete token representation learning for extreme token reduction in video large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[62]

Thinkjepa: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

HaichaoZhang, YijiangLi, ShwaiHe, TusharNagarajan, MingfeiChen, JianglinLu, AngLi, andYunFu. Thinkjepa: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

Pith/arXiv arXiv 2026
[63]

Out-of-sight embodied agents: Multimodal tracking, sensor fusion, and trajectory forecasting.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2026

Haichao Zhang, Yi Xu, and Yun Fu. Out-of-sight embodied agents: Multimodal tracking, sensor fusion, and trajectory forecasting.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2026

2026
[64]

Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

Pith/arXiv arXiv 2023
[65]

Action draft and verify: A self-verifying framework for vision-language-action model.arXiv preprint arXiv:2603.18091, 2026

Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, and Jing Zhang. Action draft and verify: A self-verifying framework for vision-language-action model.arXiv preprint arXiv:2603.18091, 2026

arXiv 2026
[66]

answer":

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025. 14 Appendix A Experimental Details A.1 Benchmarks The used benchmarks...

2025
[67]

The color palette, atmospheric haze, and building density in A provide seamless edge continuation at both the right and bottom boundaries. Verdict:✓(rescued after repairs) Response: Opus 4.7 /BASE Answer:B Details:The main image shows dark smoke rising on the right side against a stormy sky, with buildings at lower left. Candidate B continues the smoke pl...

[1] [1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[2] [2]

Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

Pith/arXiv arXiv 2025

[3] [3]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. Technical Report

2024

[4] [4]

Introducing claude haiku 4.5

Anthropic. Introducing claude haiku 4.5. 2025. https://www.anthropic.com/news/claude-haiku-4-5

2025

[5] [5]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. 2026. https://www.anthropic.com/news/claude-opus-4-7

2026

[6] [6]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023

[7] [7]

Harnessengineering

BirgittaBöckeler. Harnessengineering. April2026. Accessed: 2026-04, https://martinfowler.com/articles/exploring- gen-ai/harness-engineering.html

2026

[8] [8]

I improved 15 llms at coding in one afternoon

Can Bölük. I improved 15 llms at coding in one afternoon. only the harness changed. February 2026. Accessed: 2026-04, https://blog.can.ac/2026/02/12/the-harness-problem/

2026

[9] [9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

1901

[10] [10]

Comt: A novel benchmark for chain of multi-modal thought on large vision-language models

Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. Comt: A novel benchmark for chain of multi-modal thought on large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025

[11] [11]

Cot referring: Improving referring expression tasks with grounded reasoning.arXiv preprint arXiv:2510.06243, 2026

Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, and Yun Fu. Cot referring: Improving referring expression tasks with grounded reasoning.arXiv preprint arXiv:2510.06243, 2026

arXiv 2026

[12] [12]

Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026

Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo. Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026

Pith/arXiv arXiv 2026

[13] [13]

Ref-adv: Exploring MLLM visual reasoning in referring expression tasks

Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, and Yun Fu. Ref-adv: Exploring MLLM visual reasoning in referring expression tasks. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[14] [14]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

2024

[15] [15]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[16] [16]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14953–14962, 2023

2023

[17] [17]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[18] [18]

Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023. 11

Pith/arXiv arXiv 2023

[19] [19]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023

2023

[20] [20]

Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison.arXiv preprint arXiv:2511.07919, 2025

arXiv 2025

[21] [21]

Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

Pith/arXiv arXiv 2026

[22] [22]

Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

Pith/arXiv arXiv 2025

[23] [23]

Agent harness engineering: A survey

Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, and Chandan Reddy. Agent harness engineering: A survey. 2026. https://openreview.net/pdf?id=eONq7FdiHa

2026

[24] [24]

Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Wei Chen, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

arXiv 2025

[25] [25]

Muse-autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, and Tieying Zhang. Muse-autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026

Pith/arXiv arXiv 2026

[26] [26]

Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu-Gang Jiang. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

Pith/arXiv arXiv 2026

[27] [27]

Visual instruction tuning.Advances in neural information processing systems, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023

2023

[28] [28]

Visualagentbench: Towards large multimodal models as visual foundation agents

Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, and Jie Tang. Visualagentbench: To...

2025

[29] [29]

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

arXiv 2026

[30] [30]

Representation potentials of foundation models for multimodal alignment: A survey

Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. Representation potentials of foundation models for multimodal alignment: A survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16669–16684, November 2025

2025

[31] [31]

The indra representation hypothesis for multimodal alignment

Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, and Yun Fu. The indra representation hypothesis for multimodal alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[32] [32]

Restore-r1: Efficient image restoration agents via reinforcement learning with multimodal llm perceptual feedback.arXiv preprint arXiv:2512.18599, 2026

Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, and Yun Fu. Restore-r1: Efficient image restoration agents via reinforcement learning with multimodal llm perceptual feedback.arXiv preprint arXiv:2512.18599, 2026

Pith/arXiv arXiv 2026

[33] [33]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 2023

2023

[34] [34]

Agent harness for large language model agents: A survey

Qianyu Meng, Yanan Wang, Liyi Chen, Yihang Li, Wei Wu, Wenyuan Jiang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, and Yao Hu. Agent harness for large language model agents: A survey. 2026. https://www.preprints.org/manuscript/202604.0428/v3

arXiv 2026

[35] [35]

Code as agent harness.arXiv preprint arXiv:2605.18747, 2026

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pa...

Pith/arXiv arXiv 2026

[36] [36]

Harness engineering: Leveraging codex in an agent-first world

OpenAI. Harness engineering: Leveraging codex in an agent-first world. February 2026. Accessed: 2026-04, OpenAI Blog, https://openai.com/index/harness-engineering/

2026

[37] [37]

Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

arXiv 2024

[38] [38]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 2023

2023

[39] [39]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

2023

[40] [40]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 2023

2023

[41] [41]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Pith/arXiv arXiv 2026

[42] [42]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023

2023

[43] [43]

Agent systems with harness engineering

Xinyu Tang, Han Peng, Guoxin Chen, Yuze Shi, Zitao Su, Peiyu Liu, Wayne Xin Zhao, Yawen Li, and Zhe Xue. Agent systems with harness engineering. 2026. https://openreview.net/pdf?id=nM5tDHrQsx

2026

[44] [44]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

2024

[45] [45]

Heavyskill: Heavy thinking as the inner skill in agentic harness.arXiv preprint arXiv:2605.02396, 2026

Jianing Wang, Linsen Guo, Zhengyu Chen, Qi Guo, Hongyu Zang, Wenjie Shi, Haoxiang Ma, Xiangyu Xi, Xiaoyu Li, Wei Wang, and Xunliang Cai. Heavyskill: Heavy thinking as the inner skill in agentic harness.arXiv preprint arXiv:2605.02396, 2026

Pith/arXiv arXiv 2026

[46] [46]

Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

arXiv 2025

[47] [47]

Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

Pith/arXiv arXiv 2023

[48] [48]

Aha moment revisited: Are vlms truly capable of self verification in inference-time scaling? arXiv preprint arXiv:2506.17417, 2025

Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, and Klara Nahrstedt. Aha moment revisited: Are vlms truly capable of self verification in inference-time scaling? arXiv preprint arXiv:2506.17417, 2025

arXiv 2025

[49] [49]

Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms

Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, and Shiyu Chang. Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025

[50] [50]

Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026

Tianshi Xu, Huifeng Wen, and Meng Li. Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026

Pith/arXiv arXiv 2026

[51] [51]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024

2024

[52] [52]

Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

Pith/arXiv arXiv 2023

[53] [53]

Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Pith/arXiv arXiv 2025

[54] [54]

Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023. 13

Pith/arXiv arXiv 2023

[55] [55]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 2023

2023

[56] [56]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023

[57] [57]

Effective harnesses for long-running agents.Anthropic Engineering Blog, Nov, 2025

Justin Young. Effective harnesses for long-running agents.Anthropic Engineering Blog, Nov, 2025

2025

[58] [58]

Effective harnesses for long-running agents

Justin Young. Effective harnesses for long-running agents. November 2025. Anthropic Engineering Blog, https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

2025

[59] [59]

MMMU-pro: A more robust multi- discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-pro: A more robust multi- discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 15134–15186, 2025

2025

[60] [60]

differentiation

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

Pith/arXiv arXiv 2024

[61] [61]

Vqtoken: Neural discrete token representation learning for extreme token reduction in video large language models

Haichao Zhang and Yun Fu. Vqtoken: Neural discrete token representation learning for extreme token reduction in video large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[62] [62]

Thinkjepa: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

HaichaoZhang, YijiangLi, ShwaiHe, TusharNagarajan, MingfeiChen, JianglinLu, AngLi, andYunFu. Thinkjepa: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

Pith/arXiv arXiv 2026

[63] [63]

Out-of-sight embodied agents: Multimodal tracking, sensor fusion, and trajectory forecasting.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2026

Haichao Zhang, Yi Xu, and Yun Fu. Out-of-sight embodied agents: Multimodal tracking, sensor fusion, and trajectory forecasting.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2026

2026

[64] [64]

Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

Pith/arXiv arXiv 2023

[65] [65]

Action draft and verify: A self-verifying framework for vision-language-action model.arXiv preprint arXiv:2603.18091, 2026

Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, and Jing Zhang. Action draft and verify: A self-verifying framework for vision-language-action model.arXiv preprint arXiv:2603.18091, 2026

arXiv 2026

[66] [66]

answer":

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025. 14 Appendix A Experimental Details A.1 Benchmarks The used benchmarks...

2025

[67] [67]

The color palette, atmospheric haze, and building density in A provide seamless edge continuation at both the right and bottom boundaries. Verdict:✓(rescued after repairs) Response: Opus 4.7 /BASE Answer:B Details:The main image shows dark smoke rising on the right side against a stormy sky, with buildings at lower left. Candidate B continues the smoke pl...