SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Baobao Chang; Elvis Zhang; Jason Zeng; Jialong Wu; Kean Shi; Kuan Li; Liang Chen; Michael Heinrich; Ming Wu; Qingyao Yang

arxiv: 2605.15777 · v1 · pith:YMQQ42OKnew · submitted 2026-05-15 · 💻 cs.AI

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Kean Shi , Zihang Li , Tianyi Ma , Zengji Tu , Jialong Wu , Xinbo Xu , Qingyao Yang , Ruoyu Wu

show 8 more authors

Weichu Xie Ming Wu Jason Zeng Michael Heinrich Elvis Zhang Liang Chen Kuan Li Baobao Chang

This is my paper

Pith reviewed 2026-05-20 18:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords SaaS-Benchcomputer-using agentsLLM agentsprofessional workflowsbenchmarktask completionGUI agents

0 comments

The pith

LLM-based computer-using agents complete fewer than 4% of realistic professional SaaS tasks end-to-end.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SaaS-Bench as a benchmark for evaluating computer-using agents in real software-as-a-service environments. It includes 106 tasks across 23 SaaS systems in six professional domains, requiring long-horizon interactions and coordination. Experiments with representative agents show success rates below 4% for the strongest models, pointing to weaknesses in planning, tracking states across applications, and recovering from errors. This evaluation matters because it tests agents on the kind of dynamic, multi-step work that professionals do daily in tools like project management and collaboration software. A sympathetic reader would conclude that current agent designs are not yet ready for complex real-world deployment.

Core claim

SaaS-Bench is introduced as a benchmark built on 23 deployable SaaS systems across six domains with 106 tasks grounded in realistic scenarios. These tasks involve long-horizon execution in both text and multimodal settings and use weighted verification checkpoints to measure completion and progress. Representative LLM-based agents struggle, with the strongest completing fewer than 4% of tasks end-to-end, revealing limitations in planning, state tracking, cross-application context maintenance, and error recovery.

What carries the argument

SaaS-Bench benchmark with its 106 tasks and weighted verification checkpoints, which evaluates agents on dynamic system states and cross-application coordination in professional SaaS environments.

If this is right

Current agents lack the ability to maintain context across multiple applications over long periods.
Error recovery is a critical missing capability for handling real workflows.
Both planning and state tracking need significant improvement to achieve practical utility.
The benchmark highlights the need for agents that can handle multimodal inputs effectively in GUI settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If true, this implies that future agent research should prioritize architectures with explicit memory or state management modules.
The results could motivate development of hybrid systems that combine LLM reasoning with rule-based automation for SaaS tasks.
Extending the benchmark to include more domains might reveal domain-specific strengths or weaknesses in agent performance.

Load-bearing premise

The assumption that the selected 106 tasks accurately represent realistic professional workflows and that the weighted checkpoints reliably indicate task success or partial progress.

What would settle it

A new agent design that achieves end-to-end completion on more than 20% of the 106 tasks would challenge the reported limitations of current approaches.

Figures

Figures reproduced from arXiv: 2605.15777 by Baobao Chang, Elvis Zhang, Jason Zeng, Jialong Wu, Kean Shi, Kuan Li, Liang Chen, Michael Heinrich, Ming Wu, Qingyao Yang, Ruoyu Wu, Tianyi Ma, Weichu Xie, Xinbo Xu, Zengji Tu, Zihang Li.

**Figure 1.** Figure 1: Leaderboard of SAAS-BENCH. We report overall checkpoint scores (bar length) and resolved scores for seven frontier models across 106 long-horizon SaaS tasks. ∗Equal Core Contributors †Correspondence: Liang Chen <liangchen@unipat.ai>, Kuan Li <kuanli@unipat.ai>, Baobao Chang <chbb@pku.edu.cn> 1 arXiv:2605.15777v1 [cs.AI] 15 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: SAAS-BENCH provides a realistic benchmark for evaluating CUAs in deployable SaaS environments. It consists of 23 real SaaS systems organized into six professional domains, supporting 106 tasks that reflect real-world SaaS workflows. 1 Introduction Recent advances in Large Language Models (LLMs) have enabled the emergence of Computer-Using Agents (CUAs) Qin et al. (2025); Wang et al. (2025); OpenAI (2025);… view at source ↗

**Figure 3.** Figure 3: Overview of SAAS-BENCH. Agents receive natural-language task instructions and interact with locally deployed SaaS applications through browser-use. After execution, task outcomes are evaluated using verification tools, which are aggregated into resolved score and checkpoint score. systems, while a Business. task may involve CRM, finance, and structured record management systems. This domain-and-cluster org… view at source ↗

**Figure 4.** Figure 4: Task statistics of SAAS-BENCH. (a) Nested donut showing the breakdown of SAAS-BENCH tasks across the two evaluation modes (text-only and multimodal), six task domains, and the underlying SaaS applications. The outer ring quantifies how often each application is exercised, illustrating the diversity of real-world tools spanned by the benchmark. (b) Combined view of (top) the per-task application count and (… view at source ↗

**Figure 5.** Figure 5: Task synthesis pipeline of SAAS-BENCH. Starting from domain-specific task seeds and occupational roles, SAAS-BENCH synthesizes candidate tasks through an iterative Builder–Challenger– Refiner loop for template generation and instantiation. The generated tasks are then filtered by static rubric-based checking and execution check, ensuring that the final tasks are realistic, executable, and verifiable. such … view at source ↗

**Figure 6.** Figure 6: Pass@k average best scores (k = 1, 2, 3) for four models on SAAS-BENCH across three evaluation splits: text-only, multimodal, and overall. Each bar is divided into three segments: the dark base represents pass@1, the mid-tone segment shows the incremental gain from pass@1 to pass@2, and the lightest segment shows the further gain to pass@3. solution, verifier, database schema, or backend API is exposed. Th… view at source ↗

**Figure 7.** Figure 7: Left: Distribution of low-level actions emitted by Claude Opus 4.6 over the full benchmark; Right: categorization of failed verification checks by failure mode. Together the two panels link execution behaviour to the dominant failure types. 1 2 3 4 # distinct apps per task 0 20 40 60 80 100 Avg. score (%) (a) Score vs. # apps 0 50 100 150 200 250 300 350 400 Operation length (steps, Opus) 0 20 40 60 80 100… view at source ↗

**Figure 8.** Figure 8: Per-task score as a function of three structural complexity measures: ( [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Per-domain composition of agent behaviour errors observed in the trajectories of Opus 4.6. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Average pass rate of verification check [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SaaS-Bench gives a more realistic testbed for agents on live professional tools, but the sub-4% success numbers rest on verification checkpoints whose construction is not yet clear enough to fully trust.

read the letter

The main point is that this paper shows representative agents doing very poorly on a new set of tasks built from real SaaS tools, with the best ones finishing under 4 percent end to end. It suggests that planning, state tracking, and error recovery remain big hurdles when agents have to work across applications over long sequences. What the paper gets right is the construction of the benchmark itself. They assembled 23 deployable SaaS systems across six professional domains and created 106 tasks grounded in realistic scenarios. These tasks involve long-horizon work, cross-application coordination, and both text and multimodal elements. That is more grounded than the isolated or simplified settings common in prior agent benchmarks. Making the code available helps others check the results or build on them. The softer part is the evaluation method. The results rely on weighted verification checkpoints to judge full completion versus partial progress. The abstract does not provide much on how those checkpoints were selected, any agreement checks between raters, or tests of how the scores change if the weights shift. If the checkpoints do not fully capture critical state changes or if they under-penalize losses in context across apps, the low success rates could partly come from the scoring rules rather than the agents alone. The paper does flag the agent limitations, but the strength of that claim tracks how well the checkpoints reflect actual task success. This work is for people studying computer-using agents who need evaluation settings closer to professional use. Readers who care about moving agent research past toy environments will find the setup and the reported gaps useful. It has enough substance to go to a serious referee, particularly since new benchmarks can shape what the field measures next. I would recommend putting it through peer review, with attention to the verification details in the revisions.

Referee Report

1 major / 1 minor

Summary. The paper introduces SaaS-Bench, a benchmark built on 23 deployable real-world SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks emphasize long-horizon execution, dynamic states, cross-application coordination, and both text-only and multimodal interactions. Evaluation uses weighted verification checkpoints to measure strict end-to-end task completion as well as partial progress. Experiments with representative LLM-based computer-use agents report that even the strongest model completes fewer than 4% of tasks end-to-end, highlighting limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code is released for reproduction.

Significance. If the tasks accurately reflect professional workflows and the verification method reliably distinguishes full completion from partial progress, the benchmark would fill a notable gap left by existing simplified web and GUI agent evaluations. The reported sub-4% success rates would then constitute a concrete, falsifiable signal of current agent shortcomings in realistic SaaS settings. The public code release is a clear strength that supports reproducibility and future extensions.

major comments (1)

[§3] §3 (Benchmark Construction) and the associated verification protocol: the manuscript does not provide quantitative details on how the weighted checkpoints were derived, how weights were assigned to sub-steps, inter-rater agreement for task grounding, or sensitivity analysis for missing critical state transitions. Because the central claim of <4% end-to-end success rests on these checkpoints accurately measuring strict completion rather than benchmark artifacts, this omission is load-bearing and requires explicit documentation or supplementary material.

minor comments (1)

The abstract contains a minor grammatical issue ('Code are available' should read 'Code is available').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and commit to revisions that directly respond to the concerns raised about documentation of the verification protocol.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction) and the associated verification protocol: the manuscript does not provide quantitative details on how the weighted checkpoints were derived, how weights were assigned to sub-steps, inter-rater agreement for task grounding, or sensitivity analysis for missing critical state transitions. Because the central claim of <4% end-to-end success rests on these checkpoints accurately measuring strict completion rather than benchmark artifacts, this omission is load-bearing and requires explicit documentation or supplementary material.

Authors: We agree that the current manuscript provides only a high-level description of the weighted verification checkpoints in §3 and that additional quantitative details are required to substantiate the evaluation protocol. In the revised version we will expand §3 with a new subsection that (1) explains the derivation process, including the use of domain-expert review to identify critical state transitions and assign weights proportionally to their impact on task completion; (2) reports the exact weighting scheme and the rationale for each weight value; (3) presents inter-rater agreement statistics (Cohen’s κ) obtained from the three annotators who independently grounded each task and its checkpoints; and (4) includes a sensitivity analysis (moved to the appendix) that perturbs checkpoint weights and omits selected state transitions to show that the reported sub-4 % end-to-end success rate remains stable. These additions will be supported by new tables and will not alter any experimental results. We believe the expanded documentation will eliminate concerns about benchmark artifacts while preserving the paper’s central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements

full rationale

The paper introduces SaaS-Bench as a new collection of 106 tasks on 23 real SaaS systems and reports measured agent success rates (under 4% end-to-end) from direct experiments. No equations, fitted parameters, or derivations are present; the headline percentages are observations on the constructed benchmark rather than quantities forced by self-definition, renamed fits, or self-citation chains. Task design and weighted checkpoints are presented as independent engineering choices grounded in professional scenarios, with no reduction of the reported outcomes back to the inputs by construction. This is a standard empirical benchmark paper whose central claims remain falsifiable against external agent runs and do not rely on any load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical benchmark and evaluation protocol; it does not introduce new mathematical axioms, free parameters, or postulated entities beyond standard assumptions about agent capabilities.

axioms (1)

domain assumption SaaS environments naturally involve dynamic system states, cross-application coordination, and long-horizon dependencies suitable for CUA evaluation.
Stated in the abstract as justification for choosing SaaS platforms over existing simplified benchmarks.

pith-pipeline@v0.9.0 · 5826 in / 1179 out tokens · 52831 ms · 2026-05-20T18:59:13.163215+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 9 internal anchors

[1]

Proceedings of the Sixth International Conference on Learning Representations (ICLR) , year =

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , author =. Proceedings of the Sixth International Conference on Learning Representations (ICLR) , year =

work page
[2]

Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su , booktitle =

work page
[3]

2024 , url =

Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po-Yu Huang and Graham Neubig and Shuyan Zhou and Ruslan Salakhutdinov and Daniel Fried , booktitle =. 2024 , url =

work page 2024
[4]

2024 , url =

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle =. 2024 , url =

work page 2024
[5]

An Illusion of Progress?

Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su , booktitle =. An Illusion of Progress?. 2025 , url =

work page 2025
[6]

2025 , url =

Boyu Gou and others , booktitle =. 2025 , url =

work page 2025
[7]

2026 , url =

Shibo Hao and others , journal =. 2026 , url =

work page 2026
[8]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and L. arXiv preprint arXiv:2403.07718 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

L. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

work page
[10]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , journal =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , journal =

work page
[11]

2026 , url =

Shuyan Zhou , journal =. 2026 , url =

work page 2026
[12]

Browser-Use: Make Websites Accessible for

Magnus M\". Browser-Use: Make Websites Accessible for. 2024 , howpublished =

work page 2024
[13]

A Real-World

Izzeddin Gur and Hiroki Furuta and Austin Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , booktitle =. A Real-World

work page
[14]

2024 , url =

Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su , booktitle =. 2024 , url =

work page 2024
[15]

2024 , url =

Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxuan Zhang and Juanzi Li and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang , booktitle =. 2024 , url =

work page 2024
[16]

2025 , eprint=

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , author=. 2025 , eprint=

work page 2025
[17]

2026 , month =

Claude Opus 4.6 , author =. 2026 , month =

work page 2026
[18]

2026 , month =

Claude Sonnet 4.6 , author =. 2026 , month =

work page 2026
[19]

2026 , month =

Introducing GPT-5.4 , author =. 2026 , month =

work page 2026
[20]

2026 , month =

Gemini 3.1 Pro Model Card , author =. 2026 , month =

work page 2026
[21]

2026 , month =

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity , author =. 2026 , month =

work page 2026
[22]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

work page 2026
[23]

2026 , month =

MiniMax M2.7: Early Echoes of Self-Evolution , author =. 2026 , month =

work page 2026
[24]

2026 , month =

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author =. 2026 , month =

work page 2026
[25]

2025 , eprint=

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

OpenCUA: Open Foundations for Computer-Use Agents , author=. 2025 , eprint=

work page 2025
[27]

2025 , month =

Computer-Using Agent , author =. 2025 , month =

work page 2025
[28]

2024 , month =

Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku , author =. 2024 , month =

work page 2024
[29]

2024 , eprint=

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models , author=. 2024 , eprint=

work page 2024
[30]

2025 , howpublished =

What is Software as a Service (SaaS)? , author =. 2025 , howpublished =

work page 2025
[31]

2024 , month =

Gartner Forecasts Worldwide Public Cloud End-User Spending to Total \ 723 Billion in 2025 , author =. 2024 , month =

work page 2025
[32]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Theagentcompany: benchmarking llm agents on consequential real world tasks , author=. arXiv preprint arXiv:2412.14161 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon: Mixed-Modal Early-Fusion Foundation Models , author=. arXiv preprint arXiv:2405.09818 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. arXiv preprint arXiv:2408.11039 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

2024 , url=

Gemini 2.0: A new era for AI , author=. 2024 , url=

work page 2024
[36]

2025 , url=

Bagel: Unified Model for Image Understanding and Generation , author=. 2025 , url=

work page 2025
[37]

2024 , url=

Sora: Creating video from text , author=. 2024 , url=

work page 2024
[38]

2024 , url=

Veo 2: Google's most capable video generation model , author=. 2024 , url=

work page 2024
[39]

2025 , url=

Gemini 3: The next generation of AI models , author=. 2025 , url=

work page 2025
[40]

2024 , url=

GPT-4o: OpenAI's multimodal AI model , author=. 2024 , url=

work page 2024
[41]

2025 , url=

Sora 2: Advanced video generation , author=. 2025 , url=

work page 2025
[42]

2025 , url=

Veo 3: Google's next-generation video model , author=. 2025 , url=

work page 2025
[43]

2025 , eprint=

MMGR: Multi-Modal Generative Reasoning , author=. 2025 , eprint=

work page 2025
[44]

2025 , eprint=

Humanity's Last Exam , author=. 2025 , eprint=

work page 2025
[45]

1969 , publisher=

Analyzing Children's Art , author=. 1969 , publisher=

work page 1969
[46]

2003 , publisher=

The child's creation of a pictorial world , author=. 2003 , publisher=

work page 2003
[47]

European journal of disorders of communication , volume=

Beyond modularity: A developmental perspective on cognitive science , author=. European journal of disorders of communication , volume=. 1994 , publisher=

work page 1994
[48]

Wiley Interdisciplinary Reviews: Cognitive Science , volume=

Development of visual perception , author=. Wiley Interdisciplinary Reviews: Cognitive Science , volume=. 2010 , publisher=

work page 2010
[49]

Vision research , volume=

Development of human visual function , author=. Vision research , volume=. 2011 , publisher=

work page 2011
[50]

Infant visual perception , author=

work page
[51]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[52]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[53]

M. J. Kearns , title =

work page
[54]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[55]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[56]

Suppressed for Anonymity , author=

work page
[57]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[58]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[59]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025
[61]

ByteDance Seed , title =

work page
[62]

2025 , eprint=

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

work page 2025
[63]

2025 , eprint=

Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding , author=. 2025 , eprint=

work page 2025
[64]

2024 , eprint=

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs , author=. 2024 , eprint=

work page 2024
[65]

2025 , eprint=

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation , author=. 2025 , eprint=

work page 2025
[66]

2025 , eprint=

Kimi-VL Technical Report , author=. 2025 , eprint=

work page 2025
[67]

2025 , eprint=

MiMo-VL Technical Report , author=. 2025 , eprint=

work page 2025
[68]

GLM-V Team , title =

work page
[69]

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework , author =

work page
[70]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024
[71]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[72]

International Conference on Learning Representations , year=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. International Conference on Learning Representations , year=

work page
[73]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

European Conference on Computer Vision , pages=

BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. European Conference on Computer Vision , pages=. 2024 , publisher=

work page 2024
[75]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. Advances in Neural Information Processing Systems , volume=

work page
[76]

arXiv preprint arXiv:2510.13394 , year=

Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models , author=. arXiv preprint arXiv:2510.13394 , year=

work page arXiv
[77]

Cognition , volume=

Object permanence in five-month-old infants , author=. Cognition , volume=. 1985 , publisher=

work page 1985
[78]

, author=

Core knowledge. , author=. American psychologist , volume=. 2000 , publisher=

work page 2000
[79]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark , author=. arXiv preprint arXiv:2409.02813 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset , author=. arXiv preprint arXiv:2402.14804 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

Proceedings of the Sixth International Conference on Learning Representations (ICLR) , year =

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , author =. Proceedings of the Sixth International Conference on Learning Representations (ICLR) , year =

work page

[2] [2]

Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su , booktitle =

work page

[3] [3]

2024 , url =

Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po-Yu Huang and Graham Neubig and Shuyan Zhou and Ruslan Salakhutdinov and Daniel Fried , booktitle =. 2024 , url =

work page 2024

[4] [4]

2024 , url =

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle =. 2024 , url =

work page 2024

[5] [5]

An Illusion of Progress?

Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su , booktitle =. An Illusion of Progress?. 2025 , url =

work page 2025

[6] [6]

2025 , url =

Boyu Gou and others , booktitle =. 2025 , url =

work page 2025

[7] [7]

2026 , url =

Shibo Hao and others , journal =. 2026 , url =

work page 2026

[8] [8]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and L. arXiv preprint arXiv:2403.07718 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

L. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

work page

[10] [10]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , journal =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , journal =

work page

[11] [11]

2026 , url =

Shuyan Zhou , journal =. 2026 , url =

work page 2026

[12] [12]

Browser-Use: Make Websites Accessible for

Magnus M\". Browser-Use: Make Websites Accessible for. 2024 , howpublished =

work page 2024

[13] [13]

A Real-World

Izzeddin Gur and Hiroki Furuta and Austin Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , booktitle =. A Real-World

work page

[14] [14]

2024 , url =

Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su , booktitle =. 2024 , url =

work page 2024

[15] [15]

2024 , url =

Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxuan Zhang and Juanzi Li and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang , booktitle =. 2024 , url =

work page 2024

[16] [16]

2025 , eprint=

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , author=. 2025 , eprint=

work page 2025

[17] [17]

2026 , month =

Claude Opus 4.6 , author =. 2026 , month =

work page 2026

[18] [18]

2026 , month =

Claude Sonnet 4.6 , author =. 2026 , month =

work page 2026

[19] [19]

2026 , month =

Introducing GPT-5.4 , author =. 2026 , month =

work page 2026

[20] [20]

2026 , month =

Gemini 3.1 Pro Model Card , author =. 2026 , month =

work page 2026

[21] [21]

2026 , month =

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity , author =. 2026 , month =

work page 2026

[22] [22]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

work page 2026

[23] [23]

2026 , month =

MiniMax M2.7: Early Echoes of Self-Evolution , author =. 2026 , month =

work page 2026

[24] [24]

2026 , month =

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author =. 2026 , month =

work page 2026

[25] [25]

2025 , eprint=

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author=. 2025 , eprint=

work page 2025

[26] [26]

2025 , eprint=

OpenCUA: Open Foundations for Computer-Use Agents , author=. 2025 , eprint=

work page 2025

[27] [27]

2025 , month =

Computer-Using Agent , author =. 2025 , month =

work page 2025

[28] [28]

2024 , month =

Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku , author =. 2024 , month =

work page 2024

[29] [29]

2024 , eprint=

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models , author=. 2024 , eprint=

work page 2024

[30] [30]

2025 , howpublished =

What is Software as a Service (SaaS)? , author =. 2025 , howpublished =

work page 2025

[31] [31]

2024 , month =

Gartner Forecasts Worldwide Public Cloud End-User Spending to Total \ 723 Billion in 2025 , author =. 2024 , month =

work page 2025

[32] [32]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Theagentcompany: benchmarking llm agents on consequential real world tasks , author=. arXiv preprint arXiv:2412.14161 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon: Mixed-Modal Early-Fusion Foundation Models , author=. arXiv preprint arXiv:2405.09818 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. arXiv preprint arXiv:2408.11039 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

2024 , url=

Gemini 2.0: A new era for AI , author=. 2024 , url=

work page 2024

[36] [36]

2025 , url=

Bagel: Unified Model for Image Understanding and Generation , author=. 2025 , url=

work page 2025

[37] [37]

2024 , url=

Sora: Creating video from text , author=. 2024 , url=

work page 2024

[38] [38]

2024 , url=

Veo 2: Google's most capable video generation model , author=. 2024 , url=

work page 2024

[39] [39]

2025 , url=

Gemini 3: The next generation of AI models , author=. 2025 , url=

work page 2025

[40] [40]

2024 , url=

GPT-4o: OpenAI's multimodal AI model , author=. 2024 , url=

work page 2024

[41] [41]

2025 , url=

Sora 2: Advanced video generation , author=. 2025 , url=

work page 2025

[42] [42]

2025 , url=

Veo 3: Google's next-generation video model , author=. 2025 , url=

work page 2025

[43] [43]

2025 , eprint=

MMGR: Multi-Modal Generative Reasoning , author=. 2025 , eprint=

work page 2025

[44] [44]

2025 , eprint=

Humanity's Last Exam , author=. 2025 , eprint=

work page 2025

[45] [45]

1969 , publisher=

Analyzing Children's Art , author=. 1969 , publisher=

work page 1969

[46] [46]

2003 , publisher=

The child's creation of a pictorial world , author=. 2003 , publisher=

work page 2003

[47] [47]

European journal of disorders of communication , volume=

Beyond modularity: A developmental perspective on cognitive science , author=. European journal of disorders of communication , volume=. 1994 , publisher=

work page 1994

[48] [48]

Wiley Interdisciplinary Reviews: Cognitive Science , volume=

Development of visual perception , author=. Wiley Interdisciplinary Reviews: Cognitive Science , volume=. 2010 , publisher=

work page 2010

[49] [49]

Vision research , volume=

Development of human visual function , author=. Vision research , volume=. 2011 , publisher=

work page 2011

[50] [50]

Infant visual perception , author=

work page

[51] [51]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[52] [52]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[53] [53]

M. J. Kearns , title =

work page

[54] [54]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[55] [55]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[56] [56]

Suppressed for Anonymity , author=

work page

[57] [57]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[58] [58]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[59] [59]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025

[61] [61]

ByteDance Seed , title =

work page

[62] [62]

2025 , eprint=

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

work page 2025

[63] [63]

2025 , eprint=

Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding , author=. 2025 , eprint=

work page 2025

[64] [64]

2024 , eprint=

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs , author=. 2024 , eprint=

work page 2024

[65] [65]

2025 , eprint=

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation , author=. 2025 , eprint=

work page 2025

[66] [66]

2025 , eprint=

Kimi-VL Technical Report , author=. 2025 , eprint=

work page 2025

[67] [67]

2025 , eprint=

MiMo-VL Technical Report , author=. 2025 , eprint=

work page 2025

[68] [68]

GLM-V Team , title =

work page

[69] [69]

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework , author =

work page

[70] [70]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024

[71] [71]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[72] [72]

International Conference on Learning Representations , year=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. International Conference on Learning Representations , year=

work page

[73] [73]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[74] [74]

European Conference on Computer Vision , pages=

BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. European Conference on Computer Vision , pages=. 2024 , publisher=

work page 2024

[75] [75]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. Advances in Neural Information Processing Systems , volume=

work page

[76] [76]

arXiv preprint arXiv:2510.13394 , year=

Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models , author=. arXiv preprint arXiv:2510.13394 , year=

work page arXiv

[77] [77]

Cognition , volume=

Object permanence in five-month-old infants , author=. Cognition , volume=. 1985 , publisher=

work page 1985

[78] [78]

, author=

Core knowledge. , author=. American psychologist , volume=. 2000 , publisher=

work page 2000

[79] [79]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark , author=. arXiv preprint arXiv:2409.02813 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [80]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset , author=. arXiv preprint arXiv:2402.14804 , year=

work page internal anchor Pith review Pith/arXiv arXiv