pith. sign in

arxiv: 2605.15777 · v1 · pith:YMQQ42OKnew · submitted 2026-05-15 · 💻 cs.AI

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Pith reviewed 2026-05-20 18:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords SaaS-Benchcomputer-using agentsLLM agentsprofessional workflowsbenchmarktask completionGUI agents
0
0 comments X

The pith

LLM-based computer-using agents complete fewer than 4% of realistic professional SaaS tasks end-to-end.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SaaS-Bench as a benchmark for evaluating computer-using agents in real software-as-a-service environments. It includes 106 tasks across 23 SaaS systems in six professional domains, requiring long-horizon interactions and coordination. Experiments with representative agents show success rates below 4% for the strongest models, pointing to weaknesses in planning, tracking states across applications, and recovering from errors. This evaluation matters because it tests agents on the kind of dynamic, multi-step work that professionals do daily in tools like project management and collaboration software. A sympathetic reader would conclude that current agent designs are not yet ready for complex real-world deployment.

Core claim

SaaS-Bench is introduced as a benchmark built on 23 deployable SaaS systems across six domains with 106 tasks grounded in realistic scenarios. These tasks involve long-horizon execution in both text and multimodal settings and use weighted verification checkpoints to measure completion and progress. Representative LLM-based agents struggle, with the strongest completing fewer than 4% of tasks end-to-end, revealing limitations in planning, state tracking, cross-application context maintenance, and error recovery.

What carries the argument

SaaS-Bench benchmark with its 106 tasks and weighted verification checkpoints, which evaluates agents on dynamic system states and cross-application coordination in professional SaaS environments.

If this is right

  • Current agents lack the ability to maintain context across multiple applications over long periods.
  • Error recovery is a critical missing capability for handling real workflows.
  • Both planning and state tracking need significant improvement to achieve practical utility.
  • The benchmark highlights the need for agents that can handle multimodal inputs effectively in GUI settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If true, this implies that future agent research should prioritize architectures with explicit memory or state management modules.
  • The results could motivate development of hybrid systems that combine LLM reasoning with rule-based automation for SaaS tasks.
  • Extending the benchmark to include more domains might reveal domain-specific strengths or weaknesses in agent performance.

Load-bearing premise

The assumption that the selected 106 tasks accurately represent realistic professional workflows and that the weighted checkpoints reliably indicate task success or partial progress.

What would settle it

A new agent design that achieves end-to-end completion on more than 20% of the 106 tasks would challenge the reported limitations of current approaches.

Figures

Figures reproduced from arXiv: 2605.15777 by Baobao Chang, Elvis Zhang, Jason Zeng, Jialong Wu, Kean Shi, Kuan Li, Liang Chen, Michael Heinrich, Ming Wu, Qingyao Yang, Ruoyu Wu, Tianyi Ma, Weichu Xie, Xinbo Xu, Zengji Tu, Zihang Li.

Figure 1
Figure 1. Figure 1: Leaderboard of SAAS-BENCH. We report overall checkpoint scores (bar length) and resolved scores for seven frontier models across 106 long-horizon SaaS tasks. ∗Equal Core Contributors †Correspondence: Liang Chen <liangchen@unipat.ai>, Kuan Li <kuanli@unipat.ai>, Baobao Chang <chbb@pku.edu.cn> 1 arXiv:2605.15777v1 [cs.AI] 15 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SAAS-BENCH provides a realistic benchmark for evaluating CUAs in deployable SaaS environ￾ments. It consists of 23 real SaaS systems organized into six professional domains, supporting 106 tasks that reflect real-world SaaS workflows. 1 Introduction Recent advances in Large Language Models (LLMs) have enabled the emergence of Computer-Using Agents (CUAs) Qin et al. (2025); Wang et al. (2025); OpenAI (2025);… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SAAS-BENCH. Agents receive natural-language task instructions and interact with locally deployed SaaS applications through browser-use. After execution, task outcomes are evaluated using verification tools, which are aggregated into resolved score and checkpoint score. systems, while a Business. task may involve CRM, finance, and structured record management systems. This domain-and-cluster org… view at source ↗
Figure 4
Figure 4. Figure 4: Task statistics of SAAS-BENCH. (a) Nested donut showing the breakdown of SAAS-BENCH tasks across the two evaluation modes (text-only and multimodal), six task domains, and the underlying SaaS applications. The outer ring quantifies how often each application is exercised, illustrating the diversity of real-world tools spanned by the benchmark. (b) Combined view of (top) the per-task application count and (… view at source ↗
Figure 5
Figure 5. Figure 5: Task synthesis pipeline of SAAS-BENCH. Starting from domain-specific task seeds and occupational roles, SAAS-BENCH synthesizes candidate tasks through an iterative Builder–Challenger– Refiner loop for template generation and instantiation. The generated tasks are then filtered by static rubric-based checking and execution check, ensuring that the final tasks are realistic, executable, and verifiable. such … view at source ↗
Figure 6
Figure 6. Figure 6: Pass@k average best scores (k = 1, 2, 3) for four models on SAAS-BENCH across three evaluation splits: text-only, multimodal, and overall. Each bar is divided into three segments: the dark base represents pass@1, the mid-tone segment shows the incremental gain from pass@1 to pass@2, and the lightest segment shows the further gain to pass@3. solution, verifier, database schema, or backend API is exposed. Th… view at source ↗
Figure 7
Figure 7. Figure 7: Left: Distribution of low-level actions emitted by Claude Opus 4.6 over the full benchmark; Right: categorization of failed verification checks by failure mode. Together the two panels link execution behaviour to the dominant failure types. 1 2 3 4 # distinct apps per task 0 20 40 60 80 100 Avg. score (%) (a) Score vs. # apps 0 50 100 150 200 250 300 350 400 Operation length (steps, Opus) 0 20 40 60 80 100… view at source ↗
Figure 8
Figure 8. Figure 8: Per-task score as a function of three structural complexity measures: ( [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-domain composition of agent behaviour errors observed in the trajectories of Opus 4.6. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average pass rate of verification check [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces SaaS-Bench, a benchmark built on 23 deployable real-world SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks emphasize long-horizon execution, dynamic states, cross-application coordination, and both text-only and multimodal interactions. Evaluation uses weighted verification checkpoints to measure strict end-to-end task completion as well as partial progress. Experiments with representative LLM-based computer-use agents report that even the strongest model completes fewer than 4% of tasks end-to-end, highlighting limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code is released for reproduction.

Significance. If the tasks accurately reflect professional workflows and the verification method reliably distinguishes full completion from partial progress, the benchmark would fill a notable gap left by existing simplified web and GUI agent evaluations. The reported sub-4% success rates would then constitute a concrete, falsifiable signal of current agent shortcomings in realistic SaaS settings. The public code release is a clear strength that supports reproducibility and future extensions.

major comments (1)
  1. [§3] §3 (Benchmark Construction) and the associated verification protocol: the manuscript does not provide quantitative details on how the weighted checkpoints were derived, how weights were assigned to sub-steps, inter-rater agreement for task grounding, or sensitivity analysis for missing critical state transitions. Because the central claim of <4% end-to-end success rests on these checkpoints accurately measuring strict completion rather than benchmark artifacts, this omission is load-bearing and requires explicit documentation or supplementary material.
minor comments (1)
  1. The abstract contains a minor grammatical issue ('Code are available' should read 'Code is available').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and commit to revisions that directly respond to the concerns raised about documentation of the verification protocol.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction) and the associated verification protocol: the manuscript does not provide quantitative details on how the weighted checkpoints were derived, how weights were assigned to sub-steps, inter-rater agreement for task grounding, or sensitivity analysis for missing critical state transitions. Because the central claim of <4% end-to-end success rests on these checkpoints accurately measuring strict completion rather than benchmark artifacts, this omission is load-bearing and requires explicit documentation or supplementary material.

    Authors: We agree that the current manuscript provides only a high-level description of the weighted verification checkpoints in §3 and that additional quantitative details are required to substantiate the evaluation protocol. In the revised version we will expand §3 with a new subsection that (1) explains the derivation process, including the use of domain-expert review to identify critical state transitions and assign weights proportionally to their impact on task completion; (2) reports the exact weighting scheme and the rationale for each weight value; (3) presents inter-rater agreement statistics (Cohen’s κ) obtained from the three annotators who independently grounded each task and its checkpoints; and (4) includes a sensitivity analysis (moved to the appendix) that perturbs checkpoint weights and omits selected state transitions to show that the reported sub-4 % end-to-end success rate remains stable. These additions will be supported by new tables and will not alter any experimental results. We believe the expanded documentation will eliminate concerns about benchmark artifacts while preserving the paper’s central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements

full rationale

The paper introduces SaaS-Bench as a new collection of 106 tasks on 23 real SaaS systems and reports measured agent success rates (under 4% end-to-end) from direct experiments. No equations, fitted parameters, or derivations are present; the headline percentages are observations on the constructed benchmark rather than quantities forced by self-definition, renamed fits, or self-citation chains. Task design and weighted checkpoints are presented as independent engineering choices grounded in professional scenarios, with no reduction of the reported outcomes back to the inputs by construction. This is a standard empirical benchmark paper whose central claims remain falsifiable against external agent runs and do not rely on any load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical benchmark and evaluation protocol; it does not introduce new mathematical axioms, free parameters, or postulated entities beyond standard assumptions about agent capabilities.

axioms (1)
  • domain assumption SaaS environments naturally involve dynamic system states, cross-application coordination, and long-horizon dependencies suitable for CUA evaluation.
    Stated in the abstract as justification for choosing SaaS platforms over existing simplified benchmarks.

pith-pipeline@v0.9.0 · 5826 in / 1179 out tokens · 52831 ms · 2026-05-20T18:59:13.163215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 9 internal anchors

  1. [1]

    Proceedings of the Sixth International Conference on Learning Representations (ICLR) , year =

    Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , author =. Proceedings of the Sixth International Conference on Learning Representations (ICLR) , year =

  2. [2]

    Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su , booktitle =

  3. [3]

    2024 , url =

    Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po-Yu Huang and Graham Neubig and Shuyan Zhou and Ruslan Salakhutdinov and Daniel Fried , booktitle =. 2024 , url =

  4. [4]

    2024 , url =

    Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle =. 2024 , url =

  5. [5]

    An Illusion of Progress?

    Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su , booktitle =. An Illusion of Progress?. 2025 , url =

  6. [6]

    2025 , url =

    Boyu Gou and others , booktitle =. 2025 , url =

  7. [7]

    2026 , url =

    Shibo Hao and others , journal =. 2026 , url =

  8. [8]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and L. arXiv preprint arXiv:2403.07718 , year =

  9. [9]

    Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

    L. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

  10. [10]

    Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , journal =

    Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , journal =

  11. [11]

    2026 , url =

    Shuyan Zhou , journal =. 2026 , url =

  12. [12]

    Browser-Use: Make Websites Accessible for

    Magnus M\". Browser-Use: Make Websites Accessible for. 2024 , howpublished =

  13. [13]

    A Real-World

    Izzeddin Gur and Hiroki Furuta and Austin Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , booktitle =. A Real-World

  14. [14]

    2024 , url =

    Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su , booktitle =. 2024 , url =

  15. [15]

    2024 , url =

    Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxuan Zhang and Juanzi Li and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang , booktitle =. 2024 , url =

  16. [16]

    2025 , eprint=

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , author=. 2025 , eprint=

  17. [17]

    2026 , month =

    Claude Opus 4.6 , author =. 2026 , month =

  18. [18]

    2026 , month =

    Claude Sonnet 4.6 , author =. 2026 , month =

  19. [19]

    2026 , month =

    Introducing GPT-5.4 , author =. 2026 , month =

  20. [20]

    2026 , month =

    Gemini 3.1 Pro Model Card , author =. 2026 , month =

  21. [21]

    2026 , month =

    Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity , author =. 2026 , month =

  22. [22]

    2026 , eprint=

    Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

  23. [23]

    2026 , month =

    MiniMax M2.7: Early Echoes of Self-Evolution , author =. 2026 , month =

  24. [24]

    2026 , month =

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author =. 2026 , month =

  25. [25]

    2025 , eprint=

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author=. 2025 , eprint=

  26. [26]

    2025 , eprint=

    OpenCUA: Open Foundations for Computer-Use Agents , author=. 2025 , eprint=

  27. [27]

    2025 , month =

    Computer-Using Agent , author =. 2025 , month =

  28. [28]

    2024 , month =

    Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku , author =. 2024 , month =

  29. [29]

    2024 , eprint=

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models , author=. 2024 , eprint=

  30. [30]

    2025 , howpublished =

    What is Software as a Service (SaaS)? , author =. 2025 , howpublished =

  31. [31]

    2024 , month =

    Gartner Forecasts Worldwide Public Cloud End-User Spending to Total \ 723 Billion in 2025 , author =. 2024 , month =

  32. [32]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Theagentcompany: benchmarking llm agents on consequential real world tasks , author=. arXiv preprint arXiv:2412.14161 , year=

  33. [33]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon: Mixed-Modal Early-Fusion Foundation Models , author=. arXiv preprint arXiv:2405.09818 , year=

  34. [34]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. arXiv preprint arXiv:2408.11039 , year=

  35. [35]

    2024 , url=

    Gemini 2.0: A new era for AI , author=. 2024 , url=

  36. [36]

    2025 , url=

    Bagel: Unified Model for Image Understanding and Generation , author=. 2025 , url=

  37. [37]

    2024 , url=

    Sora: Creating video from text , author=. 2024 , url=

  38. [38]

    2024 , url=

    Veo 2: Google's most capable video generation model , author=. 2024 , url=

  39. [39]

    2025 , url=

    Gemini 3: The next generation of AI models , author=. 2025 , url=

  40. [40]

    2024 , url=

    GPT-4o: OpenAI's multimodal AI model , author=. 2024 , url=

  41. [41]

    2025 , url=

    Sora 2: Advanced video generation , author=. 2025 , url=

  42. [42]

    2025 , url=

    Veo 3: Google's next-generation video model , author=. 2025 , url=

  43. [43]

    2025 , eprint=

    MMGR: Multi-Modal Generative Reasoning , author=. 2025 , eprint=

  44. [44]

    2025 , eprint=

    Humanity's Last Exam , author=. 2025 , eprint=

  45. [45]

    1969 , publisher=

    Analyzing Children's Art , author=. 1969 , publisher=

  46. [46]

    2003 , publisher=

    The child's creation of a pictorial world , author=. 2003 , publisher=

  47. [47]

    European journal of disorders of communication , volume=

    Beyond modularity: A developmental perspective on cognitive science , author=. European journal of disorders of communication , volume=. 1994 , publisher=

  48. [48]

    Wiley Interdisciplinary Reviews: Cognitive Science , volume=

    Development of visual perception , author=. Wiley Interdisciplinary Reviews: Cognitive Science , volume=. 2010 , publisher=

  49. [49]

    Vision research , volume=

    Development of human visual function , author=. Vision research , volume=. 2011 , publisher=

  50. [50]

    Infant visual perception , author=

  51. [51]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  52. [52]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  53. [53]

    M. J. Kearns , title =

  54. [54]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  55. [55]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  56. [56]

    Suppressed for Anonymity , author=

  57. [57]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  58. [58]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  59. [59]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  60. [60]

    2025 , eprint=

    Qwen3-VL Technical Report , author=. 2025 , eprint=

  61. [61]

    ByteDance Seed , title =

  62. [62]

    2025 , eprint=

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

  63. [63]

    2025 , eprint=

    Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding , author=. 2025 , eprint=

  64. [64]

    2024 , eprint=

    Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs , author=. 2024 , eprint=

  65. [65]

    2025 , eprint=

    MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation , author=. 2025 , eprint=

  66. [66]

    2025 , eprint=

    Kimi-VL Technical Report , author=. 2025 , eprint=

  67. [67]

    2025 , eprint=

    MiMo-VL Technical Report , author=. 2025 , eprint=

  68. [68]

    GLM-V Team , title =

  69. [69]

    EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework , author =

  70. [70]

    2024 , journal =

    HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

  71. [71]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  72. [72]

    International Conference on Learning Representations , year=

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. International Conference on Learning Representations , year=

  73. [73]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=

  74. [74]

    European Conference on Computer Vision , pages=

    BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. European Conference on Computer Vision , pages=. 2024 , publisher=

  75. [75]

    Advances in Neural Information Processing Systems , volume=

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. Advances in Neural Information Processing Systems , volume=

  76. [76]

    arXiv preprint arXiv:2510.13394 , year=

    Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models , author=. arXiv preprint arXiv:2510.13394 , year=

  77. [77]

    Cognition , volume=

    Object permanence in five-month-old infants , author=. Cognition , volume=. 1985 , publisher=

  78. [78]

    , author=

    Core knowledge. , author=. American psychologist , volume=. 2000 , publisher=

  79. [79]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark , author=. arXiv preprint arXiv:2409.02813 , year=

  80. [80]

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset , author=. arXiv preprint arXiv:2402.14804 , year=

Showing first 80 references.