pith. machine review for the scientific record. sign in

arxiv: 2603.20633 · v3 · submitted 2026-03-21 · 💻 cs.AI

Recognition: no theorem link

Seed1.8 Model Card: Towards Generalized Real-World Agency

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:41 UTC · model grok-4.3

classification 💻 cs.AI
keywords Seed1.8foundation modelreal-world agencyagentic interfacemulti-turn interactiontool useGUI interactionmultimodal understanding
0
0 comments X

The pith

Seed1.8 is presented as a foundation model for generalized real-world agency through multi-turn interaction, tool use, and multi-step execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Seed1.8, a foundation model aimed at generalized real-world agency. It moves beyond single-turn prediction to support multi-turn interactions, tool use, and sequential execution. A reader would care if this allows AI systems to manage complex interactive tasks in practical settings instead of isolated queries. The model retains strong language and vision-language abilities while adding a unified agentic interface for search, code generation and execution, and GUI interaction. It also includes deployment features such as latency-aware inference with configurable thinking modes and optimized visual encoding for images and video.

Core claim

Seed1.8 is a foundation model aimed at generalized real-world agency that supports multi-turn interaction, tool use, and multi-step execution while keeping strong LLM and vision-language performance and offering a unified agentic interface for search, code generation and execution, and GUI interaction along with latency- and cost-aware inference options including configurable thinking modes and optimized visual encoding.

What carries the argument

The unified agentic interface combining search, code generation and execution, and GUI interaction to enable multi-turn and multi-step agentic behavior.

Load-bearing premise

That performance on the reported benchmarks and application-aligned workflows demonstrates true generalized real-world agency rather than results limited to controlled test settings.

What would settle it

A test in which Seed1.8 is placed in a novel uncontrolled real-world scenario requiring adaptation to unseen multi-step tasks and fails to complete them reliably without prior matching examples.

read the original abstract

We present Seed1.8, a foundation model aimed at generalized real-world agency: going beyond single-turn prediction to multi-turn interaction, tool use, and multi-step execution. Seed1.8 keeps strong LLM and vision-language performance while supporting a unified agentic interface-search, code generation and execution, and GUI interaction. For deployment, it offers latency- and cost-aware inference, including configurable thinking modes and optimized visual encoding for images and video. We report evaluations on standard benchmarks and application-aligned workflows spanning foundational skills, multimodal understanding, and agentic behavior. Seed1.8 is released to support further research and development on interactive, real-world use cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents Seed1.8, a foundation model aimed at generalized real-world agency. It extends beyond single-turn prediction to support multi-turn interaction, tool use, and multi-step execution while retaining strong LLM and vision-language capabilities. The model provides a unified agentic interface covering search, code generation/execution, and GUI interaction, along with deployment optimizations such as latency- and cost-aware inference, configurable thinking modes, and improved visual encoding. Evaluations are reported on standard benchmarks and application-aligned workflows spanning foundational skills, multimodal understanding, and agentic behavior. The model is released to facilitate further research on interactive real-world use cases.

Significance. If the reported evaluations establish robust generalization of multi-turn agency, tool use, and execution beyond curated settings, the work would represent a meaningful step toward practical, deployable agentic systems that integrate perception, reasoning, and action in dynamic environments. The emphasis on unified interfaces and inference optimizations addresses real deployment constraints, and the open release directly supports reproducibility and extension by the community.

major comments (1)
  1. [Abstract] Abstract: The central claim of 'generalized real-world agency' is load-bearing on the evaluations demonstrating transfer beyond controlled conditions, yet the description provides no metrics, methods, error analysis, environment diversity details, out-of-distribution test cases, or failure-recovery results under dynamic perturbations (e.g., interface changes or novel tool APIs). This leaves open whether performance reflects true generalization or overfitting to fixed workflows.
minor comments (1)
  1. [Abstract] The abstract mentions 'standard benchmarks and application-aligned workflows' without naming specific benchmarks, tasks, or baseline comparisons; adding these would improve clarity even in a model-card format.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have addressed the major comment regarding the abstract by expanding it to better summarize the evaluation details supporting claims of generalized agency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of 'generalized real-world agency' is load-bearing on the evaluations demonstrating transfer beyond controlled conditions, yet the description provides no metrics, methods, error analysis, environment diversity details, out-of-distribution test cases, or failure-recovery results under dynamic perturbations (e.g., interface changes or novel tool APIs). This leaves open whether performance reflects true generalization or overfitting to fixed workflows.

    Authors: We appreciate the referee's observation that the abstract could more explicitly convey the robustness of our evaluations. The original abstract was intentionally concise to provide a high-level overview while adhering to length limits. The full manuscript details evaluations across standard benchmarks and application-aligned workflows, including metrics for multi-turn interaction, tool use, multimodal understanding, environment diversity, out-of-distribution testing, error analysis, and failure-recovery under perturbations such as interface changes. To address this directly, we have revised the abstract to briefly reference these evaluation aspects, including mention of transfer beyond curated settings and configurable inference modes that support dynamic use cases. This change strengthens the abstract without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: model card reports benchmark results without derivations or self-referential reductions

full rationale

The document is a model card describing Seed1.8 and its evaluations on standard benchmarks and workflows for skills, multimodal understanding, and agentic behavior. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. The central claim of generalized real-world agency is presented as an empirical outcome of reported evaluations rather than derived from any self-citation chain or definitional equivalence. All load-bearing statements reduce to external benchmark measurements, which are independent of the model's internal construction and do not collapse by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence and performance of the Seed1.8 model itself. No free parameters, axioms, or invented entities are explicitly defined in the abstract; the model and its 'unified agentic interface' are the primary elements introduced without independent evidence beyond the paper's statements.

pith-pipeline@v0.9.0 · 5396 in / 1033 out tokens · 45894 ms · 2026-05-15T07:41:07.689351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

    cs.CV 2026-05 unverdicted novelty 7.0

    ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

  3. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.

  4. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  5. Towards Long-horizon Agentic Multimodal Search

    cs.CV 2026-04 unverdicted novelty 6.0

    LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...

  6. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

    cs.LG 2026-04 unverdicted novelty 6.0

    Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.

  7. Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.

  8. OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

    cs.CV 2026-04 unverdicted novelty 5.0

    OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.

  9. Valley3: Scaling Omni Foundation Models for E-commerce

    cs.AI 2026-05 unverdicted novelty 4.0

    Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · cited by 9 Pith papers · 10 internal anchors

  1. [1]

    Solving einstein’s equations on supercomputers.Computer, 32(12):52–58, 1999

    Gabrielle Allen, Tom Goodale, Gerd Lanfermann, Thomas Radke, Edward Seidel, Werner Benger, Hans-Christian Hege, Andre Merzky, Joan Masso, and John Shalf. Solving einstein’s equations on supercomputers.Computer, 32(12):52–58, 1999. 21

  2. [2]

    Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

    Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, and Andrew Zisserman. Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

  3. [3]

    Amo-bench: Large language models still struggle in high school math competitions.arXiv preprint arXiv:2510.26768, 2025

    Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, et al. Amo-bench: Large language models still struggle in high school math competitions.arXiv preprint arXiv:2510.26768, 2025

  4. [4]

    MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

  5. [5]

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ 2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

  6. [6]

    Visual physics comprehension test, 2025

    Chase Brower. Visual physics comprehension test, 2025

  7. [7]

    Beyondaime: Advancing math reasoning evaluation beyond high school olympiads

    ByteDance-Seed. Beyondaime: Advancing math reasoning evaluation beyond high school olympiads. https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025

  8. [8]

    Video simpleqa: Towards factuality evaluation in large video language models

    Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Jiahua Dong, Wangbo Yu, Ge Zhang, Ian Reid, and Xiaodan Liang. Video simpleqa: Towards factuality evaluation in large video language models. CoRR, abs/2503.18923, 2025

  9. [9]

    How people use chatgpt

    Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Working Paper 34255, National Bureau of Economic Research, September 2025

  10. [10]

    Cg-bench: Clue-grounded question answering benchmark for long video understanding

    Guo Chen, Yicheng Liu, Yifei Huang, Baoqi Pei, Jilan Xu, Yuping He, Tong Lu, Yali Wang, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025

  11. [11]

    Livecc: Learning video LLM with streaming speech transcription at scale

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video LLM with streaming speech transcription at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 29083–29095, 2025

  12. [12]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  13. [13]

    Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025

  14. [14]

    Pointarena: Probing multimodal grounding through language-guided pointing

    Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, Boyang Li, Yushan Huang, Elvis Wang, Ainaz Eftekhar, Jason Lee, Wentao Yuan, et al. Pointarena: Probing multimodal grounding through language-guided pointing. arXiv preprint arXiv:2505.09990, 2025

  15. [15]

    Simplevqa: Multimodal factuality evaluation for multimodal large language models

    Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4637–4646, 2025

  16. [16]

    Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, and Yuki M. Asano. Tvbench: Redesigning video-language evaluation.CoRR, abs/2410.07752, 2024

  17. [17]

    Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms

    Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...

  18. [18]

    Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

  19. [19]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

    Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025. 22

  20. [20]

    Counting out time: Class agnostic video repetition counting in the wild

    Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Counting out time: Class agnostic video repetition counting in the wild. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, W A, USA, June 13-19, 2020, pages 10384–10393, 2020

  21. [21]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InI...

  22. [22]

    Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025

    Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025

  23. [23]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  24. [24]

    Springer, 2012

    Eric Gourgoulhon.3+ 1 formalism in general relativity. Springer, 2012

  25. [25]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

  26. [26]

    Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

    Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

  27. [27]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  28. [28]

    Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models

    Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8450–8460, 2025

  29. [29]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

  30. [30]

    Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025

    Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, et al. Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025

  31. [31]

    Online video understanding: Ovbench and videochat-online

    Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 3328–3338, 2025

  32. [32]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  33. [33]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

  34. [34]

    Screenspot-pro: Gui grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

  35. [35]

    Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents, 2025b.URL https://arxiv

    Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, et al. Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents, 2025b.URL https://arxiv. org/abs/2508.13186. 23

  36. [36]

    Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding.CoRR, abs/2411.03628, 2024

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding.CoRR, abs/2411.03628, 2024

  37. [37]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  38. [38]

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 8731–8772....

  39. [39]

    Charles, Xinyu Zhou, and Xu Sun

    Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?CoRR, abs/2505.23359, 2025

  40. [40]

    Mundim, Christian D

    Frank Löffler, Joshua Faber, Eloisa Bentivegna, Tanja Bode, Peter Diener, Roland Haas, Ian Hinder, Bruno C. Mundim, Christian D. Ott, Erik Schnetter, Gabrielle Allen, Manuela Campanelli, and Pablo Laguna. The Einstein Toolkit: A Community Computational Infrastructure for Relativistic Astrophysics.Class. Quantum Grav., 29(11):115001, 2012

  41. [41]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  42. [42]

    Towards robust mathematical reasoning

    Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025

  43. [43]

    Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

    Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, et al. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

  44. [44]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  45. [45]

    MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025

    Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, and Tobias Weyand. MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025

  46. [46]

    Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding? InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvill...

  47. [47]

    Introducing SWE-bench Verified, August 2024

    OpenAI. Introducing SWE-bench Verified, August 2024. Accessed: 2025-12-10

  48. [48]

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

  49. [49]

    Teaching clip to count to ten

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023

  50. [50]

    Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

  51. [51]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

  52. [52]

    Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos

    Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, and Federico Tombari. Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos. InIEEE/CVF Conference on 24 Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 24129–24138, 2025

  53. [53]

    Arc agi: The $1 million artificial general intelligence prize

    ARC Prize. Arc agi: The $1 million artificial general intelligence prize. https://arcprize.org/arc-agi/1/, 2024

  54. [54]

    Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.CoRR, abs/2504.07956, 2025

    Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.CoRR, abs/2504.07956, 2025

  55. [55]

    Phybench: Holistic evaluation of physical perception and reasoning in large language models

    Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, et al. Phybench: Holistic evaluation of physical perception and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025

  56. [56]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InProceedings of the Asian Conference on Computer Vision, pages 18–34, 2024

  57. [57]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

  58. [58]

    Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

    Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, et al. Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

  59. [59]

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

  60. [60]

    TOMATO: assessing visual temporal reasoning capabilities in multimodal foundation models

    Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. TOMATO: assessing visual temporal reasoning capabilities in multimodal foundation models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025

  61. [61]

    Optogenetic activators of apoptosis, necroptosis, and pyroptosis.Journal of Cell Biology, 221(6):e202109038, 2022

    Kateryna Shkarina, Eva Hasel de Carvalho, José Carlos Santos, Saray Ramos, Maria Leptin, and Petr Broz. Optogenetic activators of apoptosis, necroptosis, and pyroptosis.Journal of Cell Biology, 221(6):e202109038, 2022

  62. [62]

    DeepConsult: A deep research benchmark for consulting and business queries

    DeepConsult Team. DeepConsult: A deep research benchmark for consulting and business queries. https://github.com/youdotcom-oss/ydc-deep-research-evals, 2025. GitHub repository

  63. [63]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  64. [64]

    Introducing terminal-bench 2.0 and harbor

    Terminal-Bench Team. Introducing terminal-bench 2.0 and harbor. https://www.tbench.ai/news/announcement- 2-0, nov 2025. Accessed: 2025-12-10

  65. [65]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

  66. [66]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024

  67. [67]

    Document understanding dataset and evaluation (dude)

    Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Valveny, et al. Document understanding dataset and evaluation (dude). InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19528–19540, 2023

  68. [68]

    Vision Language Models are Biased

    An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased.arXiv preprint arXiv:2505.23941, 2025

  69. [69]

    Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

    Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024. 25

  70. [70]

    Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

  71. [71]

    Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024

  72. [72]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  73. [73]

    Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts

    Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 2025

  74. [74]

    Mmlongbench: Benchmarking long-context vision-language models effectively and thoroughly.arXiv preprint arXiv:2505.10610, 2025

    Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, et al. Mmlongbench: Benchmarking long-context vision-language models effectively and thoroughly.arXiv preprint arXiv:2505.10610, 2025

  75. [75]

    Aethercode: Evaluating llms’ ability to win in premier programming competitions.arXiv preprint arXiv:2508.16402, 2025

    Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, et al. Aethercode: Evaluating llms’ ability to win in premier programming competitions.arXiv preprint arXiv:2508.16402, 2025

  76. [76]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

  77. [77]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  78. [78]

    Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

    Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, et al. Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

  79. [79]

    Longvideobench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024

  80. [80]

    Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

Showing first 80 references.