Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity
Pith reviewed 2026-07-02 18:54 UTC · model grok-4.3
The pith
Seed2.0 improves reliability on intricate long-horizon tasks by targeting long-tail knowledge and complex instruction following via a user-need evaluation system.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Seed2.0 takes a meaningful step toward solving complex, real-world tasks. Our approach begins with identifying users' genuine needs and constructing a reliable, forward-looking evaluation system by selecting and abstracting benchmarks grounded in these needs and in realistic, complex scenarios. Guided by this evaluation system, Seed2.0 targets two persistent challenges, long-tail knowledge and complex instruction following, substantially improving the model's reliability on intricate, long-horizon tasks. Beyond these, Seed2.0 delivers world-leading reasoning intelligence, visual understanding, and search capabilities that address the most common needs of a broad user base.
What carries the argument
The evaluation system that selects and abstracts benchmarks grounded in users' genuine needs and realistic complex scenarios, which then directs targeted improvements on long-tail knowledge and complex instruction following.
If this is right
- Substantially improves the model's reliability on intricate, long-horizon tasks.
- Delivers world-leading reasoning intelligence, visual understanding, and search capabilities.
- Addresses the most common needs of a broad user base.
- Begins to exhibit the ability to handle initial complex real-world tasks.
Where Pith is reading between the lines
- Similar user-need evaluation systems could be applied to other model families to close the gap between benchmark scores and actual deployment value.
- The emphasis on long-tail knowledge suggests that scaling data alone may be less effective than targeted curation for rare but high-value scenarios.
- If the approach scales, future model cards may routinely include real-world use-case logs rather than only standard benchmark tables.
Load-bearing premise
The constructed evaluation system, by selecting and abstracting benchmarks grounded in users' genuine needs and realistic complex scenarios, accurately measures and predicts real-world performance.
What would settle it
A head-to-head test on unfiltered real-world complex tasks where Seed2.0 shows no measurable gain over earlier models in success rate or reliability would falsify the central claim.
read the original abstract
We present Seed2.0, a model series that takes a meaningful step toward solving complex, real-world tasks. Our approach begins with identifying users' genuine needs and constructing a reliable, forward-looking evaluation system by selecting and abstracting benchmarks grounded in these needs and in realistic, complex scenarios. Guided by this evaluation system, Seed2.0 targets two persistent challenges, long-tail knowledge and complex instruction following, substantially improving the model's reliability on intricate, long-horizon tasks. Beyond these, Seed2.0 delivers world-leading reasoning intelligence, visual understanding, and search capabilities that address the most common needs of a broad user base. Through extensive real-world use cases documented in this model card, we demonstrate that Seed2.0 begins to exhibit the ability to handle initial complex real-world tasks, delivering greater value to hundreds of millions of users.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Seed2.0, a model series designed to tackle complex, real-world tasks. It outlines the construction of a reliable, forward-looking evaluation system derived from users' genuine needs and realistic complex scenarios. This system guides improvements targeting long-tail knowledge and complex instruction following, leading to claims of substantially improved reliability on intricate, long-horizon tasks. Additionally, Seed2.0 is asserted to deliver world-leading performance in reasoning intelligence, visual understanding, and search capabilities, as evidenced by real-world use cases that demonstrate its ability to handle initial complex real-world tasks for a broad user base.
Significance. If the internal evaluation system proves to be representative and predictive of real-world performance, and if the performance claims can be independently verified, the work could significantly advance the field by providing models better suited for practical, long-horizon applications. The emphasis on user-centric benchmark selection is a positive direction, but requires substantiation.
major comments (1)
- [Abstract] Abstract: The central claims of 'substantially improving the model's reliability' and delivering 'world-leading' capabilities lack any supporting quantitative data, specific benchmark names, selection criteria, scoring methods, or comparisons to baselines or prior models. This absence makes the assertions impossible to evaluate or reproduce based on the provided manuscript.
Simulated Author's Rebuttal
We thank the referee for their review and the opportunity to clarify the presentation of our work. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 'substantially improving the model's reliability' and delivering 'world-leading' capabilities lack any supporting quantitative data, specific benchmark names, selection criteria, scoring methods, or comparisons to baselines or prior models. This absence makes the assertions impossible to evaluate or reproduce based on the provided manuscript.
Authors: We agree that the abstract, being a high-level summary, does not embed the quantitative details, benchmark names, or explicit comparisons. The manuscript body describes the user-centric benchmark selection process, the internal evaluation system, scoring, and real-world use cases that support the reliability and capability claims. In the revised version we will expand the abstract to include specific benchmark names, key quantitative results from the evaluation system, and brief comparisons to prior models, while retaining the model-card focus on practical use cases. revision: yes
Circularity Check
No significant circularity detected
full rationale
The provided text is a model card, not a technical paper presenting a derivation chain, equations, or first-principles results. It explicitly describes constructing an internal evaluation system from user needs as the starting point of the approach, then reports empirical outcomes guided by that system and self-documented real-world use cases. No load-bearing step reduces a claimed prediction or result to its inputs by construction, as there are no mathematical reductions, fitted parameters renamed as predictions, or self-citation chains invoking uniqueness theorems. The evaluation system is openly positioned as an author-constructed input rather than a hidden assumption that forces equivalence. This is the standard structure for model cards and does not match any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
https://lf3-static.bytednsdoc.com/obj/eden-cn/ lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf, 2025
Seed-1.8 Model Card and Evaluation Overview. https://lf3-static.bytednsdoc.com/obj/eden-cn/ lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf, 2025. Accessed: 2026-02-xx
2025
-
[2]
Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023
Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, and Andrew Zisserman. Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023
-
[3]
The anthropic economic index: January 2026 report on ai work task evo- lution
Anthropic. The anthropic economic index: January 2026 report on ai work task evo- lution. Technical report, Anthropic, 2026. URL https://www.anthropic.com/research/ anthropic-economic-index-january-2026-report
2026
-
[4]
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
T. F. Bloom. Erdős problem #1051. URLhttps://www.erdosproblems.com/1051
-
[7]
Visual physics comprehension test, 2025
Chase Brower. Visual physics comprehension test, 2025
2025
-
[8]
Beyondaime: Advancing math reasoning evaluation beyond high school olympiads
ByteDance-Seed. Beyondaime: Advancing math reasoning evaluation beyond high school olympiads. https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025
2025
-
[9]
Seed1.6.https://seed.bytedance.com/en/seed1_6, 2025
ByteDance Seed. Seed1.6.https://seed.bytedance.com/en/seed1_6, 2025. Accessed: 2026-02-08
2025
-
[10]
Introduction to techniques used in seed1.6
ByteDance Seed. Introduction to techniques used in seed1.6. https://seed.bytedance.com/en/blog/ introduction-to-techniques-used-in-seed1-6, 2025. Accessed: 2026-02-08
2025
-
[11]
Seed-coder open-sourced: Llm-based code data building method validated.https://seed
ByteDance Seed. Seed-coder open-sourced: Llm-based code data building method validated.https://seed. bytedance.com/en/blog/seed-coder-open-sourced-llm-based-code-data-building-method-validated ,
-
[12]
Accessed: 2026-02-08
2026
-
[13]
Seed research: Seed diffusion preview released — a diffusion language model delivering breakthrough 2,146 tokens/s inference speed
ByteDance Seed. Seed research: Seed diffusion preview released — a diffusion language model delivering breakthrough 2,146 tokens/s inference speed. https://seed.bytedance.com/blog/ seed-research-seed-diffusion-preview-released-a-diffusion-language-model-delivering-breakthrough-2-146-tokens-s-inference-speed ,
-
[14]
Accessed: 2026-02-08. 48
2026
-
[15]
Seed diffusion: A large-scale diffusion language model with high-speed inference
ByteDance Seed. Seed diffusion: A large-scale diffusion language model with high-speed inference. https://seed.bytedance.com/public_papers/ seed-diffusion-a-large-scale-diffusion-language-model-with-high-speed-inference , 2025. Accessed: 2026-02-08
2025
-
[16]
Seededit 3.0: Fast and high-quality generative image editing.https://seed.bytedance.com/ public_papers/seededit-3-0-fast-and-high-quality-generative-image-editing , 2025
ByteDance Seed. Seededit 3.0: Fast and high-quality generative image editing.https://seed.bytedance.com/ public_papers/seededit-3-0-fast-and-high-quality-generative-image-editing , 2025. Accessed: 2026- 02-08
2025
-
[17]
Seed-oss open-source models release
ByteDance Seed. Seed-oss open-source models release. https://seed.bytedance.com/en/blog/ seed-oss-open-source-models-release, 2025. Accessed: 2026-02-08
2025
-
[18]
Seedream 3.0 text-to-image model technical report released.https://seed.bytedance.com/ blog/seedream-3-0-text-to-image-model-technical-report-released, 2025
ByteDance Seed. Seedream 3.0 text-to-image model technical report released.https://seed.bytedance.com/ blog/seedream-3-0-text-to-image-model-technical-report-released, 2025. Accessed: 2026-02-08
2025
-
[19]
Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, and Furong Huang. MORSE-500: A programmatically controllable video benchmark to stress-test multimodal reasoning.CoRR, abs/2506.05523, 2025
-
[20]
Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Jiahua Dong, Wangbo Yu, Ge Zhang, Ian Reid, and Xiaodan Liang. Video simpleqa: Towards factuality evaluation in large video language models.CoRR, abs/2503.18923, 2025
-
[21]
How people use chatgpt
Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Working Paper 34255, National Bureau of Economic Research, September
-
[22]
URLhttp://www.nber.org/papers/w34255
-
[23]
Cg-bench: Clue-grounded question answering benchmark for long video understanding
Guo Chen, Yicheng Liu, Yifei Huang, Baoqi Pei, Jilan Xu, Yuping He, Tong Lu, Yali Wang, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025
2025
-
[24]
Seed- prover 1.5: Mastering undergraduate-level theorem proving via learning from experience, 2025
Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. Seed- prover 1.5: Mastering undergraduate-level theorem proving v...
-
[25]
Livecc: Learning video LLM with streaming speech transcription at scale
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video LLM with streaming speech transcription at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 29083–29095, 2025
2025
-
[26]
Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026
Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026
-
[27]
Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
2024
-
[28]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Pointarena: Probing multimodal grounding through language-guided pointing
Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, Boyang Li, Yushan Huang, Elvis Wang, Ainaz Eftekhar, Jason Lee, Wentao Yuan, et al. Pointarena: Probing multimodal grounding through language-guided pointing. arXiv preprint arXiv:2505.09990, 2025
-
[31]
Simplevqa: Multimodal factuality evaluation for multimodal large language models
Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4637–4646, 2025
2025
- [32]
-
[33]
Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating
Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1135–...
2025
-
[34]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms
Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...
2025
-
[36]
Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents
Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, et al. Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents. arXiv preprint arXiv:2512.12730, 2025
-
[37]
CL-bench: A Benchmark for Context Learning.arXiv e-prints, art
Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. CL-bench: A Benchmar...
-
[38]
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Mingxuan Du, BenfengXu, ChiweiZhu, XiaoruiWang, and ZhendongMao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge.arXiv e-prints, art
Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De- An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge.arXiv e-prints, art. arXiv:2206.08853, June 2022. doi: 10.48550/arXiv.2206.08853
-
[41]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InI...
2025
-
[42]
Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025
Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025
-
[44]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024
2024
-
[45]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Gemini enterprise: Release notes and workspace agent integration updates
Google Cloud. Gemini enterprise: Release notes and workspace agent integration updates. Official Documentation,
-
[47]
URLhttps://docs.cloud.google.com/gemini/enterprise/docs/release-notes
-
[48]
Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...
2024
-
[49]
D. Guo et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. URLhttps://arxiv.org/ abs/2505.07062
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025
-
[51]
arXiv preprint arXiv:2509.26490 , year=
Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025
-
[52]
Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models
Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8450–8460, 2025
2025
-
[53]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, et al. Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025
-
[55]
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, and Zhaopeng Tu. Visfactor: Benchmarking fundamental visual cognition in multimodal large language models.arXiv preprint arXiv:2502.16435, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Yichen Huang and Lin F Yang. Winning gold at imo 2025 with a model-agnostic verification-and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025
-
[57]
Online video understanding: Ovbench and videochat-online
Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 3328–3338, 2025
2025
-
[58]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Jingyao Li, Jingyun Wang, Molin Tan, Haochen Wang, Cilin Yan, Likun Shi, Jiayin Cai, Xiaolong Jiang, and Yao Hu. Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models.CoRR, abs/2511.12263, 2025
-
[61]
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents.arXiv e-prints, art
Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, and Shilei Wen. MM-BrowseComp: A Comprehensive Benchmark for ...
-
[62]
Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements.arXiv e-prints, art
Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou, Yuchen Wu, Yinzhu Piao, Denghui Cao, Tong Sun, Ziniu Li, Li Du, Bo Lei, Jiaheng Liu, Chenghua Lin, Zhaoxiang Zhang, Wenhao Huang, and Jiajun Zhang. Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements.arXiv e-prints, art. arXiv:2512.24867, December 2025. doi: 10.48550/arXiv.2512.24867
-
[63]
Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024
2024
-
[64]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 51 editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 8731–87...
2024
-
[65]
Charles, Xinyu Zhou, and Xu Sun
Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?CoRR, abs/2505.23359, 2025
-
[66]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Towards robust mathematical reasoning
Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025
2025
-
[68]
Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, et al. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024
-
[69]
Videoeval-pro: Robust and realistic long video understanding evaluation.CoRR, abs/2505.14640, 2025
Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, and Wenhu Chen. Videoeval-pro: Robust and realistic long video understanding evaluation.CoRR, abs/2505.14640, 2025
-
[70]
Mmlongbench-doc: Benchmarking long-context document understanding with visualizations
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. Advances in Neural Information Processing Systems, 37:95963–96010, 2024
2024
-
[71]
SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation.arXiv e-prints, art
Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation.arXiv e-prints, art. arXiv:2406.14991, June 2024. doi: 10.48550/arXiv.2406.14991
-
[72]
Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more diverse and challenging benchmark for chart question answering.arXiv preprint arXiv:2504.05506, 2025
-
[73]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[74]
Gaia: a benchmark for general ai assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023
2023
-
[75]
Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?arXiv e-prints, art. arXiv:2502.12115, February 2025. doi: 10.48550/arXiv.2502.12115
-
[76]
MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025
Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, and Tobias Weyand. MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025
-
[77]
Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding? InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvill...
2025
-
[78]
The state of enterprise ai 2025: Adoption, depth, and workflow integration
OpenAI. The state of enterprise ai 2025: Adoption, depth, and workflow integration. Technical report, OpenAI,
2025
-
[79]
URLhttps://openai.com/index/the-state-of-enterprise-ai-2025-report/
2025
-
[80]
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.