Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

Bytedance Seed

arxiv: 2607.00248 · v1 · pith:VD4LYTBSnew · submitted 2026-06-30 · 💻 cs.AI

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

Bytedance Seed This is my paper

Pith reviewed 2026-07-02 18:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords model cardreal-world taskslong-tail knowledgecomplex instruction followingreasoning intelligencevisual understandingsearch capabilitieslong-horizon tasks

0 comments

The pith

Seed2.0 improves reliability on intricate long-horizon tasks by targeting long-tail knowledge and complex instruction following via a user-need evaluation system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Seed2.0 as a model series that begins by identifying users' genuine needs and building an evaluation system from benchmarks abstracted from realistic complex scenarios. Guided by this system the work focuses on two persistent issues, long-tail knowledge and complex instruction following, to raise performance on difficult tasks. It further claims leading results in reasoning intelligence, visual understanding, and search that match common user requirements. Documented real-world use cases are offered to show that the model can now manage initial complex tasks and provide value to a broad audience of hundreds of millions.

Core claim

Seed2.0 takes a meaningful step toward solving complex, real-world tasks. Our approach begins with identifying users' genuine needs and constructing a reliable, forward-looking evaluation system by selecting and abstracting benchmarks grounded in these needs and in realistic, complex scenarios. Guided by this evaluation system, Seed2.0 targets two persistent challenges, long-tail knowledge and complex instruction following, substantially improving the model's reliability on intricate, long-horizon tasks. Beyond these, Seed2.0 delivers world-leading reasoning intelligence, visual understanding, and search capabilities that address the most common needs of a broad user base.

What carries the argument

The evaluation system that selects and abstracts benchmarks grounded in users' genuine needs and realistic complex scenarios, which then directs targeted improvements on long-tail knowledge and complex instruction following.

If this is right

Substantially improves the model's reliability on intricate, long-horizon tasks.
Delivers world-leading reasoning intelligence, visual understanding, and search capabilities.
Addresses the most common needs of a broad user base.
Begins to exhibit the ability to handle initial complex real-world tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar user-need evaluation systems could be applied to other model families to close the gap between benchmark scores and actual deployment value.
The emphasis on long-tail knowledge suggests that scaling data alone may be less effective than targeted curation for rare but high-value scenarios.
If the approach scales, future model cards may routinely include real-world use-case logs rather than only standard benchmark tables.

Load-bearing premise

The constructed evaluation system, by selecting and abstracting benchmarks grounded in users' genuine needs and realistic complex scenarios, accurately measures and predicts real-world performance.

What would settle it

A head-to-head test on unfiltered real-world complex tasks where Seed2.0 shows no measurable gain over earlier models in success rate or reliability would falsify the central claim.

read the original abstract

We present Seed2.0, a model series that takes a meaningful step toward solving complex, real-world tasks. Our approach begins with identifying users' genuine needs and constructing a reliable, forward-looking evaluation system by selecting and abstracting benchmarks grounded in these needs and in realistic, complex scenarios. Guided by this evaluation system, Seed2.0 targets two persistent challenges, long-tail knowledge and complex instruction following, substantially improving the model's reliability on intricate, long-horizon tasks. Beyond these, Seed2.0 delivers world-leading reasoning intelligence, visual understanding, and search capabilities that address the most common needs of a broad user base. Through extensive real-world use cases documented in this model card, we demonstrate that Seed2.0 begins to exhibit the ability to handle initial complex real-world tasks, delivering greater value to hundreds of millions of users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Seed2.0 is a closed model card from Bytedance that claims gains on complex tasks via an internal eval system but supplies no public benchmarks or numbers to check.

read the letter

Seed2.0 is a model card for Bytedance's latest series. The main claim is that by building an evaluation system around real user needs and complex scenarios, they've made meaningful gains on long-tail knowledge and complex instruction following, leading to better performance on intricate long-horizon tasks. They also position it as having strong reasoning, visual, and search abilities, backed by real-world examples.

The paper does a decent job laying out their philosophy for evaluation. Starting from users' genuine needs and abstracting benchmarks from realistic situations is a reasonable way to guide development for practical AI. Including documentation of actual use cases helps show what the model can do in the wild.

The main weakness is the missing details on the evaluation itself. There are no specific benchmarks named, no scores reported, no comparisons to other models, and no data or code released. All the assertions about substantial improvements and world-leading capabilities depend on this internal system without any external check. That makes it hard to judge how much progress has actually been made.

Readers who track commercial AI releases from major labs might find this interesting for seeing Bytedance's direction. Anyone looking for new methods, open benchmarks, or verifiable results will come away empty. It is not the kind of work that calls for serious peer review.

I would not recommend sending this to referees. It functions as a model card and should be treated as such.

Referee Report

1 major / 0 minor

Summary. The paper presents Seed2.0, a model series designed to tackle complex, real-world tasks. It outlines the construction of a reliable, forward-looking evaluation system derived from users' genuine needs and realistic complex scenarios. This system guides improvements targeting long-tail knowledge and complex instruction following, leading to claims of substantially improved reliability on intricate, long-horizon tasks. Additionally, Seed2.0 is asserted to deliver world-leading performance in reasoning intelligence, visual understanding, and search capabilities, as evidenced by real-world use cases that demonstrate its ability to handle initial complex real-world tasks for a broad user base.

Significance. If the internal evaluation system proves to be representative and predictive of real-world performance, and if the performance claims can be independently verified, the work could significantly advance the field by providing models better suited for practical, long-horizon applications. The emphasis on user-centric benchmark selection is a positive direction, but requires substantiation.

major comments (1)

[Abstract] Abstract: The central claims of 'substantially improving the model's reliability' and delivering 'world-leading' capabilities lack any supporting quantitative data, specific benchmark names, selection criteria, scoring methods, or comparisons to baselines or prior models. This absence makes the assertions impossible to evaluate or reproduce based on the provided manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the presentation of our work. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'substantially improving the model's reliability' and delivering 'world-leading' capabilities lack any supporting quantitative data, specific benchmark names, selection criteria, scoring methods, or comparisons to baselines or prior models. This absence makes the assertions impossible to evaluate or reproduce based on the provided manuscript.

Authors: We agree that the abstract, being a high-level summary, does not embed the quantitative details, benchmark names, or explicit comparisons. The manuscript body describes the user-centric benchmark selection process, the internal evaluation system, scoring, and real-world use cases that support the reliability and capability claims. In the revised version we will expand the abstract to include specific benchmark names, key quantitative results from the evaluation system, and brief comparisons to prior models, while retaining the model-card focus on practical use cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided text is a model card, not a technical paper presenting a derivation chain, equations, or first-principles results. It explicitly describes constructing an internal evaluation system from user needs as the starting point of the approach, then reports empirical outcomes guided by that system and self-documented real-world use cases. No load-bearing step reduces a claimed prediction or result to its inputs by construction, as there are no mathematical reductions, fitted parameters renamed as predictions, or self-citation chains invoking uniqueness theorems. The evaluation system is openly positioned as an author-constructed input rather than a hidden assumption that forces equivalence. This is the standard structure for model cards and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; model training inherently involves many fitted parameters but none are named or justified here.

pith-pipeline@v0.9.1-grok · 5665 in / 1005 out tokens · 20036 ms · 2026-07-02T18:54:34.743511+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

185 extracted references · 87 canonical work pages · 30 internal anchors

[1]

https://lf3-static.bytednsdoc.com/obj/eden-cn/ lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf, 2025

Seed-1.8 Model Card and Evaluation Overview. https://lf3-static.bytednsdoc.com/obj/eden-cn/ lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf, 2025. Accessed: 2026-02-xx

2025
[2]

Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, and Andrew Zisserman. Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

work page arXiv 2023
[3]

The anthropic economic index: January 2026 report on ai work task evo- lution

Anthropic. The anthropic economic index: January 2026 report on ai work task evo- lution. Technical report, Anthropic, 2026. URL https://www.anthropic.com/research/ anthropic-economic-index-january-2026-report

2026
[4]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

T. F. Bloom. Erdős problem #1051. URLhttps://www.erdosproblems.com/1051
[7]

Visual physics comprehension test, 2025

Chase Brower. Visual physics comprehension test, 2025

2025
[8]

Beyondaime: Advancing math reasoning evaluation beyond high school olympiads

ByteDance-Seed. Beyondaime: Advancing math reasoning evaluation beyond high school olympiads. https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025

2025
[9]

Seed1.6.https://seed.bytedance.com/en/seed1_6, 2025

ByteDance Seed. Seed1.6.https://seed.bytedance.com/en/seed1_6, 2025. Accessed: 2026-02-08

2025
[10]

Introduction to techniques used in seed1.6

ByteDance Seed. Introduction to techniques used in seed1.6. https://seed.bytedance.com/en/blog/ introduction-to-techniques-used-in-seed1-6, 2025. Accessed: 2026-02-08

2025
[11]

Seed-coder open-sourced: Llm-based code data building method validated.https://seed

ByteDance Seed. Seed-coder open-sourced: Llm-based code data building method validated.https://seed. bytedance.com/en/blog/seed-coder-open-sourced-llm-based-code-data-building-method-validated ,
[12]

Accessed: 2026-02-08

2026
[13]

Seed research: Seed diffusion preview released — a diffusion language model delivering breakthrough 2,146 tokens/s inference speed

ByteDance Seed. Seed research: Seed diffusion preview released — a diffusion language model delivering breakthrough 2,146 tokens/s inference speed. https://seed.bytedance.com/blog/ seed-research-seed-diffusion-preview-released-a-diffusion-language-model-delivering-breakthrough-2-146-tokens-s-inference-speed ,
[14]

Accessed: 2026-02-08. 48

2026
[15]

Seed diffusion: A large-scale diffusion language model with high-speed inference

ByteDance Seed. Seed diffusion: A large-scale diffusion language model with high-speed inference. https://seed.bytedance.com/public_papers/ seed-diffusion-a-large-scale-diffusion-language-model-with-high-speed-inference , 2025. Accessed: 2026-02-08

2025
[16]

Seededit 3.0: Fast and high-quality generative image editing.https://seed.bytedance.com/ public_papers/seededit-3-0-fast-and-high-quality-generative-image-editing , 2025

ByteDance Seed. Seededit 3.0: Fast and high-quality generative image editing.https://seed.bytedance.com/ public_papers/seededit-3-0-fast-and-high-quality-generative-image-editing , 2025. Accessed: 2026- 02-08

2025
[17]

Seed-oss open-source models release

ByteDance Seed. Seed-oss open-source models release. https://seed.bytedance.com/en/blog/ seed-oss-open-source-models-release, 2025. Accessed: 2026-02-08

2025
[18]

Seedream 3.0 text-to-image model technical report released.https://seed.bytedance.com/ blog/seedream-3-0-text-to-image-model-technical-report-released, 2025

ByteDance Seed. Seedream 3.0 text-to-image model technical report released.https://seed.bytedance.com/ blog/seedream-3-0-text-to-image-model-technical-report-released, 2025. Accessed: 2026-02-08

2025
[19]

MORSE-500: A programmatically controllable video benchmark to stress-test multimodal reasoning.CoRR, abs/2506.05523, 2025

Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, and Furong Huang. MORSE-500: A programmatically controllable video benchmark to stress-test multimodal reasoning.CoRR, abs/2506.05523, 2025

work page arXiv 2025
[20]

Video simpleqa: Towards factuality evaluation in large video language models.CoRR, abs/2503.18923, 2025

Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Jiahua Dong, Wangbo Yu, Ge Zhang, Ian Reid, and Xiaodan Liang. Video simpleqa: Towards factuality evaluation in large video language models.CoRR, abs/2503.18923, 2025

work page arXiv 2025
[21]

How people use chatgpt

Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Working Paper 34255, National Bureau of Economic Research, September
[22]

URLhttp://www.nber.org/papers/w34255
[23]

Cg-bench: Clue-grounded question answering benchmark for long video understanding

Guo Chen, Yicheng Liu, Yifei Huang, Baoqi Pei, Jilan Xu, Yuping He, Tong Lu, Yali Wang, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025

2025
[24]

Seed- prover 1.5: Mastering undergraduate-level theorem proving via learning from experience, 2025

Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. Seed- prover 1.5: Mastering undergraduate-level theorem proving v...

work page arXiv 2025
[25]

Livecc: Learning video LLM with streaming speech transcription at scale

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video LLM with streaming speech transcription at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 29083–29095, 2025

2025
[26]

Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026

work page arXiv 2026
[27]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024
[28]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Pointarena: Probing multimodal grounding through language-guided pointing

Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, Boyang Li, Yushan Huang, Elvis Wang, Ainaz Eftekhar, Jason Lee, Wentao Yuan, et al. Pointarena: Probing multimodal grounding through language-guided pointing. arXiv preprint arXiv:2505.09990, 2025

work page arXiv 2025
[31]

Simplevqa: Multimodal factuality evaluation for multimodal large language models

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4637–4646, 2025

2025
[32]

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, and Yuki M. Asano. Tvbench: Redesigning video-language evaluation.CoRR, abs/2410.07752, 2024. 49

work page arXiv 2024
[33]

Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating

Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1135–...

2025
[34]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...

2025
[36]

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, et al. Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents. arXiv preprint arXiv:2512.12730, 2025

work page arXiv 2025
[37]

CL-bench: A Benchmark for Context Learning.arXiv e-prints, art

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. CL-bench: A Benchmar...

work page doi:10.48550/arxiv.2602.03587 2026
[38]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, BenfengXu, ChiweiZhu, XiaoruiWang, and ZhendongMao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge.arXiv e-prints, art

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De- An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge.arXiv e-prints, art. arXiv:2206.08853, June 2022. doi: 10.48550/arXiv.2206.08853

work page doi:10.48550/arxiv.2206.08853 2022
[41]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InI...

2025
[42]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025

Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025

work page arXiv 2025
[44]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

2024
[45]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Gemini enterprise: Release notes and workspace agent integration updates

Google Cloud. Gemini enterprise: Release notes and workspace agent integration updates. Official Documentation,
[47]

URLhttps://docs.cloud.google.com/gemini/enterprise/docs/release-notes
[48]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

2024
[49]

Seed1.5-VL Technical Report

D. Guo et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. URLhttps://arxiv.org/ abs/2505.07062

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

work page arXiv 2025
[51]

arXiv preprint arXiv:2509.26490 , year=

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025
[52]

Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8450–8460, 2025

2025
[53]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025

Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, et al. Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025

work page arXiv 2025
[55]

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, and Zhaopeng Tu. Visfactor: Benchmarking fundamental visual cognition in multimodal large language models.arXiv preprint arXiv:2502.16435, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Winning gold at imo 2025 with a model-agnostic verification-and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

Yichen Huang and Lin F Yang. Winning gold at imo 2025 with a model-agnostic verification-and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

work page arXiv 2025
[57]

Online video understanding: Ovbench and videochat-online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 3328–3338, 2025

2025
[58]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models.CoRR, abs/2511.12263, 2025

Jingyao Li, Jingyun Wang, Molin Tan, Haochen Wang, Cilin Yan, Likun Shi, Jiayin Cai, Xiaolong Jiang, and Yao Hu. Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models.CoRR, abs/2511.12263, 2025

work page arXiv 2025
[61]

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents.arXiv e-prints, art

Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, and Shilei Wen. MM-BrowseComp: A Comprehensive Benchmark for ...

work page doi:10.48550/arxiv.2508.13186 2025
[62]

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements.arXiv e-prints, art

Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou, Yuchen Wu, Yinzhu Piao, Denghui Cao, Tong Sun, Ziniu Li, Li Du, Bo Lei, Jiaheng Liu, Chenghua Lin, Zhaoxiang Zhang, Wenhao Huang, and Jiajun Zhang. Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements.arXiv e-prints, art. arXiv:2512.24867, December 2025. doi: 10.48550/arXiv.2512.24867

work page doi:10.48550/arxiv.2512.24867 2025
[63]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024
[64]

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 51 editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 8731–87...

2024
[65]

Charles, Xinyu Zhou, and Xu Sun

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?CoRR, abs/2505.23359, 2025

work page arXiv 2025
[66]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Towards robust mathematical reasoning

Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025

2025
[68]

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, et al. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

work page arXiv 2024
[69]

Videoeval-pro: Robust and realistic long video understanding evaluation.CoRR, abs/2505.14640, 2025

Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, and Wenhu Chen. Videoeval-pro: Robust and realistic long video understanding evaluation.CoRR, abs/2505.14640, 2025

work page arXiv 2025
[70]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. Advances in Neural Information Processing Systems, 37:95963–96010, 2024

2024
[71]

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation.arXiv e-prints, art

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation.arXiv e-prints, art. arXiv:2406.14991, June 2024. doi: 10.48550/arXiv.2406.14991

work page doi:10.48550/arxiv.2406.14991 2024
[72]

Chartqapro: A more diverse and challenging benchmark for chart question answering.arXiv preprint arXiv:2504.05506, 2025

Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more diverse and challenging benchmark for chart question answering.arXiv preprint arXiv:2504.05506, 2025

work page arXiv 2025
[73]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[74]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023
[75]

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?arXiv e-prints, art

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?arXiv e-prints, art. arXiv:2502.12115, February 2025. doi: 10.48550/arXiv.2502.12115

work page doi:10.48550/arxiv.2502.12115 2025
[76]

MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025

Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, and Tobias Weyand. MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025

work page arXiv 2025
[77]

Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding? InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvill...

2025
[78]

The state of enterprise ai 2025: Adoption, depth, and workflow integration

OpenAI. The state of enterprise ai 2025: Adoption, depth, and workflow integration. Technical report, OpenAI,

2025
[79]

URLhttps://openai.com/index/the-state-of-enterprise-ai-2025-report/

2025
[80]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

2025

Showing first 80 references.

[1] [1]

https://lf3-static.bytednsdoc.com/obj/eden-cn/ lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf, 2025

Seed-1.8 Model Card and Evaluation Overview. https://lf3-static.bytednsdoc.com/obj/eden-cn/ lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf, 2025. Accessed: 2026-02-xx

2025

[2] [2]

Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, and Andrew Zisserman. Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

work page arXiv 2023

[3] [3]

The anthropic economic index: January 2026 report on ai work task evo- lution

Anthropic. The anthropic economic index: January 2026 report on ai work task evo- lution. Technical report, Anthropic, 2026. URL https://www.anthropic.com/research/ anthropic-economic-index-january-2026-report

2026

[4] [4]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

T. F. Bloom. Erdős problem #1051. URLhttps://www.erdosproblems.com/1051

[7] [7]

Visual physics comprehension test, 2025

Chase Brower. Visual physics comprehension test, 2025

2025

[8] [8]

Beyondaime: Advancing math reasoning evaluation beyond high school olympiads

ByteDance-Seed. Beyondaime: Advancing math reasoning evaluation beyond high school olympiads. https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025

2025

[9] [9]

Seed1.6.https://seed.bytedance.com/en/seed1_6, 2025

ByteDance Seed. Seed1.6.https://seed.bytedance.com/en/seed1_6, 2025. Accessed: 2026-02-08

2025

[10] [10]

Introduction to techniques used in seed1.6

ByteDance Seed. Introduction to techniques used in seed1.6. https://seed.bytedance.com/en/blog/ introduction-to-techniques-used-in-seed1-6, 2025. Accessed: 2026-02-08

2025

[11] [11]

Seed-coder open-sourced: Llm-based code data building method validated.https://seed

ByteDance Seed. Seed-coder open-sourced: Llm-based code data building method validated.https://seed. bytedance.com/en/blog/seed-coder-open-sourced-llm-based-code-data-building-method-validated ,

[12] [12]

Accessed: 2026-02-08

2026

[13] [13]

Seed research: Seed diffusion preview released — a diffusion language model delivering breakthrough 2,146 tokens/s inference speed

ByteDance Seed. Seed research: Seed diffusion preview released — a diffusion language model delivering breakthrough 2,146 tokens/s inference speed. https://seed.bytedance.com/blog/ seed-research-seed-diffusion-preview-released-a-diffusion-language-model-delivering-breakthrough-2-146-tokens-s-inference-speed ,

[14] [14]

Accessed: 2026-02-08. 48

2026

[15] [15]

Seed diffusion: A large-scale diffusion language model with high-speed inference

ByteDance Seed. Seed diffusion: A large-scale diffusion language model with high-speed inference. https://seed.bytedance.com/public_papers/ seed-diffusion-a-large-scale-diffusion-language-model-with-high-speed-inference , 2025. Accessed: 2026-02-08

2025

[16] [16]

Seededit 3.0: Fast and high-quality generative image editing.https://seed.bytedance.com/ public_papers/seededit-3-0-fast-and-high-quality-generative-image-editing , 2025

ByteDance Seed. Seededit 3.0: Fast and high-quality generative image editing.https://seed.bytedance.com/ public_papers/seededit-3-0-fast-and-high-quality-generative-image-editing , 2025. Accessed: 2026- 02-08

2025

[17] [17]

Seed-oss open-source models release

ByteDance Seed. Seed-oss open-source models release. https://seed.bytedance.com/en/blog/ seed-oss-open-source-models-release, 2025. Accessed: 2026-02-08

2025

[18] [18]

Seedream 3.0 text-to-image model technical report released.https://seed.bytedance.com/ blog/seedream-3-0-text-to-image-model-technical-report-released, 2025

ByteDance Seed. Seedream 3.0 text-to-image model technical report released.https://seed.bytedance.com/ blog/seedream-3-0-text-to-image-model-technical-report-released, 2025. Accessed: 2026-02-08

2025

[19] [19]

MORSE-500: A programmatically controllable video benchmark to stress-test multimodal reasoning.CoRR, abs/2506.05523, 2025

Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, and Furong Huang. MORSE-500: A programmatically controllable video benchmark to stress-test multimodal reasoning.CoRR, abs/2506.05523, 2025

work page arXiv 2025

[20] [20]

Video simpleqa: Towards factuality evaluation in large video language models.CoRR, abs/2503.18923, 2025

Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Jiahua Dong, Wangbo Yu, Ge Zhang, Ian Reid, and Xiaodan Liang. Video simpleqa: Towards factuality evaluation in large video language models.CoRR, abs/2503.18923, 2025

work page arXiv 2025

[21] [21]

How people use chatgpt

Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Working Paper 34255, National Bureau of Economic Research, September

[22] [22]

URLhttp://www.nber.org/papers/w34255

[23] [23]

Cg-bench: Clue-grounded question answering benchmark for long video understanding

Guo Chen, Yicheng Liu, Yifei Huang, Baoqi Pei, Jilan Xu, Yuping He, Tong Lu, Yali Wang, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025

2025

[24] [24]

Seed- prover 1.5: Mastering undergraduate-level theorem proving via learning from experience, 2025

Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. Seed- prover 1.5: Mastering undergraduate-level theorem proving v...

work page arXiv 2025

[25] [25]

Livecc: Learning video LLM with streaming speech transcription at scale

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video LLM with streaming speech transcription at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 29083–29095, 2025

2025

[26] [26]

Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026

work page arXiv 2026

[27] [27]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024

[28] [28]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Pointarena: Probing multimodal grounding through language-guided pointing

Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, Boyang Li, Yushan Huang, Elvis Wang, Ainaz Eftekhar, Jason Lee, Wentao Yuan, et al. Pointarena: Probing multimodal grounding through language-guided pointing. arXiv preprint arXiv:2505.09990, 2025

work page arXiv 2025

[31] [31]

Simplevqa: Multimodal factuality evaluation for multimodal large language models

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4637–4646, 2025

2025

[32] [32]

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, and Yuki M. Asano. Tvbench: Redesigning video-language evaluation.CoRR, abs/2410.07752, 2024. 49

work page arXiv 2024

[33] [33]

Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating

Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1135–...

2025

[34] [34]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...

2025

[36] [36]

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, et al. Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents. arXiv preprint arXiv:2512.12730, 2025

work page arXiv 2025

[37] [37]

CL-bench: A Benchmark for Context Learning.arXiv e-prints, art

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. CL-bench: A Benchmar...

work page doi:10.48550/arxiv.2602.03587 2026

[38] [38]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, BenfengXu, ChiweiZhu, XiaoruiWang, and ZhendongMao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge.arXiv e-prints, art

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De- An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge.arXiv e-prints, art. arXiv:2206.08853, June 2022. doi: 10.48550/arXiv.2206.08853

work page doi:10.48550/arxiv.2206.08853 2022

[41] [41]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InI...

2025

[42] [42]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025

Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025

work page arXiv 2025

[44] [44]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

2024

[45] [45]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Gemini enterprise: Release notes and workspace agent integration updates

Google Cloud. Gemini enterprise: Release notes and workspace agent integration updates. Official Documentation,

[47] [47]

URLhttps://docs.cloud.google.com/gemini/enterprise/docs/release-notes

[48] [48]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

2024

[49] [49]

Seed1.5-VL Technical Report

D. Guo et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. URLhttps://arxiv.org/ abs/2505.07062

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

work page arXiv 2025

[51] [51]

arXiv preprint arXiv:2509.26490 , year=

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025

[52] [52]

Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8450–8460, 2025

2025

[53] [53]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025

Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, et al. Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning.arXiv preprint arXiv:2509.13160, 2025

work page arXiv 2025

[55] [55]

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, and Zhaopeng Tu. Visfactor: Benchmarking fundamental visual cognition in multimodal large language models.arXiv preprint arXiv:2502.16435, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Winning gold at imo 2025 with a model-agnostic verification-and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

Yichen Huang and Lin F Yang. Winning gold at imo 2025 with a model-agnostic verification-and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

work page arXiv 2025

[57] [57]

Online video understanding: Ovbench and videochat-online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 3328–3338, 2025

2025

[58] [58]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models.CoRR, abs/2511.12263, 2025

Jingyao Li, Jingyun Wang, Molin Tan, Haochen Wang, Cilin Yan, Likun Shi, Jiayin Cai, Xiaolong Jiang, and Yao Hu. Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models.CoRR, abs/2511.12263, 2025

work page arXiv 2025

[61] [61]

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents.arXiv e-prints, art

Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, and Shilei Wen. MM-BrowseComp: A Comprehensive Benchmark for ...

work page doi:10.48550/arxiv.2508.13186 2025

[62] [62]

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements.arXiv e-prints, art

Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou, Yuchen Wu, Yinzhu Piao, Denghui Cao, Tong Sun, Ziniu Li, Li Du, Bo Lei, Jiaheng Liu, Chenghua Lin, Zhaoxiang Zhang, Wenhao Huang, and Jiajun Zhang. Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements.arXiv e-prints, art. arXiv:2512.24867, December 2025. doi: 10.48550/arXiv.2512.24867

work page doi:10.48550/arxiv.2512.24867 2025

[63] [63]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024

[64] [64]

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 51 editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 8731–87...

2024

[65] [65]

Charles, Xinyu Zhou, and Xu Sun

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?CoRR, abs/2505.23359, 2025

work page arXiv 2025

[66] [66]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Towards robust mathematical reasoning

Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025

2025

[68] [68]

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, et al. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

work page arXiv 2024

[69] [69]

Videoeval-pro: Robust and realistic long video understanding evaluation.CoRR, abs/2505.14640, 2025

Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, and Wenhu Chen. Videoeval-pro: Robust and realistic long video understanding evaluation.CoRR, abs/2505.14640, 2025

work page arXiv 2025

[70] [70]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. Advances in Neural Information Processing Systems, 37:95963–96010, 2024

2024

[71] [71]

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation.arXiv e-prints, art

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation.arXiv e-prints, art. arXiv:2406.14991, June 2024. doi: 10.48550/arXiv.2406.14991

work page doi:10.48550/arxiv.2406.14991 2024

[72] [72]

Chartqapro: A more diverse and challenging benchmark for chart question answering.arXiv preprint arXiv:2504.05506, 2025

Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more diverse and challenging benchmark for chart question answering.arXiv preprint arXiv:2504.05506, 2025

work page arXiv 2025

[73] [73]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[74] [74]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023

[75] [75]

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?arXiv e-prints, art

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?arXiv e-prints, art. arXiv:2502.12115, February 2025. doi: 10.48550/arXiv.2502.12115

work page doi:10.48550/arxiv.2502.12115 2025

[76] [76]

MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025

Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, and Tobias Weyand. MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025

work page arXiv 2025

[77] [77]

Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding? InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvill...

2025

[78] [78]

The state of enterprise ai 2025: Adoption, depth, and workflow integration

OpenAI. The state of enterprise ai 2025: Adoption, depth, and workflow integration. Technical report, OpenAI,

2025

[79] [79]

URLhttps://openai.com/index/the-state-of-enterprise-ai-2025-report/

2025

[80] [80]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24838–24848, 2025

2025