Benchmark Everything Everywhere All at Once

Bokang Yang; Dongming Wu; Peiwen Sun; Shiyun Xiong; Wencheng Han; Xiangyu Yue; Xiao-Hui Li; Yuang Ai

arxiv: 2606.06462 · v1 · pith:U237FBZ6new · submitted 2026-06-04 · 💻 cs.AI

Benchmark Everything Everywhere All at Once

Shiyun Xiong , Dongming Wu , Peiwen Sun , Yuang Ai , Bokang Yang , Wencheng Han , Xiao-Hui Li , Xiangyu Yue This is my paper

Pith reviewed 2026-06-28 01:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords benchmark constructionautonomous agentsLLM evaluationdata annotationquality controlmultimodal modelsscalable benchmarksagentic systems

0 comments

The pith

An autonomous agent can construct high-quality benchmarks for LLMs and MLLMs across many domains with minimal human input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Benchmark Agent as a system that takes over the full process of benchmark creation, from analyzing user needs and designing subtasks to annotating data and running quality checks. This matters because manual benchmark building is slow, hard to reuse, and leads to tests that stop distinguishing between top models once they saturate. The authors ran the agent on fifteen cases covering text, multimodal, and specialized reasoning tasks, then checked the outputs with human reviewers, LLM judges, and consistency tests. The results indicate the agent produces samples that meet expert standards while requiring little ongoing human work. The approach also surfaces observations about where current models still fail on domain-specific problems.

Core claim

Benchmark Agent is a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline from user query analysis and subtask design through data annotation and quality control, and when applied to generate fifteen representative benchmarks it produces high-quality samples validated by human evaluation, LLM-as-a-judge assessment, and consistency checks with only minimal human involvement.

What carries the argument

Benchmark Agent, the agentic system that manages the end-to-end pipeline of query analysis, subtask design, data annotation, and quality control.

If this is right

Benchmarks can be produced rapidly enough to stay ahead of model performance saturation.
Current models show clear weaknesses on certain domain-specific reasoning tasks when evaluated with the new samples.
The same agentic pipeline works for text understanding, multimodal understanding, and specialized reasoning scenarios.
Large numbers of reusable benchmarks become feasible without proportional increases in human labor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continual regeneration of benchmarks could keep evaluation sets discriminative even as models improve quickly.
Lowering the cost of creating domain-specific tests might encourage more targeted evaluations in new fields.
If the agent's judgments align closely with its underlying model, the benchmarks could systematically miss certain failure modes that human experts would notice.

Load-bearing premise

An LLM-driven agent can carry out data annotation and quality control at expert-human level without introducing undetected biases or low-quality samples.

What would settle it

Domain experts reviewing the generated benchmark samples identify a large share of flawed or biased items that the agent's quality-control steps did not catch.

Figures

Figures reproduced from arXiv: 2606.06462 by Bokang Yang, Dongming Wu, Peiwen Sun, Shiyun Xiong, Wencheng Han, Xiangyu Yue, Xiao-Hui Li, Yuang Ai.

**Figure 2.** Figure 2: Benchmark performance saturation on Qwen. Moreover, existing benchmarks often reach performance saturation shortly after their release. To demonstrate this trend, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The overall pipeline of Benchmark Agent. It consists of two main components. Benchmark Planner first (i) decomposes user requirements into subtasks, then (ii) grounds each subtask to real datasets through transformability validation, and finally (iii) determines feasible allocations under global constraints. Benchmark Executor subsequently (i) performs sample-level planning, (ii) executes tool-based trans… view at source ↗

**Figure 4.** Figure 4: Failure cases observed during model evaluation. Human verification confirms that the annotations for these samples are correct, and the errors arise from model predictions. from the previous sample state to the newly produced fields. Although planning remains adaptive, it is explicitly constrained by the ti,j , preventing uncontrolled divergence across samples. ii) Execution. Each planned action is execute… view at source ↗

read the original abstract

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds an end-to-end agent for generating benchmarks from a user query, but its quality claims depend on evaluation methods that may miss the exact problems the system could introduce.

read the letter

The main takeaway is an autonomous agent pipeline that takes a query, designs subtasks, annotates data, and runs quality control with almost no further human input. They used it to produce 15 benchmarks covering text, multimodal, and domain-specific tasks, then checked the results with human raters, an LLM judge, and consistency metrics.

What is new is the complete orchestration in one agentic loop rather than separate tools for data generation or filtering. The abstract positions this as a way to keep benchmarks from saturating quickly and to reduce the manual effort that currently limits how often new tests appear.

The system description itself is straightforward and directly targets a known bottleneck. The plan to release code and a demo is also useful for anyone who wants to try reproducing the pipeline.

The soft spot is the validation. The paper states that the experiments show high-quality output, yet the abstract gives no numbers on agreement rates, no breakdown of what the human raters were actually scoring, and no comparison against existing human-written benchmarks on the same tasks. The stress-test concern holds: if the human evaluators or the LLM judge lack the domain depth to spot subtle factual or reasoning errors, those checks would not catch systematic issues introduced during the agent's annotation step. Because the system is meant to run autonomously after the initial query, any undetected bias would stay in the released benchmark.

This is for researchers who build or maintain LLM evaluation suites and are looking for ways to increase throughput. A reader already working on automated data pipelines might pick up implementation ideas, but would still need to run their own checks before trusting the outputs.

It should go to peer review. The engineering approach is concrete enough to discuss, and referees can require tighter evidence on whether the generated benchmarks actually hold up under expert scrutiny.

Referee Report

2 major / 0 minor

Summary. The paper introduces Benchmark Agent, a fully autonomous agentic system that orchestrates the full benchmark construction pipeline for LLMs and MLLMs, including user query analysis, subtask design, data annotation, and quality control. The authors report implementing the system to generate 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning. They claim that extensive experiments using human evaluation, LLM-as-a-judge assessment, and consistency checks demonstrate that the system produces high-quality benchmark samples with minimal human involvement, and they report additional findings from continual evaluation on model performance limitations.

Significance. If the central claims are substantiated with rigorous evidence, the work could meaningfully advance sustainable benchmark creation by reducing labor intensity and enabling rapid iteration to avoid saturation. The planned public release of code and previews would be a concrete strength, supporting reproducibility and community use. However, the current presentation provides no quantitative results, error analysis, or dataset statistics, limiting assessment of whether the approach delivers expert-level output.

major comments (2)

[Abstract] Abstract: The central claim that 'Benchmark Agent can generate high-quality benchmark samples with minimal human involvement' rests on unspecified experiments; no quantitative metrics, inter-rater agreement scores, error rates, or sample statistics are reported, rendering the validity of the high-quality output assertion impossible to assess.
[Experiments] Experiments (as described): The human evaluation and LLM-as-a-judge protocols are mentioned at a high level without specifying evaluator expertise across all 15 domains, the evaluation rubric, number of raters, or how subtle factual/reasoning errors were probed; this directly bears on whether the weakest assumption (reliable expert-level annotation without undetected bias) holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The feedback highlights important gaps in the presentation of our experimental results and protocols. We agree that additional quantitative details and protocol specifications are required to fully substantiate the claims regarding benchmark quality and will revise the manuscript to address these points.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Benchmark Agent can generate high-quality benchmark samples with minimal human involvement' rests on unspecified experiments; no quantitative metrics, inter-rater agreement scores, error rates, or sample statistics are reported, rendering the validity of the high-quality output assertion impossible to assess.

Authors: We acknowledge that the abstract does not include specific quantitative metrics or statistics. While the experiments section describes human evaluation, LLM-as-a-judge assessment, and consistency checks across the 15 benchmarks, we agree that key numbers (e.g., agreement scores, error rates, and dataset statistics) should be summarized upfront. In the revision we will update the abstract to report these quantitative findings and will add a dedicated results table or subsection with the requested statistics. revision: yes
Referee: [Experiments] Experiments (as described): The human evaluation and LLM-as-a-judge protocols are mentioned at a high level without specifying evaluator expertise across all 15 domains, the evaluation rubric, number of raters, or how subtle factual/reasoning errors were probed; this directly bears on whether the weakest assumption (reliable expert-level annotation without undetected bias) holds.

Authors: We agree that the current description of the evaluation protocols is insufficiently detailed. The revised manuscript will explicitly state the number of raters per benchmark, their domain expertise (including how experts were recruited for each of the 15 domains), the full evaluation rubric, and the procedures used to detect subtle factual or reasoning errors. We will also describe how the LLM-as-a-judge was validated against human judgments to address potential bias concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with no derivations or fitted quantities

full rationale

The paper presents an engineering system (Benchmark Agent) for autonomous benchmark construction and reports empirical results from human/LLM evaluations on 15 generated benchmarks. No equations, parameters, uniqueness theorems, or derivation chains appear in the provided text. The central claim is an empirical demonstration of system performance rather than a mathematical reduction; evaluations are external to any self-referential input. This is a standard non-circular systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the invented Benchmark Agent system and the assumption that LLMs can serve as competent annotators and judges.

axioms (1)

domain assumption LLMs can perform reliable data annotation and quality control comparable to humans for benchmark construction
The evaluation pipeline relies on LLM-as-a-judge assessment and agent-driven annotation without detailing error rates or human baselines.

invented entities (1)

Benchmark Agent no independent evidence
purpose: Autonomous orchestration of the full benchmark construction pipeline
The system is the primary new contribution introduced to solve the stated problems.

pith-pipeline@v0.9.1-grok · 5751 in / 1305 out tokens · 50032 ms · 2026-06-28T01:00:52.673519+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 33 canonical work pages · 17 internal anchors

[1]

Synthetic dialogue dataset generation using llm agents

Yelaman Abdullin, Diego Molla, Bahadorreza Ofoghi, John Yearwood, and Qingyang Li. Synthetic dialogue dataset generation using llm agents. InProceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), 2023

2023
[2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

System card: Claude opus 4 and claude sonnet 4

Anthropic. System card: Claude opus 4 and claude sonnet 4. https://www-cdn.anthropic. com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf, May 2025

2025
[4]

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, et al. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Benchagents: Multi-agent systems for structured benchmark creation.arXiv preprint arXiv:2410.22584, 2024

Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, and Vidhisha Balachan- dran. Benchagents: Multi-agent systems for structured benchmark creation.arXiv preprint arXiv:2410.22584, 2024

work page arXiv 2024
[8]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Mllm-as-a-judge: Assessing multimodal llm-as-a- judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a- judge with vision-language benchmark. InICML, 2024

2024
[10]

Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

2024
[11]

Can large language models be an alternative to human evaluations?arXiv preprint arXiv:2305.01937, 2023

Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations?arXiv preprint arXiv:2305.01937, 2023. 10

work page arXiv 2023
[12]

CL-bench: A Benchmark for Context Learning.arXiv e-prints, art

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, et al. Cl-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

work page arXiv 2026
[13]

On path to multimodal generalist: General-level and general-bench

Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, et al. On path to multimodal generalist: General-level and general-bench. InICML, 2025

2025
[14]

Gptscore: Evaluate as you desire

Jinlan Fu, See Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

2024
[15]

Gemini 3 pro model card

Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf, December 2025

2025
[16]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[17]

Cerebellar output shapes cortical preparatory activity during motor adaptation.Nature Communications, 2025

Sharon Israely, Hugo Ninou, Ori Rajchert, Lee Elmaleh, Ran Harel, Firas Mawase, Jonathan Kadmon, and Yifat Prut. Cerebellar output shapes cortical preparatory activity during motor adaptation.Nature Communications, 2025

2025
[18]

Kimi K2: Open Agentic Intelligence

Team Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

work page arXiv 2025
[20]

Act as human: Multimodal large language model data annotation with critical thinking.arXiv preprint arXiv:2511.09833, 2025

Lequan Lin, Dai Shi, Andi Han, Feng Chen, Qiuzheng Chen, Jiawen Li, Zhaoyang Li, Jiyuan Li, Zhenbang Sun, and Junbin Gao. Act as human: Multimodal large language model data annotation with critical thinking.arXiv preprint arXiv:2511.09833, 2025

work page arXiv 2025
[21]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[22]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

2024
[23]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, and Weizhu Chen. Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024

work page arXiv 2024
[25]

Unitcoder: Scalable iterative code synthesis with unit test guidance.arXiv preprint arXiv:2502.11460, 2025

Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, and Kai Chen. Unitcoder: Scalable iterative code synthesis with unit test guidance.arXiv preprint arXiv:2502.11460, 2025

work page arXiv 2025
[26]

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Team MiroMind, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, et al. Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Au- tonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Au- tonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024. 11

work page arXiv 2024
[28]

Benchmarkˆ 2: Systematic evaluation of llm benchmarks.arXiv preprint arXiv:2601.03986, 2026

Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, et al. Benchmarkˆ 2: Systematic evaluation of llm benchmarks.arXiv preprint arXiv:2601.03986, 2026

work page arXiv 2026
[29]

Autobench: Automatic testbench generation and evaluation using llms for hdl design

Ruidi Qiu, Grace Li Zhang, Rolf Drechsler, Ulf Schlichtmann, and Bing Li. Autobench: Automatic testbench generation and evaluation using llms for hdl design. InProceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, 2024

2024
[30]

Qwen2 Technical Report

Team Qwen et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Qwen3.6-Plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URL https://qwen.ai/ blog?id=qwen3.6

2026
[32]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Neuronal dynamics of cerebellum and medial prefrontal cortex in adaptive motor timing.Nature Commu- nications, 2025

Zhong Ren, Xiaolu Wang, Milen Angelov, Chris I De Zeeuw, and Zhenyu Gao. Neuronal dynamics of cerebellum and medial prefrontal cortex in adaptive motor timing.Nature Commu- nications, 2025

2025
[34]

Tagal: Tabular data generation using agentic llm methods.arXiv preprint arXiv:2509.04152, 2025

Benoît Ronval, Pierre Dupont, and Siegfried Nijssen. Tagal: Tabular data generation using agentic llm methods.arXiv preprint arXiv:2509.04152, 2025

work page arXiv 2025
[35]

One-eval: An agentic system for automated and traceable llm evaluation.arXiv preprint arXiv:2603.09821, 2026

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, et al. One-eval: An agentic system for automated and traceable llm evaluation.arXiv preprint arXiv:2603.09821, 2026

work page arXiv 2026
[36]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, 2019

2019
[38]

Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

work page arXiv 2025
[39]

Spacevista: All-scale visual spatial reasoning from mm to km.ICML, 2025

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.ICML, 2025

2025
[40]

Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration.arXiv preprint arXiv:2505.03673, 2025

Huajie Tan, Xiaoshuai Hao, Cheng Chi, Minglan Lin, Yaoxu Lyu, Mingyu Cao, Dong Liang, Zhuo Chen, Mengsi Lyu, Cheng Peng, et al. Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration.arXiv preprint arXiv:2505.03673, 2025

work page arXiv 2025
[41]

Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

work page arXiv 2025
[42]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February
[43]

URLhttps://qwen.ai/blog?id=qwen3.5
[44]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.NeurIPS, 2024

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.NeurIPS, 2024

2024
[45]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.NeurIPS, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.NeurIPS, 2024

2024
[48]

Finevision: Open data is all you need,

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need,
[49]

URLhttps://arxiv.org/abs/2510.17269

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Language prompt for autonomous driving

Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-zhong Xu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving. InAAAI, 2025

2025
[51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, and Xiangyu Yue. From web to pixels: Bringing agentic search into visual perception.arXiv preprint arXiv:2605.12497, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Swe-agent: Agent-computer interfaces enable automated software engineering.NeurIPS, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.NeurIPS, 2024

2024
[54]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2022

2022
[55]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InACL, 2025

2025
[56]

Evaluation agent: Efficient and promptable evaluation framework for visual generative models

Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. Evaluation agent: Efficient and promptable evaluation framework for visual generative models. InACL, 2025

2025
[57]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

2024
[58]

Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?ICLR, 2025

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?ICLR, 2025

2025
[59]

Dual and plasticity-dependent regulation of cerebello-zona incerta circuits on anxiety-like behaviors.Nature communications, 2025

Yue Zhao, Jin-Tao Wu, Jia-Bin Feng, Xin-Yu Cai, Xin-Tai Wang, Luxi Wang, Wei Xie, Yan Gu, Jun Liu, Wei Chen, et al. Dual and plasticity-dependent regulation of cerebello-zona incerta circuits on anxiety-like behaviors.Nature communications, 2025

2025
[60]

Judging llm-as-a-judge with mt-bench and chatbot arena.NeurIPS, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.NeurIPS, 2023

2023
[61]

Dyval: Dynamic evaluation of large language models for reasoning tasks

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks. InICLR, 2024

2024
[62]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

work page arXiv 2023
[63]

Paper2video: Automatic video generation from scientific papers.arXiv preprint arXiv:2510.05096, 2025

Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou. Paper2video: Automatic video generation from scientific papers.arXiv preprint arXiv:2510.05096, 2025

work page arXiv 2025
[64]

at the same time

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent- as-a-judge: Evaluate agents with agents.ICML, 2024. 13 Appendix Contents A Experiment Details 15 A.1 Benchmarks Generated from Benchmark Agent . . . . . . . . . . . . . . . . . . . 1...

2024

[1] [1]

Synthetic dialogue dataset generation using llm agents

Yelaman Abdullin, Diego Molla, Bahadorreza Ofoghi, John Yearwood, and Qingyang Li. Synthetic dialogue dataset generation using llm agents. InProceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), 2023

2023

[2] [2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

System card: Claude opus 4 and claude sonnet 4

Anthropic. System card: Claude opus 4 and claude sonnet 4. https://www-cdn.anthropic. com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf, May 2025

2025

[4] [4]

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, et al. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Benchagents: Multi-agent systems for structured benchmark creation.arXiv preprint arXiv:2410.22584, 2024

Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, and Vidhisha Balachan- dran. Benchagents: Multi-agent systems for structured benchmark creation.arXiv preprint arXiv:2410.22584, 2024

work page arXiv 2024

[8] [8]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Mllm-as-a-judge: Assessing multimodal llm-as-a- judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a- judge with vision-language benchmark. InICML, 2024

2024

[10] [10]

Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

2024

[11] [11]

Can large language models be an alternative to human evaluations?arXiv preprint arXiv:2305.01937, 2023

Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations?arXiv preprint arXiv:2305.01937, 2023. 10

work page arXiv 2023

[12] [12]

CL-bench: A Benchmark for Context Learning.arXiv e-prints, art

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, et al. Cl-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

work page arXiv 2026

[13] [13]

On path to multimodal generalist: General-level and general-bench

Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, et al. On path to multimodal generalist: General-level and general-bench. InICML, 2025

2025

[14] [14]

Gptscore: Evaluate as you desire

Jinlan Fu, See Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

2024

[15] [15]

Gemini 3 pro model card

Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf, December 2025

2025

[16] [16]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[17] [17]

Cerebellar output shapes cortical preparatory activity during motor adaptation.Nature Communications, 2025

Sharon Israely, Hugo Ninou, Ori Rajchert, Lee Elmaleh, Ran Harel, Firas Mawase, Jonathan Kadmon, and Yifat Prut. Cerebellar output shapes cortical preparatory activity during motor adaptation.Nature Communications, 2025

2025

[18] [18]

Kimi K2: Open Agentic Intelligence

Team Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

work page arXiv 2025

[20] [20]

Act as human: Multimodal large language model data annotation with critical thinking.arXiv preprint arXiv:2511.09833, 2025

Lequan Lin, Dai Shi, Andi Han, Feng Chen, Qiuzheng Chen, Jiawen Li, Zhaoyang Li, Jiyuan Li, Zhenbang Sun, and Junbin Gao. Act as human: Multimodal large language model data annotation with critical thinking.arXiv preprint arXiv:2511.09833, 2025

work page arXiv 2025

[21] [21]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024

[22] [22]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

2024

[23] [23]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, and Weizhu Chen. Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024

work page arXiv 2024

[25] [25]

Unitcoder: Scalable iterative code synthesis with unit test guidance.arXiv preprint arXiv:2502.11460, 2025

Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, and Kai Chen. Unitcoder: Scalable iterative code synthesis with unit test guidance.arXiv preprint arXiv:2502.11460, 2025

work page arXiv 2025

[26] [26]

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Team MiroMind, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, et al. Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Au- tonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Au- tonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024. 11

work page arXiv 2024

[28] [28]

Benchmarkˆ 2: Systematic evaluation of llm benchmarks.arXiv preprint arXiv:2601.03986, 2026

Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, et al. Benchmarkˆ 2: Systematic evaluation of llm benchmarks.arXiv preprint arXiv:2601.03986, 2026

work page arXiv 2026

[29] [29]

Autobench: Automatic testbench generation and evaluation using llms for hdl design

Ruidi Qiu, Grace Li Zhang, Rolf Drechsler, Ulf Schlichtmann, and Bing Li. Autobench: Automatic testbench generation and evaluation using llms for hdl design. InProceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, 2024

2024

[30] [30]

Qwen2 Technical Report

Team Qwen et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Qwen3.6-Plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URL https://qwen.ai/ blog?id=qwen3.6

2026

[32] [32]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Neuronal dynamics of cerebellum and medial prefrontal cortex in adaptive motor timing.Nature Commu- nications, 2025

Zhong Ren, Xiaolu Wang, Milen Angelov, Chris I De Zeeuw, and Zhenyu Gao. Neuronal dynamics of cerebellum and medial prefrontal cortex in adaptive motor timing.Nature Commu- nications, 2025

2025

[34] [34]

Tagal: Tabular data generation using agentic llm methods.arXiv preprint arXiv:2509.04152, 2025

Benoît Ronval, Pierre Dupont, and Siegfried Nijssen. Tagal: Tabular data generation using agentic llm methods.arXiv preprint arXiv:2509.04152, 2025

work page arXiv 2025

[35] [35]

One-eval: An agentic system for automated and traceable llm evaluation.arXiv preprint arXiv:2603.09821, 2026

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, et al. One-eval: An agentic system for automated and traceable llm evaluation.arXiv preprint arXiv:2603.09821, 2026

work page arXiv 2026

[36] [36]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, 2019

2019

[38] [38]

Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by- interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

work page arXiv 2025

[39] [39]

Spacevista: All-scale visual spatial reasoning from mm to km.ICML, 2025

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.ICML, 2025

2025

[40] [40]

Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration.arXiv preprint arXiv:2505.03673, 2025

Huajie Tan, Xiaoshuai Hao, Cheng Chi, Minglan Lin, Yaoxu Lyu, Mingyu Cao, Dong Liang, Zhuo Chen, Mengsi Lyu, Cheng Peng, et al. Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration.arXiv preprint arXiv:2505.03673, 2025

work page arXiv 2025

[41] [41]

Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

work page arXiv 2025

[42] [42]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

[43] [43]

URLhttps://qwen.ai/blog?id=qwen3.5

[44] [44]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.NeurIPS, 2024

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.NeurIPS, 2024

2024

[45] [45]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.NeurIPS, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.NeurIPS, 2024

2024

[48] [48]

Finevision: Open data is all you need,

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need,

[49] [49]

URLhttps://arxiv.org/abs/2510.17269

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Language prompt for autonomous driving

Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-zhong Xu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving. InAAAI, 2025

2025

[51] [51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, and Xiangyu Yue. From web to pixels: Bringing agentic search into visual perception.arXiv preprint arXiv:2605.12497, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[53] [53]

Swe-agent: Agent-computer interfaces enable automated software engineering.NeurIPS, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.NeurIPS, 2024

2024

[54] [54]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2022

2022

[55] [55]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InACL, 2025

2025

[56] [56]

Evaluation agent: Efficient and promptable evaluation framework for visual generative models

Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. Evaluation agent: Efficient and promptable evaluation framework for visual generative models. InACL, 2025

2025

[57] [57]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

2024

[58] [58]

Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?ICLR, 2025

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?ICLR, 2025

2025

[59] [59]

Dual and plasticity-dependent regulation of cerebello-zona incerta circuits on anxiety-like behaviors.Nature communications, 2025

Yue Zhao, Jin-Tao Wu, Jia-Bin Feng, Xin-Yu Cai, Xin-Tai Wang, Luxi Wang, Wei Xie, Yan Gu, Jun Liu, Wei Chen, et al. Dual and plasticity-dependent regulation of cerebello-zona incerta circuits on anxiety-like behaviors.Nature communications, 2025

2025

[60] [60]

Judging llm-as-a-judge with mt-bench and chatbot arena.NeurIPS, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.NeurIPS, 2023

2023

[61] [61]

Dyval: Dynamic evaluation of large language models for reasoning tasks

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks. InICLR, 2024

2024

[62] [62]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

work page arXiv 2023

[63] [63]

Paper2video: Automatic video generation from scientific papers.arXiv preprint arXiv:2510.05096, 2025

Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou. Paper2video: Automatic video generation from scientific papers.arXiv preprint arXiv:2510.05096, 2025

work page arXiv 2025

[64] [64]

at the same time

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent- as-a-judge: Evaluate agents with agents.ICML, 2024. 13 Appendix Contents A Experiment Details 15 A.1 Benchmarks Generated from Benchmark Agent . . . . . . . . . . . . . . . . . . . 1...

2024