OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Boxian Ai; Jiahao Ying; Siyuan Liu; Wei Tang; Yixin Cao

REVIEW 3 major objections 2 minor 2 cited by

Skill availability does not guarantee effective usage in LLM agents, and many popular skills fail to outperform base agents without skills.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 16:10 UTC pith:H5ICOP4S

load-bearing objection OpenSkillEval's main finding is that many popular skills add little or no value over base agents and that any gains depend heavily on the model and framework. the 3 major comments →

arxiv 2605.23657 v2 pith:H5ICOP4S submitted 2026-05-22 cs.CL

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Jiahao Ying , Boxian Ai , Wei Tang , Siyuan Liu , Yixin Cao This is my paper

classification cs.CL

keywords LLM agentsskill evaluationautomatic benchmarkingopen-source skillsagent frameworkstask generationperformance evaluationdownstream tasks

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenSkillEval, an automatic evaluation framework that builds realistic task instances from evolving real-world artifacts to test skills for LLM agents. It runs controlled comparisons across more than 600 dynamically generated tasks in five categories using 30 open-source skills and multiple models and frameworks. The evaluation establishes that skills are not always used effectively even when available, that any performance gains from skills vary sharply with the underlying model and agent framework, and that many publicly popular skills produce no consistent improvement over base agents. These patterns matter because developers and users need reliable ways to select and deploy skills under real cost and performance constraints rather than assuming availability equals value.

Core claim

OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 task instances and 30 open-source skills, the framework shows that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents witho

What carries the argument

OpenSkillEval, an automatic evaluation framework that constructs dynamic task instances from real-world artifacts and enables unified comparisons of community skills.

Load-bearing premise

The automatically constructed task instances drawn from evolving real-world artifacts across the five categories serve as valid and representative proxies for practical downstream performance.

What would settle it

If direct tests on actual user-submitted tasks in the same five categories produce substantially different patterns of which skills improve performance and which do not, the central findings would not hold.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Skill selection for agents should be tested against specific models and frameworks rather than based on availability or community popularity.
Agent frameworks require mechanisms to ensure available skills are actually applied effectively during task execution.
Dynamic, task-grounded evaluation is necessary to assess skill quality instead of relying on static benchmarks or reported usage.
Many community skills may require redesign or filtering to deliver consistent gains across different agent setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent frameworks could incorporate automatic skill validation steps to filter out ineffective additions before deployment.
Extending the evaluation to track token costs alongside performance would clarify practical trade-offs for users selecting skills.
The results suggest that open skill repositories might benefit from standardized usage testing before listing skills as recommended.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

OpenSkillEval's main finding is that many popular skills add little or no value over base agents and that any gains depend heavily on the model and framework.

read the letter

The paper's central result is that skill availability does not guarantee better performance. Across more than 600 tasks and 30 open skills, many community favorites do not beat the unaugmented agent, and improvements vary sharply by base model and agent setup.

What is new is the OpenSkillEval framework for building dynamic tasks directly from real-world artifacts in five categories: presentation generation, front-end design, posters, data visualization, and reports. It then runs controlled comparisons under unified conditions instead of relying on fixed benchmarks. That scale and the automatic construction approach are a clear step beyond most prior agent evaluations.

The work does a solid job surfacing the model and framework interactions and documenting cases where popular skills underperform. The negative findings on skill effectiveness are the part that stands out as useful for practitioners.

The soft spot is the task generation process. The claims rest on these auto-constructed instances serving as valid proxies, yet the paper gives limited detail on generation rules, human validation, or correlation with actual deployment outcomes. If the distillation step favors certain structures or omits edge cases, the observed non-monotonic effects could partly reflect the benchmark rather than general properties of the skill ecosystem.

This is for researchers and engineers who select or design skills for LLM agents and want evidence-based guidance on when they help. The questions are practical and the evaluation is large enough to warrant a serious referee, even with the need for tighter validation of the tasks.

Referee Report

3 major / 2 minor

Summary. The paper introduces OpenSkillEval, an automatic evaluation framework for auditing open-source skills for LLM agents. It dynamically constructs over 600 task instances from evolving real-world artifacts across five categories (presentation generation, front-end web design, poster generation, data visualization, report generation), collects 30 community skills, and evaluates interactions with state-of-the-art models and agent frameworks. The central empirical claims are that skill availability does not guarantee effective usage, that augmentation benefits depend strongly on the underlying model and framework, and that many popular skills fail to outperform base agents without skills.

Significance. If the generated tasks are valid proxies, the work offers timely empirical evidence on the open skill ecosystem, showing that static assumptions about skill benefits are unreliable and underscoring the value of dynamic, task-grounded evaluation over fixed benchmarks. The release of benchmark resources via the project website is a positive contribution that supports reproducibility and follow-on work.

major comments (3)

[Task construction / evaluation methodology] Task construction section: the manuscript provides no human validation, inter-rater agreement metrics, or correlation analysis with real deployment outcomes for the automatically distilled task instances. This is load-bearing because all headline claims (non-guaranteed usage, model/framework dependence, popular skills underperforming) rest entirely on performance deltas measured on these >600 instances.
[Results / analysis] Results and analysis sections: no description of statistical controls, significance testing, or error analysis is given for the reported performance differences across models, frameworks, and skills. Without these, the claim that many publicly popular skills "do not consistently outperform base agents" cannot be rigorously supported.
[Skill collection and task generation] Skill selection and task generation rules: the criteria for choosing the 30 skills and the precise distillation process from real-world artifacts are not detailed, leaving open the possibility that observed interactions are artifacts of the benchmark construction rather than intrinsic properties of the ecosystem.

minor comments (2)

[Abstract] The abstract would benefit from briefly naming the specific models and frameworks evaluated to convey experimental scope.
[Figures] Figures comparing skill-augmented vs. base performance should include error bars or confidence intervals where applicable for clearer interpretation of deltas.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Task construction / evaluation methodology] Task construction section: the manuscript provides no human validation, inter-rater agreement metrics, or correlation analysis with real deployment outcomes for the automatically distilled task instances. This is load-bearing because all headline claims (non-guaranteed usage, model/framework dependence, popular skills underperforming) rest entirely on performance deltas measured on these >600 instances.

Authors: We acknowledge that the original manuscript does not report human validation, inter-rater agreement, or correlation with real deployment outcomes. The framework was intentionally designed to be fully automatic to scale with evolving artifacts. In the revision we will add a dedicated subsection on task construction that describes the automated quality checks performed during distillation and will include an explicit limitations paragraph discussing the absence of human validation and real-world outcome correlation. This will better contextualize the empirical claims without altering the automatic nature of the benchmark. revision: yes
Referee: [Results / analysis] Results and analysis sections: no description of statistical controls, significance testing, or error analysis is given for the reported performance differences across models, frameworks, and skills. Without these, the claim that many publicly popular skills "do not consistently outperform base agents" cannot be rigorously supported.

Authors: We agree that the absence of statistical controls and significance testing weakens the presentation of the results. In the revised manuscript we will augment the Results section with paired statistical tests (e.g., Wilcoxon signed-rank tests) between skill-augmented and base-agent conditions, report standard errors or confidence intervals, and add a short error-analysis subsection that examines representative failure cases across models and frameworks. revision: yes
Referee: [Skill collection and task generation] Skill selection and task generation rules: the criteria for choosing the 30 skills and the precise distillation process from real-world artifacts are not detailed, leaving open the possibility that observed interactions are artifacts of the benchmark construction rather than intrinsic properties of the ecosystem.

Authors: We will expand the Skill Collection and Task Generation sections to specify the exact selection criteria for the 30 skills (GitHub popularity thresholds, domain relevance, and community usage signals) and to provide a step-by-step description of the distillation pipeline, including concrete examples of how real-world artifacts are transformed into the >600 task instances. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation is externally grounded

full rationale

The paper's central claims derive from direct empirical measurements of agent performance on >600 task instances constructed from external real-world artifacts and 30 community-contributed skills. No equations, fitted parameters, or predictions are defined in terms of the target results; the task-generation process and skill collection are independent of the reported deltas. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the conclusions. The derivation chain remains self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is an empirical evaluation framework built on standard LLM agent components.

pith-pipeline@v0.9.1-grok · 5814 in / 1073 out tokens · 28885 ms · 2026-06-30T16:10:37.297214+00:00 · methodology

0 comments

read the original abstract

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.

Figures

Figures reproduced from arXiv: 2605.23657 by Boxian Ai, Jiahao Ying, Siyuan Liu, Wei Tang, Yixin Cao.

**Figure 3.** Figure 3: Trajectory-level analysis of how different agent access and follow provided skills. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Token usage across agents and tasks. Mean completion tokens (left) and uncached input [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Skill performance versus cost across tasks and agent systems. Each subplot corresponds to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of skills on stylistic diversity relative to the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of skill-augmented and no-skills settings on reasoning intensive tasks. Cost Analysis. Beyond their impact on artifact quality, we further analyze the cost implications of skill augmentation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Web-based interface for human evaluation of generated task instances. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Artifact inspection interface used in human evaluation. The system provides task-specific [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Skill performance versus cost across tasks and agent systems. Each subplot corresponds [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Impact of web design skills on stylistic diversity relative to the [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Skill Coverage: A Test Adequacy Metric for Agent Skills
cs.AI 2026-06 unverdicted novelty 6.0

Skill coverage is a binary test adequacy metric that extracts observable behavior constraints from skill documents and judges whether trajectories provide sufficient evidence to cover each constraint, revealing 39.90-...
Skill Coverage: A Test Adequacy Metric for Agent Skills
cs.AI 2026-06 conditional novelty 6.0

Skill coverage measures which natural-language skill constraints an LLM agent trajectory exercises and passes, revealing low coverage on SkillsBench and enabling a 16% recovery of failed tasks via targeted skill emphasis.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

GPT-5.4 thinking system card

OpenAI. GPT-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https: //deploymentsafety.openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf

work page 2026
[2]

System card: Claude Opus 4.6

Anthropic. System card: Claude Opus 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5. pdf

work page 2026
[3]

Claude code by anthropic | ai coding agent, terminal, ide

Anthropic. Claude code by anthropic | ai coding agent, terminal, ide. https://www. anthropic.com/claude-code, 2025

work page 2025
[4]

Codex by openai | ai coding agent.https://openai.com/codex/, 2025

OpenAI. Codex by openai | ai coding agent.https://openai.com/codex/, 2025

work page 2025
[5]

Equipping agents for the real world with agent skills

Anthropic. Equipping agents for the real world with agent skills. https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills , 2025

work page 2025
[6]

Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026

Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026. URL https://github.com/ harbor-framework/harbor. 14

work page 2026
[7]

Pptagent: Generating and evaluating presentations beyond text-to-slides

Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Pptagent: Generating and evaluating presentations beyond text-to-slides. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14413–14429, 2025

work page 2025
[8]

Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization,

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, and Yixin Cao. Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization,

work page
[9]

URLhttps://arxiv.org/abs/2505.12795

work page arXiv
[10]

Webarena: A realistic web environment for build- ing autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for build- ing autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[11]

GPT-5.3-Codex system card

OpenAI. GPT-5.3-Codex system card. https://openai.com/index/ gpt-5-3-codex-system-card/, 2026

work page 2026
[12]

Gemini CLI

Google. Gemini CLI. https://github.com/google-gemini/gemini-cli, 2025. Ac- cessed: 2026-05-02

work page 2025
[13]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026

work page 2026
[14]

Kimi code CLI.https://github.com/MoonshotAI/kimi-cli, 2025

Moonshot AI. Kimi code CLI.https://github.com/MoonshotAI/kimi-cli, 2025

work page 2025
[15]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

MiniMax M2.7: Early echoes of self-evolution

MiniMax. MiniMax M2.7: Early echoes of self-evolution. https://www.minimax.io/news/ minimax-m27-en, 2026

work page 2026
[17]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[18]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Intuitive or dependent? investigating LLMs’ behavior style to conflicting prompts

Jiahao Ying, Yixin Cao, Kai Xiong, Long Cui, Yidong He, and Yongbin Liu. Intuitive or dependent? investigating LLMs’ behavior style to conflicting prompts. In Lun-Wei Ku, 15 Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4221–4246, Bangkok, T...

work page 2024
[20]

Why claude code skills don’t activate and how to fix it, 2026

Ivan Seleznov. Why claude code skills don’t activate and how to fix it, 2026. Medium blog post

work page 2026
[21]

OckBench: Measuring the Efficiency of LLM Reasoning

Zheng Du, Hao Kang, Song Han, Tushar Krishna, and Ligeng Zhu. Ockbench: Measuring the efficiency of llm reasoning.arXiv preprint arXiv:2511.05722, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

arXiv preprint arXiv:2404.01292 (2024)

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

work page arXiv 2024
[23]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward, 2026. URLhttps://arxiv.org/abs/2602.12430

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems, 2026. URL https://arxiv.org/abs/ 2603.02766

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145,

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. Autoskill: Experience-driven lifelong learning via skill self-evolution, 2026. URLhttps://arxiv.org/abs/2603.01145

work page arXiv 2026
[26]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills, 2025. URL https://arxiv.org/ abs/2504.07079

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

PinchBench: Real-world benchmarks for AI coding agents

PinchBench Contributors. PinchBench: Real-world benchmarks for AI coding agents. https: //github.com/pinchbench/skill, 2026. GitHub repository

work page 2026
[29]

Wildclawbench, 2026

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Jingyi Yang, Penghui Yang, Zhixiong Zhang, Xilin Wei, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, and Yuhang Zang. Wildclawbench, 2026. URL https://github.com/InternLM/ WildClawBench

work page 2026
[30]

Swe-bench: Can language models resolve real-world github issues? 2023

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? 2023

work page 2023
[31]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[32]

Toward generalizable evaluation in the llm era: A survey beyond benchmarks

Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang, Tat-Seng Chua, and Yu-Gang Jiang. Toward generalizable evaluatio...

work page arXiv 2025
[33]

Automating dataset updates towards reliable and timely evaluation of large language models

Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, and Shuicheng Yan. Automating dataset updates towards reliable and timely evaluation of large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

work page
[34]

URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_ and_Benchmarks_Track.pdf

doi: 10.52202/079017-0544. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_ and_Benchmarks_Track.pdf

work page doi:10.52202/079017-0544 2024
[35]

EvoWiki: Evaluating LLMs on evolving knowledge

Wei Tang, Yixin Cao, Yang Deng, Jiahao Ying, Bo Wang, Yizhe Yang, Yuyue Zhao, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, and Yong Liao. EvoWiki: Evaluating LLMs on evolving knowledge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics...

work page doi:10.18653/v1/2025.acl-long.47 2025
[36]

Livebench: A challenging, contamination-free LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-free LLM benchmark. InThe...

work page 2025
[37]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/ 2403.07974. 17 A Technical Appendices and Supplementary Material A.1 Experimental Environment We...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

application

Data Visualization { // Meta -- required "application": "data-visualization", "case_id": "case-climate-trends", "language": "en", 19 // Style -- optional (omit to test agent autonomy) "style": { "theme": "scientific", "audience": "researchers and policy makers", "tone": "clean, publication-ready" }, // Goal -- required (one insight per case; chart_type ch...

work page
[39]

application

Poster Generation { // Meta -- required "application": "poster-generation", "case_id": "case-01-data-report", "language": "en", // Poster constraints -- optional "poster": { "aspect_ratio": "landscape",// landscape | portrait | square | A0-landscape | ... "audience": "data-report", "tone": "data-forward, professional", }, // Content brief -- optional "bri...

work page 2025
[40]

application

Presentation Generation { // Meta -- required "application": "ppt-generation", "case_id": "case-01-internal-review", "language": "en", // Deck constraints -- optional "deck": { 20 "aspect_ratio": "16:9",// default 16:9 "slide_count": 6,// omit to let agent decide "audience": "internal product review", "tone": "professional, concise" }, // Content brief --...

work page
[41]

application

Report Generation { // Meta -- required "application": "report-generation", "case_id": "case-01-sales-analysis", "language": "en", // Report constraints -- optional "report": { "type": "sales-report", "audience": "management", "tone": "professional, data-forward" }, // Content brief -- optional "brief": { "title": "2024 Q4 Sales Performance Report", "one_...

work page 2024
[42]

application

Web Design { // Meta -- required "application": "web-design", "case_id": "case-01-landing-page", "language": "en", // Site constraints -- optional "site": { "type": "landing-page", "page_count": 2,// omit to let agent decide "audience": "developers and technical decision-makers", "tone": "modern, professional, bold", "responsive": true,// default true "da...

work page
[43]

expressed

Data Visualization Insight Expression single image Evaluate the **insight expression** of this data visualization. The visualization was created to convey a specific insight: **Goal insight**: {insight} Criteria: - Does the chosen visualization type effectively communicate this insight? - Can the reader **actually** understand the key message at a glance,...

work page
[44]

score": <1-5>,

Poster Generation Design single image Evaluate the **visual design quality** of this poster/infographic. 27 Criteria: - Color scheme: harmonious palette, appropriate for the topic and tone - Layout: clean alignment, proper spacing, clear visual hierarchy - Typography: readable fonts, clear size hierarchy (title > heading > body) - Consistency: unified sty...

work page
[45]

score": <1-5>,

PPT Generation Content per-slide image Evaluate the **content quality** of this presentation slide. Judge how effectively this slide delivers its key message to the reader. Criteria: - Key message: does the slide have a clear takeaway that the reader can grasp? - Information density: appropriate amount of content (not too crowded, not too sparse) - Clarit...

work page
[46]

score": <1-5>,

Report Generation Content Quality report text only Evaluate the **content quality** of this report across two aspects: writing quality AND analysis depth. A. Writing & Structure: - Organization: clear headings, logical flow, well-structured executive summary - Clarity: well-written, grammatically correct, easy to understand - Information density: appropri...

work page
[47]

score": <1-5>,

Web Design Visual Design per-page multi-image (full + crops) Evaluate the **visual design execution quality** of this web page. Criteria: - Color & typography: harmonious palette, readable fonts, clear heading hierarchy (h1 > h2 > body), consistent font sizing - Layout & structure: well-organized sections, clear information hierarchy, consistent grid alig...

work page

[1] [1]

GPT-5.4 thinking system card

OpenAI. GPT-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https: //deploymentsafety.openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf

work page 2026

[2] [2]

System card: Claude Opus 4.6

Anthropic. System card: Claude Opus 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5. pdf

work page 2026

[3] [3]

Claude code by anthropic | ai coding agent, terminal, ide

Anthropic. Claude code by anthropic | ai coding agent, terminal, ide. https://www. anthropic.com/claude-code, 2025

work page 2025

[4] [4]

Codex by openai | ai coding agent.https://openai.com/codex/, 2025

OpenAI. Codex by openai | ai coding agent.https://openai.com/codex/, 2025

work page 2025

[5] [5]

Equipping agents for the real world with agent skills

Anthropic. Equipping agents for the real world with agent skills. https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills , 2025

work page 2025

[6] [6]

Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026

Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026. URL https://github.com/ harbor-framework/harbor. 14

work page 2026

[7] [7]

Pptagent: Generating and evaluating presentations beyond text-to-slides

Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Pptagent: Generating and evaluating presentations beyond text-to-slides. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14413–14429, 2025

work page 2025

[8] [8]

Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization,

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, and Yixin Cao. Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization,

work page

[9] [9]

URLhttps://arxiv.org/abs/2505.12795

work page arXiv

[10] [10]

Webarena: A realistic web environment for build- ing autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for build- ing autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[11] [11]

GPT-5.3-Codex system card

OpenAI. GPT-5.3-Codex system card. https://openai.com/index/ gpt-5-3-codex-system-card/, 2026

work page 2026

[12] [12]

Gemini CLI

Google. Gemini CLI. https://github.com/google-gemini/gemini-cli, 2025. Ac- cessed: 2026-05-02

work page 2025

[13] [13]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026

work page 2026

[14] [14]

Kimi code CLI.https://github.com/MoonshotAI/kimi-cli, 2025

Moonshot AI. Kimi code CLI.https://github.com/MoonshotAI/kimi-cli, 2025

work page 2025

[15] [15]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

MiniMax M2.7: Early echoes of self-evolution

MiniMax. MiniMax M2.7: Early echoes of self-evolution. https://www.minimax.io/news/ minimax-m27-en, 2026

work page 2026

[17] [17]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026

[18] [18]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Intuitive or dependent? investigating LLMs’ behavior style to conflicting prompts

Jiahao Ying, Yixin Cao, Kai Xiong, Long Cui, Yidong He, and Yongbin Liu. Intuitive or dependent? investigating LLMs’ behavior style to conflicting prompts. In Lun-Wei Ku, 15 Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4221–4246, Bangkok, T...

work page 2024

[20] [20]

Why claude code skills don’t activate and how to fix it, 2026

Ivan Seleznov. Why claude code skills don’t activate and how to fix it, 2026. Medium blog post

work page 2026

[21] [21]

OckBench: Measuring the Efficiency of LLM Reasoning

Zheng Du, Hao Kang, Song Han, Tushar Krishna, and Ligeng Zhu. Ockbench: Measuring the efficiency of llm reasoning.arXiv preprint arXiv:2511.05722, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

arXiv preprint arXiv:2404.01292 (2024)

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

work page arXiv 2024

[23] [23]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward, 2026. URLhttps://arxiv.org/abs/2602.12430

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems, 2026. URL https://arxiv.org/abs/ 2603.02766

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145,

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. Autoskill: Experience-driven lifelong learning via skill self-evolution, 2026. URLhttps://arxiv.org/abs/2603.01145

work page arXiv 2026

[26] [26]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills, 2025. URL https://arxiv.org/ abs/2504.07079

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

PinchBench: Real-world benchmarks for AI coding agents

PinchBench Contributors. PinchBench: Real-world benchmarks for AI coding agents. https: //github.com/pinchbench/skill, 2026. GitHub repository

work page 2026

[29] [29]

Wildclawbench, 2026

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Jingyi Yang, Penghui Yang, Zhixiong Zhang, Xilin Wei, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, and Yuhang Zang. Wildclawbench, 2026. URL https://github.com/InternLM/ WildClawBench

work page 2026

[30] [30]

Swe-bench: Can language models resolve real-world github issues? 2023

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? 2023

work page 2023

[31] [31]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[32] [32]

Toward generalizable evaluation in the llm era: A survey beyond benchmarks

Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang, Tat-Seng Chua, and Yu-Gang Jiang. Toward generalizable evaluatio...

work page arXiv 2025

[33] [33]

Automating dataset updates towards reliable and timely evaluation of large language models

Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, and Shuicheng Yan. Automating dataset updates towards reliable and timely evaluation of large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

work page

[34] [34]

URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_ and_Benchmarks_Track.pdf

doi: 10.52202/079017-0544. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_ and_Benchmarks_Track.pdf

work page doi:10.52202/079017-0544 2024

[35] [35]

EvoWiki: Evaluating LLMs on evolving knowledge

Wei Tang, Yixin Cao, Yang Deng, Jiahao Ying, Bo Wang, Yizhe Yang, Yuyue Zhao, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, and Yong Liao. EvoWiki: Evaluating LLMs on evolving knowledge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics...

work page doi:10.18653/v1/2025.acl-long.47 2025

[36] [36]

Livebench: A challenging, contamination-free LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-free LLM benchmark. InThe...

work page 2025

[37] [37]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/ 2403.07974. 17 A Technical Appendices and Supplementary Material A.1 Experimental Environment We...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

application

Data Visualization { // Meta -- required "application": "data-visualization", "case_id": "case-climate-trends", "language": "en", 19 // Style -- optional (omit to test agent autonomy) "style": { "theme": "scientific", "audience": "researchers and policy makers", "tone": "clean, publication-ready" }, // Goal -- required (one insight per case; chart_type ch...

work page

[39] [39]

application

Poster Generation { // Meta -- required "application": "poster-generation", "case_id": "case-01-data-report", "language": "en", // Poster constraints -- optional "poster": { "aspect_ratio": "landscape",// landscape | portrait | square | A0-landscape | ... "audience": "data-report", "tone": "data-forward, professional", }, // Content brief -- optional "bri...

work page 2025

[40] [40]

application

Presentation Generation { // Meta -- required "application": "ppt-generation", "case_id": "case-01-internal-review", "language": "en", // Deck constraints -- optional "deck": { 20 "aspect_ratio": "16:9",// default 16:9 "slide_count": 6,// omit to let agent decide "audience": "internal product review", "tone": "professional, concise" }, // Content brief --...

work page

[41] [41]

application

Report Generation { // Meta -- required "application": "report-generation", "case_id": "case-01-sales-analysis", "language": "en", // Report constraints -- optional "report": { "type": "sales-report", "audience": "management", "tone": "professional, data-forward" }, // Content brief -- optional "brief": { "title": "2024 Q4 Sales Performance Report", "one_...

work page 2024

[42] [42]

application

Web Design { // Meta -- required "application": "web-design", "case_id": "case-01-landing-page", "language": "en", // Site constraints -- optional "site": { "type": "landing-page", "page_count": 2,// omit to let agent decide "audience": "developers and technical decision-makers", "tone": "modern, professional, bold", "responsive": true,// default true "da...

work page

[43] [43]

expressed

Data Visualization Insight Expression single image Evaluate the **insight expression** of this data visualization. The visualization was created to convey a specific insight: **Goal insight**: {insight} Criteria: - Does the chosen visualization type effectively communicate this insight? - Can the reader **actually** understand the key message at a glance,...

work page

[44] [44]

score": <1-5>,

Poster Generation Design single image Evaluate the **visual design quality** of this poster/infographic. 27 Criteria: - Color scheme: harmonious palette, appropriate for the topic and tone - Layout: clean alignment, proper spacing, clear visual hierarchy - Typography: readable fonts, clear size hierarchy (title > heading > body) - Consistency: unified sty...

work page

[45] [45]

score": <1-5>,

PPT Generation Content per-slide image Evaluate the **content quality** of this presentation slide. Judge how effectively this slide delivers its key message to the reader. Criteria: - Key message: does the slide have a clear takeaway that the reader can grasp? - Information density: appropriate amount of content (not too crowded, not too sparse) - Clarit...

work page

[46] [46]

score": <1-5>,

Report Generation Content Quality report text only Evaluate the **content quality** of this report across two aspects: writing quality AND analysis depth. A. Writing & Structure: - Organization: clear headings, logical flow, well-structured executive summary - Clarity: well-written, grammatically correct, easy to understand - Information density: appropri...

work page

[47] [47]

score": <1-5>,

Web Design Visual Design per-page multi-image (full + crops) Evaluate the **visual design execution quality** of this web page. Criteria: - Color & typography: harmonious palette, readable fonts, clear heading hierarchy (h1 > h2 > body), consistent font sizing - Layout & structure: well-organized sections, clear information hierarchy, consistent grid alig...

work page