pith. sign in

arxiv: 2605.23657 · v1 · pith:H5ICOP4Snew · submitted 2026-05-22 · 💻 cs.CL

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Pith reviewed 2026-05-25 04:15 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentsskill evaluationopen-source skillsagent frameworksdynamic task generationbenchmarkingskill augmentation
0
0 comments X

The pith

Many publicly popular skills for LLM agents do not consistently outperform base agents without skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenSkillEval as a framework that automatically builds realistic task instances from live real-world artifacts in five application areas to test how skills interact with different LLM models and agent setups. It evaluates over 600 tasks against 30 open-source skills and finds that simply having skills available does not ensure they get used effectively. The benefit of adding skills turns out to depend heavily on which model and framework is underneath, and many popular skills add little or no advantage over running the base agent alone. This matters for anyone trying to pick or build skills because the open ecosystem is expanding without clear signals on which ones actually help in practice. The work pushes for ongoing, task-grounded testing instead of relying on static benchmarks.

Core claim

OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications and collects community-contributed skills for controlled comparison under unified settings. Using more than 600 dynamically generated task instances and 30 open-source skills, the evaluation shows that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills.

What carries the argument

OpenSkillEval, an automatic evaluation framework that dynamically generates task instances from real-world artifacts for side-by-side testing of skill-augmented LLM agent systems.

If this is right

  • Skill selection must be done per model and per agent framework rather than treating skills as plug-and-play improvements.
  • Skill authors should test their instructions across multiple base models instead of a single one.
  • Agent frameworks need better mechanisms to decide when to invoke a skill versus running without one.
  • The open skill ecosystem would benefit from continuous re-evaluation as models and artifacts evolve.
  • Base agents without skills can remain competitive choices when cost or reliability is prioritized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Platforms hosting skills could add automated quality checks that rerun evaluations whenever new models appear.
  • The same generation approach could be applied to other agent domains such as code editing or scientific workflows to test generality.
  • If skills are to be treated as modular components, the field may need interface standards that reduce model-specific tuning.
  • The observed variance suggests that some skills might be better reframed as lightweight prompt templates rather than full workflows.

Load-bearing premise

The dynamically generated task instances drawn from real-world artifacts are realistic enough proxies for actual downstream user tasks.

What would settle it

A controlled study in which real users complete the same categories of tasks with and without the evaluated skills and report measurably higher success rates or lower effort for the popular skills would falsify the claim that many skills fail to outperform base agents.

Figures

Figures reproduced from arXiv: 2605.23657 by Boxian Ai, Jiahao Ying, Siyuan Liu, Wei Tang, Yixin Cao.

Figure 1
Figure 1. Figure 1: Overview of the OpenSkillEval framework. The framework supports automatic test case [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Trajectory-level analysis of how different agent access and follow provided skills. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Token usage across agents and tasks. Mean completion tokens (left) and uncached input [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Skill performance versus cost across tasks and agent systems. Each subplot corresponds to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of skills on stylistic diversity relative to the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of skill-augmented and no-skills settings on reasoning intensive tasks. Cost Analysis. Beyond their impact on artifact quality, we further analyze the cost implications of skill augmentation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Web-based interface for human evaluation of generated task instances. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Artifact inspection interface used in human evaluation. The system provides task-specific [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Skill performance versus cost across tasks and agent systems. Each subplot corresponds [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of web design skills on stylistic diversity relative to the [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
read the original abstract

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OpenSkillEval, an automatic evaluation framework that dynamically constructs over 600 task instances from real-world artifacts across five categories (presentation generation, front-end web design, poster generation, data visualization, report generation). It organizes 30 community-contributed open-source skills for controlled comparison against base agents using state-of-the-art models and frameworks, concluding that skill availability does not guarantee effective usage, that augmentation benefits depend strongly on the underlying model and agent framework, and that many popular skills fail to consistently outperform base agents without skills.

Significance. If the task instances prove representative, the work would offer timely empirical evidence on the practical limitations of the expanding open skill ecosystem for LLM agents, underscoring the value of dynamic, task-grounded evaluation over static benchmarks and supplying actionable insights for skill design and selection.

major comments (2)
  1. [Methods (task construction)] Methods (task construction): The central claims rest on 600+ dynamically generated instances derived from real-world artifacts, yet the manuscript describes no external validation (expert ratings, comparison to logged user sessions, or hold-out real tasks) to confirm that the automatic construction process yields faithful proxies for downstream user tasks; without this, observed non-improvements and model/framework interactions risk being artifacts of the generation procedure rather than general properties of the skill ecosystem.
  2. [Evaluation setup] Evaluation setup (§ on experimental design): No details are provided on statistical controls, variance estimation across task instances, or error analysis for the reported comparisons; this leaves the strength of the model- and framework-dependent interaction claims difficult to assess.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief explicit statement of the five task categories and the exact number of skills per category to improve readability.
  2. [Conclusion] Project website link is provided but the manuscript does not indicate whether the generated task instances and skill implementations are released as open resources.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that directly strengthen the claims regarding task fidelity and statistical rigor.

read point-by-point responses
  1. Referee: [Methods (task construction)] The central claims rest on 600+ dynamically generated instances derived from real-world artifacts, yet the manuscript describes no external validation (expert ratings, comparison to logged user sessions, or hold-out real tasks) to confirm that the automatic construction process yields faithful proxies for downstream user tasks; without this, observed non-improvements and model/framework interactions risk being artifacts of the generation procedure rather than general properties of the skill ecosystem.

    Authors: We agree that external validation would further substantiate the representativeness of the generated tasks. The construction procedure directly ingests and adapts real-world artifacts (e.g., actual slide decks, web pages, and data tables) rather than synthesizing from scratch, which we argue already provides a stronger proxy than static benchmarks. Nevertheless, to address the concern explicitly, we will add a targeted expert validation study on a random subset of tasks in the revised manuscript. revision: yes

  2. Referee: [Evaluation setup] Evaluation setup (§ on experimental design): No details are provided on statistical controls, variance estimation across task instances, or error analysis for the reported comparisons; this leaves the strength of the model- and framework-dependent interaction claims difficult to assess.

    Authors: We acknowledge that the current manuscript omits explicit statistical controls and variance reporting. In the revision we will include (1) details of the statistical tests performed to assess model–framework–skill interactions, (2) per-category variance and confidence intervals across the 600+ instances, and (3) a qualitative error analysis highlighting representative failure modes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation independent of inputs

full rationale

The paper presents an empirical auditing framework that constructs task instances from external real-world artifacts and compares skill-augmented agents against base agents using community-contributed skills. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations are present in the derivation of the central claims. Results are measured directly on held-out generated instances rather than reducing to the construction process by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5814 in / 1065 out tokens · 26967 ms · 2026-05-25T04:15:58.017298+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 6 internal anchors

  1. [1]

    GPT-5.4 thinking system card

    OpenAI. GPT-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https: //deploymentsafety.openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf

  2. [2]

    System card: Claude Opus 4.6

    Anthropic. System card: Claude Opus 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5. pdf

  3. [3]

    Claude code by anthropic | ai coding agent, terminal, ide

    Anthropic. Claude code by anthropic | ai coding agent, terminal, ide. https://www. anthropic.com/claude-code, 2025

  4. [4]

    Codex by openai | ai coding agent.https://openai.com/codex/, 2025

    OpenAI. Codex by openai | ai coding agent.https://openai.com/codex/, 2025

  5. [5]

    Equipping agents for the real world with agent skills

    Anthropic. Equipping agents for the real world with agent skills. https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills , 2025. 14

  6. [6]

    Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026

    Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026. URL https://github.com/ harbor-framework/harbor

  7. [7]

    Pptagent: Generating and evaluating presentations beyond text-to-slides

    Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Pptagent: Generating and evaluating presentations beyond text-to-slides. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14413–14429, 2025

  8. [8]

    Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization,

    Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, and Yixin Cao. Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization,

  9. [9]

    URLhttps://arxiv.org/abs/2505.12795

  10. [10]

    Webarena: A realistic web environment for build- ing autonomous agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for build- ing autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023

  11. [11]

    GPT-5.3-Codex system card

    OpenAI. GPT-5.3-Codex system card. https://openai.com/index/ gpt-5-3-codex-system-card/, 2026

  12. [12]

    Gemini CLI

    Google. Gemini CLI. https://github.com/google-gemini/gemini-cli, 2025. Ac- cessed: 2026-05-02

  13. [13]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026

  14. [14]

    Kimi code CLI.https://github.com/MoonshotAI/kimi-cli, 2025

    Moonshot AI. Kimi code CLI.https://github.com/MoonshotAI/kimi-cli, 2025

  15. [15]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

  16. [16]

    MiniMax M2.7: Early echoes of self-evolution

    MiniMax. MiniMax M2.7: Early echoes of self-evolution. https://www.minimax.io/news/ minimax-m27-en, 2026

  17. [17]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  18. [18]

    Intuitive or dependent? investigating LLMs’ behavior style to conflicting prompts

    Jiahao Ying, Yixin Cao, Kai Xiong, Long Cui, Yidong He, and Yongbin Liu. Intuitive or dependent? investigating LLMs’ behavior style to conflicting prompts. In Lun-Wei Ku, 15 Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4221–4246, Bangkok, T...

  19. [19]

    Why claude code skills don’t activate and how to fix it, 2026

    Ivan Seleznov. Why claude code skills don’t activate and how to fix it, 2026. Medium blog post

  20. [20]

    Ockbench: Measuring the efficiency of llm reasoning.arXiv preprint arXiv:2511.05722, 2025

    Zheng Du, Hao Kang, Song Han, Tushar Krishna, and Ligeng Zhu. Ockbench: Measuring the efficiency of llm reasoning.arXiv preprint arXiv:2511.05722, 2025

  21. [21]

    Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

    Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

  22. [22]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward, 2026. URLhttps://arxiv.org/abs/2602.12430

  23. [23]

    EvoSkill: Automated Skill Discovery for Multi-Agent Systems

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems, 2026. URL https://arxiv.org/abs/ 2603.02766

  24. [24]

    Autoskill: Experience-driven lifelong learning via skill self-evolution, 2026

    Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. Autoskill: Experience-driven lifelong learning via skill self-evolution, 2026. URLhttps://arxiv.org/abs/2603.01145

  25. [25]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills, 2025. URL https://arxiv.org/ abs/2504.07079

  26. [26]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

  27. [27]

    PinchBench: Real-world benchmarks for AI coding agents

    PinchBench Contributors. PinchBench: Real-world benchmarks for AI coding agents. https: //github.com/pinchbench/skill, 2026. GitHub repository

  28. [28]

    Wildclawbench, 2026

    Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Jingyi Yang, Penghui Yang, Zhixiong Zhang, Xilin Wei, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, and Yuhang Zang. Wildclawbench, 2026. URL https://github.com/InternLM/ WildClawBench

  29. [29]

    Swe-bench: Can language models resolve real-world github issues? 2023

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? 2023

  30. [30]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, 2023

  31. [31]

    Toward generalizable evaluation in the llm era: A survey beyond benchmarks, 2025

    Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang, Tat-Seng Chua, and Yu-Gang Jiang. Toward generalizable evaluatio...

  32. [32]

    Automating dataset updates towards reliable and timely evaluation of large language models

    Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, and Shuicheng Yan. Automating dataset updates towards reliable and timely evaluation of large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

  33. [33]

    URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_ and_Benchmarks_Track.pdf

    doi: 10.52202/079017-0544. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_ and_Benchmarks_Track.pdf

  34. [34]

    EvoWiki: Evaluating LLMs on evolving knowledge

    Wei Tang, Yixin Cao, Yang Deng, Jiahao Ying, Bo Wang, Yizhe Yang, Yuyue Zhao, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, and Yong Liao. EvoWiki: Evaluating LLMs on evolving knowledge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics...

  35. [35]

    Livebench: A challenging, contamination-free LLM benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-free LLM benchmark. InThe...

  36. [36]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/ 2403.07974. 17 A Technical Appendices and Supplementary Material A.1 Experimental Environment We...

  37. [37]

    application

    Data Visualization { // Meta -- required "application": "data-visualization", "case_id": "case-climate-trends", "language": "en", 19 // Style -- optional (omit to test agent autonomy) "style": { "theme": "scientific", "audience": "researchers and policy makers", "tone": "clean, publication-ready" }, // Goal -- required (one insight per case; chart_type ch...

  38. [38]

    application

    Poster Generation { // Meta -- required "application": "poster-generation", "case_id": "case-01-data-report", "language": "en", // Poster constraints -- optional "poster": { "aspect_ratio": "landscape",// landscape | portrait | square | A0-landscape | ... "audience": "data-report", "tone": "data-forward, professional", }, // Content brief -- optional "bri...

  39. [39]

    application

    Presentation Generation { // Meta -- required "application": "ppt-generation", "case_id": "case-01-internal-review", "language": "en", // Deck constraints -- optional "deck": { 20 "aspect_ratio": "16:9",// default 16:9 "slide_count": 6,// omit to let agent decide "audience": "internal product review", "tone": "professional, concise" }, // Content brief --...

  40. [40]

    application

    Report Generation { // Meta -- required "application": "report-generation", "case_id": "case-01-sales-analysis", "language": "en", // Report constraints -- optional "report": { "type": "sales-report", "audience": "management", "tone": "professional, data-forward" }, // Content brief -- optional "brief": { "title": "2024 Q4 Sales Performance Report", "one_...

  41. [41]

    application

    Web Design { // Meta -- required "application": "web-design", "case_id": "case-01-landing-page", "language": "en", // Site constraints -- optional "site": { "type": "landing-page", "page_count": 2,// omit to let agent decide "audience": "developers and technical decision-makers", "tone": "modern, professional, bold", "responsive": true,// default true "da...

  42. [42]

    expressed

    Data Visualization Insight Expression single image Evaluate the **insight expression** of this data visualization. The visualization was created to convey a specific insight: **Goal insight**: {insight} Criteria: - Does the chosen visualization type effectively communicate this insight? - Can the reader **actually** understand the key message at a glance,...

  43. [43]

    score": <1-5>,

    Poster Generation Design single image Evaluate the **visual design quality** of this poster/infographic. 27 Criteria: - Color scheme: harmonious palette, appropriate for the topic and tone - Layout: clean alignment, proper spacing, clear visual hierarchy - Typography: readable fonts, clear size hierarchy (title > heading > body) - Consistency: unified sty...

  44. [44]

    score": <1-5>,

    PPT Generation Content per-slide image Evaluate the **content quality** of this presentation slide. Judge how effectively this slide delivers its key message to the reader. Criteria: - Key message: does the slide have a clear takeaway that the reader can grasp? - Information density: appropriate amount of content (not too crowded, not too sparse) - Clarit...

  45. [45]

    score": <1-5>,

    Report Generation Content Quality report text only Evaluate the **content quality** of this report across two aspects: writing quality AND analysis depth. A. Writing & Structure: - Organization: clear headings, logical flow, well-structured executive summary - Clarity: well-written, grammatically correct, easy to understand - Information density: appropri...

  46. [46]

    score": <1-5>,

    Web Design Visual Design per-page multi-image (full + crops) Evaluate the **visual design execution quality** of this web page. Criteria: - Color & typography: harmonious palette, readable fonts, clear heading hierarchy (h1 > h2 > body), consistent font sizing - Layout & structure: well-organized sections, clear information hierarchy, consistent grid alig...