Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

Bocheng Huang; Bo Li; Fei Sun; Huhai Hong; Huimei Wang; Jiarui Jiang; Liping Su; Qianyu Yao; Ruoqiong Wu; Shu Quan

arxiv: 2606.11830 · v1 · pith:ZDWTQRPHnew · submitted 2026-06-10 · 💻 cs.AI

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

Qianyu Yao , Fei Sun , Bocheng Huang , Wei Chen , Jiarui Jiang , Shu Quan , Yifei Chen , Wenjie Xu

show 5 more authors

Bo li Liping Su Ruoqiong Wu Huhai Hong Huimei Wang

This is my paper

Pith reviewed 2026-06-27 09:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsskill augmentationmedical research analysistranscriptomicsNSCLChuman evaluationquality assessmentbiomarker task

0 comments

The pith

Skill-augmented AI agents produce directionally higher expert-rated quality than native AI in a transcriptomic biomarker task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether autonomous access to a medical research skill package improves the quality of AI-generated outputs for analyzing transcriptomic data in a non-small cell lung cancer immunotherapy biomarker task. Six model backbones generated 21 anonymized outputs that were rated for overall quality by four non-expert reviewers and two blinded experts. Skill-augmented outputs received higher mean ratings from experts (5.50 versus 5.11) and non-experts (4.72 versus 4.47), but the differences were not statistically significant and expert agreement was low. The work concludes that the directional signal exists but is too small and noisy to count as confirmatory evidence. A sympathetic reader would care because the test directly checks whether adding structured skills can reduce common AI failures like omitted steps or overstated claims in real biomedical analysis.

Core claim

In this exploratory multi-model human evaluation, autonomous access to a medical research skill package was associated with higher mean expert overall quality ratings for skill-augmented outputs (5.50) compared with native-AI outputs (5.11), with a parallel directional effect in non-expert ratings; the differences did not reach statistical significance, expert single-rating agreement was low, and model-specific effects were heterogeneous.

What carries the argument

Autonomous access to a medical research skill package that supplies structured guidance on analytical steps, implemented through an AI agent to generate the outputs.

If this is right

The directional quality signal holds across both expert and non-expert reviewer groups.
Skill augmentation can be applied across multiple underlying model backbones.
Future work must address low expert agreement and add biological-validity checks.
The current sample size and rating noise prevent treating the result as confirmatory.
Model-specific effects vary, indicating that benefits may not be uniform across all AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If human ratings track perceived polish more than factual accuracy, skill augmentation may improve presentation without fixing underlying analytical errors.
The same skill-access approach could be tested in other data-heavy biomedical domains such as proteomics or clinical trial design.
A sample several times larger would be needed to detect a 0.39-point difference with adequate power given the observed rating variability.
Combining the agent outputs with automated validation against public databases could provide an objective check independent of human ratings.

Load-bearing premise

Human expert and non-expert ratings of output quality are a reliable and representative measure of the actual analytical soundness of the AI-generated transcriptomic research outputs.

What would settle it

A larger study that measures the same outputs against independent biological ground truth, such as known correct biomarker associations from the literature, and finds no quality advantage or a reversal for skill-augmented versions.

Figures

Figures reproduced from arXiv: 2606.11830 by Bocheng Huang, Bo Li, Fei Sun, Huhai Hong, Huimei Wang, Jiarui Jiang, Liping Su, Qianyu Yao, Ruoqiong Wu, Shu Quan, Wei Chen, Wenjie Xu, Yifei Chen.

**Figure 2.** Figure 2: Overall quality scores by generation strategy. Boxplots show expert-rated overall quality [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Expert methodological quality and non-expert reviewer perceived risk by generation strat [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Descriptive model-specific skill-minus-native differences. Bars show the model-level dif [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This exploratory eval finds a small directional edge for skill-augmented agents but the negative expert ICC makes the quality ratings too noisy to support much weight on that signal.

read the letter

The main thing to take from this paper is that skill-augmented AI outputs scored a bit higher on expert overall quality than native ones (5.50 vs 5.11), yet the difference is small, non-significant, and rests on ratings that agree worse than chance.

The work applies an existing skill-augmentation approach to a concrete transcriptomic biomarker task and runs a human evaluation across six model backbones. It does a solid job of staying transparent: the abstract and results section report the bootstrap CIs that cross zero, the Welch p-value of 0.156, the same directional pattern in non-expert ratings, and the single-rating ICC of -0.15 between the two blinded experts. That level of candor about measurement problems is useful and not automatic in this area.

The clearest limitation is the ICC itself. When expert agreement is negative, the 0.39-point gap cannot be confidently read as evidence about analytical soundness rather than rater noise or bias. The total of 21 outputs and the exploratory framing compound that, so the directional claim stays weak even if the authors correctly avoid over-interpreting it. No new agent techniques or frameworks appear; this is an application plus evaluation.

The paper is mainly for groups already working on AI agents for biomedical research who want a realistic case study of evaluation difficulties. It will not change practice on its own, but the explicit discussion of reliability issues could help others design stronger follow-ups.

I would send it to peer review. The authors already flag the core problems, so referees could focus on tightening the interpretation and suggesting concrete fixes for rating reliability in future studies.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an exploratory multi-model human evaluation comparing skill-augmented AI agents (OpenClaw implementation) to native AI models on a transcriptomic biomarker analysis task for NSCLC immunotherapy. Across six model backbones and 21 anonymized outputs evaluated by four non-expert and two expert reviewers, it finds a directional but non-significant difference favoring skill-augmented outputs on expert overall quality (means 5.50 vs 5.11; difference 0.39; bootstrap 95% CI -0.04 to 0.90; Welch p=0.156), a similar non-significant direction for non-expert ratings, limited expert agreement (single-rating ICC=-0.15), and heterogeneous model-specific effects. The paper concludes that the directional signal motivates larger evaluations with improved reliability controls rather than constituting confirmatory evidence.

Significance. If the directional quality signal were replicated in larger studies with reliable expert ratings and biological validity checks, the work would provide initial empirical support for the benefit of autonomous skill packages in AI agents for biomedical research tasks. In its current form, however, the exploratory design, small sample, non-significant tests, and measurement issues limit the contribution to motivating future research rather than establishing efficacy of skill augmentation.

major comments (2)

[Results] Results (expert overall quality comparison): The primary outcome is expert-rated overall quality, yet the reported single-rating ICC of -0.15 between the two blinded experts indicates agreement worse than chance. This directly undermines attribution of the observed 0.39-point difference to differences in analytical soundness of the transcriptomic outputs, as the signal is smaller than rater noise; the bootstrap CI and p-value are computed on an unreliable measure.
[Methods] Methods (rater design): With only two experts and negative ICC, the evaluation lacks a reliable primary outcome measure. The manuscript should address whether additional expert raters, consensus procedures, or alternative validity anchors (e.g., biological accuracy checks) are feasible within the exploratory scope, as this measurement problem is more fundamental than sample size or the non-significant p-value.

minor comments (1)

[Abstract] Abstract and Conclusions: The cautious framing ('should not be interpreted as confirmatory evidence') is appropriate but could be strengthened by explicitly linking the ICC value to the size of the observed difference in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the critical issue of inter-rater reliability in our exploratory evaluation. We agree that the negative ICC represents a fundamental measurement limitation that prevents strong attribution of the directional signal to skill augmentation. The manuscript is already framed as non-confirmatory and primarily motivational for future work; we will revise to address the comments explicitly while preserving the exploratory scope.

read point-by-point responses

Referee: [Results] Results (expert overall quality comparison): The primary outcome is expert-rated overall quality, yet the reported single-rating ICC of -0.15 between the two blinded experts indicates agreement worse than chance. This directly undermines attribution of the observed 0.39-point difference to differences in analytical soundness of the transcriptomic outputs, as the signal is smaller than rater noise; the bootstrap CI and p-value are computed on an unreliable measure.

Authors: We agree that an ICC of -0.15 indicates agreement worse than chance and that this measurement unreliability is a core limitation. The observed 0.39-point difference is indeed smaller than the rater noise, which is why the manuscript already states that the findings 'should not be interpreted as confirmatory evidence' and that the signal 'was smaller than expert-rating noise.' We will revise the Results and Discussion sections to more explicitly note that the directional difference cannot be reliably attributed to skill augmentation given the ICC, and we will strengthen the language that the bootstrap CI and p-value are computed on an unreliable measure. This revision will be made without altering the reported statistics or the exploratory conclusion. revision: yes
Referee: [Methods] Methods (rater design): With only two experts and negative ICC, the evaluation lacks a reliable primary outcome measure. The manuscript should address whether additional expert raters, consensus procedures, or alternative validity anchors (e.g., biological accuracy checks) are feasible within the exploratory scope, as this measurement problem is more fundamental than sample size or the non-significant p-value.

Authors: We acknowledge that two experts yield insufficient reliability for a primary outcome and that this issue is more fundamental than sample size. Within the resource constraints of this exploratory study, additional expert raters or real-time consensus procedures were not feasible during data collection. We will revise the Methods and Discussion to explicitly address feasibility by (1) stating the practical limits of the current design, (2) outlining that future studies could incorporate consensus rating or more raters, and (3) proposing biological validity anchors such as cross-checking outputs against established NSCLC biomarker literature. These additions will clarify the exploratory scope without claiming the current data overcome the reliability problem. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical human-evaluation study

full rationale

The paper reports an exploratory multi-model human evaluation of AI-generated transcriptomic outputs. Primary outcomes are direct mean comparisons of expert and non-expert quality ratings (5.50 vs 5.11; 4.72 vs 4.47) with bootstrap CIs, Welch p-values, and single-rating ICC. No equations, fitted parameters, predictive models, or derivation chains appear in the abstract or described methods. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claim is a directional empirical signal whose validity rests on the rating data themselves rather than any reduction to prior inputs. This matches the default non-circular case for an observational evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation depends on the domain assumption that the chosen NSCLC task and reviewer ratings capture meaningful differences in research analysis quality; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human expert and non-expert ratings provide a valid and generalizable measure of AI output quality in biomedical research analysis
Primary outcome and conclusions rest directly on these ratings.

pith-pipeline@v0.9.1-grok · 5905 in / 1110 out tokens · 21068 ms · 2026-06-27T09:53:34.988702+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 linked inside Pith

[1]

Singhal, K., Azizi, S., Tu, T., et al. (2023). Large language models encode clinical knowledge. Nature, 620, 172-180. https://www.nature.com/articles/s41586-023-06291-2

2023
[2]

Moor, M., Banerjee, O., Abad, Z. S. H., et al. (2023). Foundation models for generalist medical artificial intelligence.Nature, 616, 259-265. https://www.nature.com/articles/ s41586-023-05881-4

2023
[3]

Li, M., Song, F., Yu, B., et al. (2023). API-Bank: A comprehensive benchmark for tool- augmented LLMs.arXiv. https://arxiv.org/abs/2304.08244

Pith/arXiv arXiv 2023
[4]

G., Zhang, T., Wang, X., & Gonzalez, J

Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large language model connected with massive APIs.arXiv. https://arxiv.org/abs/2305.15334

Pith/arXiv arXiv 2023
[5]

Qin, Y., Liang, S., Ye, Y., et al. (2023). ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv. https://arxiv.org/abs/2307.16789

Pith/arXiv arXiv 2023
[6]

Shen, Y., Song, K., Tan, X., et al. (2023). TaskBench: Benchmarking large language models for task automation.arXiv. https://arxiv.org/abs/2311.18760

arXiv 2023
[7]

https: //arxiv.org/abs/2603.02176

Li,H.,Mu,C.,Chen,J.,Ren,S.,Cui,Z.,Zhang,Y.,Bai,L.,Hu,S.,etal.(2026).AgentSkillOS: Organizing, orchestrating, and benchmarking agent skills at ecosystem scale.arXiv. https: //arxiv.org/abs/2603.02176

arXiv 2026
[8]

Zheng, Y., Zhang, Z., Ma, C., Yu, Y., Zhu, J., Dong, B., & Zhu, H. (2026). SkillRouter: Skill routing for LLM agents at scale.arXiv. https://arxiv.org/abs/2603.22455

arXiv 2026
[9]

Li, D., Li, Z., Du, H., Wu, X., Gui, S., Kuang, Y., & Sun, L. (2026). Graph of Skills: Dependency-aware structural retrieval for massive agent skills.arXiv. https://arxiv.org/abs/ 2604.05333

Pith/arXiv arXiv 2026
[10]

Wang, J., Ming, Y., Ke, Z., Joty, S., Albarghouthi, A., & Sala, F. (2026). SkillOrchestra: Learning to route agents via skill transfer.arXiv. https://arxiv.org/abs/2602.19672

arXiv 2026
[11]

Li, X., Chen, W., Liu, Y., Zheng, S., Chen, X., He, Y., Li, Y., You, B., Shen, H., Sun, J., Wang, S., Zeng, Q., Wang, D., Zhao, X., Wang, Y., Ben Chaim, R., Di, Z., Gao, Y., He, J., et al. (2026). SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv. https://arxiv.org/abs/2602.12670

Pith/arXiv arXiv 2026
[12]

Li, F., Tagkopoulos, P., & Tagkopoulos, I. (2025). SkillFlow: Scalable and efficient agent skill retrieval system.arXiv. https://arxiv.org/abs/2504.06188 13 A Supplementary Reproducibility Information This supplement provides reproducibility details that are not fully expanded in the main text, including the exact task prompt, output inclusion rules, eval...

arXiv 2025

[1] [1]

Singhal, K., Azizi, S., Tu, T., et al. (2023). Large language models encode clinical knowledge. Nature, 620, 172-180. https://www.nature.com/articles/s41586-023-06291-2

2023

[2] [2]

Moor, M., Banerjee, O., Abad, Z. S. H., et al. (2023). Foundation models for generalist medical artificial intelligence.Nature, 616, 259-265. https://www.nature.com/articles/ s41586-023-05881-4

2023

[3] [3]

Li, M., Song, F., Yu, B., et al. (2023). API-Bank: A comprehensive benchmark for tool- augmented LLMs.arXiv. https://arxiv.org/abs/2304.08244

Pith/arXiv arXiv 2023

[4] [4]

G., Zhang, T., Wang, X., & Gonzalez, J

Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large language model connected with massive APIs.arXiv. https://arxiv.org/abs/2305.15334

Pith/arXiv arXiv 2023

[5] [5]

Qin, Y., Liang, S., Ye, Y., et al. (2023). ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv. https://arxiv.org/abs/2307.16789

Pith/arXiv arXiv 2023

[6] [6]

Shen, Y., Song, K., Tan, X., et al. (2023). TaskBench: Benchmarking large language models for task automation.arXiv. https://arxiv.org/abs/2311.18760

arXiv 2023

[7] [7]

https: //arxiv.org/abs/2603.02176

Li,H.,Mu,C.,Chen,J.,Ren,S.,Cui,Z.,Zhang,Y.,Bai,L.,Hu,S.,etal.(2026).AgentSkillOS: Organizing, orchestrating, and benchmarking agent skills at ecosystem scale.arXiv. https: //arxiv.org/abs/2603.02176

arXiv 2026

[8] [8]

Zheng, Y., Zhang, Z., Ma, C., Yu, Y., Zhu, J., Dong, B., & Zhu, H. (2026). SkillRouter: Skill routing for LLM agents at scale.arXiv. https://arxiv.org/abs/2603.22455

arXiv 2026

[9] [9]

Li, D., Li, Z., Du, H., Wu, X., Gui, S., Kuang, Y., & Sun, L. (2026). Graph of Skills: Dependency-aware structural retrieval for massive agent skills.arXiv. https://arxiv.org/abs/ 2604.05333

Pith/arXiv arXiv 2026

[10] [10]

Wang, J., Ming, Y., Ke, Z., Joty, S., Albarghouthi, A., & Sala, F. (2026). SkillOrchestra: Learning to route agents via skill transfer.arXiv. https://arxiv.org/abs/2602.19672

arXiv 2026

[11] [11]

Li, X., Chen, W., Liu, Y., Zheng, S., Chen, X., He, Y., Li, Y., You, B., Shen, H., Sun, J., Wang, S., Zeng, Q., Wang, D., Zhao, X., Wang, Y., Ben Chaim, R., Di, Z., Gao, Y., He, J., et al. (2026). SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv. https://arxiv.org/abs/2602.12670

Pith/arXiv arXiv 2026

[12] [12]

Li, F., Tagkopoulos, P., & Tagkopoulos, I. (2025). SkillFlow: Scalable and efficient agent skill retrieval system.arXiv. https://arxiv.org/abs/2504.06188 13 A Supplementary Reproducibility Information This supplement provides reproducibility details that are not fully expanded in the main text, including the exact task prompt, output inclusion rules, eval...

arXiv 2025