COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

Dongrui Liu; Jing Shao; Leitao Yuan; Tianyi Zhou; Xia Hu

arxiv: 2605.31264 · v1 · pith:FI7TNJIDnew · submitted 2026-05-29 · 💻 cs.AI · cs.CL· cs.LG

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

Tianyi Zhou , Dongrui Liu , Leitao Yuan , Jing Shao , Xia Hu This is my paper

Pith reviewed 2026-06-28 22:27 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords AI skill generationexpert knowledge distillationLLM agentsskill packagestrace-to-skillperson-grounded skillsagent deployment

0 comments

The pith

COLLEAGUE.SKILL automates conversion of expert traces into versioned AI skill packages with capability and behavior tracks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a system that takes heterogeneous materials from a person or role and automatically produces structured skill packages for LLM agents. These packages separate a capability track covering practices, mental models, and decision heuristics from a bounded behavior track covering communication style, interaction rules, and correction history. The resulting packages support inspection, natural-language updates, rollback, and installation across agent hosts. This approach addresses the difficulty of embedding actionable human expertise into agents when that expertise exists only in scattered traces rather than clean instructions. A sympathetic reader would care because it offers an end-to-end workflow for creating portable, correctable representations of individual judgment and style.

Core claim

Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution.

What carries the argument

The versioned skill package with its capability track and bounded behavior track, produced by the trace-to-skill distillation workflow.

Load-bearing premise

Heterogeneous traces contain sufficient actionable knowledge that can be automatically extracted into inspectable, correctable skill packages without significant loss or distortion of the original expertise.

What would settle it

Deploy the generated skill packages in agents and check whether their decisions on new scenarios drawn from the same domain match the original expert's choices, or whether natural-language corrections fail to produce accurate updates to the package.

Figures

Figures reproduced from arXiv: 2605.31264 by Dongrui Liu, Jing Shao, Leitao Yuan, Tianyi Zhou, Xia Hu.

**Figure 1.** Figure 1: shows the deployed COLLEAGUE.SKILL architecture. The core path begins with traces of a target person or role: work documents and review comments for a colleague, public interviews and long-form writings for a public figure, or private interaction records for a relationship preset. Collectors and parsers normalize this material into local knowledge directories. Analyzers extract evidence about durable capab… view at source ↗

**Figure 2.** Figure 2: Application presets layered on the COLLEAGUE.SKILL person-grounded skill pipeline. The shared artifact workflow branches into colleague, celebrity, and relationship presets with different evidence scopes, governance requirements, and invocation aliases. These presets are domain specializations of the same person-grounded artifact workflow, not separate systems. They avoid duplicating the pipeline when a ne… view at source ↗

**Figure 3.** Figure 3: Lifecycle loop for generated skills. Corrections and patches create new versions while preserving [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Observed public deployment counters on 2026-05-28. Counts summarize repository activity, gallery [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end-to-end workflow for distilling these traces into inspectable, correctable, and agent-usable skills. We present an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge distillation. Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open-source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person-grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A concrete workflow for turning expert traces into versioned skill packages, but no tests to show the distillation actually preserves the expertise.

read the letter

The paper's main point is that it gives a full workflow for distilling heterogeneous traces into inspectable, correctable skill packages with two tracks: one for capabilities and mental models, one for bounded behaviors and interaction style. That fills a practical gap between raw memory fragments and portable agent skills.

It does a solid job spelling out the artifact contract, the generation steps, the correction lifecycle via natural language, and how the packages can be installed or rolled back. The open-source release and reported usage numbers show the authors have built something people are actually trying.

The clear weakness is the total absence of any validation. No expert ratings on fidelity, no downstream task comparisons, no checks on whether the LLM extraction loses or distorts the original judgment. The whole utility claim sits on the untested assumption that the process works without significant loss.

This is aimed at people building or deploying LLM agents who need a packaging format for person-specific behaviors. A practitioner could pull useful implementation ideas from the workflow and presets.

It deserves peer review because the system is concrete and already has traction; referees can push for the missing evaluation or help scope what the paper actually demonstrates.

Referee Report

2 major / 1 minor

Summary. The manuscript presents COLLEAGUE.SKILL, an automated trace-to-skill distillation system that converts heterogeneous materials from a target person or role into versioned skill packages. Each package contains two coordinated tracks—a capability track capturing practices, mental models, and decision heuristics, and a bounded behavior track capturing communication style, interaction rules, and correction history. The paper describes the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets, and reports community metrics for the open-source implementation (18.5k GitHub stars, 215 skills from 165 contributors).

Significance. If the distillation process reliably extracts actionable expertise into inspectable and correctable packages without material distortion, the work could offer a structured alternative to opaque prompts or fragmented memory systems for building person-grounded LLM agents. The emphasis on versioning, natural-language correction, and cross-host portability addresses practical deployment needs, and the reported adoption metrics suggest the framework has already seen community uptake.

major comments (2)

[Abstract] Abstract: the central claim that the system produces 'inspectable, correctable, and agent-usable skills' that faithfully capture person-grounded expertise rests entirely on description of the workflow and artifact contract; no empirical validation (expert fidelity ratings, downstream agent performance comparisons, ablation on trace heterogeneity, or error analysis) is reported anywhere in the manuscript.
[The manuscript as a whole] The manuscript as a whole: the assumption that LLM-mediated extraction from heterogeneous traces incurs no significant loss or distortion is load-bearing for the claimed utility, yet remains untested; without such checks the two-track structure and correction lifecycle cannot be shown to improve upon existing memory or persona systems.

minor comments (1)

[Abstract] Abstract: the GitHub adoption statistics are presented without any accompanying analysis of how contributor volume or star counts correlate with skill correctness or usability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying the need to clarify the scope and evidential basis of the manuscript. We address each major comment below and commit to revisions that better frame the contribution as a systems description while acknowledging the absence of empirical validation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the system produces 'inspectable, correctable, and agent-usable skills' that faithfully capture person-grounded expertise rests entirely on description of the workflow and artifact contract; no empirical validation (expert fidelity ratings, downstream agent performance comparisons, ablation on trace heterogeneity, or error analysis) is reported anywhere in the manuscript.

Authors: We agree that the manuscript provides no empirical validation of skill fidelity, downstream performance, or error characteristics. The paper is a systems description of the artifact contract, workflow, correction lifecycle, and open-source implementation, with community adoption (18.5k GitHub stars, 215 skills) offered as indirect evidence of practical utility rather than controlled evaluation. We will revise the abstract and introduction to explicitly state that the work presents a distillation framework and portable package format, not validated extraction accuracy. A new 'Limitations and Future Work' section will be added to discuss the need for expert fidelity studies, agent performance benchmarks, and ablation experiments on trace heterogeneity. revision: yes
Referee: [The manuscript as a whole] The manuscript as a whole: the assumption that LLM-mediated extraction from heterogeneous traces incurs no significant loss or distortion is load-bearing for the claimed utility, yet remains untested; without such checks the two-track structure and correction lifecycle cannot be shown to improve upon existing memory or persona systems.

Authors: The manuscript does not assert that extraction incurs no loss or distortion; the design of the two-track structure and natural-language correction mechanism is intended to surface and mitigate such issues through human inspection and rollback. Nevertheless, we acknowledge that no comparative evaluation against memory or persona systems is provided, so claims of improvement remain untested. In revision we will add an explicit discussion of related memory and persona approaches, state the untested assumptions regarding extraction fidelity, and outline planned empirical comparisons as future work. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive system design with no derivations or fitted predictions

full rationale

The manuscript presents an engineering artifact (trace-to-skill distillation workflow, two-track package contract, correction lifecycle) without equations, parameter fitting, uniqueness theorems, or any claimed first-principles derivations. No step reduces a result to its own inputs by construction, and no self-citations are invoked as load-bearing mathematical justification. The central claim is an existence and design statement whose validity is independent of the paper's own text; external validation (user studies, fidelity metrics) is simply absent, which is a correctness issue rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical model, free parameters, or new physical entities are introduced. The work rests on the domain assumption that expert traces are rich enough to support automated distillation into correctable packages.

axioms (1)

domain assumption Actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions.
Stated directly in the abstract as the motivating premise for the distillation system.

pith-pipeline@v0.9.1-grok · 5819 in / 1265 out tokens · 26607 ms · 2026-06-28T22:27:55.905570+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 3 internal anchors

[1]

URLhttps://arxiv.org/abs/2305.16291. Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Stephen W. Huang, Jie Fu, and Junran Peng. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language mod...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

URLhttps://arxiv.org/abs/2308.08155. 11 COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. Autoskill: Experience-driven lifelong learning via skill self-evolution. arXiv:2603.01145, 2026. URLhttps...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

URLhttps://arxiv.org/abs/2506.06254. Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. SOTOPIA: Interactive evaluation for social intelligence in language agents. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/a...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

URLhttps://arxiv.org/abs/2305.16291. Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Stephen W. Huang, Jie Fu, and Junran Peng. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language mod...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

URLhttps://arxiv.org/abs/2308.08155. 11 COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. Autoskill: Experience-driven lifelong learning via skill self-evolution. arXiv:2603.01145, 2026. URLhttps...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [4]

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

URLhttps://arxiv.org/abs/2506.06254. Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. SOTOPIA: Interactive evaluation for social intelligence in language agents. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/a...

work page internal anchor Pith review Pith/arXiv arXiv 2024