arxiv: 2605.04970 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

Recognition: unknown

Skill Neologisms: Towards Skill-based Continual Learning

Antonin Berthon , Nicolas Astorga , Mihaela van der Schaar

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords skill neologismscontinual learninglarge language modelssoft tokenszero-shot compositionvocabulary extensionparameter-efficient adaptation

0 comments

The pith

Skill neologisms let LLMs gain new abilities by adding optimized soft tokens to the vocabulary without any weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces skill neologisms as a method to extend large language models to new skills in a continual learning setting. These are soft tokens added to the vocabulary and trained on skill-specific data to boost performance on targeted abilities. The approach avoids catastrophic forgetting that comes with fine-tuning and the limited scope of context-only prompting. Experiments demonstrate that the resulting tokens improve the intended skill, combine effectively with unfamiliar skills, and can be merged zero-shot even when trained independently. This setup is presented as a scalable route to accumulating skills modularly over time.

Core claim

Pre-trained LLMs already contain tokens tied to procedural knowledge. New skill neologisms can be learned to raise performance on chosen skills while staying composable with out-of-distribution skills. Independently trained neologisms can be combined at inference time without further optimization, supporting the idea of skill-based continual learning without parameter changes.

What carries the argument

Skill neologisms, which are soft tokens added to the model's vocabulary and optimized via gradient descent on skill-specific objectives, function as the extensible units that enhance and allow composition of abilities.

Load-bearing premise

That newly optimized skill tokens can be inserted and combined without lowering performance on unrelated tasks or creating interference among skills.

What would settle it

A test showing that inserting several skill neologisms causes measurable drops on the base model's original tasks or on the individual skills themselves would disprove the non-interference property.

Figures

Figures reproduced from arXiv: 2605.04970 by Antonin Berthon, Mihaela van der Schaar, Nicolas Astorga.

**Figure 1.** Figure 1: Overview of Skill Neologisms. et al., 2023). In-context learning has shown some success at skill composition in simple settings (Chen et al., 2023; Levy et al., 2023; Xu et al., 2024), but it does not adapt as well as PEFT methods (Liu et al., 2022a), and does not scale because of effective context limitations (Hsieh et al., 2024). Prompt tuning (Lester et al., 2021) can adapt models to new tasks by only … view at source ↗

**Figure 2.** Figure 2: Overview of Skill Neologisms. (A) We consider pretrained model endowed with a set of implicit skills learned during pretraining. (B) A skill-centered dataset contains snippets of text that require at least the skill of interest, composed with pretraining skills. (C) Skill neologisms appends new token embeddings to the model’s vocabulary and embedding matrix, which are trained on the skill-centered dataset … view at source ↗

**Figure 3.** Figure 3: Accuracy on XOR and XNOR completion tasks across open-source models under different prompts. Only Examples provides three input–output examples before the query. + Text Description adds a natural-language description of the operation (e.g., ”output 1 iff the input bits differ” for XOR). + Keyword adds only the operation name (”XOR” or ”XNOR”). Results are averaged over N = 100 samples; error bars show stan… view at source ↗

**Figure 4.** Figure 4: Accuracy on 2-combinations of skills mixing Snew with Σtrain (in-distribution) or Sheld-out (out-of-distribution). Dotted lines show the average accuracy across all Sheld-out. PT: Prompt Tuning. erty 3). This distinguishes skill neologisms from LoRA and Prompt Tuning-like approaches, which cannot be composed after independent training without retraining on the joint task. We compare against in-context lea… view at source ↗

**Figure 5.** Figure 5: shows the average accuracy across different Sheld-out for Skill Neologisms and N for ICL, for different sequence lengths (increasing task difficulty). Traces for individual runs are shown Figure A3 in the Appendix. Skill neologisms significantly outperform ICL across all sequence lengths. This demonstrates that skill neologisms successfully capture reusable procedural knowledge that transfers zero-shot t… view at source ↗

**Figure 6.** Figure 6: Effect of skill token length. Accuracy on 2-skill combinations with Sheld-out = ADD for varying skill token length l. Takeaway: The limited capacity of skill neologisms acts as an inductive bias to learn more composable skill representations. 5.3.2. COMPOSITION COMPLEXITY IN TRAINING SET The complexity of skill combinations in the training set provides another source of inductive bias: exposing skill tok… view at source ↗

read the original abstract

Modern LLMs show mastery over an ever-growing range of skills, as well as the ability to compose them flexibly. However, extending model capabilities to new skills in a scalable manner is an open-problem: fine-tuning and parameter-efficient variants risk catastrophic forgetting, while context-based approaches have limited expressiveness and are constrained by the model's effective context. We explore skill neologisms--i.e., soft tokens integrated in the model's vocabulary and optimized to improve capabilities over a specific skill--as a way to selectively extend model capabilities to new skills without weight updates. We first observe that off-the-shelf pre-trained LLMs already demonstrate tokens associated with procedural knowledge. We then show that skill neologisms can be learned to improve model capabilities on specific skills while being composable with out-of-distribution skills, and that independently trained skill neologisms can be composed zero-shot. These results suggest that skill neologisms may provide a scalable path towards skill-based continual learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes skill neologisms as soft tokens added to a pre-trained LLM's vocabulary and optimized (without weight updates) to extend capabilities on targeted skills. It first notes that off-the-shelf models already contain tokens linked to procedural knowledge, then reports that learned neologisms improve performance on specific skills, compose with out-of-distribution skills, and that independently trained neologisms can be combined zero-shot, suggesting a path to scalable skill-based continual learning.

Significance. If the empirical claims hold, the approach offers a modular alternative to fine-tuning (which risks forgetting) and context-only methods (which are limited in expressiveness). The zero-shot composability of separately optimized neologisms would be a notable strength for additive skill extension in LLMs.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the central claims of improved targeted skills, composability, and zero-shot composition of independently trained neologisms are asserted without any reported quantitative metrics, baselines, task definitions, or error bars; this directly undermines verification of the data-to-claim link for the no-interference assumption.
[§3 and §5] §3 (Method) and §5 (Composition results): the optimization of neologism embeddings is described at a high level but contains no analysis or ablation showing that the learned embeddings affect only the target skill rather than globally altering attention patterns or token associations; this is load-bearing for the claim that independently trained neologisms can be added without degrading unrelated tasks.
[§5.2] §5.2 (Zero-shot composition): the evaluation of multi-neologism composition reports positive outcomes on the target skills but does not measure or control for performance degradation on a held-out set of unrelated tasks after insertion, leaving the weakest assumption (no interference) untested.

minor comments (2)

[Abstract] The abstract would be strengthened by naming the specific skills used and at least one key quantitative result.
[§3] Clarify the precise parameterization of a skill neologism (e.g., whether it is a single embedding vector or a short sequence) and how it is inserted into the vocabulary matrix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review. We agree that the manuscript would benefit from more rigorous quantitative reporting and additional analyses to support the claims about skill neologisms. We will revise the paper accordingly to address these points.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claims of improved targeted skills, composability, and zero-shot composition of independently trained neologisms are asserted without any reported quantitative metrics, baselines, task definitions, or error bars; this directly undermines verification of the data-to-claim link for the no-interference assumption.

Authors: We appreciate this observation. Upon review, we recognize that while the experiments section includes some performance indicators, the presentation lacks explicit baselines, error bars, and clear task definitions in the abstract and summary. In the revised manuscript, we will add quantitative metrics with baselines (e.g., standard prompting and other PEFT methods), report means and standard deviations over multiple seeds, and provide precise task definitions. This will better substantiate the claims and the no-interference aspect. revision: yes
Referee: [§3 and §5] §3 (Method) and §5 (Composition results): the optimization of neologism embeddings is described at a high level but contains no analysis or ablation showing that the learned embeddings affect only the target skill rather than globally altering attention patterns or token associations; this is load-bearing for the claim that independently trained neologisms can be added without degrading unrelated tasks.

Authors: We agree that demonstrating the locality of the effect is important. We will include new ablations in the revised version, such as comparing attention maps before and after neologism insertion, and measuring performance on unrelated tasks to show that changes are skill-specific rather than global. revision: yes
Referee: [§5.2] §5.2 (Zero-shot composition): the evaluation of multi-neologism composition reports positive outcomes on the target skills but does not measure or control for performance degradation on a held-out set of unrelated tasks after insertion, leaving the weakest assumption (no interference) untested.

Authors: This is a valid point. We will expand the experiments in §5.2 to evaluate the composed neologisms on a held-out set of unrelated tasks, reporting any changes in performance to directly test for interference. If degradation is observed, we will discuss it; otherwise, it will support the claim. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical claims on skill neologisms

full rationale

The paper advances an empirical method for extending LLM capabilities via optimized soft tokens (skill neologisms) without weight updates. Its central results consist of experimental observations: off-the-shelf models already encode procedural knowledge in certain tokens, learned neologisms improve targeted skills, and independently trained neologisms compose zero-shot with out-of-distribution skills. No derivation chain, first-principles prediction, or equation is presented that reduces to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no fitted parameters are relabeled as independent predictions. The claims remain falsifiable through external evaluation on held-out tasks and do not rely on self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central innovation is the introduction of skill neologisms; no explicit free parameters, standard axioms, or additional invented entities are detailed in the abstract.

invented entities (1)

skill neologism no independent evidence
purpose: Soft token integrated into LLM vocabulary and optimized for a specific skill to extend capabilities without weight updates
Newly proposed construct to enable selective, composable skill extension.

pith-pipeline@v0.9.0 · 5466 in / 1030 out tokens · 30141 ms · 2026-05-08T17:22:13.566506+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 13 canonical work pages · 5 internal anchors

[1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review arXiv
[2]

A theory for emergence of complex skills in language models

Arora, S. and Goyal, A. A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936,

work page arXiv
[3]

Skills-in-context prompting: Unlocking compositionality in large language models.arXiv preprint arXiv:2308.00304,

Chen, J., Pan, X., Yu, D., Song, K., Wang, X., Yu, D., and Chen, J. Skills-in-context prompting: Unlocking compositionality in large language models.arXiv preprint arXiv:2308.00304,

work page arXiv
[4]

W., Grau-Moya, J., Ruoss, A., Orseau, L., and Hutter, M

Genewein, T., Li, K. W., Grau-Moya, J., Ruoss, A., Orseau, L., and Hutter, M. Understanding prompt tuning and in-context learning via meta-learning.arXiv preprint arXiv:2505.17010,

work page arXiv
[5]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review arXiv
[6]

STAT: Skill- targeted adaptive training

He, Y ., Panigrahi, A., Lin, Y ., and Arora, S. STAT: Skill- targeted adaptive training. InThe 5th Workshop on Math- ematical Reasoning and AI at NeurIPS 2025,

2025
[7]

Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Li, R., Fu, J., Zhang, B.-W., Huang, T., Sun, Z., Lyu, C., Liu, G., Jin, Z., and Li, G. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852,

work page arXiv
[8]

Ministral 3

Liu, A. H., Khandelwal, K., Subramanian, S., Jouault, V ., Rastogi, A., Sad ´e, A., Jeffares, A., Jiang, A., Cahill, A., Gavaudan, A., et al. Ministral 3.arXiv preprint arXiv:2601.08584,

work page internal anchor Pith review arXiv
[9]

Backpropagation through time and the brain.Current Opinion in Neurobiology, 55:82–89, 2019

ISSN 2666-6510. doi: https://doi.org/10.1016/j. aiopen.2023.08.012. Liu, Z., Liu, Q., Guo, T., Chen, J., Huang, S., Zhao, X., Tang, J., Luo, W., and Weng, J. Xes3g5m: A knowledge tracing benchmark dataset with auxiliary information. NeurIPS,

work page doi:10.1016/j 2023
[10]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Qi, X., Zeng, Y ., Xie, T., Chen, P.-Y ., Jia, R., Mittal, P., and Henderson, P. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693,

work page internal anchor Pith review arXiv
[11]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. Radevski, G., Gashteovski, K., Hong, G., Lawrence, C., and Glava ˇs, G. Compositional steering of large lan- guage models with steering tokens.arXiv preprint arXiv:2601.05062,

work page internal anchor Pith review arXiv
[12]

and Ros ´a, A

Sastre, I. and Ros ´a, A. Memory tokens: Large language models can generate reversible sentence embeddings. arXiv preprint arXiv:2506.15001,

work page arXiv
[13]

M., Turner, R

Wang, Z., Lamb, A., Saveliev, E., Cameron, P., Zaykov, Y ., Hern´andez-Lobato, J. M., Turner, R. E., Baraniuk, R. G., Barton, C., Jones, S. P., et al. Instructions and guide for diagnostic questions: The neurips 2020 education challenge.arXiv preprint arXiv:2007.12061,

work page arXiv 2020
[14]

Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

Yuan, J., Peng, T., Jiang, Y ., Lu, Y ., Zhang, R., Feng, K., Fu, C., Chen, T., Bai, L., Zhang, B., et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms. arXiv preprint arXiv:2505.21327,

work page arXiv
[15]

Zhao et al

aims to elicit compositional abilities in LLMs by providing in-context a description of skills and step-by-step explanation on how to compose them. Zhao et al. (2024) show that training LLMs 11 Skill Neologisms: Towards Skill-based Continual Learning 0.0 0.5 1.0 C3(Snew, Σtrain) C3(Snew, Sheld-out) 0.32 0.70 0.70 LoRA PT Skill Neologisms 0.0 0.5 1.0 LoRA ...

2024
[16]

Didolkar et al

aims to improve model capabilities by uncovering specific skills lacking from the model, and targeting these skills via either reweighting or synthetic data augmentations. Didolkar et al. (2024) demonstrated that LLMs have the ability to describe skills required by a given task, while Kaur et al. (2025) leveraged such metacognition abilities of LLMs to cr...

2024
[17]

In prompt compression, memory tokens (Sastre & Ros´a, 2025; Kuratov et al.,

represents tools via tokens integrated in the model vocabulary. In prompt compression, memory tokens (Sastre & Ros´a, 2025; Kuratov et al.,

2025
[18]

Recently, Radevski et al

replace prompts with gist tokens that preserve downstream model behavior. Recently, Radevski et al. (2026) proposed learning composable steering tokens for behavioral alignment. To the best of our knowledge, our work is the first to learn composable soft tokens that encapsulate specific procedural knowledge. C. Extended Limitations Skill-centered dataset ...

2026