Skill Neologisms: Towards Skill-based Continual Learning

Antonin Berthon; Mihaela van der Schaar; Nicolas Astorga

REVIEW 2 major objections 2 minor 15 references

Skill neologisms let LLMs learn new skills through optimized soft tokens that compose zero-shot without weight updates.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 23:28 UTC pith:SPXEBG6S

load-bearing objection Skill neologisms let you train soft tokens for separate skills and compose them zero-shot, but the work stops short of testing true sequential addition over time. the 2 major comments →

arxiv 2605.04970 v2 pith:SPXEBG6S submitted 2026-05-06 cs.LG cs.AI

Skill Neologisms: Towards Skill-based Continual Learning

Antonin Berthon , Nicolas Astorga , Mihaela van der Schaar This is my paper

classification cs.LG cs.AI

keywords skill neologismscontinual learningzero-shot compositionsoft tokenslarge language modelsSkill-Mix benchmarkprocedural knowledgecatastrophic forgetting

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces skill neologisms as soft tokens added to an LLM vocabulary and tuned to strengthen performance on one targeted skill. These tokens can be optimized separately for distinct skills and then used together at inference time on tasks that require their combination. The method is shown to work on a synthetic task where neologisms improve specific skills and combine with unfamiliar ones, and it is checked on the Skill-Mix benchmark for natural language. This route avoids the forgetting that comes with fine-tuning and the context limits of prompting, offering a modular way to keep expanding what a model can do.

Core claim

Skill neologisms are soft tokens integrated in the model's vocabulary and optimized to improve capabilities over a specific skill. Pre-trained LLMs already exhibit tokens associated with procedural knowledge. On a controlled synthetic task, skill neologisms can be learned to improve model capabilities on specific skills while being composable with out-of-distribution skills, and independently trained skill neologisms can be composed zero-shot. This zero-shot composition is validated on the Skill-Mix benchmark in a natural language setting, suggesting skill neologisms provide a scalable path towards skill-based continual learning.

What carries the argument

Skill neologisms: soft tokens added to the vocabulary and optimized for one procedural skill, then composed at test time.

Load-bearing premise

That each skill neologism can be optimized on its own without creating interference that only shows up when many are added one after another over time.

What would settle it

A clear drop in performance on tasks that combine several skills, once a long sequence of independently trained skill neologisms has been added to the vocabulary.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

LLMs can gain new skills by adding tokens rather than retraining weights, avoiding catastrophic forgetting.
Skills learned separately can be combined on the fly for tasks that require more than one at once.
The approach scales by letting each new skill be acquired without touching previous ones.
Models can handle skill combinations never seen together during any training step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This token-based modularity might extend to chaining skills into longer procedures if composition remains stable at greater depth.
Vocabulary growth through neologisms could eventually reduce reliance on very long contexts for skill recall.
Automatic selection of which skills to encode as neologisms could become a practical next step for open-ended model expansion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes skill neologisms—soft tokens added to an LLM's vocabulary and optimized to encode specific skills—as a method to extend model capabilities without weight updates or catastrophic forgetting. It reports that pre-trained LLMs already contain tokens linked to procedural knowledge, demonstrates on a controlled synthetic task that independently optimized skill neologisms improve targeted skills and compose zero-shot with out-of-distribution skills, and validates zero-shot composition on the Skill-Mix benchmark for natural language tasks.

Significance. If the empirical findings hold under more rigorous testing, the work identifies a potentially scalable, modular route to skill acquisition that sidesteps the stability-plasticity dilemma in continual learning. The zero-shot composition result is a concrete strength, as it shows that separate optimization runs can yield composable embeddings without joint training.

major comments (2)

[§4] §4 (Experimental evaluation): The central claim concerns skill-based continual learning, yet the reported protocol trains each skill neologism in an independent run and then composes the resulting embeddings; no experiments add neologisms sequentially while re-testing prior compositions after each addition. Interference effects (e.g., embedding crowding or loss of skill specificity) that would appear only under sequential accumulation therefore remain unexamined.
[§3.2] §3.2 (Optimization of skill neologisms): The manuscript does not report the precise loss used to optimize the neologism embeddings, the number of tokens allocated per skill, or any regularization that would prevent interference with the existing vocabulary; without these details the claim that optimization can be performed independently is difficult to evaluate.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly separate the synthetic-task results from the Skill-Mix validation so that readers can assess the degree of generalization.
[Figures] Figure captions and axis labels should state the exact metric (e.g., accuracy or normalized score) and the number of random seeds used for each bar or curve.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of our experimental design and implementation details. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experimental evaluation): The central claim concerns skill-based continual learning, yet the reported protocol trains each skill neologism in an independent run and then composes the resulting embeddings; no experiments add neologisms sequentially while re-testing prior compositions after each addition. Interference effects (e.g., embedding crowding or loss of skill specificity) that would appear only under sequential accumulation therefore remain unexamined.

Authors: We agree that sequential addition of neologisms with re-testing of prior compositions would provide a more complete test of interference effects in a continual learning setting. Our experiments were designed to first establish that independently optimized skill neologisms can improve targeted skills and compose zero-shot, which directly supports the core idea of modular skill acquisition without weight updates. This isolates the contribution of the neologism approach from joint training effects. In the revised manuscript, we will add a dedicated paragraph in §4 and the discussion section acknowledging this limitation, explaining why independent optimization was prioritized, and outlining future work on sequential accumulation to examine potential crowding or specificity loss. revision: partial
Referee: [§3.2] §3.2 (Optimization of skill neologisms): The manuscript does not report the precise loss used to optimize the neologism embeddings, the number of tokens allocated per skill, or any regularization that would prevent interference with the existing vocabulary; without these details the claim that optimization can be performed independently is difficult to evaluate.

Authors: We thank the referee for identifying this gap in reporting. These implementation details were omitted from the main text. In the revised version of §3.2, we will explicitly state the loss function used to optimize the neologism embeddings, the number of tokens allocated per skill, and any regularization terms employed to limit interference with the existing vocabulary. This will make the independent optimization procedure fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper is entirely empirical and presents no closed-form derivation, first-principles equations, or parameter-fitting procedure whose outputs are then relabeled as predictions. Claims rest on direct measurements of zero-shot composition performance on a controlled synthetic task and the external Skill-Mix benchmark; these quantities are not defined in terms of the authors' own fitted values or prior self-citations. No load-bearing step reduces to a self-definition, fitted-input renaming, or uniqueness theorem imported from the same authors. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that soft-token optimization can isolate skill representations without side effects on the base model, plus standard transformer assumptions about token embeddings and attention.

free parameters (1)

skill neologism embedding vectors
These are learned parameters optimized per skill; their values are not derived from first principles.

axioms (1)

domain assumption Pre-trained LLMs already contain tokens associated with procedural knowledge that can be extended via soft tokens.
Stated as an initial observation in the abstract.

invented entities (1)

skill neologism no independent evidence
purpose: A soft token that encodes a specific skill for zero-shot composition.
Newly introduced construct without external falsifiable evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5720 in / 1306 out tokens · 38893 ms · 2026-05-20T23:28:47.199286+00:00 · methodology

0 comments

read the original abstract

Modern LLMs show mastery over an ever-growing range of skills, as well as the ability to compose them flexibly. However, extending model capabilities to new skills in a scalable manner is an open problem: fine-tuning and parameter-efficient variants risk catastrophic forgetting, while context-based approaches have limited expressiveness and are constrained by the model's effective context. We explore skill neologisms--soft tokens integrated in the model's vocabulary and optimized to improve capabilities over a specific skill--as a way to selectively acquire new skills without weight updates. We first observe that pre-trained LLMs already exhibit tokens associated with procedural knowledge. We then show on a controlled synthetic task that skill neologisms can be learned to improve model capabilities on specific skills while being composable with out-of-distribution skills, and that independently trained skill neologisms can be composed zero-shot. Finally, we validate zero-shot composition of independently learned skill neologisms on the more realistic natural language setting of the Skill-Mix benchmark. These results suggest that skill neologisms may provide a scalable path towards skill-based continual learning.

Figures

Figures reproduced from arXiv: 2605.04970 by Antonin Berthon, Mihaela van der Schaar, Nicolas Astorga.

**Figure 1.** Figure 1: Overview of Skill Neologisms. et al., 2023). In-context learning has shown some success at skill composition in simple settings (Chen et al., 2023; Levy et al., 2023; Xu et al., 2024), but it does not adapt as well as PEFT methods (Liu et al., 2022a), and does not scale because of effective context limitations (Hsieh et al., 2024). Prompt tuning (Lester et al., 2021) can adapt models to new tasks by only … view at source ↗

**Figure 2.** Figure 2: Overview of Skill Neologisms. (A) We consider pretrained model endowed with a set of implicit skills learned during pretraining. (B) A skill-centered dataset contains snippets of text that require at least the skill of interest, composed with pretraining skills. (C) Skill neologisms appends new token embeddings to the model’s vocabulary and embedding matrix, which are trained on the skill-centered dataset … view at source ↗

**Figure 3.** Figure 3: Accuracy on XOR and XNOR completion tasks across open-source models under different prompts. Only Examples provides three input–output examples before the query. + Text Description adds a natural-language description of the operation (e.g., ”output 1 iff the input bits differ” for XOR). + Keyword adds only the operation name (”XOR” or ”XNOR”). Results are averaged over N = 100 samples; error bars show stan… view at source ↗

**Figure 4.** Figure 4: Accuracy on 2-combinations of skills mixing Snew with Σtrain (in-distribution) or Sheld-out (out-of-distribution). Dotted lines show the average accuracy across all Sheld-out. PT: Prompt Tuning. erty 3). This distinguishes skill neologisms from LoRA and Prompt Tuning-like approaches, which cannot be composed after independent training without retraining on the joint task. We compare against in-context lea… view at source ↗

**Figure 5.** Figure 5: shows the average accuracy across different Sheld-out for Skill Neologisms and N for ICL, for different sequence lengths (increasing task difficulty). Traces for individual runs are shown Figure A3 in the Appendix. Skill neologisms significantly outperform ICL across all sequence lengths. This demonstrates that skill neologisms successfully capture reusable procedural knowledge that transfers zero-shot t… view at source ↗

**Figure 6.** Figure 6: Effect of skill token length. Accuracy on 2-skill combinations with Sheld-out = ADD for varying skill token length l. Takeaway: The limited capacity of skill neologisms acts as an inductive bias to learn more composable skill representations. 5.3.2. COMPOSITION COMPLEXITY IN TRAINING SET The complexity of skill combinations in the training set provides another source of inductive bias: exposing skill tok… view at source ↗

**Figure 7.** Figure 7: Learning two target skills on Skill-Mix. “No training”: vanilla Llama3.2-3B-Instruct model. “Trained on Snew”: each model is trained on a Snew-centered dataset mixing Snew with other skills. “Using both skill neologisms”: combining the independently learned skill tokens zero-shot. ”Both Skills” corresponds to successfully producing a text that includes both skills. Grading is done with GPT-5. Prompts for i… view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936, 2023

Arora, S. and Goyal, A. A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936,

work page arXiv
[3]

Skills-in-context prompting: Unlocking compositionality in large language models.arXiv preprint arXiv:2308.00304,

Chen, J., Pan, X., Yu, D., Song, K., Wang, X., Yu, D., and Chen, J. Skills-in-context prompting: Unlocking compositionality in large language models.arXiv preprint arXiv:2308.00304,

work page arXiv
[4]

W., Grau-Moya, J., Ruoss, A., Orseau, L., and Hutter, M

Genewein, T., Li, K. W., Grau-Moya, J., Ruoss, A., Orseau, L., and Hutter, M. Understanding prompt tuning and in-context learning via meta-learning.arXiv preprint arXiv:2505.17010,

work page arXiv
[5]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Taco: Topics in algorithmic code generation dataset

Li, R., Fu, J., Zhang, B.-W., Huang, T., Sun, Z., Lyu, C., Liu, G., Jin, Z., and Li, G. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852,

work page arXiv
[7]

Ministral 3

Liu, A. H., Khandelwal, K., Subramanian, S., Jouault, V ., Rastogi, A., Sad ´e, A., Jeffares, A., Jiang, A., Cahill, A., Gavaudan, A., et al. Ministral 3.arXiv preprint arXiv:2601.08584,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Compositional Steering of Large Language Models with Steering Tokens

Radevski, G., Gashteovski, K., Hong, G., Lawrence, C., and Glava ˇs, G. Compositional steering of large lan- guage models with steering tokens.arXiv preprint arXiv:2601.05062,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Instructions and Guide for Diagnostic Questions: The NeurIPS 2020 Education Challenge

Wang, Z., Lamb, A., Saveliev, E., Cameron, P., Zaykov, Y ., Hern´andez-Lobato, J. M., Turner, R. E., Baraniuk, R. G., Barton, C., Jones, S. P., et al. Instructions and guide for diagnostic questions: The neurips 2020 education challenge.arXiv preprint arXiv:2007.12061,

work page Pith review arXiv 2020
[10]

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Yuan, J., Peng, T., Jiang, Y ., Lu, Y ., Zhang, R., Feng, K., Fu, C., Chen, T., Bai, L., Zhang, B., et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms. arXiv preprint arXiv:2505.21327,

work page Pith review arXiv
[11]

Extended Results A.1

11 Skill Neologisms: Towards Skill-based Continual Learning A. Extended Results A.1. Model pre-training Figure A1 shows the accuracy of Mpretrain after pre-training (same as Table 4), across sequence lengths and operations. Sequence lengths {2,3,4,6,8} are in-distribution, while lengths {5,7,9} were held-out from pre-training data. The model successfully ...

work page 2050
[12]

Zhao et al

aims to elicit compositional abilities in LLMs by providing in-context descriptions of skills and step-by-step explanations on how to compose them. Zhao et al. (2024) show that training LLMs on skill-rich synthetic datasets improve compositional abilities, even on held-out skills unseen during training. STAT (He et al.,

work page 2024
[13]

Didolkar et al

aims to improve model capabilities by uncovering specific skills lacking from the model, and targeting these skills via either reweighting or synthetic data augmentations. Didolkar et al. (2024) demonstrated that LLMs have the ability 13 Skill Neologisms: Towards Skill-based Continual Learning 0.0 0.2 0.4 0.6 0.8 0.00 0.25 0.50 0.75 1.00Accuracy C1 (1-op)...

work page 2024
[14]

In prompt compression, memory tokens (Sastre & Ros´a, 2025; Kuratov et al.,

represents tools via tokens integrated in the model vocabulary. In prompt compression, memory tokens (Sastre & Ros´a, 2025; Kuratov et al.,

work page 2025
[15]

Recently, Radevski et al

replace prompts with gist tokens that preserve downstream model behavior. Recently, Radevski et al. (2026) proposed learning composable steering tokens for behavioral alignment. To the best of our knowledge, our work is the first to learn composable soft tokens that encapsulate specific procedural knowledge. C. Extended Limitations Skill-centered dataset ...

work page 2026

[1] [1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936, 2023

Arora, S. and Goyal, A. A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936,

work page arXiv

[3] [3]

Skills-in-context prompting: Unlocking compositionality in large language models.arXiv preprint arXiv:2308.00304,

Chen, J., Pan, X., Yu, D., Song, K., Wang, X., Yu, D., and Chen, J. Skills-in-context prompting: Unlocking compositionality in large language models.arXiv preprint arXiv:2308.00304,

work page arXiv

[4] [4]

W., Grau-Moya, J., Ruoss, A., Orseau, L., and Hutter, M

Genewein, T., Li, K. W., Grau-Moya, J., Ruoss, A., Orseau, L., and Hutter, M. Understanding prompt tuning and in-context learning via meta-learning.arXiv preprint arXiv:2505.17010,

work page arXiv

[5] [5]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Taco: Topics in algorithmic code generation dataset

Li, R., Fu, J., Zhang, B.-W., Huang, T., Sun, Z., Lyu, C., Liu, G., Jin, Z., and Li, G. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852,

work page arXiv

[7] [7]

Ministral 3

Liu, A. H., Khandelwal, K., Subramanian, S., Jouault, V ., Rastogi, A., Sad ´e, A., Jeffares, A., Jiang, A., Cahill, A., Gavaudan, A., et al. Ministral 3.arXiv preprint arXiv:2601.08584,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Compositional Steering of Large Language Models with Steering Tokens

Radevski, G., Gashteovski, K., Hong, G., Lawrence, C., and Glava ˇs, G. Compositional steering of large lan- guage models with steering tokens.arXiv preprint arXiv:2601.05062,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Instructions and Guide for Diagnostic Questions: The NeurIPS 2020 Education Challenge

Wang, Z., Lamb, A., Saveliev, E., Cameron, P., Zaykov, Y ., Hern´andez-Lobato, J. M., Turner, R. E., Baraniuk, R. G., Barton, C., Jones, S. P., et al. Instructions and guide for diagnostic questions: The neurips 2020 education challenge.arXiv preprint arXiv:2007.12061,

work page Pith review arXiv 2020

[10] [10]

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Yuan, J., Peng, T., Jiang, Y ., Lu, Y ., Zhang, R., Feng, K., Fu, C., Chen, T., Bai, L., Zhang, B., et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms. arXiv preprint arXiv:2505.21327,

work page Pith review arXiv

[11] [11]

Extended Results A.1

11 Skill Neologisms: Towards Skill-based Continual Learning A. Extended Results A.1. Model pre-training Figure A1 shows the accuracy of Mpretrain after pre-training (same as Table 4), across sequence lengths and operations. Sequence lengths {2,3,4,6,8} are in-distribution, while lengths {5,7,9} were held-out from pre-training data. The model successfully ...

work page 2050

[12] [12]

Zhao et al

aims to elicit compositional abilities in LLMs by providing in-context descriptions of skills and step-by-step explanations on how to compose them. Zhao et al. (2024) show that training LLMs on skill-rich synthetic datasets improve compositional abilities, even on held-out skills unseen during training. STAT (He et al.,

work page 2024

[13] [13]

Didolkar et al

aims to improve model capabilities by uncovering specific skills lacking from the model, and targeting these skills via either reweighting or synthetic data augmentations. Didolkar et al. (2024) demonstrated that LLMs have the ability 13 Skill Neologisms: Towards Skill-based Continual Learning 0.0 0.2 0.4 0.6 0.8 0.00 0.25 0.50 0.75 1.00Accuracy C1 (1-op)...

work page 2024

[14] [14]

In prompt compression, memory tokens (Sastre & Ros´a, 2025; Kuratov et al.,

represents tools via tokens integrated in the model vocabulary. In prompt compression, memory tokens (Sastre & Ros´a, 2025; Kuratov et al.,

work page 2025

[15] [15]

Recently, Radevski et al

replace prompts with gist tokens that preserve downstream model behavior. Recently, Radevski et al. (2026) proposed learning composable steering tokens for behavioral alignment. To the best of our knowledge, our work is the first to learn composable soft tokens that encapsulate specific procedural knowledge. C. Extended Limitations Skill-centered dataset ...

work page 2026