Compositional Steering of Large Language Models with Steering Tokens

Carolin Lawrence; Giwon Hong; Goran Glava\v{s}; Gorjan Radevski; Kiril Gashteovski

arxiv: 2601.05062 · v2 · submitted 2026-01-08 · 💻 cs.CL · cs.AI· cs.LG

Compositional Steering of Large Language Models with Steering Tokens

Gorjan Radevski , Kiril Gashteovski , Giwon Hong , Carolin Lawrence , Goran Glava\v{s} This is my paper

Pith reviewed 2026-05-16 16:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords compositional steeringsteering tokensmulti-behavior controlLLM steeringinput token embeddingszero-shot compositionself-distillationconstraint satisfaction

0 comments

The pith

Steering tokens embedded from instructions and combined via a learned composition token enable superior multi-behavior control in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make LLMs produce outputs that satisfy several constraints at once, such as specific length, format, structure, and language. It does this by turning natural language instructions into dedicated input tokens through self-distillation, then training one additional composition token on pairs of behaviors. The composition token learns to combine behaviors and generalizes to mixes never seen in training, including new behaviors and different counts of behaviors. Experiments across model families show these tokens outperform plain instructions, activation-space steering, and LoRA merging on verifiable constraints, and they add further gains when used together with instructions.

Core claim

Individual behaviors expressed as instructions are distilled into steering tokens that live in the input token space. A separate composition token is then trained on pairs of these behaviors so that it captures the operation of composition itself. Because the steers reside in token space, arbitrary combinations become possible at inference time without further training, and the composition token extends to unseen behaviors and unseen numbers of behaviors. This yields stronger adherence to multiple verifiable constraints than competing single-behavior or merging techniques.

What carries the argument

Compositional steering tokens: input tokens that encode single behaviors plus a learned composition token that represents their combination and generalizes beyond the training pairs.

If this is right

Steering tokens generalize to compositions containing behaviors absent from the training data.
The same composition token works for groups of two, three, or more behaviors without retraining.
Pairing steering tokens with ordinary natural-language instructions produces additive improvements over either alone.
The advantage holds across different LLM architectures for constraints on length, format, structure, and language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-space approach may reduce the need for per-composition fine-tuning by allowing new mixes to be assembled on the fly.
Extending the same idea to non-verifiable behaviors such as tone or factual accuracy would test whether the composition mechanism remains reliable outside measurable constraints.
If the composition token can be trained on higher-order tuples, the method could scale to richer instruction sets without exponential growth in training data.

Load-bearing premise

The composition token learns a general notion of how behaviors combine and applies that notion correctly to combinations and behaviors never encountered during training.

What would settle it

A test set of three-behavior combinations in which no constituent pair appeared during composition-token training; if the outputs fail to satisfy all three constraints at once on this set, the generalization claim is falsified.

read the original abstract

Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior steering of verifiable constraints (e.g., length, format, structure, language) compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Steering tokens plus a learned composition token give a clean way to handle multiple behaviors at once, but the generalization to unseen counts still needs the numbers to back it up.

read the letter

The core move here is putting individual behaviors into dedicated input tokens via self-distillation, then training one extra composition token on pairs so it can handle new combinations and even new numbers of behaviors without retraining. That is actually new relative to the activation-space steering papers they cite, and it lets them keep everything in the token space where composition is more straightforward. They also show the tokens work alongside plain instructions and give extra gains when combined, which is the kind of practical detail that matters for deployment. The experiments claim better results than instructions, activation steering, and LoRA merging on verifiable constraints like length, format, and language, and they test across a few model families. That framing is useful for anyone who needs simultaneous constraints rather than one at a time. The soft spot is the generalization claim. The composition token is trained only on pairs, yet it is supposed to handle triples and unseen behaviors; if the training regime or the metrics do not cleanly separate memorization from a real composition operator, the gains could be narrower than they look. The abstract gives no numbers or controls, so it is hard to judge how much the chosen constraints (length, format) drive the results versus a general mechanism. This is for groups working on controllable generation or multi-objective alignment. A reader who already follows steering work will pick up the token-space angle quickly. I would send it to peer review because the problem is real and the approach is distinct, even if the experiments will need careful checking on the zero-shot cardinality part.

Referee Report

2 major / 1 minor

Summary. The paper proposes compositional steering tokens for multi-behavior control of LLMs. Behaviors expressed as natural-language instructions are first distilled into dedicated input tokens; a separate composition token is then trained on pairs of these behavior tokens. The authors claim this composition token learns a general composition operator that zero-shot generalizes to unseen behaviors and to unseen cardinalities (e.g., three-way compositions after pair-only training). Experiments across LLM families reportedly show that steering tokens outperform natural-language instructions, activation steering, and LoRA merging on verifiable constraints (length, format, structure, language) and that combining steering tokens with instructions yields further gains.

Significance. If the generalization result is robust, the work would supply a token-space alternative to activation-space steering that supports compositional multi-constraint control with comparatively little additional training. Such a mechanism could meaningfully improve controllable generation for applications that must satisfy several simultaneous, verifiable requirements.

major comments (2)

[Abstract and Experiments] The load-bearing claim that the composition token generalizes to unseen cardinalities (e.g., three behaviors after training only on pairs) is stated in the abstract but lacks any description of the training regime, the exact evaluation protocol, or the metrics used to distinguish genuine composition from memorization of specific pairs. The experimental section must therefore supply (i) confirmation that no higher-cardinality examples were seen during training of the composition token and (ii) quantitative results showing that downstream constraint satisfaction, not merely token-level accuracy, improves under these conditions.
[Experiments] The superiority claims versus instructions, activation steering, and LoRA merging are asserted without any numerical results, baseline configurations, or statistical controls visible in the abstract. The experimental section must report effect sizes, variance across runs, and ablation controls that isolate the contribution of the composition token itself.

minor comments (1)

[Abstract] The phrase 'self-distillation' is used without a reference or short procedural outline; a one-sentence clarification or citation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to clarify the training regime, evaluation protocol, and quantitative results as requested, strengthening the presentation of our generalization claims and experimental comparisons.

read point-by-point responses

Referee: [Abstract and Experiments] The load-bearing claim that the composition token generalizes to unseen cardinalities (e.g., three behaviors after training only on pairs) is stated in the abstract but lacks any description of the training regime, the exact evaluation protocol, or the metrics used to distinguish genuine composition from memorization of specific pairs. The experimental section must therefore supply (i) confirmation that no higher-cardinality examples were seen during training of the composition token and (ii) quantitative results showing that downstream constraint satisfaction, not merely token-level accuracy, improves under these conditions.

Authors: We agree that these details require explicit expansion. The composition token was trained exclusively on pairs of behavior tokens with no higher-cardinality examples present. In the revised version we will add a dedicated subsection confirming this training regime, describing the zero-shot evaluation on unseen cardinalities (3+ behaviors), and reporting metrics based on verifiable downstream constraint satisfaction (exact length/format/structure compliance rates) rather than token-level accuracy. This protocol is designed to distinguish composition from memorization. revision: yes
Referee: [Experiments] The superiority claims versus instructions, activation steering, and LoRA merging are asserted without any numerical results, baseline configurations, or statistical controls visible in the abstract. The experimental section must report effect sizes, variance across runs, and ablation controls that isolate the contribution of the composition token itself.

Authors: We will fully expand the experimental section to include all numerical results with effect sizes, standard deviations across runs, complete baseline configurations, and ablation studies that isolate the composition token. Statistical significance tests will also be reported to support the superiority claims over instructions, activation steering, and LoRA merging. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experimental comparisons without self-referential reductions

full rationale

The paper presents no mathematical derivations, equations, or fitted parameters that reduce to their own inputs by construction. All central claims regarding the superiority of steering tokens for multi-behavior steering and the generalization of the composition token to unseen behaviors and cardinalities are supported exclusively by empirical experiments comparing against baselines (instructions, activation steering, LoRA merging). No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way that would make results tautological; the work is self-contained through verifiable experimental outcomes rather than definitional or fitted circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach assumes behaviors expressed as natural language can be faithfully distilled into token embeddings that remain composable; no independent evidence for this assumption is supplied in the abstract.

axioms (1)

domain assumption Individual behaviors can be embedded into dedicated tokens via self-distillation without loss of controllability
Core premise that allows moving steering into token space; stated in the first paragraph of the abstract.

invented entities (1)

composition token no independent evidence
purpose: Captures the notion of composition across multiple behaviors
New token trained on behavior pairs; no external falsifiable handle provided in abstract.

pith-pipeline@v0.9.0 · 5543 in / 1255 out tokens · 41248 ms · 2026-05-16T16:09:28.706924+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Skill Neologisms: Towards Skill-based Continual Learning
cs.LG 2026-05 unverdicted novelty 6.0

Skill neologisms are optimized soft tokens that improve LLM performance on targeted skills without weight updates and allow zero-shot composition for continual learning.
Skill Neologisms: Towards Skill-based Continual Learning
cs.LG 2026-05 unverdicted novelty 6.0

Skill neologisms are optimized soft tokens that enhance specific LLM skills and support zero-shot composition on synthetic and Skill-Mix tasks.