DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning
Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3
The pith
DeCoVec steers LLMs by adding the difference between few-shot and zero-shot logit distributions directly into decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeCoVec builds a task vector in decoding space by subtracting the zero-shot output logit distribution from the few-shot one. Adding this vector to the logits at each decoding step steers the model toward more accurate task behavior, producing consistent gains over standard few-shot prompting on TruthfulQA, Math-500, and AQUA-RAT for models from 0.5B to 9B parameters while suppressing degeneration and logical errors.
What carries the argument
The DeCoVec, defined as the difference between few-shot and zero-shot output logit distributions, which is added to the model's next-token logits during generation to steer behavior.
If this is right
- The steered outputs show higher accuracy on truthfulness and math reasoning benchmarks than standard few-shot prompting.
- Generation degeneration and logical flaws are reduced without any model modification.
- Performance remains stable across different orders of the in-context examples.
- No additional input tokens are required beyond those in the few-shot prompt.
- The approach applies across a range of model sizes from 0.5B to 9B parameters.
Where Pith is reading between the lines
- The same logit-difference construction might be tested on tasks beyond the three benchmarks, such as code completion or summarization.
- DeCoVec could be combined with other non-invasive steering methods that operate on logits or activations.
- If task directions prove linearly separable in logit space, similar subtractions might work for other output representations like hidden states at earlier layers.
Load-bearing premise
The difference between few-shot and zero-shot logit distributions encodes a stable task direction that improves performance when added during decoding without creating new errors.
What would settle it
If adding the computed DeCoVec vector to decoding logits produces no accuracy gain or lowers performance compared with plain few-shot prompting on TruthfulQA, Math-500, or AQUA-RAT, the central claim would be falsified.
Figures
read the original abstract
Task vectors, representing directions in model or activation spaces that encode task-specific behaviors, have emerged as a promising tool for steering large language models (LLMs). However, existing approaches typically require fine-tuning or invasive manipulation of internal states, limiting their flexibility and scalability. We propose \textsc{DeCoVec} (Decoding Space based Task Vector), a training-free and non-invasive framework that constructs task vectors directly in the \textit{decoding space} by leveraging in-context learning (ICL). Specifically, \textsc{DeCoVec} captures the task essence as the difference between the output logit distributions of few-shot and zero-shot prompts, then steers generation by injecting this vector into the decoding process. Experiments across seven LLMs (0.5B--9B) on TruthfulQA, Math-500, and AQUA-RAT show that \textsc{DeCoVec} consistently outperforms standard few-shot baselines, with gains up to +5.50 average accuracy. Further analysis demonstrates that \textsc{DeCoVec} effectively suppresses generation degeneration and logical flaws while exhibiting strong robustness to demonstration ordering, all without incurring additional input token costs. Our method offers a training-free and non-invasive solution for LLM steering without requiring weight updates or auxiliary models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DeCoVec, a training-free framework that constructs task vectors directly in the decoding space of LLMs by computing the difference between output logit distributions from few-shot and zero-shot prompts via in-context learning. This vector is then injected into the decoding process to steer generation. Experiments on TruthfulQA, Math-500, and AQUA-RAT across seven models (0.5B–9B) report consistent accuracy gains over standard few-shot baselines (up to +5.50 points), with additional claims of reduced degeneration, logical flaw suppression, and robustness to demonstration ordering, all without extra token costs or weight updates.
Significance. If the central mechanism holds, DeCoVec offers a lightweight, non-invasive alternative to activation-space or fine-tuning-based task vectors, which could improve practical ICL steering across model scales. The multi-model, multi-task evaluation and emphasis on decoding-space operations are strengths, as is the absence of auxiliary models. However, the modest gains and unresolved questions about generality limit the immediate impact; stronger evidence that the difference vector encodes reusable task properties (rather than demo artifacts) would elevate its significance.
major comments (3)
- [§3] §3 (Method), the vector construction and injection: the difference logit_fewshot − logit_zeroshot is presented as the core task vector, yet the manuscript provides no explicit equation for scaling, normalization, or the precise addition formula during decoding; without these details the claimed parameter-free property and the link from vector to observed gains cannot be verified or reproduced.
- [§4.2] §4.2 (Main results), Tables 1–2: reported accuracy improvements (including the +5.50 peak) lack error bars, multiple random seeds, or statistical significance tests; this is load-bearing because the central claim of consistent outperformance over few-shot baselines rests on these numbers being reliable rather than variance-driven.
- [§4.3] §4.3 (Robustness analysis): while ordering robustness is shown, there is no ablation on alternative demonstration selections or contents; because the vector is derived from concrete few-shot examples, this omission leaves open the possibility that gains partly reflect demo-specific biases rather than a generalizable task direction, directly testing the weakest assumption in the central claim.
minor comments (2)
- [Abstract] Abstract: the phrase 'gains up to +5.50 average accuracy' is ambiguous—clarify whether the average is across models, tasks, or both.
- [§5] §5 (Further analysis): the claim that DeCoVec suppresses 'generation degeneration and logical flaws' would benefit from a quantitative metric or side-by-side example table rather than qualitative description alone.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the updated manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Method), the vector construction and injection: the difference logit_fewshot − logit_zeroshot is presented as the core task vector, yet the manuscript provides no explicit equation for scaling, normalization, or the precise addition formula during decoding; without these details the claimed parameter-free property and the link from vector to observed gains cannot be verified or reproduced.
Authors: We agree that the method description would be strengthened by an explicit formulation. In the revised manuscript we will add the equation defining the task vector as v_task = logit_few-shot − logit_zero-shot and state that this vector is added directly to the output logits at each decoding step (logits' = logits + v_task) with no scaling or normalization applied. We will also include pseudocode in §3 to make the parameter-free claim and the injection procedure fully reproducible. revision: yes
-
Referee: [§4.2] §4.2 (Main results), Tables 1–2: reported accuracy improvements (including the +5.50 peak) lack error bars, multiple random seeds, or statistical significance tests; this is load-bearing because the central claim of consistent outperformance over few-shot baselines rests on these numbers being reliable rather than variance-driven.
Authors: This is a valid point; single-run results leave open the possibility that observed gains are affected by sampling variance. We will rerun the main experiments across at least three random seeds, report mean accuracy with standard-deviation error bars in Tables 1 and 2, and add paired statistical significance tests (e.g., t-tests) against the few-shot baselines to substantiate the reliability of the improvements. revision: yes
-
Referee: [§4.3] §4.3 (Robustness analysis): while ordering robustness is shown, there is no ablation on alternative demonstration selections or contents; because the vector is derived from concrete few-shot examples, this omission leaves open the possibility that gains partly reflect demo-specific biases rather than a generalizable task direction, directly testing the weakest assumption in the central claim.
Authors: We concur that testing robustness to demonstration content and selection is necessary to support the claim that the vector encodes reusable task properties. We will add an ablation study that constructs DeCoVec using multiple distinct but valid demonstration sets for each task and reports the resulting accuracy; this will directly examine whether performance gains persist across different example choices rather than being tied to specific demonstration artifacts. revision: yes
Circularity Check
No circularity: task vector defined observationally and tested empirically
full rationale
The paper defines DeCoVec as the direct difference between few-shot and zero-shot output logit distributions, then injects this vector during decoding and measures accuracy gains on held-out benchmarks. No equations, fitted parameters, or self-citations are shown that reduce the claimed steering improvement to a tautology or input by construction. The central claim remains an empirical hypothesis about generalizability, evaluated against standard few-shot baselines across multiple models and tasks. This is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The difference between the output logit distributions of few-shot and zero-shot prompts captures the task essence.
Reference graph
Works this paper leans on
-
[1]
Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3214–3252, Dublin, Ireland....
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Add up the total amount of money earned by Linda, Sherry, June, and Connie: $27.47 + $35.23 + $37.37 + $26.52 = $126.69
-
[3]
Convert the total amount of money into cents: 126.69×100 = 12669cents
-
[4]
Determine the number of coins that can be con- verted into bills... 4. The largest denomination that can be used to convert the coins into bills is 25 cents
-
[5]
Divide the total amount of money in cents by 25 to find the number of 25-cent bills that can be made: 12669÷25 = 506.76. ... 9. Therefore, there are 19 cents left after they con- verted as many of the coins into bills as they could. #### 19 Analysis: [CM]Misinterprets “converting coins” as coin-change (dividing by 25) rather than finding the modulo 100 re...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.