pith. sign in

arxiv: 2604.11129 · v1 · submitted 2026-04-13 · 💻 cs.CL

DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning

Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords DeCoVectask vectorsin-context learningdecoding spaceLLM steeringfew-shot promptinglogit distributionsnon-invasive steering
0
0 comments X

The pith

DeCoVec steers LLMs by adding the difference between few-shot and zero-shot logit distributions directly into decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeCoVec as a way to capture task-specific behavior in the decoding space of large language models without training or internal changes. It treats the task essence as the vector difference between the model's output logit distributions on few-shot prompts versus zero-shot prompts. Injecting this vector back into the decoding step improves generation on reasoning and truthfulness tasks across multiple model sizes. The method also reduces certain output flaws and works regardless of example order while adding no extra input tokens.

Core claim

DeCoVec builds a task vector in decoding space by subtracting the zero-shot output logit distribution from the few-shot one. Adding this vector to the logits at each decoding step steers the model toward more accurate task behavior, producing consistent gains over standard few-shot prompting on TruthfulQA, Math-500, and AQUA-RAT for models from 0.5B to 9B parameters while suppressing degeneration and logical errors.

What carries the argument

The DeCoVec, defined as the difference between few-shot and zero-shot output logit distributions, which is added to the model's next-token logits during generation to steer behavior.

If this is right

  • The steered outputs show higher accuracy on truthfulness and math reasoning benchmarks than standard few-shot prompting.
  • Generation degeneration and logical flaws are reduced without any model modification.
  • Performance remains stable across different orders of the in-context examples.
  • No additional input tokens are required beyond those in the few-shot prompt.
  • The approach applies across a range of model sizes from 0.5B to 9B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same logit-difference construction might be tested on tasks beyond the three benchmarks, such as code completion or summarization.
  • DeCoVec could be combined with other non-invasive steering methods that operate on logits or activations.
  • If task directions prove linearly separable in logit space, similar subtractions might work for other output representations like hidden states at earlier layers.

Load-bearing premise

The difference between few-shot and zero-shot logit distributions encodes a stable task direction that improves performance when added during decoding without creating new errors.

What would settle it

If adding the computed DeCoVec vector to decoding logits produces no accuracy gain or lowers performance compared with plain few-shot prompting on TruthfulQA, Math-500, or AQUA-RAT, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.11129 by Feiyang Li, Yile Wang.

Figure 1
Figure 1. Figure 1: Schematic of traditional methods vs. ours. (a) Traditional task vector in model space. (b)-(c) Our DECOVEC via in-context learning in decoding space. to parameter-efficient fine-tuning (Houlsby et al., 2019; Li and Liang, 2021; Hu et al., 2022) but also offers a unique lens for interpreting the internal mechanisms of LLMs and achieving controllable generation (Hendel et al., 2023; Yang et al., 2025). Curre… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of task vector methods across spaces and our DECOVEC. Top: Comparison of task vector construction and steering. (i) Weight space vector. (ii) Activation space vector. (iii) Our proposed DECOVEC in the decoding space. Bottom: Illustration of DECOVEC pipeline. (a) Context construction. (b) Task vector building and steering. (c) Resulting improvement in decoding output (correct ✓vs. incorrect ✗). the… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of demonstration settings on Truth￾fulQA. Solid lines: DECOVEC; dashed lines: few-shot baselines. Left: Sensitivity to candidate pool size. Right: Scalability with respect to the number of shots. demonstration ordering (Zhao et al., 2021). We evaluate DECOVEC on Qwen2-7B with sequential, reverse, and random orders ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis on hyperparameter λ. (a) TruthfulQA remains robust to large λ (≥ 1.0); (b,c) Math-500 and AQUA-RAT performance degrades if λ is poorly calibrated. Appendix A lists the calibrated values. Methods LR DG CA CM Total Err. KATE 35.4% 9.4% 6.5% 5.3% 56.6% w/ DECOVEC 27.0% 7.3% 8.8% 6.9% 50.0% ∆ ↓8.4% ↓2.1% ↑2.3% ↑1.6% ↓6.6% [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The specific prompt used for automated error classification using an LLM evaluator. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt templates used for TruthfulQA dataset [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt templates used for Math-500 dataset [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt templates used for AQUA-RAT dataset experiments [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Generation output of the Qwen2-7B model on the Math500 algebraic simplification problem under the Zero-shot setting. Solution 2 441+2(21)(19)+361=x 441+816+361=x 1618=x #### 1618 Analysis: [CM] Misapplies the algebraic structure. Direct multiplication leads to errors (798 → 816) [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Generation output of the Qwen2-7B model on the same algebraic problem under the Few-shot (KATE) setting. Solution 3 441 and 361 are both perfect squares. 441 = 212 and 361 = 192 . Let’s substitute these values into the expression: 441 + 2(21)(19) + 361 = x 212 + 2(21)(19) + 192 = x (21 + 19)2 = x 402 = x 1600 = x So, x = 1600 . #### 1600 Analysis: Avoids [CM]. Recognizes the algebraic structure, simplifyi… view at source ↗
Figure 13
Figure 13. Figure 13: Generation output of the Qwen2-7B model steered by DECOVEC on the algebraic problem, show￾ing the correct symbolic reasoning and an analysis of how it avoids previous errors [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A representative question from the Math500 [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Generation output of the Qwen2-7B model on the same modular arithmetic problem under the Few￾shot (KATE) setting. Solution 3 27.47 = 2747 cents 35.23 = 3523 cents 37.37 = 3737 cents 26.52 = 2652 cents 2747 + 3523 + 3737 + 2652 = 12659 cents 12659 / 100 = 126.59 126.59 - 126 = 0.59 0.59 * 100 = 59 cents #### 59 Analysis: Avoids [CM]. Converts to integer cents before summing, eliminating floating-point erro… view at source ↗
Figure 17
Figure 17. Figure 17: Generation output of the Qwen2-7B model steered by DECOVEC on the modular arithmetic prob￾lem, showing the adoption of an integer-based strategy and an analysis of the successful reasoning [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
read the original abstract

Task vectors, representing directions in model or activation spaces that encode task-specific behaviors, have emerged as a promising tool for steering large language models (LLMs). However, existing approaches typically require fine-tuning or invasive manipulation of internal states, limiting their flexibility and scalability. We propose \textsc{DeCoVec} (Decoding Space based Task Vector), a training-free and non-invasive framework that constructs task vectors directly in the \textit{decoding space} by leveraging in-context learning (ICL). Specifically, \textsc{DeCoVec} captures the task essence as the difference between the output logit distributions of few-shot and zero-shot prompts, then steers generation by injecting this vector into the decoding process. Experiments across seven LLMs (0.5B--9B) on TruthfulQA, Math-500, and AQUA-RAT show that \textsc{DeCoVec} consistently outperforms standard few-shot baselines, with gains up to +5.50 average accuracy. Further analysis demonstrates that \textsc{DeCoVec} effectively suppresses generation degeneration and logical flaws while exhibiting strong robustness to demonstration ordering, all without incurring additional input token costs. Our method offers a training-free and non-invasive solution for LLM steering without requiring weight updates or auxiliary models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes DeCoVec, a training-free framework that constructs task vectors directly in the decoding space of LLMs by computing the difference between output logit distributions from few-shot and zero-shot prompts via in-context learning. This vector is then injected into the decoding process to steer generation. Experiments on TruthfulQA, Math-500, and AQUA-RAT across seven models (0.5B–9B) report consistent accuracy gains over standard few-shot baselines (up to +5.50 points), with additional claims of reduced degeneration, logical flaw suppression, and robustness to demonstration ordering, all without extra token costs or weight updates.

Significance. If the central mechanism holds, DeCoVec offers a lightweight, non-invasive alternative to activation-space or fine-tuning-based task vectors, which could improve practical ICL steering across model scales. The multi-model, multi-task evaluation and emphasis on decoding-space operations are strengths, as is the absence of auxiliary models. However, the modest gains and unresolved questions about generality limit the immediate impact; stronger evidence that the difference vector encodes reusable task properties (rather than demo artifacts) would elevate its significance.

major comments (3)
  1. [§3] §3 (Method), the vector construction and injection: the difference logit_fewshot − logit_zeroshot is presented as the core task vector, yet the manuscript provides no explicit equation for scaling, normalization, or the precise addition formula during decoding; without these details the claimed parameter-free property and the link from vector to observed gains cannot be verified or reproduced.
  2. [§4.2] §4.2 (Main results), Tables 1–2: reported accuracy improvements (including the +5.50 peak) lack error bars, multiple random seeds, or statistical significance tests; this is load-bearing because the central claim of consistent outperformance over few-shot baselines rests on these numbers being reliable rather than variance-driven.
  3. [§4.3] §4.3 (Robustness analysis): while ordering robustness is shown, there is no ablation on alternative demonstration selections or contents; because the vector is derived from concrete few-shot examples, this omission leaves open the possibility that gains partly reflect demo-specific biases rather than a generalizable task direction, directly testing the weakest assumption in the central claim.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'gains up to +5.50 average accuracy' is ambiguous—clarify whether the average is across models, tasks, or both.
  2. [§5] §5 (Further analysis): the claim that DeCoVec suppresses 'generation degeneration and logical flaws' would benefit from a quantitative metric or side-by-side example table rather than qualitative description alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the updated manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method), the vector construction and injection: the difference logit_fewshot − logit_zeroshot is presented as the core task vector, yet the manuscript provides no explicit equation for scaling, normalization, or the precise addition formula during decoding; without these details the claimed parameter-free property and the link from vector to observed gains cannot be verified or reproduced.

    Authors: We agree that the method description would be strengthened by an explicit formulation. In the revised manuscript we will add the equation defining the task vector as v_task = logit_few-shot − logit_zero-shot and state that this vector is added directly to the output logits at each decoding step (logits' = logits + v_task) with no scaling or normalization applied. We will also include pseudocode in §3 to make the parameter-free claim and the injection procedure fully reproducible. revision: yes

  2. Referee: [§4.2] §4.2 (Main results), Tables 1–2: reported accuracy improvements (including the +5.50 peak) lack error bars, multiple random seeds, or statistical significance tests; this is load-bearing because the central claim of consistent outperformance over few-shot baselines rests on these numbers being reliable rather than variance-driven.

    Authors: This is a valid point; single-run results leave open the possibility that observed gains are affected by sampling variance. We will rerun the main experiments across at least three random seeds, report mean accuracy with standard-deviation error bars in Tables 1 and 2, and add paired statistical significance tests (e.g., t-tests) against the few-shot baselines to substantiate the reliability of the improvements. revision: yes

  3. Referee: [§4.3] §4.3 (Robustness analysis): while ordering robustness is shown, there is no ablation on alternative demonstration selections or contents; because the vector is derived from concrete few-shot examples, this omission leaves open the possibility that gains partly reflect demo-specific biases rather than a generalizable task direction, directly testing the weakest assumption in the central claim.

    Authors: We concur that testing robustness to demonstration content and selection is necessary to support the claim that the vector encodes reusable task properties. We will add an ablation study that constructs DeCoVec using multiple distinct but valid demonstration sets for each task and reports the resulting accuracy; this will directly examine whether performance gains persist across different example choices rather than being tied to specific demonstration artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: task vector defined observationally and tested empirically

full rationale

The paper defines DeCoVec as the direct difference between few-shot and zero-shot output logit distributions, then injects this vector during decoding and measures accuracy gains on held-out benchmarks. No equations, fitted parameters, or self-citations are shown that reduce the claimed steering improvement to a tautology or input by construction. The central claim remains an empirical hypothesis about generalizability, evaluated against standard few-shot baselines across multiple models and tasks. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the empirical logit difference between few-shot and zero-shot prompts isolates task-specific information usable for steering; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption The difference between the output logit distributions of few-shot and zero-shot prompts captures the task essence.
    This definition is the load-bearing step that turns an observable difference into a steering vector.

pith-pipeline@v0.9.0 · 5525 in / 1467 out tokens · 49901 ms · 2026-05-10T15:42:23.722275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    GPT-4 Technical Report

    Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3214–3252, Dublin, Ireland....

  2. [2]

    Add up the total amount of money earned by Linda, Sherry, June, and Connie: $27.47 + $35.23 + $37.37 + $26.52 = $126.69

  3. [3]

    Convert the total amount of money into cents: 126.69×100 = 12669cents

  4. [4]

    Determine the number of coins that can be con- verted into bills... 4. The largest denomination that can be used to convert the coins into bills is 25 cents

  5. [5]

    converting coins

    Divide the total amount of money in cents by 25 to find the number of 25-cent bills that can be made: 12669÷25 = 506.76. ... 9. Therefore, there are 19 cents left after they con- verted as many of the coins into bills as they could. #### 19 Analysis: [CM]Misinterprets “converting coins” as coin-change (dividing by 25) rather than finding the modulo 100 re...