Beyond Multiple Choice: Evaluating Steering Vectors for Summarization

Carsten Eickhoff; Joschka Braun; Seyed Ali Bahrainian

arxiv: 2505.24859 · v3 · submitted 2025-05-30 · 💻 cs.LG · cs.CL

Beyond Multiple Choice: Evaluating Steering Vectors for Summarization

Joschka Braun , Carsten Eickhoff , Seyed Ali Bahrainian This is my paper

Pith reviewed 2026-05-19 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords steering vectorsabstractive summarizationlanguage model controlsentiment controltoxicity reductionreadability adjustmentcontrol-quality trade-off

0 comments

The pith

Steering vectors control targeted properties in summaries but trigger repetition and hallucinations at high strengths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests steering vectors on controlling topical focus, sentiment, toxicity, and readability in summaries generated from conversation, news, and scientific paper datasets. It shows that these vectors can direct the output toward desired properties when added to model activations during generation. However, stronger steering often produces repetitive or factually incorrect summaries, while simple prompting maintains better quality but with less precise control. Combining the two approaches at moderate steering levels achieves the most effective control without major quality losses. This matters for using language models in practical summarization where both accuracy and specific attributes like positive tone or low toxicity are needed.

Core claim

Steering vectors, formed by adding a learned bias to language model activations at inference time, can effectively control properties such as topical focus, sentiment, toxicity, and readability in abstractive summaries. Evaluations on the SAMSum, NEWTS, and arXiv datasets reveal that while steering achieves the desired control, high strengths lead to degenerate repetition and factual hallucinations. Prompting alone offers weaker control but preserves quality, and their combination provides the strongest control with the best efficacy-quality trade-off at moderate strengths. This establishes a critical control-quality trade-off for steering vectors in free-form generation tasks.

What carries the argument

Steering vectors that add a learned bias to language model activations at inference time to adjust output properties like topical focus and sentiment.

If this is right

Targeted control over sentiment or toxicity becomes possible in summary generation without retraining the model.
Moderate steering strengths should be used to avoid introducing repetition or hallucinations.
Hybrid methods combining steering with prompting deliver superior results compared to either alone.
The approach applies across different domains including dialogues, news, and scientific texts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The trade-off might be mitigated by developing adaptive steering strengths that adjust based on the input.
Similar control-quality issues could appear in other free-form tasks like question answering or creative writing.
Future work could test whether larger models reduce the hallucination effect at high steering levels.

Load-bearing premise

That the chosen datasets and automatic metrics for topical focus, sentiment, toxicity, and readability accurately isolate the effects of steering vectors without confounding influences from summary length, source content, or metric limitations.

What would settle it

If summaries generated with high-strength steering vectors on the arXiv dataset show increased rates of factual hallucinations compared to baselines, as verified by comparing against source content using fact-checking tools or human raters.

read the original abstract

Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. While predominantly studied for multiple-choice and toy tasks, their effectiveness in free-form generation remains largely unexplored. Moving "Beyond Multiple Choice," we evaluate steering vectors for controlling topical focus, sentiment, toxicity, and readability in abstractive summaries across the SAMSum, NEWTS, and arXiv datasets. We find that steering effectively controls targeted properties, but high steering strengths consistently induce degenerate repetition and factual hallucinations. Prompting alone preserves summary quality but offers weaker control. Combining both methods yields the strongest control and the most favorable efficacy-quality trade-off at moderate steering strengths. Our work demonstrates that steering vectors face a critical control-quality trade-off in free-form generation, and that hybrid approaches offer the best balance in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates steering vectors as a method for controlling topical focus, sentiment, toxicity, and readability in abstractive summarization on the SAMSum, NEWTS, and arXiv datasets. It claims that steering achieves effective control over these properties but that high steering strengths induce degenerate repetition and factual hallucinations, while prompting alone provides weaker control; hybrid steering-plus-prompting approaches are reported to yield the strongest control and best efficacy-quality trade-off at moderate strengths.

Significance. If the empirical results hold after addressing potential confounds, the work would usefully extend steering-vector research from multiple-choice settings to free-form generation, documenting a control-quality trade-off and identifying hybrid methods as a practical mitigation. This could inform deployment decisions for controllable summarization systems.

major comments (1)

[Abstract] Abstract: the central claims that steering 'effectively controls targeted properties' and that hybrids provide the 'most favorable efficacy-quality trade-off' rest on the assumption that automatic metrics for topical focus, sentiment, toxicity, and readability isolate the steering intervention. The abstract supplies no information on length normalization, content-matched controls, or statistical tests, yet summary length is known to correlate with readability and toxicity scores and source content can leak into topical measures; without such controls the reported effects could be artifacts.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one quantitative result (e.g., delta in metric score or trade-off curve point) to ground the qualitative statements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the need for greater transparency in the abstract regarding evaluation controls. We address this point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims that steering 'effectively controls targeted properties' and that hybrids provide the 'most favorable efficacy-quality trade-off' rest on the assumption that automatic metrics for topical focus, sentiment, toxicity, and readability isolate the steering intervention. The abstract supplies no information on length normalization, content-matched controls, or statistical tests, yet summary length is known to correlate with readability and toxicity scores and source content can leak into topical measures; without such controls the reported effects could be artifacts.

Authors: We agree that the abstract would benefit from explicit mention of these methodological safeguards to better support the central claims. In the revised abstract we will note that metrics for readability and toxicity are length-normalized, that content-matched controls are used to isolate steering effects from source leakage, and that statistical significance testing is applied to the reported differences. These controls are described in the experimental setup of the full paper; adding a concise reference in the abstract will clarify that the observed control-quality trade-off is not an artifact of unaccounted confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation study with no derivations

full rationale

The paper is an empirical evaluation of steering vectors for controlling properties in abstractive summarization across SAMSum, NEWTS, and arXiv datasets. The abstract reports experimental observations on control effectiveness, quality trade-offs, and issues like repetition at high strengths, without any equations, first-principles derivations, fitted parameters presented as predictions, or self-referential definitions. No self-citations, uniqueness theorems, or ansatzes are invoked. All claims rest on direct experimental results against external benchmarks rather than any reduction to inputs by construction, satisfying the criteria for a self-contained non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the work rests on standard assumptions of ML evaluation such as metric validity for abstract properties.

pith-pipeline@v0.9.0 · 5642 in / 1059 out tokens · 58603 ms · 2026-05-19T12:21:03.004253+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time... We find that steering effectively controls targeted properties, but high steering strengths consistently induce degenerate repetition and factual hallucinations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
cs.CL 2026-01 unverdicted novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.