Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Amirali Abdullah; Fazl Barez; Luke Marks; Narmeen Oozeer; Shreyans Jain

arxiv: 2505.24535 · v3 · submitted 2025-05-30 · 💻 cs.LG · cs.AI· cs.CL

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Narmeen Oozeer , Luke Marks , Shreyans Jain , Fazl Barez , Amirali Abdullah This is my paper

Pith reviewed 2026-05-19 12:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords K-Steeringmulti-attribute controlLLM steeringnon-linear interventionsactivation-based controlcompositional benchmarksbehavioral steering

0 comments

The pith

K-Steering uses gradients from one non-linear classifier to control multiple attributes in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to solve the problem of interference when steering multiple behaviors in LLMs at the same time. Traditional linear methods treat activation changes as simple additions and require tuning a separate vector for each behavior. K-Steering instead trains a single classifier that can label multiple behaviors from the model's hidden activations. It then uses the gradients of this classifier to create intervention directions on the fly. If successful, this would let users combine any set of desired behaviors dynamically without retraining or managing multiple vectors.

Core claim

The central discovery is that a non-linear multi-label classifier trained on hidden activations can provide reliable gradient-based directions for intervening in LLM generations. These directions support the simultaneous control of multiple behavioral attributes in a unified way. The approach removes the need for linearity assumptions and per-attribute tuning, and it is evaluated on two new benchmarks designed for compositional control.

What carries the argument

K-Steering, which computes intervention directions as gradients with respect to a trained multi-label classifier on the model's activations.

Load-bearing premise

The gradient directions derived from the multi-label classifier will combine without introducing unexpected interference when multiple attributes are targeted together.

What would settle it

An experiment where K-Steering is applied to a set of mutually conflicting attributes and the resulting generations are judged as less coherent or more interfered than those from linear steering would falsify the claim.

Figures

Figures reproduced from arXiv: 2505.24535 by Amirali Abdullah, Fazl Barez, Luke Marks, Narmeen Oozeer, Shreyans Jain.

**Figure 2.** Figure 2: Illustration of our evaluation setup for comparing CAA, DCT and K-Steering. In Step 1, we perform [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Steering scores across steps for 3 groups of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, ToneBank and DebateMix, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

K-Steering replaces linear vectors with gradients from one non-linear classifier and adds two new benchmarks, but the tests skip direct checks on conflicting attribute pairs.

read the letter

The main takeaway is that this work moves past linear steering by training a single non-linear multi-label classifier on activations and deriving intervention directions from its gradients during inference. That lets them combine behaviors dynamically without per-attribute vectors or heavy tuning, and they back it with results on three different model families plus two new benchmarks called ToneBank and DebateMix. What the paper does well is show consistent outperformance over strong baselines using both internal classifier metrics and external LLM judges. The setup avoids some of the additivity assumptions that limit linear methods, and the benchmarks target compositional control in a way that previous work did not emphasize as directly. The approach feels practical for deployment scenarios where multiple attributes need to be satisfied at the same time. The soft spots center on how reliably the gradients compose when attributes are not independent. Since the classifier is non-linear, the direction for one label can shift based on the others, which risks new interference when you combine them. The reported benchmarks do not appear to include deliberate tests on conflicting pairs, such as incompatible tones or opposing debate stances, so the claim that this handles interference better remains partly untested. The training procedure for the classifier and the exact data splits also stay light on detail, which affects how much weight to put on the numbers. Citations cover the linear steering literature without missing obvious priors. Overall the thinking is clear and the experiments are multi-model, which helps. This paper is for researchers focused on inference-time alignment and behavioral control in LLMs. It deserves a serious referee because the method is a genuine alternative and the benchmarks are fresh, even if revisions will likely need to address the conflict cases more explicitly.

Referee Report

2 major / 2 minor

Summary. The paper introduces K-Steering, which trains a single non-linear multi-label classifier on hidden activations of LLMs and computes intervention directions from gradients at inference time. This is presented as a unified method for controlling multiple behavioral attributes without linearity assumptions or per-attribute vector storage/tuning. Two new benchmarks (ToneBank and DebateMix) are proposed for evaluating compositional control, with empirical results across three model families claiming outperformance over baselines when measured by both activation-based classifiers and LLM judges.

Significance. If the results hold after addressing the gaps in experimental detail and composition testing, the approach could meaningfully extend inference-time steering beyond linear methods by enabling dynamic, gradient-based composition from a single classifier. The new benchmarks targeting multi-attribute interference are a clear positive contribution to the literature on controllable generation.

major comments (2)

[Abstract and §4] Abstract and §4 (Method): the central claim that gradients from the non-linear multi-label classifier compose reliably without new interference is load-bearing, yet the manuscript provides no explicit results or ablations on deliberately conflicting attribute pairs (e.g., mutually exclusive tones in ToneBank or opposing debate positions in DebateMix). Because the classifier is non-linear, the gradient for one label at a given activation can depend on the values of other labels, so simple summation or selection of gradients risks cross-term effects that linear methods avoid by construction.
[§5] §5 (Empirical Evaluation): the reported outperformance is difficult to interpret because the manuscript supplies no details on classifier architecture, training procedure, data splits, or potential post-hoc choices. This directly affects verification of whether the gradient directions support the claimed advantage in compositional settings.

minor comments (2)

[Method] Clarify in the method section how the multi-label outputs are converted into a single intervention direction vector (e.g., which labels are active, whether gradients are summed or selected, and the step size or number of steps used).
[Appendix] Add a table or appendix entry listing the exact hyper-parameters and training details for the classifier across the three model families to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and revise the manuscript where needed to strengthen the presentation.

read point-by-point responses

Referee: [Abstract and §4] the central claim that gradients from the non-linear multi-label classifier compose reliably without new interference is load-bearing, yet the manuscript provides no explicit results or ablations on deliberately conflicting attribute pairs (e.g., mutually exclusive tones in ToneBank or opposing debate positions in DebateMix). Because the classifier is non-linear, the gradient for one label at a given activation can depend on the values of other labels, so simple summation or selection of gradients risks cross-term effects that linear methods avoid by construction.

Authors: We appreciate the referee's emphasis on verifying composition under conflict. ToneBank and DebateMix were constructed to include interfering attribute combinations, and the reported results already reflect performance under such conditions. Nevertheless, to directly address potential cross-term effects, the revised manuscript will add a dedicated ablation subsection in §5 that isolates mutually exclusive tones and opposing debate positions. These new results will quantify any additional interference introduced by gradient composition relative to linear baselines, using both activation classifiers and LLM judges. revision: yes
Referee: [§5] the reported outperformance is difficult to interpret because the manuscript supplies no details on classifier architecture, training procedure, data splits, or potential post-hoc choices. This directly affects verification of whether the gradient directions support the claimed advantage in compositional settings.

Authors: We agree that these experimental details are required for reproducibility and proper interpretation. The revised manuscript expands §5 with the classifier architecture (MLP with hidden dimensions and activation functions), training procedure (optimizer, learning rate, epochs, and multi-label loss), data splits (train/validation/test ratios and sampling strategy), and any post-hoc decisions. These details are also moved to a new appendix section for completeness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in K-Steering derivation or claims

full rationale

The paper introduces K-Steering as a method that trains one non-linear multi-label classifier on hidden activations then derives intervention directions from gradients at inference time, avoiding per-attribute vectors. This is evaluated empirically on the newly proposed ToneBank and DebateMix benchmarks across three model families, with results validated by activation classifiers and LLM judges showing outperformance versus baselines. No derivation chain reduces a claimed result to its inputs by construction, no self-citations are load-bearing for uniqueness or ansatzes, and no fitted parameters are relabeled as independent predictions. The approach is self-contained against external benchmarks and does not invoke prior author theorems to force its choices.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence of a trainable non-linear classifier whose gradients produce effective steering directions; this introduces free parameters in classifier design and training that are not enumerated in the abstract.

free parameters (1)

Classifier architecture and training procedure
Choice of non-linear model, loss function, and optimization details for the multi-label classifier are required to produce the intervention directions.

pith-pipeline@v0.9.0 · 5678 in / 1109 out tokens · 62286 ms · 2026-05-19T12:51:56.487570+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize the K-Steering intervention as a′_i = a_i − α ∇_{a_i} L(g_ϕ(a_i)) where L maximizes logits for target classes and minimizes logits for avoid classes.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Train an MLP for multi-label classification with two hidden layers (256 units, ReLU) and output layer of size K.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773. Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language mod- els: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164. Sunipa Dev and Jeff Phillip...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

A unified understanding and evaluation of steering methods, 2026

Collective constitutional ai: Aligning a lan- guage model with public input. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1395–1417. Shawn Im and Yixuan Li. 2025. A unified under- standing and evaluation of steering methods.arXiv preprint arXiv:2502.02716. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men...

work page arXiv 2024
[4]

Steering Llama 2 via Contrastive Activation Addition

Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681. Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, and David Krueger. 2024. Towards reliable evaluation of behavior steering interventions in llms.arXiv preprint arXiv:2410.17245. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

A forward pass through a 3-layer MLP: O dseq ·(d model ·H+H 2 +H·C)

work page
[6]

the input activation (same cost as forward)

A backward pass to compute gradients w.r.t. the input activation (same cost as forward)

work page
[7]

{category}

An activation update: O(dseq ·d model) These steps are repeated independently for each of the N iterations, with no reuse of computation between steps. This is because each iteration per- forms a new forward and backward pass based on the updated activation vector, followed by a gradi- ent descent step. As a result, the total cost scales linearly withN. T...

work page
[8]

Be a clear and well-formed debatable question or statement

work page
[9]

Be style-neutral (able to be approached well using any of the debate styles)

work page
[10]

Have sufficient complexity to allow for nuanced arguments

work page
[11]

Avoid numbering or special formatting

work page
[12]

{category}

Be suitable for formal debate settings Focus on creating questions where the SAME question can be approached in meaningfully different ways depending on which debate style is used to argue the position. These should be questions where reasonable people might disagree, and where multiple debate techniques could be effectively employed. We show the distribu...

work page
[13]

Be a clear and well-formed question ending with a question mark

work page
[14]

Be tone-neutral (able to be answered well in any of the tones)

work page
[15]

We include a count of example by category in Table 11

Avoid numbering or special formatting Focus on creating questions where the SAME question can receive meaningfully different responses depending on which tone is used to answer. We include a count of example by category in Table 11. F Dataset Labels F.1 ToneBank TONEBANK: We select six diverse tone categories, described for language model prompting as below:

work page
[16]

Expert:formal, authoritative, using technical terminology Attribute K-Steering (%) CAA (%) DCT (%) Bias0.520.36 0.04 Refusal0.430.37 0.11 Toxicity0.520.41 0.35 Unhelpfulness0.570.49 0.31 Table 9: Final layer activation classifier scores caused by steering with different methods on TruthfulQA questions. Category Count civil_liberties 34 human_rights 37 sci...

work page
[17]

Empathetic:warm, supportive, focusing on emotional understanding

work page
[18]

Cautious:hedging, acknowledging limita- tions, presenting multiple perspectives

work page
[19]

Casual:conversational, informal, using col- loquial language

work page
[20]

Concise:brief, minimal, avoiding elaboration F.2 DebateMix DEBATEMIX: We construct a dataset of debate questions that can be answered using the following ten styles:

work page
[21]

Reductio ad Absurdum:Extend opponent’s logic to absurd extremes to reveal flaws

work page
[22]

Appeal to Precedent:Cite past rulings or history to justify present stance

work page
[23]

Straw Man Reframing:Oversimplify oppo- nent’s view to refute an easier version

work page
[24]

Burden of Proof Shift:Demand opponent disprove your claim to shift burden

work page
[25]

Analogy Construction:Use relatable analo- gies to clarify and support your point

work page
[26]

Concession and Pivot:Concede a minor point, then redirect to stronger arguments

work page
[27]

Empirical Grounding:Rely on data, studies, and statistics to support your case

work page
[28]

Moral Framing:Frame issue in terms of ethics and moral values

work page
[29]

Refutation by Distinction:Highlight key dif- ferences that invalidate opponent’s logic

work page
[30]

makes absolutely no sense; the model generated text that is not even valid English

Circular Anticipation:Preemptively address and rebut expected counterarguments. These are classical rhetoric and logical tech- niques, refer to Toulmin (2003); Walton (2008) for more details. We describe the creation of both datasets in Appendix E. We give the full prompts used to direct models to respond in these debate and tone styles in Appendix G, alo...

work page 2003
[31]

the target attributes are not present at all

No extra text. Instead of using the judge output as the score we take a weighted average of the logits of the integers 0 to 100 in the 20 largest logits (the most that can be accessed via the OpenAI API). We sample with a temperature of 0. We set a threshold for this score between 30 and 60 and sample 20 generations from the steered model at a given α. If...

work page
[32]

We use the token output as the score

No extra text. We use the token output as the score. L Steering Score Judge We evaluate steering effectiveness by using GPT- 4o-Mini to assess whether a generation has been successfully steered away from an undesired style and optionally toward a desired one. Prompting Setup.The judge receives: • A short rubric describing the debate styles involved (botha...

work page
[33]

Ground your arguments in historical examples, established case law, or widely accepted previous decisions

**The government has the authority to dictate what information is true and what is not**: This implies precedent1 Prompt:RESPOND USING APPEAL TO PRECEDENT DEBATE STYLE. Ground your arguments in historical examples, established case law, or widely accepted previous decisions. Reference specific historical situations, court rulings, scientific discoveries, ...

work page 1919
[34]

hold them accountable

**Effectiveness in Reducing Misinformation:** A study published in the Journal of Communication (2019) found that government-mand moral1 Prompt:RESPOND USING MORAL FRAMING DEBATE STYLE. Position the issue within a framework of ethical principles, values, and moral imperatives rather than pragmatic concerns. Identify the core moral values at stake such as ...

work page 2019
[35]

**self-care is key**: keep doin’ the things that made u feel good in therapy, like exercisin’, meditatin’, or just chillin’ with friends. 2. ** Table 28: Prompts and completions for question 3 (styles 1–4). Tone Style Example # Prompt and Completion Expert3 Prompt:RESPOND AS AN AUTHORITATIVE EXPERT. Use advanced terminol- ogy and industry-specific jargon....

work page 1992
[36]

I" statements**: Instead of saying

**Debt Repayment Prioritization**: Focus on high-interest debts first, such as credit card balances. This Empathetic4 Prompt:RESPOND WITH DEEP EMOTIONAL UNDERSTANDING. Begin by validating the feelings that might accompany this question. Use warm, com- passionate language that acknowledges the human experience behind the query. Express genuine concern and ...

work page
[37]

**If the government 3 3.05 If we assume that the government’s primary goal is to protect public health and safety, then it’s reasonable to consider restricting free speech in cases where misinformation could lead to harm. However, if we take this logic to its extreme, we might conclude that the government should have the authority to restrict all forms of...

work page 2016
[38]

The healthcare system in the United States is multifaceted, and opinions on its flaws and potential reforms vary widely

**Access to care:** The US has made significant progress in increasing access 8 4.00 I’ll start by acknowledging the complexity of the question. The healthcare system in the United States is multifaceted, and opinions on its flaws and potential reforms vary widely. To address this question, I’ll present some data and insights. The US healthcare system is ...

work page
[39]

carbon footprints

**Unrealistic expectations 5 3.60 I’d like to start by acknowledging the importance of transparency in corporate practices, par- ticularly when it comes to environmental impact. If we follow this logic, then, by requiring international corporations to disclose their carbon footprints publicly, we can gain a better under- standing of their environmental fo...

work page
[40]

**Transparency and Accountability**: Public disclosure of carbon footprints would provide stakeholders with a comprehensive understanding of a corporation’s environmental impact 7 2.75 The case of the British monarch The case of the British monarch The case of the British monarch The case of the British monarch The case of the British monarch The case of ...

work page

[1] [1]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773. Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language mod- els: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164. Sunipa Dev and Jeff Phillip...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

A unified understanding and evaluation of steering methods, 2026

Collective constitutional ai: Aligning a lan- guage model with public input. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1395–1417. Shawn Im and Yixuan Li. 2025. A unified under- standing and evaluation of steering methods.arXiv preprint arXiv:2502.02716. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men...

work page arXiv 2024

[4] [4]

Steering Llama 2 via Contrastive Activation Addition

Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681. Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, and David Krueger. 2024. Towards reliable evaluation of behavior steering interventions in llms.arXiv preprint arXiv:2410.17245. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

A forward pass through a 3-layer MLP: O dseq ·(d model ·H+H 2 +H·C)

work page

[6] [6]

the input activation (same cost as forward)

A backward pass to compute gradients w.r.t. the input activation (same cost as forward)

work page

[7] [7]

{category}

An activation update: O(dseq ·d model) These steps are repeated independently for each of the N iterations, with no reuse of computation between steps. This is because each iteration per- forms a new forward and backward pass based on the updated activation vector, followed by a gradi- ent descent step. As a result, the total cost scales linearly withN. T...

work page

[8] [8]

Be a clear and well-formed debatable question or statement

work page

[9] [9]

Be style-neutral (able to be approached well using any of the debate styles)

work page

[10] [10]

Have sufficient complexity to allow for nuanced arguments

work page

[11] [11]

Avoid numbering or special formatting

work page

[12] [12]

{category}

Be suitable for formal debate settings Focus on creating questions where the SAME question can be approached in meaningfully different ways depending on which debate style is used to argue the position. These should be questions where reasonable people might disagree, and where multiple debate techniques could be effectively employed. We show the distribu...

work page

[13] [13]

Be a clear and well-formed question ending with a question mark

work page

[14] [14]

Be tone-neutral (able to be answered well in any of the tones)

work page

[15] [15]

We include a count of example by category in Table 11

Avoid numbering or special formatting Focus on creating questions where the SAME question can receive meaningfully different responses depending on which tone is used to answer. We include a count of example by category in Table 11. F Dataset Labels F.1 ToneBank TONEBANK: We select six diverse tone categories, described for language model prompting as below:

work page

[16] [16]

Expert:formal, authoritative, using technical terminology Attribute K-Steering (%) CAA (%) DCT (%) Bias0.520.36 0.04 Refusal0.430.37 0.11 Toxicity0.520.41 0.35 Unhelpfulness0.570.49 0.31 Table 9: Final layer activation classifier scores caused by steering with different methods on TruthfulQA questions. Category Count civil_liberties 34 human_rights 37 sci...

work page

[17] [17]

Empathetic:warm, supportive, focusing on emotional understanding

work page

[18] [18]

Cautious:hedging, acknowledging limita- tions, presenting multiple perspectives

work page

[19] [19]

Casual:conversational, informal, using col- loquial language

work page

[20] [20]

Concise:brief, minimal, avoiding elaboration F.2 DebateMix DEBATEMIX: We construct a dataset of debate questions that can be answered using the following ten styles:

work page

[21] [21]

Reductio ad Absurdum:Extend opponent’s logic to absurd extremes to reveal flaws

work page

[22] [22]

Appeal to Precedent:Cite past rulings or history to justify present stance

work page

[23] [23]

Straw Man Reframing:Oversimplify oppo- nent’s view to refute an easier version

work page

[24] [24]

Burden of Proof Shift:Demand opponent disprove your claim to shift burden

work page

[25] [25]

Analogy Construction:Use relatable analo- gies to clarify and support your point

work page

[26] [26]

Concession and Pivot:Concede a minor point, then redirect to stronger arguments

work page

[27] [27]

Empirical Grounding:Rely on data, studies, and statistics to support your case

work page

[28] [28]

Moral Framing:Frame issue in terms of ethics and moral values

work page

[29] [29]

Refutation by Distinction:Highlight key dif- ferences that invalidate opponent’s logic

work page

[30] [30]

makes absolutely no sense; the model generated text that is not even valid English

Circular Anticipation:Preemptively address and rebut expected counterarguments. These are classical rhetoric and logical tech- niques, refer to Toulmin (2003); Walton (2008) for more details. We describe the creation of both datasets in Appendix E. We give the full prompts used to direct models to respond in these debate and tone styles in Appendix G, alo...

work page 2003

[31] [31]

the target attributes are not present at all

No extra text. Instead of using the judge output as the score we take a weighted average of the logits of the integers 0 to 100 in the 20 largest logits (the most that can be accessed via the OpenAI API). We sample with a temperature of 0. We set a threshold for this score between 30 and 60 and sample 20 generations from the steered model at a given α. If...

work page

[32] [32]

We use the token output as the score

No extra text. We use the token output as the score. L Steering Score Judge We evaluate steering effectiveness by using GPT- 4o-Mini to assess whether a generation has been successfully steered away from an undesired style and optionally toward a desired one. Prompting Setup.The judge receives: • A short rubric describing the debate styles involved (botha...

work page

[33] [33]

Ground your arguments in historical examples, established case law, or widely accepted previous decisions

**The government has the authority to dictate what information is true and what is not**: This implies precedent1 Prompt:RESPOND USING APPEAL TO PRECEDENT DEBATE STYLE. Ground your arguments in historical examples, established case law, or widely accepted previous decisions. Reference specific historical situations, court rulings, scientific discoveries, ...

work page 1919

[34] [34]

hold them accountable

**Effectiveness in Reducing Misinformation:** A study published in the Journal of Communication (2019) found that government-mand moral1 Prompt:RESPOND USING MORAL FRAMING DEBATE STYLE. Position the issue within a framework of ethical principles, values, and moral imperatives rather than pragmatic concerns. Identify the core moral values at stake such as ...

work page 2019

[35] [35]

**self-care is key**: keep doin’ the things that made u feel good in therapy, like exercisin’, meditatin’, or just chillin’ with friends. 2. ** Table 28: Prompts and completions for question 3 (styles 1–4). Tone Style Example # Prompt and Completion Expert3 Prompt:RESPOND AS AN AUTHORITATIVE EXPERT. Use advanced terminol- ogy and industry-specific jargon....

work page 1992

[36] [36]

I" statements**: Instead of saying

**Debt Repayment Prioritization**: Focus on high-interest debts first, such as credit card balances. This Empathetic4 Prompt:RESPOND WITH DEEP EMOTIONAL UNDERSTANDING. Begin by validating the feelings that might accompany this question. Use warm, com- passionate language that acknowledges the human experience behind the query. Express genuine concern and ...

work page

[37] [37]

**If the government 3 3.05 If we assume that the government’s primary goal is to protect public health and safety, then it’s reasonable to consider restricting free speech in cases where misinformation could lead to harm. However, if we take this logic to its extreme, we might conclude that the government should have the authority to restrict all forms of...

work page 2016

[38] [38]

The healthcare system in the United States is multifaceted, and opinions on its flaws and potential reforms vary widely

**Access to care:** The US has made significant progress in increasing access 8 4.00 I’ll start by acknowledging the complexity of the question. The healthcare system in the United States is multifaceted, and opinions on its flaws and potential reforms vary widely. To address this question, I’ll present some data and insights. The US healthcare system is ...

work page

[39] [39]

carbon footprints

**Unrealistic expectations 5 3.60 I’d like to start by acknowledging the importance of transparency in corporate practices, par- ticularly when it comes to environmental impact. If we follow this logic, then, by requiring international corporations to disclose their carbon footprints publicly, we can gain a better under- standing of their environmental fo...

work page

[40] [40]

**Transparency and Accountability**: Public disclosure of carbon footprints would provide stakeholders with a comprehensive understanding of a corporation’s environmental impact 7 2.75 The case of the British monarch The case of the British monarch The case of the British monarch The case of the British monarch The case of the British monarch The case of ...

work page