A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity

Lee, A · 2024 · arXiv 2401.01967

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.

CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

cs.CL · 2026-04-16 · unverdicted · novelty 6.0

CausalDetox identifies minimal attention heads causally linked to toxicity via Probability of Necessity and Sufficiency, then applies targeted inference-time steering or fine-tuning to reduce toxic generation while preserving fluency and achieving faster selection.

Why Do Large Language Models Generate Harmful Content?

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.

Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight

cs.CL · 2025-09-29 · unverdicted · novelty 6.0

Learned Task Vectors trained directly outperform extracted task vectors for in-context learning with added mechanistic insights into linear propagation and key attention circuits.

Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

cs.LG · 2025-05-30 · unverdicted · novelty 6.0

Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

cs.LG · 2026-02-07 · unverdicted · novelty 5.0

ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.

citing papers explorer

Showing 9 of 9 citing papers.

Tracing Persona Vectors Through LLM Pretraining cs.CL · 2026-05-13 · unverdicted · none · ref 11
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions cs.LG · 2026-05-11 · unverdicted · none · ref 11
Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 145
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs cs.CL · 2026-04-30 · unverdicted · none · ref 13
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification cs.CL · 2026-04-16 · unverdicted · none · ref 4
CausalDetox identifies minimal attention heads causally linked to toxicity via Probability of Necessity and Sufficiency, then applies targeted inference-time steering or fine-tuning to reduce toxic generation while preserving fluency and achieving faster selection.
Why Do Large Language Models Generate Harmful Content? cs.AI · 2026-04-13 · unverdicted · none · ref 9
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight cs.CL · 2025-09-29 · unverdicted · none · ref 2
Learned Task Vectors trained directly outperform extracted task vectors for in-context learning with added mechanistic insights into linear propagation and key attention circuits.
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment cs.LG · 2025-05-30 · unverdicted · none · ref 23
Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.
Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control cs.LG · 2026-02-07 · unverdicted · none · ref 13
ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.

A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer