pith. sign in

reward-lens: A Mechanistic Interpretability Library for Reward Models

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, activation patching, sparse autoencoders -- was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector $w_r$ is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, concept-vector analysis). A ten-method adapter protocol covers Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman $\rho = -0.256$ on Skywork, $-0.027$ on ArmoRM). The framework treats this disagreement as a property to expose, not a bug -- motivating a design that keeps observational and causal views first-class and directly comparable.

fields

cs.LG 1

years

2026 1

verdicts

UNVERDICTED 1

clear filters

representative citing papers

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models cs.LG · 2026-06-02 · unverdicted · none · ref 32 · internal anchor

    HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.