reward-lens: A Mechanistic Interpretability Library for Reward Models

Mohammed Suhail B Nadaf

arxiv: 2604.26130 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

reward-lens: A Mechanistic Interpretability Library for Reward Models

Mohammed Suhail B Nadaf This is my paper

Pith reviewed 2026-05-07 16:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reward modelsmechanistic interpretabilityactivation patchingRLHFattributionscalar headreward hacking

0 comments

The pith

Reward models break standard interpretability tools and need versions centered on their scalar reward head weight vector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reward models replace the vocabulary unembedding of language models with a scalar head, which breaks tools like logit lens and activation patching that were designed for generative LLMs. The paper introduces reward-lens, a library that ports these methods by treating the reward head's weight vector as the organizing axis for attribution, patching, and probes. It supplies component attribution, three-mode activation patching, reward-hacking suites, and several theory extensions, all validated across production models and RewardBench pairs. The central result is negative: linear attribution scores show no positive correlation with causal effects measured by patching. The library treats this mismatch as a property worth exposing rather than a flaw to ignore.

Core claim

The reward-lens framework adapts mechanistic interpretability to reward models by making the reward head weight vector the common reference axis for every analysis. This permits direct porting of activation patching, feature attribution, and cross-model comparison while adding new probes for reward-term conflicts and misalignment cascades. Tests on Skywork and ArmoRM models across hundreds of pairs show that linear attribution does not predict patching outcomes, with mean Spearman correlations of -0.256 and -0.027 respectively.

What carries the argument

The reward head's weight vector, which serves as the natural projection axis for turning internal activations into scalar reward contributions.

If this is right

Observational attribution and causal patching views can be kept first-class and directly comparable in reward model analysis.
A single adapter protocol supports multiple architectures including Llama, Mistral, Gemma-2, and multi-objective heads.
New metrics such as distortion index and misalignment cascade detection become available for reward model inspection.
Linear methods alone are insufficient to understand how reward models assign scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The mismatch between linear and causal measures may point to nonlinear interactions inside reward models that standard probes miss.
The same centering approach could transfer to other scalar-output models outside RLHF.
Exposing these disagreements could help identify concrete reward hacking patterns before deployment.

Load-bearing premise

The adapted three-mode activation patching correctly isolates causal effects in the reward model's forward pass rather than introducing artifacts from the scalar head.

What would settle it

A positive or zero Spearman correlation between linear attribution scores and patching effect sizes on additional reward models or input pairs would undermine the reported negative finding.

Figures

Figures reproduced from arXiv: 2604.26130 by Mohammed Suhail B Nadaf.

**Figure 1.** Figure 1: Reward Lens dashboard for a representative Skywork helpfulness pair: layer-wise prefer view at source ↗

**Figure 2.** Figure 2: Attribution vs. patching, Skywork helpfulness. Late-MLP components dominate attribu view at source ↗

**Figure 3.** Figure 3: Cross-model trajectory overlay. Normalised preference differential as a function of frac view at source ↗

**Figure 4.** Figure 4: Hacking-detector effect sizes. Positive d rewards the biased variant; negative d penalises it. 5.7 Concept-level structure We extract six linear concept directions from contrastive preference pairs — confidence, formality, agreement, verbosity, hedging, and helpfulness — and report each concept’s cosine alignment with wr, the slope of its causal dose-response curve under residual-stream addition h ← h + αv… view at source ↗

**Figure 5.** Figure 5: Concept dose-response curves. Skywork responds linearly at unit scale; ArmoRM responds view at source ↗

read the original abstract

Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, activation patching, sparse autoencoders -- was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector $w_r$ is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, concept-vector analysis). A ten-method adapter protocol covers Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman $\rho = -0.256$ on Skywork, $-0.027$ on ArmoRM). The framework treats this disagreement as a property to expose, not a bug -- motivating a design that keeps observational and causal views first-class and directly comparable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The library adapts interpretability tools to reward models and reports that linear attribution fails to match patching results, but the patching adaptation itself needs stronger validation.

read the letter

The punchline is that reward-lens gives researchers a working set of tools for mechanistic interpretability on reward models, and it finds that standard linear attribution does not predict what activation patching actually changes in the scalar output. On Skywork and ArmoRM the Spearman correlations come out near zero or negative, and the authors treat the mismatch as a feature worth exposing rather than a flaw in one method.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces reward-lens, an open-source library that ports mechanistic interpretability tools (logit lens, activation patching, SAEs) to reward models by centering analysis on the reward head weight vector w_r. It supplies component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE attribution, cross-model adapters for Llama/Mistral/Gemma-2/ArmoRM, and five theory-grounded extensions. Validation on Skywork and ArmoRM using ~695 RewardBench pairs yields the central negative result that linear attribution fails to predict causal patching effects (mean Spearman ρ = -0.256 on Skywork, -0.027 on ArmoRM), which the authors treat as a feature motivating separate observational and causal views.

Significance. If the negative correlation result holds after proper validation of the causal tools, it would demonstrate that standard linear attribution methods are unreliable for predicting causal effects in reward models, with direct implications for detecting reward hacking and misalignment in RLHF. The open-source library with broad HuggingFace adapters and reproducible code for multiple model families constitutes a practical strength that lowers barriers for follow-up work.

major comments (2)

[Empirical validation paragraph and three-mode patching description] The central negative finding (linear attribution vs. patching mismatch) is load-bearing on the correctness of the adapted three-mode activation patching. The manuscript replaces the vocabulary unembedding with the scalar regression head w_r but provides no explicit checks, synthetic recovery experiments, or controls demonstrating that the scalar difference computation isolates feature-specific causal effects rather than global head artifacts or baseline sensitivities.
[Validation results] The reported mean Spearman ρ values (-0.256, -0.027) are presented without error bars, per-pair distributions, or statistical significance tests. This omission prevents assessment of whether the negative correlation is robust or could be explained by variance in the ~695 RewardBench pairs or implementation details of the scalar head.

minor comments (2)

[Abstract] The abstract states that the library provides 'five theory-grounded extensions' but does not enumerate them; an explicit list would improve clarity.
[Introduction] Notation for the reward head weight vector w_r is introduced without an equation reference in the opening paragraphs; adding a numbered definition would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, particularly on the need for stronger validation of the causal patching methods and statistical reporting. We address each major comment below and will incorporate revisions to enhance the rigor of our empirical claims.

read point-by-point responses

Referee: [Empirical validation paragraph and three-mode patching description] The central negative finding (linear attribution vs. patching mismatch) is load-bearing on the correctness of the adapted three-mode activation patching. The manuscript replaces the vocabulary unembedding with the scalar regression head w_r but provides no explicit checks, synthetic recovery experiments, or controls demonstrating that the scalar difference computation isolates feature-specific causal effects rather than global head artifacts or baseline sensitivities.

Authors: We agree that additional validation is warranted to confirm that the three-mode activation patching isolates the intended causal effects. The adaptation is straightforward: the scalar reward difference is computed as w_r · (h_patched - h_clean), where h are the hidden states at the patched layer, which follows directly from the linearity of the reward head. However, to demonstrate that this does not capture global artifacts, we will add synthetic recovery experiments in the revised manuscript. Specifically, we will construct controlled interventions on known directions in activation space (e.g., by adding scaled versions of w_r itself or orthogonal vectors) and verify that the patching recovers the expected reward delta without spurious effects from baseline shifts. We will also include controls comparing patched effects to those from random activation perturbations of matched magnitude. These will be presented in a new 'Validation of Causal Tools' subsection. revision: yes
Referee: [Validation results] The reported mean Spearman ρ values (-0.256, -0.027) are presented without error bars, per-pair distributions, or statistical significance tests. This omission prevents assessment of whether the negative correlation is robust or could be explained by variance in the ~695 RewardBench pairs or implementation details of the scalar head.

Authors: We acknowledge this presentation gap. In the revision, we will augment the results section with: (i) standard error bars computed across the 695 pairs, (ii) a distribution plot (e.g., histogram or boxplot) of the per-pair Spearman ρ values to show the spread, and (iii) a statistical test, such as a Wilcoxon signed-rank test or bootstrap confidence interval, to evaluate whether the mean ρ is significantly different from zero. These statistics will be computed using the existing experimental code and added to both the main text and the appendix for full transparency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central result is an independent empirical comparison

full rationale

The paper's load-bearing claim is the negative empirical result that linear attribution fails to predict causal patching effects (Spearman ρ near zero or negative) when both are run on external production reward models (Skywork, ArmoRM) using ~695 RewardBench pairs. This comparison does not reduce to any fitted parameter, self-definition, or self-citation chain within the paper. The library ports existing tools (logit lens, activation patching, SAEs) by replacing the vocabulary unembedding with the scalar head weight vector w_r; the mismatch between observational and causal views is treated as the finding itself rather than a quantity derived from prior assumptions. No equations or sections exhibit self-definitional loops, fitted-input predictions, or uniqueness theorems imported from the authors' own prior work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that mechanistic interpretability primitives remain meaningful when the unembedding is replaced by a scalar regression head; no free parameters or new entities are introduced.

axioms (1)

domain assumption The reward head's weight vector w_r is the natural axis for every interpretability question in reward models
This observation organizes the entire library design and all provided tools.

pith-pipeline@v0.9.0 · 5557 in / 1169 out tokens · 76578 ms · 2026-05-07T16:13:44.368288+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Addressing divergent representations from causal interventions on neural networks

argues that reward hacking is the equilibrium strategy when the policy can identify quality dimensions left under-covered by evaluation. The actionable corollary is that one can predict which dimensions are at riskbeforeany RL by quantifying coverage from the evaluation suite alone. Operationalisation.Inputs: a list of quality dimensions (strings) and a l...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

noising"|

reports that reward-hacked policies exhibit correlated failures across superficially distinct misalignment dimensions, suggesting that several visible failure modes are driven by a shared un- derlying mechanism. Detection at reward-model level helps triage which behaviours to test for in the policy. Operationalisation.Thedetectorshipssixdimensions(alignme...

work page 2026

[1] [1]

Addressing divergent representations from causal interventions on neural networks

argues that reward hacking is the equilibrium strategy when the policy can identify quality dimensions left under-covered by evaluation. The actionable corollary is that one can predict which dimensions are at riskbeforeany RL by quantifying coverage from the evaluation suite alone. Operationalisation.Inputs: a list of quality dimensions (strings) and a l...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

noising"|

reports that reward-hacked policies exhibit correlated failures across superficially distinct misalignment dimensions, suggesting that several visible failure modes are driven by a shared un- derlying mechanism. Detection at reward-model level helps triage which behaviours to test for in the policy. Operationalisation.Thedetectorshipssixdimensions(alignme...

work page 2026