reward-lens: A Mechanistic Interpretability Library for Reward Models
Pith reviewed 2026-05-07 16:13 UTC · model grok-4.3
The pith
Reward models break standard interpretability tools and need versions centered on their scalar reward head weight vector.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The reward-lens framework adapts mechanistic interpretability to reward models by making the reward head weight vector the common reference axis for every analysis. This permits direct porting of activation patching, feature attribution, and cross-model comparison while adding new probes for reward-term conflicts and misalignment cascades. Tests on Skywork and ArmoRM models across hundreds of pairs show that linear attribution does not predict patching outcomes, with mean Spearman correlations of -0.256 and -0.027 respectively.
What carries the argument
The reward head's weight vector, which serves as the natural projection axis for turning internal activations into scalar reward contributions.
If this is right
- Observational attribution and causal patching views can be kept first-class and directly comparable in reward model analysis.
- A single adapter protocol supports multiple architectures including Llama, Mistral, Gemma-2, and multi-objective heads.
- New metrics such as distortion index and misalignment cascade detection become available for reward model inspection.
- Linear methods alone are insufficient to understand how reward models assign scores.
Where Pith is reading between the lines
- The mismatch between linear and causal measures may point to nonlinear interactions inside reward models that standard probes miss.
- The same centering approach could transfer to other scalar-output models outside RLHF.
- Exposing these disagreements could help identify concrete reward hacking patterns before deployment.
Load-bearing premise
The adapted three-mode activation patching correctly isolates causal effects in the reward model's forward pass rather than introducing artifacts from the scalar head.
What would settle it
A positive or zero Spearman correlation between linear attribution scores and patching effect sizes on additional reward models or input pairs would undermine the reported negative finding.
Figures
read the original abstract
Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, activation patching, sparse autoencoders -- was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector $w_r$ is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, concept-vector analysis). A ten-method adapter protocol covers Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman $\rho = -0.256$ on Skywork, $-0.027$ on ArmoRM). The framework treats this disagreement as a property to expose, not a bug -- motivating a design that keeps observational and causal views first-class and directly comparable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces reward-lens, an open-source library that ports mechanistic interpretability tools (logit lens, activation patching, SAEs) to reward models by centering analysis on the reward head weight vector w_r. It supplies component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE attribution, cross-model adapters for Llama/Mistral/Gemma-2/ArmoRM, and five theory-grounded extensions. Validation on Skywork and ArmoRM using ~695 RewardBench pairs yields the central negative result that linear attribution fails to predict causal patching effects (mean Spearman ρ = -0.256 on Skywork, -0.027 on ArmoRM), which the authors treat as a feature motivating separate observational and causal views.
Significance. If the negative correlation result holds after proper validation of the causal tools, it would demonstrate that standard linear attribution methods are unreliable for predicting causal effects in reward models, with direct implications for detecting reward hacking and misalignment in RLHF. The open-source library with broad HuggingFace adapters and reproducible code for multiple model families constitutes a practical strength that lowers barriers for follow-up work.
major comments (2)
- [Empirical validation paragraph and three-mode patching description] The central negative finding (linear attribution vs. patching mismatch) is load-bearing on the correctness of the adapted three-mode activation patching. The manuscript replaces the vocabulary unembedding with the scalar regression head w_r but provides no explicit checks, synthetic recovery experiments, or controls demonstrating that the scalar difference computation isolates feature-specific causal effects rather than global head artifacts or baseline sensitivities.
- [Validation results] The reported mean Spearman ρ values (-0.256, -0.027) are presented without error bars, per-pair distributions, or statistical significance tests. This omission prevents assessment of whether the negative correlation is robust or could be explained by variance in the ~695 RewardBench pairs or implementation details of the scalar head.
minor comments (2)
- [Abstract] The abstract states that the library provides 'five theory-grounded extensions' but does not enumerate them; an explicit list would improve clarity.
- [Introduction] Notation for the reward head weight vector w_r is introduced without an equation reference in the opening paragraphs; adding a numbered definition would aid readers.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, particularly on the need for stronger validation of the causal patching methods and statistical reporting. We address each major comment below and will incorporate revisions to enhance the rigor of our empirical claims.
read point-by-point responses
-
Referee: [Empirical validation paragraph and three-mode patching description] The central negative finding (linear attribution vs. patching mismatch) is load-bearing on the correctness of the adapted three-mode activation patching. The manuscript replaces the vocabulary unembedding with the scalar regression head w_r but provides no explicit checks, synthetic recovery experiments, or controls demonstrating that the scalar difference computation isolates feature-specific causal effects rather than global head artifacts or baseline sensitivities.
Authors: We agree that additional validation is warranted to confirm that the three-mode activation patching isolates the intended causal effects. The adaptation is straightforward: the scalar reward difference is computed as w_r · (h_patched - h_clean), where h are the hidden states at the patched layer, which follows directly from the linearity of the reward head. However, to demonstrate that this does not capture global artifacts, we will add synthetic recovery experiments in the revised manuscript. Specifically, we will construct controlled interventions on known directions in activation space (e.g., by adding scaled versions of w_r itself or orthogonal vectors) and verify that the patching recovers the expected reward delta without spurious effects from baseline shifts. We will also include controls comparing patched effects to those from random activation perturbations of matched magnitude. These will be presented in a new 'Validation of Causal Tools' subsection. revision: yes
-
Referee: [Validation results] The reported mean Spearman ρ values (-0.256, -0.027) are presented without error bars, per-pair distributions, or statistical significance tests. This omission prevents assessment of whether the negative correlation is robust or could be explained by variance in the ~695 RewardBench pairs or implementation details of the scalar head.
Authors: We acknowledge this presentation gap. In the revision, we will augment the results section with: (i) standard error bars computed across the 695 pairs, (ii) a distribution plot (e.g., histogram or boxplot) of the per-pair Spearman ρ values to show the spread, and (iii) a statistical test, such as a Wilcoxon signed-rank test or bootstrap confidence interval, to evaluate whether the mean ρ is significantly different from zero. These statistics will be computed using the existing experimental code and added to both the main text and the appendix for full transparency. revision: yes
Circularity Check
No significant circularity; central result is an independent empirical comparison
full rationale
The paper's load-bearing claim is the negative empirical result that linear attribution fails to predict causal patching effects (Spearman ρ near zero or negative) when both are run on external production reward models (Skywork, ArmoRM) using ~695 RewardBench pairs. This comparison does not reduce to any fitted parameter, self-definition, or self-citation chain within the paper. The library ports existing tools (logit lens, activation patching, SAEs) by replacing the vocabulary unembedding with the scalar head weight vector w_r; the mismatch between observational and causal views is treated as the finding itself rather than a quantity derived from prior assumptions. No equations or sections exhibit self-definitional loops, fitted-input predictions, or uniqueness theorems imported from the authors' own prior work. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The reward head's weight vector w_r is the natural axis for every interpretability question in reward models
Reference graph
Works this paper leans on
-
[1]
Addressing divergent representations from causal interventions on neural networks
argues that reward hacking is the equilibrium strategy when the policy can identify quality dimensions left under-covered by evaluation. The actionable corollary is that one can predict which dimensions are at riskbeforeany RL by quantifying coverage from the evaluation suite alone. Operationalisation.Inputs: a list of quality dimensions (strings) and a l...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
reports that reward-hacked policies exhibit correlated failures across superficially distinct misalignment dimensions, suggesting that several visible failure modes are driven by a shared un- derlying mechanism. Detection at reward-model level helps triage which behaviours to test for in the policy. Operationalisation.Thedetectorshipssixdimensions(alignme...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.