Fast Multi-dimensional Refusal Subspaces via RFM-AGOP

Thomas Winninger

arxiv: 2607.02396 · v1 · pith:5F4N3DC4new · submitted 2026-07-02 · 💻 cs.AI · cs.LG

Fast Multi-dimensional Refusal Subspaces via RFM-AGOP

Thomas Winninger This is my paper

Pith reviewed 2026-07-03 13:22 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords refusal subspacesmulti-dimensional subspacesrecursive feature machineLLM activation steeringmodel safetyprobe initializationQwen models

0 comments

The pith

Adapting the Recursive Feature Machine with probe-informed initialization locates multi-dimensional refusal subspaces in large language models within seconds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that refusal behavior in LLMs lives in multi-dimensional subspaces rather than single directions, and that existing extraction methods are too slow for practical use on reasoning models. It shows that an efficient adaptation of the Recursive Feature Machine algorithm, initialized with a probe, recovers these subspaces quickly on both reasoning and non-reasoning models. A sympathetic reader would care because this could make activation steering and monitoring for safety feasible at scale where current approaches are prohibitive. The method also outperforms alternatives on an ablation task. The authors note that further work is needed to compare subspaces across methods.

Core claim

By adapting the Recursive Feature Machine algorithm with a probe-informed initialization, the multi-dimensional refusal subspace can be identified in seconds on both reasoning models such as Qwen 3 and non-reasoning models such as Qwen 2.5, while also achieving better performance on the ablation task than prior alternatives.

What carries the argument

RFM-AGOP, the adapted Recursive Feature Machine algorithm with probe-informed initialization, which efficiently computes the multi-dimensional subspace encoding refusal.

If this is right

The subspace can be extracted in seconds rather than requiring prohibitive computation on long reasoning traces.
The approach outperforms existing methods on the ablation task.
RFM could serve as a cheap and scalable complement to existing subspace-extraction techniques if the results hold.
The method applies equally to reasoning and non-reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the recovered subspace proves causal, the speed gain would allow routine safety checks during model serving rather than offline analysis only.
The same initialization trick might extend to extracting subspaces for other behaviors that current methods find expensive to isolate.
Direct comparisons of the subspaces produced by RFM-AGOP versus prior techniques would clarify whether they capture overlapping or distinct aspects of refusal.

Load-bearing premise

The subspace recovered by the method actually encodes causal refusal behavior rather than some correlated but non-causal direction in activation space.

What would settle it

Steering or ablating along the recovered subspace produces no measurable change in the model's refusal rate on harmful queries.

read the original abstract

Steering and monitoring activations in Large Language Models (LLMs) are increasingly used for both safety and interpretability. Early work assumed behaviours are encoded along single linear directions, but recent findings suggest complex behaviours, such as the refusal to answer harmful queries, live in multi-dimensional subspaces. However, existing methods for extracting these subspaces are computationally expensive, which becomes prohibitive on reasoning models who produce long reasoning traces. By adapting the Recursive Feature Machine (RFM) algorithm -- which can be computed efficiently -- with a probe-informed initialization, we are able to identify the multi-dimensional refusal subspace in seconds, on reasoning (Qwen 3) and non-reasoning (Qwen 2.5) models. While RFM allows for faster subspace identification, it also showed better performances on the ablation task than its alternatives. More work is planned to better understand the relations between subspaces found by different methods. If confirmed, RFM could be a cheap and scalable complement to existing subspace-extraction methods in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RFM-AGOP gives a fast way to pull multi-dimensional refusal subspaces from LLMs but the evidence that those directions actually control refusal is still thin.

read the letter

The main point is that this adapts the Recursive Feature Machine algorithm with a probe-informed initialization to extract refusal subspaces in seconds on both reasoning and non-reasoning Qwen models. That speed is the concrete advance over prior methods that were too slow for long traces.

It does well by showing the approach runs quickly across model types and reports better ablation results than the alternatives it compares against. The efficiency claim is straightforward and addresses a real bottleneck for anyone working with activation steering on complex behaviors.

The soft spots sit in the validation. The abstract gives no details on the ablation protocol, metrics, or controls for dimensionality and capability preservation, so it is hard to judge how meaningful the performance edge is. The authors themselves flag that more work is needed to compare subspaces across methods, which leaves the causal status open. The load-bearing assumption is that the recovered directions encode refusal rather than simply correlating with the probe labels; without intervention results that isolate the effect, that link stays provisional.

This is aimed at the LLM safety and interpretability group that needs quicker tools for multi-dimensional behaviors. A reader focused on practical extraction methods would get value from the speed if the ablations check out.

It deserves peer review. The efficiency contribution is specific enough to warrant checking the full methods and results even with the open questions on causality.

Referee Report

2 major / 1 minor

Summary. The paper introduces RFM-AGOP, an adaptation of the Recursive Feature Machine (RFM) algorithm with probe-informed initialization, to efficiently extract multi-dimensional refusal subspaces in LLMs. It claims this identifies the subspaces in seconds on both reasoning (Qwen 3) and non-reasoning (Qwen 2.5) models while showing better performance than alternatives on an ablation task, positioning it as a scalable complement to existing methods for safety and interpretability.

Significance. If the efficiency and performance claims hold with proper validation, the approach would offer a computationally cheap method for subspace extraction on complex models, addressing a key bottleneck for reasoning LLMs and enabling more practical steering/monitoring applications.

major comments (2)

[Abstract] Abstract: the claim that RFM-AGOP 'showed better performances on the ablation task' is load-bearing for the performance superiority assertion, yet the manuscript provides no description of the ablation protocol, metric (e.g., refusal rate on harmful vs. benign prompts), controls, or dataset details, preventing evaluation of whether the comparison supports the central claim.
[Abstract] Abstract/Results: the note that 'more work is planned to better understand the relations between subspaces found by different methods' indicates the recovered subspace's causal responsibility for refusal behavior (vs. mere correlation with probe labels) is not established, which is required to substantiate that the method solves the intended safety/interpretability problem rather than recovering a correlated direction.

minor comments (1)

[Abstract] Abstract: 'RFM-AGOP' is used without prior definition or citation to the base RFM algorithm.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate where revisions will be made to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that RFM-AGOP 'showed better performances on the ablation task' is load-bearing for the performance superiority assertion, yet the manuscript provides no description of the ablation protocol, metric (e.g., refusal rate on harmful vs. benign prompts), controls, or dataset details, preventing evaluation of whether the comparison supports the central claim.

Authors: We agree that the abstract does not include sufficient detail on the ablation protocol to allow independent evaluation. The full manuscript describes the ablation using refusal rate as the primary metric on a balanced set of harmful and benign prompts with matched controls, but this information is not summarized in the abstract. We will revise the abstract to briefly specify the metric, high-level protocol, and dataset characteristics so that the performance claim can be assessed from the abstract alone. revision: yes
Referee: [Abstract] Abstract/Results: the note that 'more work is planned to better understand the relations between subspaces found by different methods' indicates the recovered subspace's causal responsibility for refusal behavior (vs. mere correlation with probe labels) is not established, which is required to substantiate that the method solves the intended safety/interpretability problem rather than recovering a correlated direction.

Authors: The ablation experiments measure the functional impact of removing the extracted subspace on refusal rates, which provides evidence that the subspace is more directly tied to the behavior than probe-label correlation alone. We nevertheless accept the referee's point that this does not constitute a full causal demonstration (e.g., via activation patching or counterfactual interventions). The sentence about planned future work accurately reflects the current scope; we will revise the abstract and discussion to explicitly frame the ablation results as evidence of functional utility rather than complete causal proof, while retaining the honest statement about additional work needed. revision: partial

Circularity Check

0 steps flagged

No circularity: adaptation of external RFM algorithm with empirical claims

full rationale

The paper describes an adaptation of the pre-existing Recursive Feature Machine (RFM) algorithm using probe-informed initialization to extract refusal subspaces, claiming computational efficiency and better ablation performance. No equations, fitting procedures, or derivation steps are shown that reduce a claimed prediction or result to its own inputs by construction. The central claim rests on reported empirical outcomes rather than a self-definitional loop or load-bearing self-citation chain. The provisional note on future work regarding subspace relations further indicates the argument is not closed by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that RFM recovers causally relevant subspaces and that the probe initialization does not introduce circularity.

pith-pipeline@v0.9.1-grok · 5696 in / 1097 out tokens · 35876 ms · 2026-07-03T13:22:02.811265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

http://arxiv.org/abs/2604.04385v3 Hegazy, A., Elhoushi, M., and Alanwar, A. Guiding Giants: Lightweight Controllers for Weighted Activation Steer ing in LLMs, 2025. https://arxiv.org/abs/2505.20309 Hildebrandt, F., Maier, A., Krauss, P., and Schilling, A. Re fusal Behavior in Large Language Models: A Nonlinear Perspective, 2025. https://arxiv.org/abs/25...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

harmfulness

https://arxiv.org/abs/2510.06036 Zhao et al. Steering Autoregressive Music Generation with Recursive Feature Machines , 2025. https://arxiv.org/ abs/2510.19127 Zhao et al. LLMs Encode Harmfulness and Refusal Sepa rately, 2025. https://arxiv.org/abs/2507.11878 Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and Transf...

work page arXiv 2025

[1] [1]

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

http://arxiv.org/abs/2604.04385v3 Hegazy, A., Elhoushi, M., and Alanwar, A. Guiding Giants: Lightweight Controllers for Weighted Activation Steer ing in LLMs, 2025. https://arxiv.org/abs/2505.20309 Hildebrandt, F., Maier, A., Krauss, P., and Schilling, A. Re fusal Behavior in Large Language Models: A Nonlinear Perspective, 2025. https://arxiv.org/abs/25...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [3]

harmfulness

https://arxiv.org/abs/2510.06036 Zhao et al. Steering Autoregressive Music Generation with Recursive Feature Machines , 2025. https://arxiv.org/ abs/2510.19127 Zhao et al. LLMs Encode Harmfulness and Refusal Sepa rately, 2025. https://arxiv.org/abs/2507.11878 Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and Transf...

work page arXiv 2025