Rational Sparse Autoencoder

Naiyu Yin; Yue Yu

arxiv: 2606.14990 · v2 · pith:NRX72TCJnew · submitted 2026-06-12 · 💻 cs.LG · cs.AI

Rational Sparse Autoencoder

Naiyu Yin , Yue Yu This is my paper

Pith reviewed 2026-06-27 04:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse autoencodersrational activationsmechanistic interpretabilitylanguage modelsactivation functionsreconstruction metricsfeature interpretability

0 comments

The pith

Trainable rational functions replace fixed nonlinearities in sparse autoencoders and improve reconstruction and downstream metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Rational Sparse Autoencoder to relax the constraint of fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK that currently hard-code sparsity mechanisms. It replaces the activation with a trainable rational function that can approximate those primitives on compact domains while adapting to the geometry of pre-activations. The approach uses a two-stage pipeline of relaxed Remez initialization on synthetic data, scale calibration, weight copying from a baseline SAE, and then standard fine-tuning. Experiments on residual-stream activations from three language models show strict gains on reconstruction metrics, downstream behavior metrics, and maintained interpretability under sparse probing. A sympathetic reader would care because SAEs are central to extracting interpretable features from models, and a more flexible activation class could shift the achievable reconstruction-sparsity frontier.

Core claim

The RSAE replaces the fixed encoder activation with a trainable rational function. It is realized through a two-stage pipeline that copies baseline SAE weights, obtains rational coefficients via relaxed Remez exchange on synthetic data, calibrates scale parameters, and then fine-tunes under the standard sparsity-regularized reconstruction objective. On residual-stream activations of three open-weight language models and across three baseline activation families, the RSAE strictly improves both reconstruction-side metrics and downstream-behavior metrics after fine-tuning, without sacrificing feature-level interpretability under sparse probing, with gains consistent across host models, baselin

What carries the argument

The trainable rational function used as the encoder nonlinearity, which supplies a richer function class that uniformly approximates the activation primitives of existing SAE families while adapting to observed pre-activation geometry.

If this is right

RSAE achieves a better reconstruction-versus-sparsity trade-off than fixed-activation baselines.
Downstream-behavior metrics improve while feature interpretability under sparse probing is preserved.
Gains remain consistent across all tested host language models, baseline activation families, and sparsity levels.
The method adds only a small number of scalar parameters per autoencoder and completes in minutes on consumer hardware.
The rational activation can be initialized to match any of the three standard SAE families before adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rational activations could be tested as drop-in replacements in other sparse coding or dictionary-learning settings beyond SAEs.
The same two-stage Remez-plus-fine-tuning pipeline might be applied to learn other parametric activation families for interpretability tools.
If rational functions adapt to pre-activation geometry, they may surface feature types that fixed nonlinearities systematically miss in language-model residuals.
Scaling the approach to larger models would require checking whether the added parameters remain negligible relative to total SAE size.

Load-bearing premise

The two-stage initialization with relaxed Remez on synthetic data plus scale calibration, followed by standard fine-tuning, reliably yields a rational function that outperforms the original fixed nonlinearity without introducing new instabilities or metric-specific artifacts.

What would settle it

Apply the RSAE pipeline to residual-stream activations from one of the three tested language models; if after fine-tuning the reconstruction error or any downstream-behavior metric fails to improve over the corresponding baseline SAE on the same data, the central empirical claim is false.

Figures

Figures reproduced from arXiv: 2606.14990 by Naiyu Yin, Yue Yu.

**Figure 1.** Figure 1: Rational approximation of SAE activation primitives (ReLU and JumpReLU) on [−1, 1]. Best-MSE rational fits of ReLU (figure (a) and figure (b)) and JumpReLU with θ = 0.1 (figure (c) and figure (d)) under three procedures: the relaxed Remez exchange (red for standard-Padé and purple for safe-Padé), the L 2 fit (blue), and the smoothed L ∞ fit (green). Figure (b) and figure (d) zoom into the kink and the jump… view at source ↗

**Figure 2.** Figure 2: Pareto fronts on Pythia-160m for all three baseline activation families. Subfigures (a), (b), (c) plot MSE vs. ℓ0, and subfigures (d), (e), (f) plot MSE vs. alive. Subfigures (a) and (d) use ReLU as the teacher, (b) and (e) use JumpReLU, and (c) and (f) use TopK. The black star is the teacher SAE; the red curve traces the RSAE Pareto front under a λ sweep. The green sweet-zone marks the strict-Pareto-domin… view at source ↗

**Figure 3.** Figure 3: L 2 MSE versus rational degree on [−1, 1]. Each curve traces the best L 2 MSE attained by one fitter as the type (p, q) is swept across {(3, 2),(5, 4), . . . ,(19, 18)}. (a) On ReLU, standard-Padé Remez (red) decays near-exponentially with degree until type (15, 14), beyond which numerical conditioning of the linearised exchange dominates and the curve flattens. The pole-free safe-Padé L 2 (blue) and L∞ (g… view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder activation with a trainable rational function. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after a separating top-k threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE strictly improves on it after the fine-tuning step, both on reconstruction-side metrics and on downstream-behaviour metrics, without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RSAE replaces fixed SAE nonlinearities with a trainable rational activation and reports gains after fine-tuning, but without a fine-tuned fixed baseline control the attribution to the new form stays uncertain.

read the letter

The main takeaway is that RSAE introduces a trainable rational function as the encoder activation in sparse autoencoders, which can approximate the standard fixed ones like ReLU or TopK. This flexibility is the core novelty, and the two-stage initialization with relaxed Remez on synthetic data followed by fine-tuning is a solid way to plug it in without starting from scratch.

The paper does a few things well. It shows that rational functions provide a richer class while keeping the parameter count low, just a handful of scalars per autoencoder. The empirical tests cover three language models and three baseline families, with reported strict improvements on reconstruction metrics, sparsity, downstream behavior, and sparse probing for interpretability. The consistency across sparsity levels is a plus, and the upgrade runs quickly on consumer hardware.

The soft spot is the missing control. The improvements come after fine-tuning the RSAE under the standard objective, but there is no parallel run where the original baseline SAEs with fixed activations get the same fine-tuning treatment. Without that, it's hard to know whether the rational activation is responsible or if any extra optimization would produce similar lifts. The abstract and stress-test note both point to this gap, and it directly affects how much credit the new activation deserves. If the full paper has that control, it would strengthen things considerably; otherwise the attribution stays provisional.

This work is aimed at researchers in mechanistic interpretability who rely on SAEs for feature dictionaries. Anyone tuning sparsity mechanisms or looking for better reconstruction-sparsity trade-offs could find it useful to try. It shows clear thinking on how to relax the fixed nonlinearity constraint.

I would recommend sending it for peer review. The idea is worth checking out in detail, even with the control issue that needs addressing in revision.

Referee Report

1 major / 2 minor

Summary. The paper proposes Rational Sparse Autoencoders (RSAEs), which replace the fixed encoder nonlinearity (ReLU, JumpReLU, TopK) in standard SAEs with a trainable rational function. It describes a two-stage procedure—copying baseline weights, initializing rational coefficients via relaxed Remez exchange on synthetic data plus scale calibration, then fine-tuning under the standard sparsity-regularized reconstruction loss—and claims that the resulting RSAEs strictly outperform the original baselines on reconstruction metrics, sparsity, downstream behavior, and sparse-probing interpretability across three language models, three activation families, and a range of sparsity levels, while adding only a few scalar parameters per autoencoder.

Significance. If the attribution to the rational activation class holds after proper controls, the work would provide a lightweight, more flexible activation primitive that can approximate existing SAE nonlinearities while adapting to data geometry. This could meaningfully improve the reconstruction-sparsity frontier in mechanistic interpretability without requiring architectural overhauls, and the minimal parameter overhead plus rapid fine-tuning would make it practically attractive.

major comments (1)

[Experimental results / evaluation protocol] The central empirical claim attributes performance gains to the trainable rational activation, yet the described pipeline fine-tunes only the RSAE (baseline weights + rational init) and reports no control in which the original baseline SAEs (fixed nonlinearity) receive identical fine-tuning steps and hyperparameters. Without this comparison, it is impossible to separate the effect of the rational function class from the effect of additional optimization on the linear weights. This directly undermines the attribution in the strongest claim and must be addressed with new experiments.

minor comments (2)

[Abstract / introduction] The abstract states that rational functions 'uniformly approximate the activation primitives used by existing SAE families on compact domains' but does not specify the domain size or the approximation error achieved for each baseline family; a short quantitative statement or reference to a supplementary table would clarify the claim.
[Methods] Notation for the rational function (numerator/denominator degrees, coefficient vectors) is introduced without an explicit equation; adding a numbered equation in the methods section would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental protocol. The point raised is valid and we will address it directly with additional controls in the revision.

read point-by-point responses

Referee: [Experimental results / evaluation protocol] The central empirical claim attributes performance gains to the trainable rational activation, yet the described pipeline fine-tunes only the RSAE (baseline weights + rational init) and reports no control in which the original baseline SAEs (fixed nonlinearity) receive identical fine-tuning steps and hyperparameters. Without this comparison, it is impossible to separate the effect of the rational function class from the effect of additional optimization on the linear weights. This directly undermines the attribution in the strongest claim and must be addressed with new experiments.

Authors: We agree that the current experiments do not isolate the contribution of the rational activation from the effect of additional fine-tuning on the linear weights. In the revised manuscript we will add the requested control: each baseline SAE (with its original fixed nonlinearity) will be fine-tuned for the same number of steps, using identical optimizer settings, learning rate schedule, sparsity coefficient, and batch size as the corresponding RSAE. We will report reconstruction, sparsity, and downstream metrics for these fine-tuned baselines alongside the RSAE results. This will allow a direct comparison that attributes any remaining gains to the trainable rational function. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks, not self-definition or fitted inputs

full rationale

The paper describes a two-stage initialization (relaxed Remez on synthetic data plus scale calibration) followed by fine-tuning under the standard objective, then reports empirical gains on reconstruction, sparsity, and probing metrics across three models and three activation families. No quoted equations or steps reduce these measured improvements to quantities defined by the fitted rational coefficients themselves, nor do any self-citations serve as load-bearing uniqueness theorems. The initialization draws on standard approximation theory rather than prior author work, and the results are presented as direct comparisons against fixed-nonlinearity baselines on held-out data. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate free parameters, axioms, or invented entities beyond the obvious trainable rational coefficients; no explicit background assumptions are stated.

free parameters (1)

rational function coefficients
Trainable scalars in the rational activation that are fitted during initialization and fine-tuning.

pith-pipeline@v0.9.1-grok · 5801 in / 1103 out tokens · 32979 ms · 2026-06-27T04:26:10.017547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 2 linked inside Pith

[1]

Bart Bussmann, Patrick Leask, and Neel Nanda

URL https://transformer-circuits.pub/2023/ monosemantic-features. Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410,

arXiv 2023
[2]

Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547,

Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547,

arXiv
[3]

Rational neural networks for approximating jump discontinuities of graph convolution operator.arXiv preprint arXiv:1808.10073,

Zhiqian Chen, Feng Chen, Rongjie Lai, Xuchao Zhang, and Chang-Tien Lu. Rational neural networks for approximating jump discontinuities of graph convolution operator.arXiv preprint arXiv:1808.10073,

arXiv
[5]

Introduces the safe- Padé parameterisation

URL https://arxiv.org/abs/2102.09407. Introduces the safe- Padé parameterisation. 11 Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits.Advances in Neural Information Processing Systems, 37:24375–24410,

arXiv
[6]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InInternational Conference on Learning Representations, volume 2025, pages 26721–26754,

2025
[7]

Sparse autoen- coders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Smith, Aidan Ewart, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models. InInternational Conference on Learning Representations, volume 2024, pages 7827–7845,

2024
[8]

Donald J Newman.Approximation with rational functions

arXiv:1907.06732. Donald J Newman.Approximation with rational functions. Number

arXiv 1907
[9]

Improving sparse decomposition of language model activations with gated sparse autoencoders.Advances in Neural Information Processing Systems, 37:775–818, 2024a

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, and Neel Nanda. Improving sparse decomposition of language model activations with gated sparse autoencoders.Advances in Neural Information Processing Systems, 37:775–818, 2024a. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy,...

Pith/arXiv arXiv
[10]

Maosen Tang and Alex Townsend

LessWrong post. Maosen Tang and Alex Townsend. Rational neural networks have expressivity advantages.arXiv preprint arXiv:2602.12390,

Pith/arXiv arXiv
[11]

Consequently, for every 0< ε <1 , there is a rational function of size O log(1/ε) log(1/δ) that approximates sign on Eδ to uniform error ε

A Detailed Proofs Lemma 1(Zolotarev; rational approximation of sign).For every δ∈(0,1) and n≥1 there is a type-(2n+ 1,2n) rational sn,δ such that supx∈Eδ sign(x)−s n,δ(x) ≤4 exp −π 2n/log(4/δ) . Consequently, for every 0< ε <1 , there is a rational function of size O log(1/ε) log(1/δ) that approximates sign on Eδ to uniform error ε. For deep-layer network...

2017
[12]

For an integer m≥1 , take the Zolotarev sign function sm of type (3m,3 m −1) . By the composition property of Zolotarev sign functions, sm can be written as 13 a composition of m rational maps of type (3,2) , so it is represented by a constant-width rational network with internal depthm. As in the proof of Lemma 1 in Boullé et al. [2020], choose the gap p...

2020
[13]

Define G(u) := 1 +s(u) 2 , eH(t) := 1 +s(t/2) 2 ,ez i(h;τ k) :=h i eH(h i −τ k)

Applying Lemma 1 tou=t i/2gives a shared scalar rational functionssuch that sup |t|∈[δ,2] sign(t)−s t 2 ≤ε. Define G(u) := 1 +s(u) 2 , eH(t) := 1 +s(t/2) 2 ,ez i(h;τ k) :=h i eH(h i −τ k). 16 Then, for every(h, τ k)∈Ω T δ , |ezi(h;τ k)−z T,i(h;τ k)| ≤ |h i| |eH(h i −τ k)−H(h i −τ k)| ≤ ε 2 ≤ε. For the direct trainable rational-activation implementation, k...

2016
[14]

We refer to the safe-Padé form distilled from the standard-Padé Remez fit as Route A

that targets the L∞ minimax objective (7) in the family Q(t) = 1 +P j ϕjtj with signed ϕj; (ii) safe-Padé Remez via warm-start (Route A), in which the converged standard- Padé Remez coefficients are distilled onto the pole-free family Q(t) = 1 + P j |bj||t|j of (3) by least-squares fitting on {tℓ}N ℓ=1; (iii) L2 safe-Padé fit, minimising 1 N P ℓ(r(a,b)(tℓ...

2019

[1] [1]

Bart Bussmann, Patrick Leask, and Neel Nanda

URL https://transformer-circuits.pub/2023/ monosemantic-features. Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410,

arXiv 2023

[2] [2]

Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547,

Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547,

arXiv

[3] [3]

Rational neural networks for approximating jump discontinuities of graph convolution operator.arXiv preprint arXiv:1808.10073,

Zhiqian Chen, Feng Chen, Rongjie Lai, Xuchao Zhang, and Chang-Tien Lu. Rational neural networks for approximating jump discontinuities of graph convolution operator.arXiv preprint arXiv:1808.10073,

arXiv

[4] [5]

Introduces the safe- Padé parameterisation

URL https://arxiv.org/abs/2102.09407. Introduces the safe- Padé parameterisation. 11 Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits.Advances in Neural Information Processing Systems, 37:24375–24410,

arXiv

[5] [6]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InInternational Conference on Learning Representations, volume 2025, pages 26721–26754,

2025

[6] [7]

Sparse autoen- coders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Smith, Aidan Ewart, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models. InInternational Conference on Learning Representations, volume 2024, pages 7827–7845,

2024

[7] [8]

Donald J Newman.Approximation with rational functions

arXiv:1907.06732. Donald J Newman.Approximation with rational functions. Number

arXiv 1907

[8] [9]

Improving sparse decomposition of language model activations with gated sparse autoencoders.Advances in Neural Information Processing Systems, 37:775–818, 2024a

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, and Neel Nanda. Improving sparse decomposition of language model activations with gated sparse autoencoders.Advances in Neural Information Processing Systems, 37:775–818, 2024a. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy,...

Pith/arXiv arXiv

[9] [10]

Maosen Tang and Alex Townsend

LessWrong post. Maosen Tang and Alex Townsend. Rational neural networks have expressivity advantages.arXiv preprint arXiv:2602.12390,

Pith/arXiv arXiv

[10] [11]

Consequently, for every 0< ε <1 , there is a rational function of size O log(1/ε) log(1/δ) that approximates sign on Eδ to uniform error ε

A Detailed Proofs Lemma 1(Zolotarev; rational approximation of sign).For every δ∈(0,1) and n≥1 there is a type-(2n+ 1,2n) rational sn,δ such that supx∈Eδ sign(x)−s n,δ(x) ≤4 exp −π 2n/log(4/δ) . Consequently, for every 0< ε <1 , there is a rational function of size O log(1/ε) log(1/δ) that approximates sign on Eδ to uniform error ε. For deep-layer network...

2017

[11] [12]

For an integer m≥1 , take the Zolotarev sign function sm of type (3m,3 m −1) . By the composition property of Zolotarev sign functions, sm can be written as 13 a composition of m rational maps of type (3,2) , so it is represented by a constant-width rational network with internal depthm. As in the proof of Lemma 1 in Boullé et al. [2020], choose the gap p...

2020

[12] [13]

Define G(u) := 1 +s(u) 2 , eH(t) := 1 +s(t/2) 2 ,ez i(h;τ k) :=h i eH(h i −τ k)

Applying Lemma 1 tou=t i/2gives a shared scalar rational functionssuch that sup |t|∈[δ,2] sign(t)−s t 2 ≤ε. Define G(u) := 1 +s(u) 2 , eH(t) := 1 +s(t/2) 2 ,ez i(h;τ k) :=h i eH(h i −τ k). 16 Then, for every(h, τ k)∈Ω T δ , |ezi(h;τ k)−z T,i(h;τ k)| ≤ |h i| |eH(h i −τ k)−H(h i −τ k)| ≤ ε 2 ≤ε. For the direct trainable rational-activation implementation, k...

2016

[13] [14]

We refer to the safe-Padé form distilled from the standard-Padé Remez fit as Route A

that targets the L∞ minimax objective (7) in the family Q(t) = 1 +P j ϕjtj with signed ϕj; (ii) safe-Padé Remez via warm-start (Route A), in which the converged standard- Padé Remez coefficients are distilled onto the pole-free family Q(t) = 1 + P j |bj||t|j of (3) by least-squares fitting on {tℓ}N ℓ=1; (iii) L2 safe-Padé fit, minimising 1 N P ℓ(r(a,b)(tℓ...

2019