arxiv: 2605.11887 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Boyi Deng , Xu Wang , Yaoning Wang , Yu Wan , Yubo Ma , Baosong Yang , Haoran Wei , Jialong Tang

show 10 more authors

Huan Lin Ruize Gao Tianhao Li Qian Cao Xuancheng Ren Xiaodong Deng An Yang Fei Huang Dayiheng Liu Jingren Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:31 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords sparse autoencodersQwen modelsmodel steeringmechanistic interpretabilityLLM developmentfeature activationpost-training optimizationsafety data synthesis

0 comments

The pith

Sparse autoencoders on Qwen models function as reusable interfaces for steering outputs, analyzing benchmarks, synthesizing data, and guiding fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen-Scope, an open collection of sparse autoencoders trained across multiple Qwen3 and Qwen3.5 models. It shows these autoencoders can move beyond explaining model internals after the fact to serve as active tools in four workflows: directing generation by activating chosen feature directions, using feature patterns to diagnose what benchmarks actually measure, supporting data filtering and creation for safety, and supplying training signals that reduce repetition or code-switching. A reader would care if this holds because it converts representation-level decompositions into direct levers for model behavior without retraining from scratch. The core demonstration is that one set of learned features can support inference control, evaluation proxies, data work, and optimization objectives at once.

Core claim

SAEs serve not only as post-hoc analysis tools but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models, demonstrated by inference-time steering via feature directions, evaluation analysis through activated features, data-centric tasks including toxicity classification and safety synthesis, and incorporation of SAE signals into supervised fine-tuning and reinforcement learning to address code-switching and repetition.

What carries the argument

The Qwen-Scope collection of 14 groups of sparse autoencoders across seven Qwen3 and Qwen3.5 variants that decompose activations into sparse features usable for development operations.

If this is right

Feature directions enable direct control over language, concepts, and preferences during inference without any weight updates.
Activated features supply a representation-level proxy for measuring benchmark redundancy and capability coverage.
The same features support practical data tasks such as multilingual toxicity detection and safety-oriented data synthesis.
SAE-derived signals can be added to supervised fine-tuning and reinforcement learning objectives to reduce repetition and code-switching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If feature stability holds across releases, developers could maintain persistent control interfaces rather than re-deriving them for each model update.
The approach opens the possibility of automated auditing pipelines that flag capability gaps directly from feature activation patterns.
Wider adoption might shift emphasis from full retraining toward targeted edits at the representation level when undesirable behaviors appear.

Load-bearing premise

The learned SAE features remain stable, interpretable, and causally effective enough to support reliable steering and training signals across uses without introducing new unintended behaviors.

What would settle it

A controlled test in which activating an SAE feature direction intended to reduce repetition instead increases repetition rate or produces new inconsistencies on held-out prompts.

Figures

Figures reproduced from arXiv: 2605.11887 by An Yang, Baosong Yang, Boyi Deng, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jialong Tang, Jingren Zhou, Qian Cao, Ruize Gao, Tianhao Li, Xiaodong Deng, Xuancheng Ren, Xu Wang, Yaoning Wang, Yubo Ma, Yu Wan.

**Figure 1.** Figure 1: Overview of Qwen-Scope. SAEs trained on Qwen3/3.5 serve as a common interface for four practical directions: interpretable steering at inference time, capability-aware benchmark analysis, feature-guided data workflows, and targeted post-training for model improvement. 1 arXiv:2605.11887v1 [cs.CL] 12 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the two-step SAE-based steering pipeline: (1) contrastive feature identification, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: SAE features provide interpretable handles for model analysis and control. Left: SAE activations can be used to undesirable generation behavior. When the model is prompted in English, the response unexpectedly mixes in Chinese text. Ranking SAE features by activation strength reveals a highly activated Chinese-language feature (id: 6159). Suppressing this feature during generation removes the unexpected la… view at source ↗

**Figure 4.** Figure 4: Illustration of the proposed SAE-based benchmark analysis framework, covering feature [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Spearman rank correlation between performance-based redundancy [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Feature overlap matrix for eight benchmarks. Entry [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of the SAE-based toxicity classification pipeline. Feature discovery is performed on a [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Layer-wise F1 of the SAE-based toxicity classifier on English. Left: Qwen3-1.7B. Right: Qwen3-8B. The curves report held-out F1 across layers using top-K toxic-biased SAE features discovered in English, with K ∈ {1, 2, 5, 10}. Star markers indicate the best F1 layer for each model. Without training any additional classifier, the existing SAE can be used directly for classification, achieving an F1 score ab… view at source ↗

**Figure 9.** Figure 9: Cross-lingual structure and transfer of toxic SAE features. Panel (a) shows the overlap of top-10 toxic SAE features in Qwen3-1.7B, and panel (b) shows the same for Qwen3-8B. Panel (c) shows the layer-wise mean overlap, with shaded bands indicating the interquartile range (IQR) across language pairs. Panel (d) shows the best held-out F1 for each test language using features discovered in English. Panel (e)… view at source ↗

**Figure 10.** Figure 10: A proxy for selecting high-performing layers before evaluation. Row 1 shows Qwen3-1.7B, and Row 2 shows Qwen3-8B; columns correspond to English, Russian, French, and Chinese. In each subplot, the curve shows held-out F1 over layers, the yellow star marks the best evaluation layer, and the yellow cross marks the layer whose strongest discovered feature most clearly separates toxic from clean examples. top1… view at source ↗

**Figure 11.** Figure 11: Relative improvement from multi-layer composition. Left: Qwen3-1.7B. Right: Qwen3-8B. For each language, the bar shows the relative improvement of the best multi-layer classifier over the single-layer baseline, where layers are ranked by the top1-diff and a small number of top-ranked layers are combined. Multi-layer composition is most useful as a targeted robustness mechanism, improving harder languages … view at source ↗

**Figure 12.** Figure 12: Macro-average best F1 across languages under different toxic-feature selection sizes. Left: Qwen3-1.7B. Right: Qwen3-8B. For each toxic-feature selection size, each bar reports the macro-average of the best held-out F1 over layers, computed across 13 languages. The dashed blue line denotes the baseline setting with select data = 2000. For both models, using only 10% of the original toxic-feature data achi… view at source ↗

**Figure 13.** Figure 13: Overview of the SAE-feature-driven safety data synthesis pipeline. Top: Conventional safety SFT data, built from broad human-designed safety categories, can miss long-tail unsafe behaviors, so post-training improves refusal mainly for covered cases. Bottom: We pass safety SFT data through SAE to identify safety-relevant features that are missing, then use these features as synthesis targets. The resulting… view at source ↗

**Figure 14.** Figure 14: Coverage of target safety features under different data construction strategies. The curve shows how target-feature coverage grows as we increase the number of naturally sampled examples in each data category. Star markers indicate two matched-budget synthetic alternatives. Feature-driven synthesis nearly saturates the target feature set with a small budget, while natural sampling and random security-rela… view at source ↗

**Figure 15.** Figure 15: Overview of the Sparse Autoencoder-guided Supervised Finetuning (SASFT). SASFT operates in two steps: First, it identifies language-specific features in LLMs (left), then leverages these features as training signals to reduce code-switching behavior (right). Most existing works leverage SAEs for inference-time activation steering, which modifies the model’s intermediate representations without updating it… view at source ↗

**Figure 16.** Figure 16: Examples of unexpected code-switching to Chinese, Russian, and Korean. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Analysis of the language feature and its role in code-switching. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Overview of SAE-guided DAPO with rare-negative augmentation. The policy model generates [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Activation values of a repetition feature and a randomly selected feature over token positions in [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: SAE feature steering controls repetition ratio across layers. Amplifying the repetition feature [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 22.** Figure 22: Repetition ratio during RL training for Qwen3-1.7B, Qwen3-8B, and Qwen3-30B-A3. Compared [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

read the original abstract

Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen-Scope ships a new open SAE suite for Qwen models with four concrete workflow demos, but the evidence for reliable reusable interfaces stays thin.

read the letter

Qwen-Scope ships a new open SAE suite for the Qwen3 and Qwen3.5 families, covering both dense and MoE variants, and walks through four end-to-end uses: inference steering, benchmark feature analysis, multilingual toxicity data work, and post-training signal injection for issues like repetition and code-switching. The release itself is the clearest addition; prior SAE work exists, but a public, ready-to-use collection tied to a major open model family plus worked examples lowers the barrier for people who already run Qwen models. The demonstrations connect the dots from feature extraction to actual development steps without requiring readers to reinvent the pipeline. That practical framing is the part that could save time for interpretability or safety teams focused on this model line. The abstract keeps claims measured and does not overstate the results as general breakthroughs. The soft spot is the missing quantitative backbone. No numbers appear on feature stability across prompts, steering success rates versus baselines, capability side effects, or ablation checks on which features were chosen. The four directions therefore read as illustrative rather than validated at scale, which matches the stress-test concern about needing per-use validation. Without those checks it is hard to treat the SAEs as drop-in reusable interfaces yet. This paper is for researchers and engineers already working with Qwen who want off-the-shelf SAEs and example code for the listed tasks. A reader in that niche gets immediate access to the artifacts and can test the workflows themselves. It deserves a serious referee because the resource is new and the applications are shown in full, even though the evaluation sections will need strengthening to support the stronger claims about reliability.

Referee Report

3 major / 2 minor

Summary. The paper introduces Qwen-Scope, an open-source suite of 14 groups of sparse autoencoders (SAEs) trained on 7 variants from the Qwen3 and Qwen3.5 model families (dense and MoE). It claims that these SAEs can function as reusable representation-level interfaces for four practical model development tasks: (i) inference-time steering of language, concepts, and preferences (§4.1), (ii) representation-level analysis of benchmark redundancy and capability coverage (§4.2), (iii) multilingual toxicity classification and safety-oriented data synthesis (§4.3), and (iv) incorporation of SAE-derived signals into SFT and RL objectives to reduce code-switching and repetition (§4.4). The central thesis is that SAEs extend beyond post-hoc interpretability to support diagnosing, controlling, evaluating, and improving LLMs.

Significance. If the empirical demonstrations hold under systematic validation, the work would be significant for mechanistic interpretability and LLM engineering. It provides the first large-scale open SAE release for the Qwen family and illustrates concrete workflows that connect sparse features directly to downstream behaviors such as steering and post-training. The open-sourcing of the SAE suite itself is a clear strength that lowers barriers for reproducible research on representation-level interfaces.

major comments (3)

[§4.1] §4.1 (steering): The demonstrations of feature-based control do not report quantitative metrics for cross-prompt stability of activated SAE features or side-effect rates (e.g., capability degradation on held-out tasks or perplexity increase) relative to non-SAE baselines. Without these, it remains unclear whether the steering effects are robust enough to support the claim of reusable interfaces.
[§4.4] §4.4 (post-training): The incorporation of SAE-derived signals into SFT and RL objectives lacks ablation studies that isolate the contribution of the SAE features from generic regularization or data-filtering effects. This is load-bearing for the claim that SAE signals specifically mitigate undesirable behaviors such as code-switching.
[§4.2–4.3] §4.2–4.3: The evaluation and data-workflow sections provide no systematic sensitivity analysis on feature selection (e.g., top-k vs. thresholded activation) or cross-model stability of the same SAE features, which is required to substantiate that the SAEs function as reliable, reusable interfaces rather than case-by-case tools.

minor comments (2)

[Abstract / §3] The abstract states that Qwen-Scope comprises 14 groups of SAEs but does not specify the exact layer indices, dictionary sizes, or training hyperparameters; these details should be added to the main text or a dedicated appendix for reproducibility.
[Throughout] Figure captions and section headings occasionally use inconsistent terminology (e.g., “SAE feature directions” vs. “activated features”); standardizing this would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback, which helps strengthen the empirical grounding of Qwen-Scope as reusable representation-level interfaces. We address each major comment below and will incorporate the suggested analyses into the revised manuscript.

read point-by-point responses

Referee: [§4.1] §4.1 (steering): The demonstrations of feature-based control do not report quantitative metrics for cross-prompt stability of activated SAE features or side-effect rates (e.g., capability degradation on held-out tasks or perplexity increase) relative to non-SAE baselines. Without these, it remains unclear whether the steering effects are robust enough to support the claim of reusable interfaces.

Authors: We agree that quantitative metrics for robustness are necessary to support the reusable-interface claim. In the revised manuscript we will add cross-prompt stability measurements (feature activation consistency across prompt paraphrases and domains) and side-effect evaluations, including perplexity shifts on held-out corpora and performance changes on standard benchmarks (e.g., subsets of MMLU and GSM8K) relative to non-SAE baselines such as direct activation steering and prompt-based interventions. revision: yes
Referee: [§4.4] §4.4 (post-training): The incorporation of SAE-derived signals into SFT and RL objectives lacks ablation studies that isolate the contribution of the SAE features from generic regularization or data-filtering effects. This is load-bearing for the claim that SAE signals specifically mitigate undesirable behaviors such as code-switching.

Authors: We concur that isolating the SAE-specific contribution is essential. The revised version will include ablation experiments that compare the full SAE-signal objective against (i) generic regularization baselines (KL penalty, entropy regularization) and (ii) data-filtering approaches without SAE features. These ablations will quantify the incremental benefit of the SAE-derived signals on code-switching and repetition metrics. revision: yes
Referee: [§4.2–4.3] §4.2–4.3: The evaluation and data-workflow sections provide no systematic sensitivity analysis on feature selection (e.g., top-k vs. thresholded activation) or cross-model stability of the same SAE features, which is required to substantiate that the SAEs function as reliable, reusable interfaces rather than case-by-case tools.

Authors: We will add systematic sensitivity analyses in the revised sections 4.2 and 4.3, comparing top-k, thresholded, and hybrid feature-selection methods on benchmark redundancy and toxicity-classification tasks. For cross-model stability we will report quantitative overlap and transfer metrics across the Qwen3/Qwen3.5 variants; because SAEs are trained per model, full feature identity transfer is limited, but we will document the degree of stability that exists within the model family to support the reusability argument. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical applications of standard SAEs with no self-referential derivations

full rationale

The paper introduces an open-source collection of SAEs trained on Qwen models and demonstrates four empirical use cases (steering, evaluation, data workflows, post-training) via experiments. No equations, derivations, or first-principles results are presented that reduce to fitted parameters or inputs defined by the same data. The central claim that SAEs can function as reusable interfaces rests on new empirical demonstrations rather than any self-definition, fitted-input prediction, or load-bearing self-citation chain. Standard SAE training procedures are invoked without redefinition or circular justification, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is primarily an engineering release and empirical demonstration rather than a theoretical derivation; it inherits standard assumptions from the SAE literature without introducing new free parameters, axioms, or invented entities in the abstract.

axioms (1)

domain assumption Sparse autoencoders trained on model activations yield interpretable and controllable features
Invoked implicitly when claiming that feature directions can be used for steering and optimization; this is a standard but unproven premise in mechanistic interpretability.

pith-pipeline@v0.9.0 · 5661 in / 1359 out tokens · 97759 ms · 2026-05-13T05:31:33.528845+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAEs decompose model activations into sparse, interpretable feature representations... inference-time steering... SAE feature directions control language, concepts, and preferences
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

feature coverage as a proxy for benchmark redundancy... overlap(D1,D2) = |F(D1) ∩ F(D2)| / |F(D1)|

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 13 internal anchors

[1]

2024 , eprint=

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders , author=. 2024 , eprint=

work page 2024
[2]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Lieberum, Tom and Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Sonnerat, Nicolas and Varma, Vikrant and Kramar, Janos and Dragan, Anca and Shah, Rohin and Nanda, Neel. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP....

work page doi:10.18653/v1/2024.blackboxnlp-1.19 2024
[3]

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior , author=

work page
[4]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[5]

Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , editor=

Overview of the Multilingual Text Detoxification Task at PAN 2024 , author=. Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , editor=. 2024 , organization=

work page 2024
[6]

Advances in Neural Information Processing Systems , volume=

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Anthropic , institution=. The

work page
[8]

Introducing Claude 4 , author=

work page
[9]

Introducing Claude Sonnet 4.6 , author=

work page
[10]

Aixin Liu and Bei Feng and Bin Wang and Bingxuan Wang and Bo Liu and Chenggang Zhao and Chengqi Deng and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and Hao Zhang and Hanwei Xu and Hao Yang and Haowei Zhang and Honghui Ding and...

work page
[11]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: A Family of Highly Capable Multimodal Models , author =. arXiv preprint arXiv:2312.11805 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Thomas Mesnard and Cassidy Hardin and Robert Dadashi and Surya Bhupatiraju and Shreya Pathak and Laurent Sifre and Morgane Rivière and Mihir Sanjay Kale and Juliette Love and Pouya Tafti and Léonard Hussenot and Pier Giuseppe Sessa and Aakanksha Chowdhery and Adam Roberts and Aditya Barua and Alex Botev and Alex Castro-Ros and Ambrose Slone and Amélie Hél...

work page
[13]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[15]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[16]

Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Truthfulqa: Measuring how models mimic human falsehoods , author=. Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page
[17]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[19]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023
[21]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2411.19799 , year=

Include: Evaluating multilingual language understanding with regional knowledge , author=. arXiv preprint arXiv:2411.19799 , year=

work page arXiv
[23]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

Supergpqa: Scaling llm evaluation across 285 graduate disciplines , author=. arXiv preprint arXiv:2502.14739 , year=

work page arXiv
[25]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

Icleval: evaluating in-context learning ability of large language models , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page
[26]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

work page
[27]

Program Synthesis with Large Language Models

Program Synthesis with Large Language Models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Are we done with mmlu? , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[29]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Theoremqa: A theorem-driven question answering dataset , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[30]

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks , author=. arXiv preprint arXiv:2410.06526 , year=

work page arXiv
[31]

Advances in Neural Information Processing Systems , year=

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. Advances in Neural Information Processing Systems , year=

work page
[32]

2023 , eprint=

CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2023 , eprint=

work page 2023
[33]

Is Your Code Generated by Chat

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by Chat. 2023 , url =

work page 2023
[34]

First Conference on Language Modeling , year =

Evaluating Language Models for Efficient Code Generation , author =. First Conference on Language Modeling , year =

work page
[35]

J., Feldman, M

Multipl-e: A scalable and extensible approach to benchmarking neural code generation , author=. arXiv preprint arXiv:2208.08227 , year=

work page arXiv
[36]

2024 , url=

Multilingual Massive Multitask Language Understanding , author=. 2024 , url=

work page 2024
[37]

Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

work page
[38]

arXiv preprint arXiv:2601.03595 , year=

Controllable LLM Reasoning via Sparse Autoencoder-Based Steering , author=. arXiv preprint arXiv:2601.03595 , year=

work page arXiv
[39]

Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

Deng, Boyi and Wan, Yu and Yang, Baosong and Zhang, Yidan and Feng, Fuli. Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.229

work page doi:10.18653/v1/2025.acl-long.229 2025
[40]

2026 , url=

Boyi Deng and Yu Wan and Baosong Yang and Fei Huang and Wenjie Wang and Fuli Feng , booktitle=. 2026 , url=

work page 2026
[41]

The 2025 Conference on Empirical Methods in Natural Language Processing , year=

Model Unlearning via Sparse Autoencoder Subspace Guided Projections , author=. The 2025 Conference on Empirical Methods in Natural Language Processing , year=

work page 2025
[42]

The Fourteenth International Conference on Learning Representations , year=

Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[43]

The 2025 Conference on Empirical Methods in Natural Language Processing , year=

Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models , author=. The 2025 Conference on Empirical Methods in Natural Language Processing , year=

work page 2025
[44]

SAEs Are Good for Steering – If You Select the Right Features , url=

Arad, Dana and Mueller, Aaron and Belinkov, Yonatan , year=. SAEs Are Good for Steering – If You Select the Right Features , url=. doi:10.18653/v1/2025.emnlp-main.519 , booktitle=

work page doi:10.18653/v1/2025.emnlp-main.519 2025
[45]

2024 , eprint=

Applying sparse autoencoders to unlearn knowledge in language models , author=. 2024 , eprint=

work page 2024
[46]

2026 , eprint=

Falsifying Sparse Autoencoder Reasoning Features in Language Models , author=. 2026 , eprint=

work page 2026
[47]

arXiv preprint arXiv:2602.10388 , year=

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs , author=. arXiv preprint arXiv:2602.10388 , year=

work page arXiv
[48]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[49]

Forty-second International Conference on Machine Learning , year=

Automatically Interpreting Millions of Features in Large Language Models , author=. Forty-second International Conference on Machine Learning , year=

work page
[50]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models , author=. arXiv preprint arXiv:2601.14004 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Automatically Interpreting Millions of Features in Large Language Models , booktitle =

Gon. Automatically Interpreting Millions of Features in Large Language Models , booktitle =

work page
[52]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024
[53]

arXiv preprint arXiv:2503.00177 , year=

Steering large language model activations in sparse spaces , author=. arXiv preprint arXiv:2503.00177 , year=

work page arXiv
[54]

arXiv preprint arXiv:2502.11356 , year=

Saif: A sparse autoencoder framework for interpreting and steering instruction following of language models , author=. arXiv preprint arXiv:2502.11356 , year=

work page arXiv
[55]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Refusal in Language Models Is Mediated by a Single Direction , booktitle =

Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka and Nina Panickssery and Wes Gurnee and Neel Nanda , editor =. Refusal in Language Models Is Mediated by a Single Direction , booktitle =. 2024 , url =

work page 2024
[57]

The Thirteenth International Conference on Learning Representations,

Javier Ferrando and Oscar Balcells Obeso and Senthooran Rajamanoharan and Neel Nanda , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025
[58]

2024 , eprint=

Rethinking Interpretability in the Era of Large Language Models , author=. 2024 , eprint=

work page 2024
[59]

Transcoders find interpretable LLM feature circuits , url =

Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , booktitle =. Transcoders find interpretable LLM feature circuits , url =. doi:10.52202/079017-0768 , editor =

work page doi:10.52202/079017-0768
[60]

On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLM s

Calderon, Nitay and Reichart, Roi. On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.29

work page doi:10.18653/v1/2025.naacl-long.29 2025
[61]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025
[62]

2024 , eprint=

Scaling and evaluating sparse autoencoders , author=. 2024 , eprint=

work page 2024
[63]

Open Problems in Mechanistic Interpretability

Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=

work page internal anchor Pith review arXiv
[64]

Mechanistic Interpretability for

Leonard Bereska and Stratis Gavves , journal=. Mechanistic Interpretability for. 2024 , url=

work page 2024
[65]

2015 , eprint=

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , author=. 2015 , eprint=

work page 2015
[66]

2024 , url =

dictionary\_learning , author =. 2024 , url =

work page 2024
[67]

Toy Models of Superposition

Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Transactions on Machine Learning Research , issn=

Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

work page 2023
[69]

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Nanda, Neel and Lee, Andrew and Wattenberg, Martin. Emergent Linear Representations in World Models of Self-Supervised Sequence Models. Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2023. doi:10.18653/v1/2023.blackboxnlp-1.2

work page doi:10.18653/v1/2023.blackboxnlp-1.2 2023
[70]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse autoencoders find highly interpretable features in language models , author=. arXiv preprint arXiv:2309.08600 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models , author=. arXiv preprint arXiv:2403.19647 , year=

work page internal anchor Pith review arXiv
[72]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

work page 2022
[73]

2023 , eprint=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2023 , eprint=

work page 2023
[74]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023
[75]

arXiv preprint arXiv:2602.11180 , year=

Mechanistic interpretability for large language model alignment: Progress, challenges, and future directions , author=. arXiv preprint arXiv:2602.11180 , year=

work page arXiv
[76]

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Shu, Dong and Wu, Xuansheng and Zhao, Haiyan and Rai, Daking and Yao, Ziyu and Liu, Ninghao and Du, Mengnan. A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.89

work page doi:10.18653/v1/2025.findings-emnlp.89 2025
[77]

Route Sparse Autoencoder to Interpret Large Language Models

Shi, Wei and Li, Sihang and Liang, Tao and Wan, Mingyang and Ma, Guojun and Wang, Xiang and He, Xiangnan. Route Sparse Autoencoder to Interpret Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.346

work page doi:10.18653/v1/2025.emnlp-main.346 2025
[78]

2026 , eprint=

Thought Branches: Interpreting LLM Reasoning Requires Resampling , author=. 2026 , eprint=

work page 2026
[79]

2025 , eprint=

Thought Anchors: Which LLM Reasoning Steps Matter? , author=. 2025 , eprint=

work page 2025
[80]

2025 , eprint=

Detecting Strategic Deception Using Linear Probes , author=. 2025 , eprint=

work page 2025

Showing first 80 references.