pith. machine review for the scientific record. sign in

arxiv: 2605.11887 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:31 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords sparse autoencodersQwen modelsmodel steeringmechanistic interpretabilityLLM developmentfeature activationpost-training optimizationsafety data synthesis
0
0 comments X

The pith

Sparse autoencoders on Qwen models function as reusable interfaces for steering outputs, analyzing benchmarks, synthesizing data, and guiding fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen-Scope, an open collection of sparse autoencoders trained across multiple Qwen3 and Qwen3.5 models. It shows these autoencoders can move beyond explaining model internals after the fact to serve as active tools in four workflows: directing generation by activating chosen feature directions, using feature patterns to diagnose what benchmarks actually measure, supporting data filtering and creation for safety, and supplying training signals that reduce repetition or code-switching. A reader would care if this holds because it converts representation-level decompositions into direct levers for model behavior without retraining from scratch. The core demonstration is that one set of learned features can support inference control, evaluation proxies, data work, and optimization objectives at once.

Core claim

SAEs serve not only as post-hoc analysis tools but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models, demonstrated by inference-time steering via feature directions, evaluation analysis through activated features, data-centric tasks including toxicity classification and safety synthesis, and incorporation of SAE signals into supervised fine-tuning and reinforcement learning to address code-switching and repetition.

What carries the argument

The Qwen-Scope collection of 14 groups of sparse autoencoders across seven Qwen3 and Qwen3.5 variants that decompose activations into sparse features usable for development operations.

If this is right

  • Feature directions enable direct control over language, concepts, and preferences during inference without any weight updates.
  • Activated features supply a representation-level proxy for measuring benchmark redundancy and capability coverage.
  • The same features support practical data tasks such as multilingual toxicity detection and safety-oriented data synthesis.
  • SAE-derived signals can be added to supervised fine-tuning and reinforcement learning objectives to reduce repetition and code-switching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If feature stability holds across releases, developers could maintain persistent control interfaces rather than re-deriving them for each model update.
  • The approach opens the possibility of automated auditing pipelines that flag capability gaps directly from feature activation patterns.
  • Wider adoption might shift emphasis from full retraining toward targeted edits at the representation level when undesirable behaviors appear.

Load-bearing premise

The learned SAE features remain stable, interpretable, and causally effective enough to support reliable steering and training signals across uses without introducing new unintended behaviors.

What would settle it

A controlled test in which activating an SAE feature direction intended to reduce repetition instead increases repetition rate or produces new inconsistencies on held-out prompts.

Figures

Figures reproduced from arXiv: 2605.11887 by An Yang, Baosong Yang, Boyi Deng, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jialong Tang, Jingren Zhou, Qian Cao, Ruize Gao, Tianhao Li, Xiaodong Deng, Xuancheng Ren, Xu Wang, Yaoning Wang, Yubo Ma, Yu Wan.

Figure 1
Figure 1. Figure 1: Overview of Qwen-Scope. SAEs trained on Qwen3/3.5 serve as a common interface for four practical directions: interpretable steering at inference time, capability-aware benchmark analysis, feature-guided data workflows, and targeted post-training for model improvement. 1 arXiv:2605.11887v1 [cs.CL] 12 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the two-step SAE-based steering pipeline: (1) contrastive feature identification, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SAE features provide interpretable handles for model analysis and control. Left: SAE activations can be used to undesirable generation behavior. When the model is prompted in English, the response unexpectedly mixes in Chinese text. Ranking SAE features by activation strength reveals a highly activated Chinese-language feature (id: 6159). Suppressing this feature during generation removes the unexpected la… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the proposed SAE-based benchmark analysis framework, covering feature [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spearman rank correlation between performance-based redundancy [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Feature overlap matrix for eight benchmarks. Entry [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the SAE-based toxicity classification pipeline. Feature discovery is performed on a [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise F1 of the SAE-based toxicity classifier on English. Left: Qwen3-1.7B. Right: Qwen3-8B. The curves report held-out F1 across layers using top-K toxic-biased SAE features discovered in English, with K ∈ {1, 2, 5, 10}. Star markers indicate the best F1 layer for each model. Without training any additional classifier, the existing SAE can be used directly for classification, achieving an F1 score ab… view at source ↗
Figure 9
Figure 9. Figure 9: Cross-lingual structure and transfer of toxic SAE features. Panel (a) shows the overlap of top-10 toxic SAE features in Qwen3-1.7B, and panel (b) shows the same for Qwen3-8B. Panel (c) shows the layer-wise mean overlap, with shaded bands indicating the interquartile range (IQR) across language pairs. Panel (d) shows the best held-out F1 for each test language using features discovered in English. Panel (e)… view at source ↗
Figure 10
Figure 10. Figure 10: A proxy for selecting high-performing layers before evaluation. Row 1 shows Qwen3-1.7B, and Row 2 shows Qwen3-8B; columns correspond to English, Russian, French, and Chinese. In each subplot, the curve shows held-out F1 over layers, the yellow star marks the best evaluation layer, and the yellow cross marks the layer whose strongest discovered feature most clearly separates toxic from clean examples. top1… view at source ↗
Figure 11
Figure 11. Figure 11: Relative improvement from multi-layer composition. Left: Qwen3-1.7B. Right: Qwen3-8B. For each language, the bar shows the relative improvement of the best multi-layer classifier over the single-layer baseline, where layers are ranked by the top1-diff and a small number of top-ranked layers are combined. Multi-layer composition is most useful as a targeted robustness mechanism, improving harder languages … view at source ↗
Figure 12
Figure 12. Figure 12: Macro-average best F1 across languages under different toxic-feature selection sizes. Left: Qwen3-1.7B. Right: Qwen3-8B. For each toxic-feature selection size, each bar reports the macro-average of the best held-out F1 over layers, computed across 13 languages. The dashed blue line denotes the baseline setting with select data = 2000. For both models, using only 10% of the original toxic-feature data achi… view at source ↗
Figure 13
Figure 13. Figure 13: Overview of the SAE-feature-driven safety data synthesis pipeline. Top: Conventional safety SFT data, built from broad human-designed safety categories, can miss long-tail unsafe behaviors, so post-training improves refusal mainly for covered cases. Bottom: We pass safety SFT data through SAE to identify safety-relevant features that are missing, then use these features as synthesis targets. The resulting… view at source ↗
Figure 14
Figure 14. Figure 14: Coverage of target safety features under different data construction strategies. The curve shows how target-feature coverage grows as we increase the number of naturally sampled examples in each data category. Star markers indicate two matched-budget synthetic alternatives. Feature-driven synthesis nearly saturates the target feature set with a small budget, while natural sampling and random security-rela… view at source ↗
Figure 15
Figure 15. Figure 15: Overview of the Sparse Autoencoder-guided Supervised Finetuning (SASFT). SASFT operates in two steps: First, it identifies language-specific features in LLMs (left), then leverages these features as training signals to reduce code-switching behavior (right). Most existing works leverage SAEs for inference-time activation steering, which modifies the model’s intermediate representations without updating it… view at source ↗
Figure 16
Figure 16. Figure 16: Examples of unexpected code-switching to Chinese, Russian, and Korean. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Analysis of the language feature and its role in code-switching. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Overview of SAE-guided DAPO with rare-negative augmentation. The policy model generates [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Activation values of a repetition feature and a randomly selected feature over token positions in [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: SAE feature steering controls repetition ratio across layers. Amplifying the repetition feature [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 22
Figure 22. Figure 22: Repetition ratio during RL training for Qwen3-1.7B, Qwen3-8B, and Qwen3-30B-A3. Compared [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
read the original abstract

Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Qwen-Scope, an open-source suite of 14 groups of sparse autoencoders (SAEs) trained on 7 variants from the Qwen3 and Qwen3.5 model families (dense and MoE). It claims that these SAEs can function as reusable representation-level interfaces for four practical model development tasks: (i) inference-time steering of language, concepts, and preferences (§4.1), (ii) representation-level analysis of benchmark redundancy and capability coverage (§4.2), (iii) multilingual toxicity classification and safety-oriented data synthesis (§4.3), and (iv) incorporation of SAE-derived signals into SFT and RL objectives to reduce code-switching and repetition (§4.4). The central thesis is that SAEs extend beyond post-hoc interpretability to support diagnosing, controlling, evaluating, and improving LLMs.

Significance. If the empirical demonstrations hold under systematic validation, the work would be significant for mechanistic interpretability and LLM engineering. It provides the first large-scale open SAE release for the Qwen family and illustrates concrete workflows that connect sparse features directly to downstream behaviors such as steering and post-training. The open-sourcing of the SAE suite itself is a clear strength that lowers barriers for reproducible research on representation-level interfaces.

major comments (3)
  1. [§4.1] §4.1 (steering): The demonstrations of feature-based control do not report quantitative metrics for cross-prompt stability of activated SAE features or side-effect rates (e.g., capability degradation on held-out tasks or perplexity increase) relative to non-SAE baselines. Without these, it remains unclear whether the steering effects are robust enough to support the claim of reusable interfaces.
  2. [§4.4] §4.4 (post-training): The incorporation of SAE-derived signals into SFT and RL objectives lacks ablation studies that isolate the contribution of the SAE features from generic regularization or data-filtering effects. This is load-bearing for the claim that SAE signals specifically mitigate undesirable behaviors such as code-switching.
  3. [§4.2–4.3] §4.2–4.3: The evaluation and data-workflow sections provide no systematic sensitivity analysis on feature selection (e.g., top-k vs. thresholded activation) or cross-model stability of the same SAE features, which is required to substantiate that the SAEs function as reliable, reusable interfaces rather than case-by-case tools.
minor comments (2)
  1. [Abstract / §3] The abstract states that Qwen-Scope comprises 14 groups of SAEs but does not specify the exact layer indices, dictionary sizes, or training hyperparameters; these details should be added to the main text or a dedicated appendix for reproducibility.
  2. [Throughout] Figure captions and section headings occasionally use inconsistent terminology (e.g., “SAE feature directions” vs. “activated features”); standardizing this would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback, which helps strengthen the empirical grounding of Qwen-Scope as reusable representation-level interfaces. We address each major comment below and will incorporate the suggested analyses into the revised manuscript.

read point-by-point responses
  1. Referee: [§4.1] §4.1 (steering): The demonstrations of feature-based control do not report quantitative metrics for cross-prompt stability of activated SAE features or side-effect rates (e.g., capability degradation on held-out tasks or perplexity increase) relative to non-SAE baselines. Without these, it remains unclear whether the steering effects are robust enough to support the claim of reusable interfaces.

    Authors: We agree that quantitative metrics for robustness are necessary to support the reusable-interface claim. In the revised manuscript we will add cross-prompt stability measurements (feature activation consistency across prompt paraphrases and domains) and side-effect evaluations, including perplexity shifts on held-out corpora and performance changes on standard benchmarks (e.g., subsets of MMLU and GSM8K) relative to non-SAE baselines such as direct activation steering and prompt-based interventions. revision: yes

  2. Referee: [§4.4] §4.4 (post-training): The incorporation of SAE-derived signals into SFT and RL objectives lacks ablation studies that isolate the contribution of the SAE features from generic regularization or data-filtering effects. This is load-bearing for the claim that SAE signals specifically mitigate undesirable behaviors such as code-switching.

    Authors: We concur that isolating the SAE-specific contribution is essential. The revised version will include ablation experiments that compare the full SAE-signal objective against (i) generic regularization baselines (KL penalty, entropy regularization) and (ii) data-filtering approaches without SAE features. These ablations will quantify the incremental benefit of the SAE-derived signals on code-switching and repetition metrics. revision: yes

  3. Referee: [§4.2–4.3] §4.2–4.3: The evaluation and data-workflow sections provide no systematic sensitivity analysis on feature selection (e.g., top-k vs. thresholded activation) or cross-model stability of the same SAE features, which is required to substantiate that the SAEs function as reliable, reusable interfaces rather than case-by-case tools.

    Authors: We will add systematic sensitivity analyses in the revised sections 4.2 and 4.3, comparing top-k, thresholded, and hybrid feature-selection methods on benchmark redundancy and toxicity-classification tasks. For cross-model stability we will report quantitative overlap and transfer metrics across the Qwen3/Qwen3.5 variants; because SAEs are trained per model, full feature identity transfer is limited, but we will document the degree of stability that exists within the model family to support the reusability argument. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical applications of standard SAEs with no self-referential derivations

full rationale

The paper introduces an open-source collection of SAEs trained on Qwen models and demonstrates four empirical use cases (steering, evaluation, data workflows, post-training) via experiments. No equations, derivations, or first-principles results are presented that reduce to fitted parameters or inputs defined by the same data. The central claim that SAEs can function as reusable interfaces rests on new empirical demonstrations rather than any self-definition, fitted-input prediction, or load-bearing self-citation chain. Standard SAE training procedures are invoked without redefinition or circular justification, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is primarily an engineering release and empirical demonstration rather than a theoretical derivation; it inherits standard assumptions from the SAE literature without introducing new free parameters, axioms, or invented entities in the abstract.

axioms (1)
  • domain assumption Sparse autoencoders trained on model activations yield interpretable and controllable features
    Invoked implicitly when claiming that feature directions can be used for steering and optimization; this is a standard but unproven premise in mechanistic interpretability.

pith-pipeline@v0.9.0 · 5661 in / 1359 out tokens · 97759 ms · 2026-05-13T05:31:33.528845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 13 internal anchors

  1. [1]

    2024 , eprint=

    Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders , author=. 2024 , eprint=

  2. [2]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    Lieberum, Tom and Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Sonnerat, Nicolas and Varma, Vikrant and Kramar, Janos and Dragan, Anca and Shah, Rohin and Nanda, Neel. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP....

  3. [3]

    Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior , author=

  4. [4]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  5. [5]

    Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , editor=

    Overview of the Multilingual Text Detoxification Task at PAN 2024 , author=. Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum , editor=. 2024 , organization=

  6. [6]

    Advances in Neural Information Processing Systems , volume=

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models , author=. Advances in Neural Information Processing Systems , volume=

  7. [7]

    Anthropic , institution=. The

  8. [8]

    Introducing Claude 4 , author=

  9. [9]

    Introducing Claude Sonnet 4.6 , author=

  10. [10]

    Aixin Liu and Bei Feng and Bin Wang and Bingxuan Wang and Bo Liu and Chenggang Zhao and Chengqi Deng and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and Hao Zhang and Hanwei Xu and Hao Yang and Haowei Zhang and Honghui Ding and...

  11. [11]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: A Family of Highly Capable Multimodal Models , author =. arXiv preprint arXiv:2312.11805 , year =

  12. [12]

    Thomas Mesnard and Cassidy Hardin and Robert Dadashi and Surya Bhupatiraju and Shreya Pathak and Laurent Sifre and Morgane Rivière and Mihir Sanjay Kale and Juliette Love and Pouya Tafti and Léonard Hussenot and Pier Giuseppe Sessa and Aakanksha Chowdhery and Adam Roberts and Aditya Barua and Alex Botev and Alex Castro-Ros and Ambrose Slone and Amélie Hél...

  13. [13]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

  14. [14]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  15. [15]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  16. [16]

    Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

    Truthfulqa: Measuring how models mimic human falsehoods , author=. Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

  17. [17]

    Instruction-Following Evaluation for Large Language Models

    Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

  18. [18]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  19. [19]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  20. [20]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  21. [21]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

  22. [22]

    arXiv preprint arXiv:2411.19799 , year=

    Include: Evaluating multilingual language understanding with regional knowledge , author=. arXiv preprint arXiv:2411.19799 , year=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines , author=. arXiv preprint arXiv:2502.14739 , year=

  25. [25]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    Icleval: evaluating in-context learning ability of large language models , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  26. [26]

    NeurIPS , year=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

  27. [27]

    Program Synthesis with Large Language Models

    Program Synthesis with Large Language Models , author=. arXiv preprint arXiv:2108.07732 , year=

  28. [28]

    Are we done with mmlu? , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  29. [29]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Theoremqa: A theorem-driven question answering dataset , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  30. [30]

    Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

    Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks , author=. arXiv preprint arXiv:2410.06526 , year=

  31. [31]

    Advances in Neural Information Processing Systems , year=

    C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. Advances in Neural Information Processing Systems , year=

  32. [32]

    2023 , eprint=

    CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2023 , eprint=

  33. [33]

    Is Your Code Generated by Chat

    Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by Chat. 2023 , url =

  34. [34]

    First Conference on Language Modeling , year =

    Evaluating Language Models for Efficient Code Generation , author =. First Conference on Language Modeling , year =

  35. [35]

    J., Feldman, M

    Multipl-e: A scalable and extensible approach to benchmarking neural code generation , author=. arXiv preprint arXiv:2208.08227 , year=

  36. [36]

    2024 , url=

    Multilingual Massive Multitask Language Understanding , author=. 2024 , url=

  37. [37]

    Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

  38. [38]

    arXiv preprint arXiv:2601.03595 , year=

    Controllable LLM Reasoning via Sparse Autoencoder-Based Steering , author=. arXiv preprint arXiv:2601.03595 , year=

  39. [39]

    Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

    Deng, Boyi and Wan, Yu and Yang, Baosong and Zhang, Yidan and Feng, Fuli. Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.229

  40. [40]

    2026 , url=

    Boyi Deng and Yu Wan and Baosong Yang and Fei Huang and Wenjie Wang and Fuli Feng , booktitle=. 2026 , url=

  41. [41]

    The 2025 Conference on Empirical Methods in Natural Language Processing , year=

    Model Unlearning via Sparse Autoencoder Subspace Guided Projections , author=. The 2025 Conference on Empirical Methods in Natural Language Processing , year=

  42. [42]

    The Fourteenth International Conference on Learning Representations , year=

    Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders , author=. The Fourteenth International Conference on Learning Representations , year=

  43. [43]

    The 2025 Conference on Empirical Methods in Natural Language Processing , year=

    Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models , author=. The 2025 Conference on Empirical Methods in Natural Language Processing , year=

  44. [44]

    SAEs Are Good for Steering – If You Select the Right Features , url=

    Arad, Dana and Mueller, Aaron and Belinkov, Yonatan , year=. SAEs Are Good for Steering – If You Select the Right Features , url=. doi:10.18653/v1/2025.emnlp-main.519 , booktitle=

  45. [45]

    2024 , eprint=

    Applying sparse autoencoders to unlearn knowledge in language models , author=. 2024 , eprint=

  46. [46]

    2026 , eprint=

    Falsifying Sparse Autoencoder Reasoning Features in Language Models , author=. 2026 , eprint=

  47. [47]

    arXiv preprint arXiv:2602.10388 , year=

    Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs , author=. arXiv preprint arXiv:2602.10388 , year=

  48. [48]

    2021 , eprint=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

  49. [49]

    Forty-second International Conference on Machine Learning , year=

    Automatically Interpreting Millions of Features in Large Language Models , author=. Forty-second International Conference on Machine Learning , year=

  50. [50]

    Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

    Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models , author=. arXiv preprint arXiv:2601.14004 , year=

  51. [51]

    Automatically Interpreting Millions of Features in Large Language Models , booktitle =

    Gon. Automatically Interpreting Millions of Features in Large Language Models , booktitle =

  52. [52]

    Steering Llama 2 via Contrastive Activation Addition

    Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

  53. [53]

    arXiv preprint arXiv:2503.00177 , year=

    Steering large language model activations in sparse spaces , author=. arXiv preprint arXiv:2503.00177 , year=

  54. [54]

    arXiv preprint arXiv:2502.11356 , year=

    Saif: A sparse autoencoder framework for interpreting and steering instruction following of language models , author=. arXiv preprint arXiv:2502.11356 , year=

  55. [55]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

  56. [56]

    Refusal in Language Models Is Mediated by a Single Direction , booktitle =

    Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka and Nina Panickssery and Wes Gurnee and Neel Nanda , editor =. Refusal in Language Models Is Mediated by a Single Direction , booktitle =. 2024 , url =

  57. [57]

    The Thirteenth International Conference on Learning Representations,

    Javier Ferrando and Oscar Balcells Obeso and Senthooran Rajamanoharan and Neel Nanda , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  58. [58]

    2024 , eprint=

    Rethinking Interpretability in the Era of Large Language Models , author=. 2024 , eprint=

  59. [59]

    Transcoders find interpretable LLM feature circuits , url =

    Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , booktitle =. Transcoders find interpretable LLM feature circuits , url =. doi:10.52202/079017-0768 , editor =

  60. [60]

    On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLM s

    Calderon, Nitay and Reichart, Roi. On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.29

  61. [61]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  62. [62]

    2024 , eprint=

    Scaling and evaluating sparse autoencoders , author=. 2024 , eprint=

  63. [63]

    Open Problems in Mechanistic Interpretability

    Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=

  64. [64]

    Mechanistic Interpretability for

    Leonard Bereska and Stratis Gavves , journal=. Mechanistic Interpretability for. 2024 , url=

  65. [65]

    2015 , eprint=

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , author=. 2015 , eprint=

  66. [66]

    2024 , url =

    dictionary\_learning , author =. 2024 , url =

  67. [67]

    Toy Models of Superposition

    Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

  68. [68]

    Transactions on Machine Learning Research , issn=

    Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

  69. [69]

    Emergent Linear Representations in World Models of Self-Supervised Sequence Models

    Nanda, Neel and Lee, Andrew and Wattenberg, Martin. Emergent Linear Representations in World Models of Self-Supervised Sequence Models. Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2023. doi:10.18653/v1/2023.blackboxnlp-1.2

  70. [70]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Sparse autoencoders find highly interpretable features in language models , author=. arXiv preprint arXiv:2309.08600 , year=

  71. [71]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Sparse feature circuits: Discovering and editing interpretable causal graphs in language models , author=. arXiv preprint arXiv:2403.19647 , year=

  72. [72]

    2022 , journal=

    Toy Models of Superposition , author=. 2022 , journal=

  73. [73]

    2023 , eprint=

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2023 , eprint=

  74. [74]

    2023 , journal=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

  75. [75]

    arXiv preprint arXiv:2602.11180 , year=

    Mechanistic interpretability for large language model alignment: Progress, challenges, and future directions , author=. arXiv preprint arXiv:2602.11180 , year=

  76. [76]

    A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

    Shu, Dong and Wu, Xuansheng and Zhao, Haiyan and Rai, Daking and Yao, Ziyu and Liu, Ninghao and Du, Mengnan. A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.89

  77. [77]

    Route Sparse Autoencoder to Interpret Large Language Models

    Shi, Wei and Li, Sihang and Liang, Tao and Wan, Mingyang and Ma, Guojun and Wang, Xiang and He, Xiangnan. Route Sparse Autoencoder to Interpret Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.346

  78. [78]

    2026 , eprint=

    Thought Branches: Interpreting LLM Reasoning Requires Resampling , author=. 2026 , eprint=

  79. [79]

    2025 , eprint=

    Thought Anchors: Which LLM Reasoning Steps Matter? , author=. 2025 , eprint=

  80. [80]

    2025 , eprint=

    Detecting Strategic Deception Using Linear Probes , author=. 2025 , eprint=

Showing first 80 references.