Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda

URLhttps://arxiv · 2024 · DOI 10.18653/v1/2024.blackboxnlp-1.19

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open at publisher browse 7 citing papers

citation-role summary

background 1 dataset 1 method 1

citation-polarity summary

background 1 use dataset 1 use method 1

representative citing papers

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

cs.LG · 2026-05-11 · unverdicted · novelty 8.0

Manifold curvature and intrinsic dimension predict layerwise SAE width exponents and asymptotic floors across Gemma models, with cross-model transfer of the geometric regression, establishing a transferable geometric law instead of a universal scaling law.

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.

Minimizing Collateral Damage in Activation Steering

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking

cs.LG · 2026-01-31 · unverdicted · novelty 6.0

Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.

ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

ConceptTracer supplies an interactive interface and saliency/selectivity metrics to locate concept-responsive neurons in neural representations, shown on TabPFN.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 2
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer