Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

Hyunjin Cho; Jaehyung Kim; Youngji Roh

arxiv: 2606.08236 · v1 · pith:ZN4TQ2AEnew · submitted 2026-06-06 · 💻 cs.CL · cs.LG

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

Hyunjin Cho , Youngji Roh , Jaehyung Kim This is my paper

Pith reviewed 2026-06-27 19:46 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords unsupervised feature discoverymechanistic interpretabilityrate-distortion objectivesemantic embeddingsattribution signaturescontinuation distributionlanguage model auditing

0 comments

The pith

Unsupervised clustering of LLM continuations by aligning semantic embeddings with mechanistic attribution signatures reveals distinct modes and actionable factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces distribution-level unsupervised feature discovery for auditing language model continuation distributions. It represents each sampled continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective to balance semantic coherence, mechanistic consistency, and cluster granularity. This joint clustering produces groups that single-view baselines miss and supplies interventional evidence linking cluster signatures to controllable internal factors. The method operates without manually chosen target outputs, complementing target-conditioned circuit analysis by handling heterogeneity across the full continuation distribution.

Core claim

Distribution-level unsupervised feature discovery clusters sampled continuations using semantic content and sequence-level mechanistic attributions by optimizing a rate-distortion objective on semantic embeddings and prefix-to-continuation attribution signatures, exposing continuation modes missed by baselines and providing evidence that cluster signatures are actionable mechanistic factors.

What carries the argument

Rate-distortion objective applied jointly to semantic embeddings and prefix-to-continuation attribution signatures to trade off coherence, consistency, and granularity.

If this is right

Discovered clusters expose continuation modes that single-view baselines miss.
Cluster signatures correspond to actionable mechanistic factors under intervention.
The approach provides a scalable audit of mechanisms underlying a model's continuation distribution.
It complements target-conditioned circuit analysis by addressing heterogeneity across the continuation distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend mechanistic interpretability from single prompts to entire output distributions at scale.
It may support identification of latent behavioral patterns without requiring predefined targets or human-specified behaviors.
Applications could include automated detection of failure modes or biases across a model's response space.
Similar joint objectives might be tested on other attribution or embedding types to refine mechanistic feature discovery.

Load-bearing premise

The rate-distortion objective applied to semantic embeddings and attribution signatures will produce clusters whose mechanistic signatures are both consistent and causally actionable under intervention.

What would settle it

If steering interventions on the discovered cluster signatures produce no measurable shift in continuation probabilities relative to random or baseline clusters, or if the joint clusters show no improvement over semantic-only or attribution-only baselines in exposing distinct modes.

Figures

Figures reproduced from arXiv: 2606.08236 by Hyunjin Cho, Jaehyung Kim, Youngji Roh.

**Figure 1.** Figure 1: Misalignment between semantic and mechanistic spaces. Top: An example triplet where (A1, A2) share semantic meaning (cosine similarity) but use different internal mechanisms (attributions, L1 distance), while (A2, A3) differ semantically but use similar mechanisms. Bottom: Silhouette scores (Rousseeuw, 1987) for K-means partitions learned in one view and evaluated in both views; a partition that is clean i… view at source ↗

**Figure 1.** Figure 1: Overview of distribution-level unsupervised feature discovery. Continuation Sampling: For a prefix x = x1:M, we sample continuations y (n) = y (n) 1:Tn with probability weights Pn. Dual Representation: Each continuation is represented by a semantic embedding en and a mechanistic embedding an, where an aggregates prefix-feature effects over the continuation span. Joint Clustering via Rate-Distortion: Rate–d… view at source ↗

**Figure 3.** Figure 3: Cluster mechanistic similarity across β values. Jaccard similarity of top-100 mechanistic features between clusters at β ∈ {0.5, 0.75, 1.0} for the prompt “Who sang the song ‘If god was one of us?”’. The notation C K i denotes cluster index i among the K clusters produced at a given β. Higher cross-β similarity reflects parent–child structure across resolutions; a score-labeled version is provided in [PIT… view at source ↗

**Figure 4.** Figure 4: Cluster structure dynamics across γ values. Each plots are drawn by sweeping γ ∈ [0.1, 0.9], keeping consistent rate by varying β ≈ 0.9. Dashed ellipses indicate K-means boundaries in each space with the same number of clusters k as RD-clustering. Attribution Space: Before and After Split Semantic Space: Before and After Split C4: Knowledge Cutoff [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Mean ℓ1 norm of the attribution vector at each generated-token position. (a) ℓ1 distance. (b) Top-100 Jaccard overlap [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Token-wise attribution similarity as a function of positional gap. Real pairs are compared against random continuations from the same prefix; lower ℓ1 and higher Jaccard indicate greater similarity. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Lower-triangular heatmap of top-100 Jaccard overlap between token-level attribution vectors, averaged across the Qwen3-4B consistency subset [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Dynamic source-mass decomposition by target position. The cached prefix-only curve uses the saved full-continuation attribution, while the dynamic prefix and history curves recompute attribution for each target token using the original prefix plus preceding generated tokens as source context. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Span-mode steering comparison on the Qwen3-4B scaled subset. The full continuation span gives the strongest average signed response, while single-token spans are noisier but remain directionally aligned. (a) Attribution-selected features. (b) Random baseline features [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Average token-level steering effect, measured as demeaned target-logit change across continuation positions. Attributionselected features produce a more coherent signed response than matched random feature sets. D. Limitations and Future Work. We introduce an unsupervised feature-discovery framework that clusters continuations while jointly controlling semantic variance and mechanistic variance. Despite … view at source ↗

**Figure 13.** Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Reasoning step-pair validation heatmap for Qwen3-0.6B. Rows are target steps j, columns are previous source steps i < j, and cells show the mean absolute Spearman correlation |ρs| across GSM8K and MATH500. Question. What is the length, in units, of the radius of a sphere whose volume and surface area, in cubic units and square units, respectively, are numerically equal? Committed previous reasoning steps.… view at source ↗

**Figure 15.** Figure 15: Fixed target-cluster effects. Rows are prior source steps i < j, columns are fixed clusters of target-step continuations, column widths are target-cluster probability mass, and color is the projected Spearman correlation. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 16.** Figure 16: Local cluster dynamics across previous reasoning steps. Columns are source steps i < j, node heights are local cluster probability mass, ribbons track adjacent continuation-overlap mass, and node colors show local ρs. E.4. AmbigQA on Qwen3-8B: Full steering results [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

read the original abstract

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper clusters LLM continuations by jointly optimizing semantic embeddings and attribution signatures in a rate-distortion objective, which is a distinct distribution-level move, but the interventional claims rest on evidence that the objective itself does not guarantee.

read the letter

The paper's main contribution is an unsupervised clustering method that represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective balancing coherence in both views against cluster granularity. This produces groups that single-view baselines miss and supplies some steering results suggesting the signatures can be acted on.

What works is the framing. Standard circuit analysis is target-conditioned and can miss heterogeneity across the continuation distribution. Aligning the two views without manual targets is a reasonable way to surface modes at scale, and the rate-distortion trade-off is a standard tool that fits the unsupervised setting.

The soft spot is the jump from cluster consistency to actionable mechanisms. The objective only enforces agreement within clusters on the two signals; it contains no term that enforces causal effect size, robustness, or separation from confounders. The stress-test note is correct on this point. Any claim that the signatures are manipulable therefore depends entirely on the steering analyses, and those analyses could reflect post-hoc selection rather than a direct consequence of the clustering. The abstract gives no equations, no details on how attributions are computed, no baseline comparisons, and no controls, so it is impossible to tell how much the reported results depend on those choices.

This work is aimed at researchers in mechanistic interpretability and safety evaluation who want tools that operate over many outputs rather than single prompts. A reader already working on attribution methods or distribution-level auditing would get the most from it.

I would send the paper to peer review. The problem it addresses is real and the joint-view idea is distinct enough to merit referee time, though the authors will need to show that the interventional results follow from the clustering rather than from later selection.

Referee Report

2 major / 2 minor

Summary. The paper introduces distribution-level unsupervised feature discovery for LLMs. It represents sampled continuations via semantic embeddings and prefix-to-continuation attribution signatures, then optimizes a rate-distortion objective trading off semantic coherence, mechanistic consistency, and granularity. The central claim is that the resulting clusters reveal continuation modes missed by single-view baselines and supply interventional evidence that the cluster signatures identify actionable mechanistic factors, thereby complementing target-conditioned circuit analysis.

Significance. If the interventional results hold and are shown to follow from the joint objective, the work supplies a scalable audit of mechanisms across a model's continuation distribution rather than isolated prompts. The joint semantic-mechanistic clustering is a clear methodological contribution; explicit credit is due for attempting to move beyond coherence metrics to steering-based validation.

major comments (2)

[§3] §3 (rate-distortion objective): the objective enforces intra-cluster coherence across the two views but contains no explicit term or post-optimization constraint for causal effect size or robustness to distribution shift. The manuscript must demonstrate that the reported steering effects are larger (or more consistent) for joint clusters than for post-hoc selection of high-coherence clusters from either view alone; otherwise the actionability claim does not follow from the optimization.
[§5.3] §5.3 (steering analyses): the interventional evidence is load-bearing, yet the text supplies no quantitative comparison of effect sizes or success rates between joint clusters and the single-view baselines that the abstract claims are inferior. Without these controls it is impossible to rule out that the observed steering arises from the attribution signatures themselves rather than from the joint rate-distortion clustering.

minor comments (2)

[Abstract] Abstract: the phrase 'provide interventional evidence' should be qualified by the specific steering metric and baseline comparison used in the experiments.
[§3.1] Notation: the precise definition of the attribution signature (e.g., which layers or heads are aggregated) should be stated once in a single equation or table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify that stronger quantitative controls are needed to link the joint optimization to the interventional results. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (rate-distortion objective): the objective enforces intra-cluster coherence across the two views but contains no explicit term or post-optimization constraint for causal effect size or robustness to distribution shift. The manuscript must demonstrate that the reported steering effects are larger (or more consistent) for joint clusters than for post-hoc selection of high-coherence clusters from either view alone; otherwise the actionability claim does not follow from the optimization.

Authors: We agree that the rate-distortion objective optimizes for coherence and consistency but does not contain an explicit term for causal effect size. The manuscript currently reports steering results only for the joint clusters and does not include the requested post-hoc comparison against high-coherence selections from single views. In the revision we will add this comparison, reporting effect sizes and consistency metrics for joint clusters versus post-hoc selections from each view alone. This will be placed in an expanded §3 and cross-referenced in the steering section. revision: yes
Referee: [§5.3] §5.3 (steering analyses): the interventional evidence is load-bearing, yet the text supplies no quantitative comparison of effect sizes or success rates between joint clusters and the single-view baselines that the abstract claims are inferior. Without these controls it is impossible to rule out that the observed steering arises from the attribution signatures themselves rather than from the joint rate-distortion clustering.

Authors: We concur that the absence of direct quantitative comparisons between joint clusters and single-view baselines leaves the superiority claim under-supported. The current §5.3 presents steering results for the discovered clusters but does not tabulate success rates or effect-size differences against the single-view baselines. We will revise §5.3 to include these side-by-side metrics (effect sizes, success rates, and statistical tests) for joint versus single-view clusters, thereby isolating the contribution of the joint rate-distortion procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: clustering objective defines coherence; interventional claims rest on separate experiments

full rationale

The paper defines its method as representing continuations via semantic embeddings and prefix-to-continuation attribution signatures, then optimizing a rate-distortion objective trading off semantic coherence, mechanistic consistency, and granularity. This objective enforces intra-cluster agreement in the two views by construction, but the central claims (exposing modes missed by baselines; cluster signatures being actionable mechanistic factors) are supported by comparative clustering results and steering/intervention analyses, which are not reduced to the objective itself. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided text. The derivation is self-contained as an unsupervised procedure whose empirical claims are externally validated rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5705 in / 999 out tokens · 13809 ms · 2026-06-27T19:46:01.464107+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 6 canonical work pages

[1]

Bogdan, P

URL https://openreview.net/forum? id=8RCmNLeeXx. Bogdan, P. C., Macar, U., Nanda, N., and Conmy, A. Thought anchors: Which llm reasoning steps mat- ter?, 2025. URL https://arxiv.org/abs/2506. 19143. Boley, D. Principal direction divisive partitioning. Data Min. Knowl. Discov., 2(4):325–344, Decem- ber 1998. ISSN 1384-5810. doi: 10.1023/ A:1009740529316. U...

Pith/arXiv arXiv 2025
[2]

Cywi´nski, B., Ryd, E., Wang, R., Rajamanoharan, S., Nanda, N., Conmy, A., and Marks, S

URL https://openreview.net/forum? id=89ia77nZ8u. Cywi´nski, B., Ryd, E., Wang, R., Rajamanoharan, S., Nanda, N., Conmy, A., and Marks, S. Eliciting se- cret knowledge from language models, 2025. URL https://arxiv.org/abs/2510.01070. Dhillon, I. S., Mallela, S., and Kumar, R. A divisive information-theoretic feature clustering algorithm for text classifica...

work page doi:10.1145/1015330.1015408 2025
[3]

Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T

URL https://openreview.net/forum? id=tcsZt9ZNKD. Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T. Causal abstraction: A theoretical foundation for mechanistic interpretability.Journal of Machine Learning Research, 26(83):1–64, 2025. URL http://jmlr. org/papers/v26/23-0058.htm...

Pith/arXiv arXiv 2025
[4]

Jacovi, A

URL https://openreview.net/forum? id=F76bwRSLeK. Jacovi, A. and Goldberg, Y . Towards faithfully interpretable NLP systems: How should we define and evaluate faith- fulness? In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.),Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pp. 4198–4205, Online, Jul...

work page doi:10.18653/v1/2020.acl-main 2020
[5]

acl-main.386/

URL https://aclanthology.org/2020. acl-main.386/. Joshi, S., Mueller, A., Klindt, D., Brendel, W., Reizinger, P., and Sridhar, D. Causality is key for interpretability 11 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms claims to generalise, 2026. URL https://arxiv. org/abs/2602.16698. Lee, B. W., Padhi, I., Ramamurthy, K. N., Miehling,...

arXiv 2020
[6]

Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L

URL https://openreview.net/forum? id=-h6WAS6eE4. Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. Am- bigQA: Answering ambiguous open-domain questions. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5783– 5797, Online, November 2020. Assoc...

work page doi:10.18653/v1/2020.emnlp-main 2020
[7]

Zoom in: An introduction to circuits

URL https://aclanthology.org/2020. emnlp-main.466/. Nanda, N. Attribution patching: Activation patching at industrial scale. https://www.neelnanda. io/mechanistic-interpretability/ attribution-patching, Feb 2023. Accessed: 2026-01-29. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpre...

work page doi:10.23915/distill.00024.001 2020
[8]

Deterministic annealing for clustering, compression, classification, regression, and related optimization problems

doi: 10.1109/5.726788. Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster anal- ysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987. ISSN 0377-0427. doi: https://doi.org/10.1016/0377-0427(87)90125-7. URL https://www.sciencedirect.com/ science/article/pii/0377042787901257. Sabo, K. and Scitovski,...

work page doi:10.1109/5.726788 1987
[9]

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

URL https://aclanthology.org/2024. blackboxnlp-1.25/. 12 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, X., Wei, F., and Wen, J.-R. Language-specific neu- rons: The key to multilingual capabilities in large lan- guage models. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),P...

work page doi:10.18653/v1/2024.acl-long.309 2024
[10]

findings-acl.33/

URL https://aclanthology.org/2024. findings-acl.33/. Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., Rouillard, L., Mesnard, T., Cideron, G., bastien Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsit...

Pith/arXiv arXiv 2024
[11]

cc/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper. pdf. Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indi- rect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations,

2020
[12]

URL https://openreview.net/forum? id=NpsVSN6o4ul. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li...

Pith/arXiv arXiv 2025
[13]

Concatenate prefix and continuation: full=prefix⊕continuation 3https://github.com/decoderesearch/circuit-tracer 17 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms
[14]

Run a forward pass through the embedding model
[15]

Extract hidden states foronlycontinuation tokens (which attend to the prefix through self-attention)
[16]

Mean pool the continuation hidden states
[17]

L2-normalize the resulting embedding This captures the semantic meaning of the continuation in the context of the prefix, rather than treating them independently. B.4. Normalization We normalize both semantic and attribution embeddings before clustering to ensure comparable scales across continuations. Semantic Embedding: Spherical Normalization.Semantic ...
[18]

Sparsity preservation:RMS normalization preserves the sparsity pattern and relative magnitudes within each vector, only standardizing the overall scale
[19]

Robustness to dimensionality:RMS is independent of dimensionality da, whereas L2 norm grows with √da for vectors with similar per-coordinate magnitudes
[20]

When Normalization is Applied.Normalization is applied once after computing embeddings and attributions, before the clustering algorithm runs

Interpretability:After RMS normalization, attribution values can still be interpreted as relative contributions—a feature with value2.0contributes twice as much as one with value1.0. When Normalization is Applied.Normalization is applied once after computing embeddings and attributions, before the clustering algorithm runs. Cluster centers inherit the nor...

2004
[21]

all previous steps

and MATH500 (Lightman et al., 2023) using the Qwen3-0.6B transcoders. For each example, we roll out up to 12 reasoning steps (delimited by “\n\n”) and sample 64 candidate traces per step. The committed reasoning path used for later prefixes is deterministic: at each step, we select the candidate with the highest normalized sequence probability. For every ...

2023
[22]

Recall the formulas for the volume and surface area of a sphere
[23]

The volume isV= 4 3 πr3 and the surface area isA= 4πr 2
[24]

Set the numerical equality 4 3 πr3 = 4πr 2 and solve forr
[25]

Divide both sides by4πr 2, assumingr̸= 0
[26]

Candidate continuations for target stepj= 7

Re-check the division of 4 3 πr3 = 4πr 2 by4πr 2. Candidate continuations for target stepj= 7. pContinuation text (C1, . . . , C6) 0.923 Left side: the volume term divided by the surface-area term gives 1 3 r.(3,2,3,2,3,3) 0.876 Left side becomes 1 3 r, and the right side is 1. So this simplifies to 1 3 r= 1 ; multiplying both sides by 3 gives r= 3 . (2,1...

[1] [1]

Bogdan, P

URL https://openreview.net/forum? id=8RCmNLeeXx. Bogdan, P. C., Macar, U., Nanda, N., and Conmy, A. Thought anchors: Which llm reasoning steps mat- ter?, 2025. URL https://arxiv.org/abs/2506. 19143. Boley, D. Principal direction divisive partitioning. Data Min. Knowl. Discov., 2(4):325–344, Decem- ber 1998. ISSN 1384-5810. doi: 10.1023/ A:1009740529316. U...

Pith/arXiv arXiv 2025

[2] [2]

Cywi´nski, B., Ryd, E., Wang, R., Rajamanoharan, S., Nanda, N., Conmy, A., and Marks, S

URL https://openreview.net/forum? id=89ia77nZ8u. Cywi´nski, B., Ryd, E., Wang, R., Rajamanoharan, S., Nanda, N., Conmy, A., and Marks, S. Eliciting se- cret knowledge from language models, 2025. URL https://arxiv.org/abs/2510.01070. Dhillon, I. S., Mallela, S., and Kumar, R. A divisive information-theoretic feature clustering algorithm for text classifica...

work page doi:10.1145/1015330.1015408 2025

[3] [3]

Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T

URL https://openreview.net/forum? id=tcsZt9ZNKD. Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T. Causal abstraction: A theoretical foundation for mechanistic interpretability.Journal of Machine Learning Research, 26(83):1–64, 2025. URL http://jmlr. org/papers/v26/23-0058.htm...

Pith/arXiv arXiv 2025

[4] [4]

Jacovi, A

URL https://openreview.net/forum? id=F76bwRSLeK. Jacovi, A. and Goldberg, Y . Towards faithfully interpretable NLP systems: How should we define and evaluate faith- fulness? In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.),Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pp. 4198–4205, Online, Jul...

work page doi:10.18653/v1/2020.acl-main 2020

[5] [5]

acl-main.386/

URL https://aclanthology.org/2020. acl-main.386/. Joshi, S., Mueller, A., Klindt, D., Brendel, W., Reizinger, P., and Sridhar, D. Causality is key for interpretability 11 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms claims to generalise, 2026. URL https://arxiv. org/abs/2602.16698. Lee, B. W., Padhi, I., Ramamurthy, K. N., Miehling,...

arXiv 2020

[6] [6]

Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L

URL https://openreview.net/forum? id=-h6WAS6eE4. Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. Am- bigQA: Answering ambiguous open-domain questions. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5783– 5797, Online, November 2020. Assoc...

work page doi:10.18653/v1/2020.emnlp-main 2020

[7] [7]

Zoom in: An introduction to circuits

URL https://aclanthology.org/2020. emnlp-main.466/. Nanda, N. Attribution patching: Activation patching at industrial scale. https://www.neelnanda. io/mechanistic-interpretability/ attribution-patching, Feb 2023. Accessed: 2026-01-29. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpre...

work page doi:10.23915/distill.00024.001 2020

[8] [8]

Deterministic annealing for clustering, compression, classification, regression, and related optimization problems

doi: 10.1109/5.726788. Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster anal- ysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987. ISSN 0377-0427. doi: https://doi.org/10.1016/0377-0427(87)90125-7. URL https://www.sciencedirect.com/ science/article/pii/0377042787901257. Sabo, K. and Scitovski,...

work page doi:10.1109/5.726788 1987

[9] [9]

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

URL https://aclanthology.org/2024. blackboxnlp-1.25/. 12 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, X., Wei, F., and Wen, J.-R. Language-specific neu- rons: The key to multilingual capabilities in large lan- guage models. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),P...

work page doi:10.18653/v1/2024.acl-long.309 2024

[10] [10]

findings-acl.33/

URL https://aclanthology.org/2024. findings-acl.33/. Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., Rouillard, L., Mesnard, T., Cideron, G., bastien Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsit...

Pith/arXiv arXiv 2024

[11] [11]

cc/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper. pdf. Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indi- rect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations,

2020

[12] [12]

URL https://openreview.net/forum? id=NpsVSN6o4ul. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li...

Pith/arXiv arXiv 2025

[13] [13]

Concatenate prefix and continuation: full=prefix⊕continuation 3https://github.com/decoderesearch/circuit-tracer 17 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

[14] [14]

Run a forward pass through the embedding model

[15] [15]

Extract hidden states foronlycontinuation tokens (which attend to the prefix through self-attention)

[16] [16]

Mean pool the continuation hidden states

[17] [17]

L2-normalize the resulting embedding This captures the semantic meaning of the continuation in the context of the prefix, rather than treating them independently. B.4. Normalization We normalize both semantic and attribution embeddings before clustering to ensure comparable scales across continuations. Semantic Embedding: Spherical Normalization.Semantic ...

[18] [18]

Sparsity preservation:RMS normalization preserves the sparsity pattern and relative magnitudes within each vector, only standardizing the overall scale

[19] [19]

Robustness to dimensionality:RMS is independent of dimensionality da, whereas L2 norm grows with √da for vectors with similar per-coordinate magnitudes

[20] [20]

When Normalization is Applied.Normalization is applied once after computing embeddings and attributions, before the clustering algorithm runs

Interpretability:After RMS normalization, attribution values can still be interpreted as relative contributions—a feature with value2.0contributes twice as much as one with value1.0. When Normalization is Applied.Normalization is applied once after computing embeddings and attributions, before the clustering algorithm runs. Cluster centers inherit the nor...

2004

[21] [21]

all previous steps

and MATH500 (Lightman et al., 2023) using the Qwen3-0.6B transcoders. For each example, we roll out up to 12 reasoning steps (delimited by “\n\n”) and sample 64 candidate traces per step. The committed reasoning path used for later prefixes is deterministic: at each step, we select the candidate with the highest normalized sequence probability. For every ...

2023

[22] [22]

Recall the formulas for the volume and surface area of a sphere

[23] [23]

The volume isV= 4 3 πr3 and the surface area isA= 4πr 2

[24] [24]

Set the numerical equality 4 3 πr3 = 4πr 2 and solve forr

[25] [25]

Divide both sides by4πr 2, assumingr̸= 0

[26] [26]

Candidate continuations for target stepj= 7

Re-check the division of 4 3 πr3 = 4πr 2 by4πr 2. Candidate continuations for target stepj= 7. pContinuation text (C1, . . . , C6) 0.923 Left side: the volume term divided by the surface-area term gives 1 3 r.(3,2,3,2,3,3) 0.876 Left side becomes 1 3 r, and the right side is 1. So this simplifies to 1 3 r= 1 ; multiplying both sides by 3 gives r= 3 . (2,1...