Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms
Pith reviewed 2026-06-27 19:46 UTC · model grok-4.3
The pith
Unsupervised clustering of LLM continuations by aligning semantic embeddings with mechanistic attribution signatures reveals distinct modes and actionable factors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Distribution-level unsupervised feature discovery clusters sampled continuations using semantic content and sequence-level mechanistic attributions by optimizing a rate-distortion objective on semantic embeddings and prefix-to-continuation attribution signatures, exposing continuation modes missed by baselines and providing evidence that cluster signatures are actionable mechanistic factors.
What carries the argument
Rate-distortion objective applied jointly to semantic embeddings and prefix-to-continuation attribution signatures to trade off coherence, consistency, and granularity.
If this is right
- Discovered clusters expose continuation modes that single-view baselines miss.
- Cluster signatures correspond to actionable mechanistic factors under intervention.
- The approach provides a scalable audit of mechanisms underlying a model's continuation distribution.
- It complements target-conditioned circuit analysis by addressing heterogeneity across the continuation distribution.
Where Pith is reading between the lines
- The method could extend mechanistic interpretability from single prompts to entire output distributions at scale.
- It may support identification of latent behavioral patterns without requiring predefined targets or human-specified behaviors.
- Applications could include automated detection of failure modes or biases across a model's response space.
- Similar joint objectives might be tested on other attribution or embedding types to refine mechanistic feature discovery.
Load-bearing premise
The rate-distortion objective applied to semantic embeddings and attribution signatures will produce clusters whose mechanistic signatures are both consistent and causally actionable under intervention.
What would settle it
If steering interventions on the discovered cluster signatures produce no measurable shift in continuation probabilities relative to random or baseline clusters, or if the joint clusters show no improvement over semantic-only or attribution-only baselines in exposing distinct modes.
Figures
read the original abstract
As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces distribution-level unsupervised feature discovery for LLMs. It represents sampled continuations via semantic embeddings and prefix-to-continuation attribution signatures, then optimizes a rate-distortion objective trading off semantic coherence, mechanistic consistency, and granularity. The central claim is that the resulting clusters reveal continuation modes missed by single-view baselines and supply interventional evidence that the cluster signatures identify actionable mechanistic factors, thereby complementing target-conditioned circuit analysis.
Significance. If the interventional results hold and are shown to follow from the joint objective, the work supplies a scalable audit of mechanisms across a model's continuation distribution rather than isolated prompts. The joint semantic-mechanistic clustering is a clear methodological contribution; explicit credit is due for attempting to move beyond coherence metrics to steering-based validation.
major comments (2)
- [§3] §3 (rate-distortion objective): the objective enforces intra-cluster coherence across the two views but contains no explicit term or post-optimization constraint for causal effect size or robustness to distribution shift. The manuscript must demonstrate that the reported steering effects are larger (or more consistent) for joint clusters than for post-hoc selection of high-coherence clusters from either view alone; otherwise the actionability claim does not follow from the optimization.
- [§5.3] §5.3 (steering analyses): the interventional evidence is load-bearing, yet the text supplies no quantitative comparison of effect sizes or success rates between joint clusters and the single-view baselines that the abstract claims are inferior. Without these controls it is impossible to rule out that the observed steering arises from the attribution signatures themselves rather than from the joint rate-distortion clustering.
minor comments (2)
- [Abstract] Abstract: the phrase 'provide interventional evidence' should be qualified by the specific steering metric and baseline comparison used in the experiments.
- [§3.1] Notation: the precise definition of the attribution signature (e.g., which layers or heads are aggregated) should be stated once in a single equation or table for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments correctly identify that stronger quantitative controls are needed to link the joint optimization to the interventional results. We respond to each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (rate-distortion objective): the objective enforces intra-cluster coherence across the two views but contains no explicit term or post-optimization constraint for causal effect size or robustness to distribution shift. The manuscript must demonstrate that the reported steering effects are larger (or more consistent) for joint clusters than for post-hoc selection of high-coherence clusters from either view alone; otherwise the actionability claim does not follow from the optimization.
Authors: We agree that the rate-distortion objective optimizes for coherence and consistency but does not contain an explicit term for causal effect size. The manuscript currently reports steering results only for the joint clusters and does not include the requested post-hoc comparison against high-coherence selections from single views. In the revision we will add this comparison, reporting effect sizes and consistency metrics for joint clusters versus post-hoc selections from each view alone. This will be placed in an expanded §3 and cross-referenced in the steering section. revision: yes
-
Referee: [§5.3] §5.3 (steering analyses): the interventional evidence is load-bearing, yet the text supplies no quantitative comparison of effect sizes or success rates between joint clusters and the single-view baselines that the abstract claims are inferior. Without these controls it is impossible to rule out that the observed steering arises from the attribution signatures themselves rather than from the joint rate-distortion clustering.
Authors: We concur that the absence of direct quantitative comparisons between joint clusters and single-view baselines leaves the superiority claim under-supported. The current §5.3 presents steering results for the discovered clusters but does not tabulate success rates or effect-size differences against the single-view baselines. We will revise §5.3 to include these side-by-side metrics (effect sizes, success rates, and statistical tests) for joint versus single-view clusters, thereby isolating the contribution of the joint rate-distortion procedure. revision: yes
Circularity Check
No circularity: clustering objective defines coherence; interventional claims rest on separate experiments
full rationale
The paper defines its method as representing continuations via semantic embeddings and prefix-to-continuation attribution signatures, then optimizing a rate-distortion objective trading off semantic coherence, mechanistic consistency, and granularity. This objective enforces intra-cluster agreement in the two views by construction, but the central claims (exposing modes missed by baselines; cluster signatures being actionable mechanistic factors) are supported by comparative clustering results and steering/intervention analyses, which are not reduced to the objective itself. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided text. The derivation is self-contained as an unsupervised procedure whose empirical claims are externally validated rather than tautological.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URL https://openreview.net/forum? id=8RCmNLeeXx. Bogdan, P. C., Macar, U., Nanda, N., and Conmy, A. Thought anchors: Which llm reasoning steps mat- ter?, 2025. URL https://arxiv.org/abs/2506. 19143. Boley, D. Principal direction divisive partitioning. Data Min. Knowl. Discov., 2(4):325–344, Decem- ber 1998. ISSN 1384-5810. doi: 10.1023/ A:1009740529316. U...
Pith/arXiv arXiv 2025
-
[2]
Cywi´nski, B., Ryd, E., Wang, R., Rajamanoharan, S., Nanda, N., Conmy, A., and Marks, S
URL https://openreview.net/forum? id=89ia77nZ8u. Cywi´nski, B., Ryd, E., Wang, R., Rajamanoharan, S., Nanda, N., Conmy, A., and Marks, S. Eliciting se- cret knowledge from language models, 2025. URL https://arxiv.org/abs/2510.01070. Dhillon, I. S., Mallela, S., and Kumar, R. A divisive information-theoretic feature clustering algorithm for text classifica...
-
[3]
URL https://openreview.net/forum? id=tcsZt9ZNKD. Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T. Causal abstraction: A theoretical foundation for mechanistic interpretability.Journal of Machine Learning Research, 26(83):1–64, 2025. URL http://jmlr. org/papers/v26/23-0058.htm...
Pith/arXiv arXiv 2025
-
[4]
URL https://openreview.net/forum? id=F76bwRSLeK. Jacovi, A. and Goldberg, Y . Towards faithfully interpretable NLP systems: How should we define and evaluate faith- fulness? In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.),Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pp. 4198–4205, Online, Jul...
-
[5]
URL https://aclanthology.org/2020. acl-main.386/. Joshi, S., Mueller, A., Klindt, D., Brendel, W., Reizinger, P., and Sridhar, D. Causality is key for interpretability 11 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms claims to generalise, 2026. URL https://arxiv. org/abs/2602.16698. Lee, B. W., Padhi, I., Ramamurthy, K. N., Miehling,...
arXiv 2020
-
[6]
Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L
URL https://openreview.net/forum? id=-h6WAS6eE4. Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. Am- bigQA: Answering ambiguous open-domain questions. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5783– 5797, Online, November 2020. Assoc...
-
[7]
Zoom in: An introduction to circuits
URL https://aclanthology.org/2020. emnlp-main.466/. Nanda, N. Attribution patching: Activation patching at industrial scale. https://www.neelnanda. io/mechanistic-interpretability/ attribution-patching, Feb 2023. Accessed: 2026-01-29. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpre...
-
[8]
doi: 10.1109/5.726788. Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster anal- ysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987. ISSN 0377-0427. doi: https://doi.org/10.1016/0377-0427(87)90125-7. URL https://www.sciencedirect.com/ science/article/pii/0377042787901257. Sabo, K. and Scitovski,...
-
[9]
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models
URL https://aclanthology.org/2024. blackboxnlp-1.25/. 12 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, X., Wei, F., and Wen, J.-R. Language-specific neu- rons: The key to multilingual capabilities in large lan- guage models. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),P...
-
[10]
URL https://aclanthology.org/2024. findings-acl.33/. Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., Rouillard, L., Mesnard, T., Cideron, G., bastien Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsit...
Pith/arXiv arXiv 2024
-
[11]
cc/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper
URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper. pdf. Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indi- rect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations,
2020
-
[12]
URL https://openreview.net/forum? id=NpsVSN6o4ul. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li...
Pith/arXiv arXiv 2025
-
[13]
Concatenate prefix and continuation: full=prefix⊕continuation 3https://github.com/decoderesearch/circuit-tracer 17 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms
-
[14]
Run a forward pass through the embedding model
-
[15]
Extract hidden states foronlycontinuation tokens (which attend to the prefix through self-attention)
-
[16]
Mean pool the continuation hidden states
-
[17]
L2-normalize the resulting embedding This captures the semantic meaning of the continuation in the context of the prefix, rather than treating them independently. B.4. Normalization We normalize both semantic and attribution embeddings before clustering to ensure comparable scales across continuations. Semantic Embedding: Spherical Normalization.Semantic ...
-
[18]
Sparsity preservation:RMS normalization preserves the sparsity pattern and relative magnitudes within each vector, only standardizing the overall scale
-
[19]
Robustness to dimensionality:RMS is independent of dimensionality da, whereas L2 norm grows with √da for vectors with similar per-coordinate magnitudes
-
[20]
When Normalization is Applied.Normalization is applied once after computing embeddings and attributions, before the clustering algorithm runs
Interpretability:After RMS normalization, attribution values can still be interpreted as relative contributions—a feature with value2.0contributes twice as much as one with value1.0. When Normalization is Applied.Normalization is applied once after computing embeddings and attributions, before the clustering algorithm runs. Cluster centers inherit the nor...
2004
-
[21]
all previous steps
and MATH500 (Lightman et al., 2023) using the Qwen3-0.6B transcoders. For each example, we roll out up to 12 reasoning steps (delimited by “\n\n”) and sample 64 candidate traces per step. The committed reasoning path used for later prefixes is deterministic: at each step, we select the candidate with the highest normalized sequence probability. For every ...
2023
-
[22]
Recall the formulas for the volume and surface area of a sphere
-
[23]
The volume isV= 4 3 πr3 and the surface area isA= 4πr 2
-
[24]
Set the numerical equality 4 3 πr3 = 4πr 2 and solve forr
-
[25]
Divide both sides by4πr 2, assumingr̸= 0
-
[26]
Candidate continuations for target stepj= 7
Re-check the division of 4 3 πr3 = 4πr 2 by4πr 2. Candidate continuations for target stepj= 7. pContinuation text (C1, . . . , C6) 0.923 Left side: the volume term divided by the surface-area term gives 1 3 r.(3,2,3,2,3,3) 0.876 Left side becomes 1 3 r, and the right side is 1. So this simplifies to 1 3 r= 1 ; multiplying both sides by 3 gives r= 3 . (2,1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.