pith. sign in

arxiv: 2606.08236 · v1 · pith:ZN4TQ2AEnew · submitted 2026-06-06 · 💻 cs.CL · cs.LG

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

Pith reviewed 2026-06-27 19:46 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords unsupervised feature discoverymechanistic interpretabilityrate-distortion objectivesemantic embeddingsattribution signaturescontinuation distributionlanguage model auditing
0
0 comments X

The pith

Unsupervised clustering of LLM continuations by aligning semantic embeddings with mechanistic attribution signatures reveals distinct modes and actionable factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces distribution-level unsupervised feature discovery for auditing language model continuation distributions. It represents each sampled continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective to balance semantic coherence, mechanistic consistency, and cluster granularity. This joint clustering produces groups that single-view baselines miss and supplies interventional evidence linking cluster signatures to controllable internal factors. The method operates without manually chosen target outputs, complementing target-conditioned circuit analysis by handling heterogeneity across the full continuation distribution.

Core claim

Distribution-level unsupervised feature discovery clusters sampled continuations using semantic content and sequence-level mechanistic attributions by optimizing a rate-distortion objective on semantic embeddings and prefix-to-continuation attribution signatures, exposing continuation modes missed by baselines and providing evidence that cluster signatures are actionable mechanistic factors.

What carries the argument

Rate-distortion objective applied jointly to semantic embeddings and prefix-to-continuation attribution signatures to trade off coherence, consistency, and granularity.

If this is right

  • Discovered clusters expose continuation modes that single-view baselines miss.
  • Cluster signatures correspond to actionable mechanistic factors under intervention.
  • The approach provides a scalable audit of mechanisms underlying a model's continuation distribution.
  • It complements target-conditioned circuit analysis by addressing heterogeneity across the continuation distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend mechanistic interpretability from single prompts to entire output distributions at scale.
  • It may support identification of latent behavioral patterns without requiring predefined targets or human-specified behaviors.
  • Applications could include automated detection of failure modes or biases across a model's response space.
  • Similar joint objectives might be tested on other attribution or embedding types to refine mechanistic feature discovery.

Load-bearing premise

The rate-distortion objective applied to semantic embeddings and attribution signatures will produce clusters whose mechanistic signatures are both consistent and causally actionable under intervention.

What would settle it

If steering interventions on the discovered cluster signatures produce no measurable shift in continuation probabilities relative to random or baseline clusters, or if the joint clusters show no improvement over semantic-only or attribution-only baselines in exposing distinct modes.

Figures

Figures reproduced from arXiv: 2606.08236 by Hyunjin Cho, Jaehyung Kim, Youngji Roh.

Figure 1
Figure 1. Figure 1: Misalignment between semantic and mechanistic spaces. Top: An example triplet where (A1, A2) share semantic meaning (cosine similarity) but use different internal mechanisms (attributions, L1 distance), while (A2, A3) differ semantically but use similar mechanisms. Bottom: Silhouette scores (Rousseeuw, 1987) for K-means partitions learned in one view and evaluated in both views; a partition that is clean i… view at source ↗
Figure 1
Figure 1. Figure 1: Overview of distribution-level unsupervised feature discovery. Continuation Sampling: For a prefix x = x1:M, we sample continuations y (n) = y (n) 1:Tn with probability weights Pn. Dual Representation: Each continuation is represented by a semantic embedding en and a mechanistic embedding an, where an aggregates prefix-feature effects over the continuation span. Joint Clustering via Rate-Distortion: Rate–d… view at source ↗
Figure 3
Figure 3. Figure 3: Cluster mechanistic similarity across β values. Jaccard similarity of top-100 mechanistic features between clusters at β ∈ {0.5, 0.75, 1.0} for the prompt “Who sang the song ‘If god was one of us?”’. The notation C K i denotes cluster index i among the K clusters produced at a given β. Higher cross-β similarity reflects parent–child structure across resolutions; a score-labeled version is provided in [PIT… view at source ↗
Figure 4
Figure 4. Figure 4: Cluster structure dynamics across γ values. Each plots are drawn by sweeping γ ∈ [0.1, 0.9], keeping consistent rate by varying β ≈ 0.9. Dashed ellipses indicate K-means boundaries in each space with the same number of clusters k as RD-clustering. Attribution Space: Before and After Split Semantic Space: Before and After Split C4: Knowledge Cutoff [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean ℓ1 norm of the attribution vector at each generated-token position. (a) ℓ1 distance. (b) Top-100 Jaccard overlap [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Token-wise attribution similarity as a function of positional gap. Real pairs are compared against random continuations from the same prefix; lower ℓ1 and higher Jaccard indicate greater similarity. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Lower-triangular heatmap of top-100 Jaccard overlap between token-level attribution vectors, averaged across the Qwen3-4B consistency subset [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Dynamic source-mass decomposition by target position. The cached prefix-only curve uses the saved full-continuation attribution, while the dynamic prefix and history curves recompute attribution for each target token using the original prefix plus preceding generated tokens as source context. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Span-mode steering comparison on the Qwen3-4B scaled subset. The full continuation span gives the strongest average signed response, while single-token spans are noisier but remain directionally aligned. (a) Attribution-selected features. (b) Random baseline features [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Average token-level steering effect, measured as demeaned target-logit change across continuation positions. Attribution￾selected features produce a more coherent signed response than matched random feature sets. D. Limitations and Future Work. We introduce an unsupervised feature-discovery framework that clusters continuations while jointly controlling semantic variance and mechanistic variance. Despite … view at source ↗
Figure 13
Figure 13. Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Reasoning step-pair validation heatmap for Qwen3-0.6B. Rows are target steps j, columns are previous source steps i < j, and cells show the mean absolute Spearman correlation |ρs| across GSM8K and MATH500. Question. What is the length, in units, of the radius of a sphere whose volume and surface area, in cubic units and square units, respectively, are numerically equal? Committed previous reasoning steps.… view at source ↗
Figure 15
Figure 15. Figure 15: Fixed target-cluster effects. Rows are prior source steps i < j, columns are fixed clusters of target-step continuations, column widths are target-cluster probability mass, and color is the projected Spearman correlation. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Local cluster dynamics across previous reasoning steps. Columns are source steps i < j, node heights are local cluster probability mass, ribbons track adjacent continuation-overlap mass, and node colors show local ρs. E.4. AmbigQA on Qwen3-8B: Full steering results [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
read the original abstract

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces distribution-level unsupervised feature discovery for LLMs. It represents sampled continuations via semantic embeddings and prefix-to-continuation attribution signatures, then optimizes a rate-distortion objective trading off semantic coherence, mechanistic consistency, and granularity. The central claim is that the resulting clusters reveal continuation modes missed by single-view baselines and supply interventional evidence that the cluster signatures identify actionable mechanistic factors, thereby complementing target-conditioned circuit analysis.

Significance. If the interventional results hold and are shown to follow from the joint objective, the work supplies a scalable audit of mechanisms across a model's continuation distribution rather than isolated prompts. The joint semantic-mechanistic clustering is a clear methodological contribution; explicit credit is due for attempting to move beyond coherence metrics to steering-based validation.

major comments (2)
  1. [§3] §3 (rate-distortion objective): the objective enforces intra-cluster coherence across the two views but contains no explicit term or post-optimization constraint for causal effect size or robustness to distribution shift. The manuscript must demonstrate that the reported steering effects are larger (or more consistent) for joint clusters than for post-hoc selection of high-coherence clusters from either view alone; otherwise the actionability claim does not follow from the optimization.
  2. [§5.3] §5.3 (steering analyses): the interventional evidence is load-bearing, yet the text supplies no quantitative comparison of effect sizes or success rates between joint clusters and the single-view baselines that the abstract claims are inferior. Without these controls it is impossible to rule out that the observed steering arises from the attribution signatures themselves rather than from the joint rate-distortion clustering.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'provide interventional evidence' should be qualified by the specific steering metric and baseline comparison used in the experiments.
  2. [§3.1] Notation: the precise definition of the attribution signature (e.g., which layers or heads are aggregated) should be stated once in a single equation or table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify that stronger quantitative controls are needed to link the joint optimization to the interventional results. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (rate-distortion objective): the objective enforces intra-cluster coherence across the two views but contains no explicit term or post-optimization constraint for causal effect size or robustness to distribution shift. The manuscript must demonstrate that the reported steering effects are larger (or more consistent) for joint clusters than for post-hoc selection of high-coherence clusters from either view alone; otherwise the actionability claim does not follow from the optimization.

    Authors: We agree that the rate-distortion objective optimizes for coherence and consistency but does not contain an explicit term for causal effect size. The manuscript currently reports steering results only for the joint clusters and does not include the requested post-hoc comparison against high-coherence selections from single views. In the revision we will add this comparison, reporting effect sizes and consistency metrics for joint clusters versus post-hoc selections from each view alone. This will be placed in an expanded §3 and cross-referenced in the steering section. revision: yes

  2. Referee: [§5.3] §5.3 (steering analyses): the interventional evidence is load-bearing, yet the text supplies no quantitative comparison of effect sizes or success rates between joint clusters and the single-view baselines that the abstract claims are inferior. Without these controls it is impossible to rule out that the observed steering arises from the attribution signatures themselves rather than from the joint rate-distortion clustering.

    Authors: We concur that the absence of direct quantitative comparisons between joint clusters and single-view baselines leaves the superiority claim under-supported. The current §5.3 presents steering results for the discovered clusters but does not tabulate success rates or effect-size differences against the single-view baselines. We will revise §5.3 to include these side-by-side metrics (effect sizes, success rates, and statistical tests) for joint versus single-view clusters, thereby isolating the contribution of the joint rate-distortion procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: clustering objective defines coherence; interventional claims rest on separate experiments

full rationale

The paper defines its method as representing continuations via semantic embeddings and prefix-to-continuation attribution signatures, then optimizing a rate-distortion objective trading off semantic coherence, mechanistic consistency, and granularity. This objective enforces intra-cluster agreement in the two views by construction, but the central claims (exposing modes missed by baselines; cluster signatures being actionable mechanistic factors) are supported by comparative clustering results and steering/intervention analyses, which are not reduced to the objective itself. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided text. The derivation is self-contained as an unsupervised procedure whose empirical claims are externally validated rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5705 in / 999 out tokens · 13809 ms · 2026-06-27T19:46:01.464107+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 6 canonical work pages

  1. [1]

    Bogdan, P

    URL https://openreview.net/forum? id=8RCmNLeeXx. Bogdan, P. C., Macar, U., Nanda, N., and Conmy, A. Thought anchors: Which llm reasoning steps mat- ter?, 2025. URL https://arxiv.org/abs/2506. 19143. Boley, D. Principal direction divisive partitioning. Data Min. Knowl. Discov., 2(4):325–344, Decem- ber 1998. ISSN 1384-5810. doi: 10.1023/ A:1009740529316. U...

  2. [2]

    Cywi´nski, B., Ryd, E., Wang, R., Rajamanoharan, S., Nanda, N., Conmy, A., and Marks, S

    URL https://openreview.net/forum? id=89ia77nZ8u. Cywi´nski, B., Ryd, E., Wang, R., Rajamanoharan, S., Nanda, N., Conmy, A., and Marks, S. Eliciting se- cret knowledge from language models, 2025. URL https://arxiv.org/abs/2510.01070. Dhillon, I. S., Mallela, S., and Kumar, R. A divisive information-theoretic feature clustering algorithm for text classifica...

  3. [3]

    Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T

    URL https://openreview.net/forum? id=tcsZt9ZNKD. Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T. Causal abstraction: A theoretical foundation for mechanistic interpretability.Journal of Machine Learning Research, 26(83):1–64, 2025. URL http://jmlr. org/papers/v26/23-0058.htm...

  4. [4]

    Jacovi, A

    URL https://openreview.net/forum? id=F76bwRSLeK. Jacovi, A. and Goldberg, Y . Towards faithfully interpretable NLP systems: How should we define and evaluate faith- fulness? In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.),Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pp. 4198–4205, Online, Jul...

  5. [5]

    acl-main.386/

    URL https://aclanthology.org/2020. acl-main.386/. Joshi, S., Mueller, A., Klindt, D., Brendel, W., Reizinger, P., and Sridhar, D. Causality is key for interpretability 11 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms claims to generalise, 2026. URL https://arxiv. org/abs/2602.16698. Lee, B. W., Padhi, I., Ramamurthy, K. N., Miehling,...

  6. [6]

    Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L

    URL https://openreview.net/forum? id=-h6WAS6eE4. Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. Am- bigQA: Answering ambiguous open-domain questions. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5783– 5797, Online, November 2020. Assoc...

  7. [7]

    Zoom in: An introduction to circuits

    URL https://aclanthology.org/2020. emnlp-main.466/. Nanda, N. Attribution patching: Activation patching at industrial scale. https://www.neelnanda. io/mechanistic-interpretability/ attribution-patching, Feb 2023. Accessed: 2026-01-29. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpre...

  8. [8]

    Deterministic annealing for clustering, compression, classification, regression, and related optimization problems

    doi: 10.1109/5.726788. Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster anal- ysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987. ISSN 0377-0427. doi: https://doi.org/10.1016/0377-0427(87)90125-7. URL https://www.sciencedirect.com/ science/article/pii/0377042787901257. Sabo, K. and Scitovski,...

  9. [9]

    Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

    URL https://aclanthology.org/2024. blackboxnlp-1.25/. 12 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, X., Wei, F., and Wen, J.-R. Language-specific neu- rons: The key to multilingual capabilities in large lan- guage models. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),P...

  10. [10]

    findings-acl.33/

    URL https://aclanthology.org/2024. findings-acl.33/. Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., Rouillard, L., Mesnard, T., Cideron, G., bastien Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsit...

  11. [11]

    cc/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper

    URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper. pdf. Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indi- rect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations,

  12. [12]

    URL https://openreview.net/forum? id=NpsVSN6o4ul. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li...

  13. [13]

    Concatenate prefix and continuation: full=prefix⊕continuation 3https://github.com/decoderesearch/circuit-tracer 17 Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

  14. [14]

    Run a forward pass through the embedding model

  15. [15]

    Extract hidden states foronlycontinuation tokens (which attend to the prefix through self-attention)

  16. [16]

    Mean pool the continuation hidden states

  17. [17]

    L2-normalize the resulting embedding This captures the semantic meaning of the continuation in the context of the prefix, rather than treating them independently. B.4. Normalization We normalize both semantic and attribution embeddings before clustering to ensure comparable scales across continuations. Semantic Embedding: Spherical Normalization.Semantic ...

  18. [18]

    Sparsity preservation:RMS normalization preserves the sparsity pattern and relative magnitudes within each vector, only standardizing the overall scale

  19. [19]

    Robustness to dimensionality:RMS is independent of dimensionality da, whereas L2 norm grows with √da for vectors with similar per-coordinate magnitudes

  20. [20]

    When Normalization is Applied.Normalization is applied once after computing embeddings and attributions, before the clustering algorithm runs

    Interpretability:After RMS normalization, attribution values can still be interpreted as relative contributions—a feature with value2.0contributes twice as much as one with value1.0. When Normalization is Applied.Normalization is applied once after computing embeddings and attributions, before the clustering algorithm runs. Cluster centers inherit the nor...

  21. [21]

    all previous steps

    and MATH500 (Lightman et al., 2023) using the Qwen3-0.6B transcoders. For each example, we roll out up to 12 reasoning steps (delimited by “\n\n”) and sample 64 candidate traces per step. The committed reasoning path used for later prefixes is deterministic: at each step, we select the candidate with the highest normalized sequence probability. For every ...

  22. [22]

    Recall the formulas for the volume and surface area of a sphere

  23. [23]

    The volume isV= 4 3 πr3 and the surface area isA= 4πr 2

  24. [24]

    Set the numerical equality 4 3 πr3 = 4πr 2 and solve forr

  25. [25]

    Divide both sides by4πr 2, assumingr̸= 0

  26. [26]

    Candidate continuations for target stepj= 7

    Re-check the division of 4 3 πr3 = 4πr 2 by4πr 2. Candidate continuations for target stepj= 7. pContinuation text (C1, . . . , C6) 0.923 Left side: the volume term divided by the surface-area term gives 1 3 r.(3,2,3,2,3,3) 0.876 Left side becomes 1 3 r, and the right side is 1. So this simplifies to 1 3 r= 1 ; multiplying both sides by 3 gives r= 3 . (2,1...