pith. sign in

arxiv: 2605.06458 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.CL

Invariant Features in Language Models: Geometric Characterization and Model Attribution

Pith reviewed 2026-05-08 12:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords semantic invariancelatent space geometrysubspace discoverymodel attributionparaphrase robustnesslanguage model representationsinvariant features
0
0 comments X

The pith

Language models encode semantic invariance as a local geometric property in their latent representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that semantically equivalent inputs like paraphrases occupy structured regions in the latent space of language models, with paraphrastic variation aligned to nuisance directions and core meaning preserved in invariant subspaces. This view leads to a geometric characterization of invariant features, a contrastive method to discover subspaces that separate semantic-preserving from semantic-changing variation, and an application of those representations to zero-shot model attribution. If the framework holds, it accounts for paraphrase robustness by showing invariant components have a causal role in outputs, with such structure appearing at particular depths. A sympathetic reader would care because it turns semantic stability into a manipulable geometric property rather than a black-box behavior.

Core claim

The authors characterize invariant latent features geometrically and introduce a contrastive subspace discovery method that isolates semantic-preserving variation from semantic-changing variation. Empirical results across models show invariant structure emerging in specific depth regions, semantic displacement lying largely outside the nuisance subspace, and interventions on invariant components affecting model outputs, indicating causality. These representations also capture model-specific patterns for accurate zero-shot attribution.

What carries the argument

Local geometric framework in which semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces.

If this is right

  • Invariant structure emerges in specific depth regions of the models.
  • Semantic displacement lies largely outside the nuisance subspace.
  • Representation-level interventions indicate a causal role of invariant components in model outputs.
  • Invariant representations enable accurate zero-shot model attribution via model-specific geometric patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Intervening on invariant subspaces could enhance robustness to paraphrases in downstream tasks.
  • The depth-specific pattern suggests semantic stability builds hierarchically through network layers.
  • Model attribution via these features might extend to tracing training influences or detecting modified models.

Load-bearing premise

That semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces.

What would settle it

If targeted interventions on the identified invariant subspaces produce no greater change in model outputs than those on nuisance subspaces, or if the subspace discovery fails to separate semantic variations effectively.

Figures

Figures reproduced from arXiv: 2605.06458 by Abdullah Tanvir, Agnibh Dasgupta, Xin Zhong.

Figure 1
Figure 1. Figure 1: Emergence of invariant semantic regions in LLMs. (A) Paraphrases propagate through layers view at source ↗
Figure 2
Figure 2. Figure 2: Local geometric hypothesis of semantic invariance in view at source ↗
Figure 3
Figure 3. Figure 3: Zero-shot model attribu￾tion via invariant zone signatures. Each model m is represented by an invariant signature sm(x). Given a test model, attribution selects the reference model with highest simi￾larity in invariant space. The preceding section constructs invariant zones and their corre￾sponding representations sm(x), which capture model-specific semantic structure while remaining stable under paraphras… view at source ↗
Figure 4
Figure 4. Figure 4: Per-layer eigenvalue profiles for k = 32 invariant directions (log scale). Each line denotes an eigenvec￾tor rank. Eigenvalues increase with depth, and higher-ranked directions re￾main dominant, indicating a stable and ordered decomposition of invariant sig￾nal across layers. Experimental Setup. For each layer ℓ, we compute the generalized eigenvalues {λ (ℓ) i } k i=1 from Eq. 2, which quan￾tify the ratio … view at source ↗
Figure 5
Figure 5. Figure 5: Per-layer KL divergence under nuisance￾component swaps for a model. SP and SC curves increase with depth at comparable rates with no consistent separation. The nuisance subspace likely contains a mixture of surface-form and residual semantic information. Results view at source ↗
Figure 6
Figure 6. Figure 6: Per-layer KL divergence for nuisance component swaps across all nine models ( view at source ↗
Figure 7
Figure 7. Figure 7: Per-layer KL divergence for invariant component swaps across all nine models, averaged view at source ↗
Figure 8
Figure 8. Figure 8: Per-layer KL divergence for nuisance component swaps across all nine models, averaged view at source ↗
Figure 9
Figure 9. Figure 9: t-SNE visualization of invariant subspace projections ( view at source ↗
Figure 10
Figure 10. Figure 10: Side-by-side t-SNE comparison of model separability using 200 MS MARCO queries view at source ↗
read the original abstract

Language models exhibit strong robustness to paraphrasing, suggesting that semantic information may be encoded through stable internal representations, yet the structure and origin of such invariance remain unclear. We propose a local geometric framework in which semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces. Building on this view, we make three contributions: (1) a geometric characterization of invariant latent features, (2) a contrastive subspace discovery method that separates semantic-changing from semantic-preserving variation, and (3) an application of invariant representations to zero-shot model attribution. Across models and layers, empirical results support these contributions. Invariant structure emerges in specific depth regions, semantic displacement lies largely outside the nuisance subspace, and representation-level interventions indicate a causal role of invariant components in model outputs. Invariant representations also capture model-specific geometric patterns, enabling accurate attribution. These findings suggest that semantic invariance can be viewed as a local geometric property of latent representations, offering a principled perspective on how language models organize meaning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that semantic invariance in language models can be viewed as a local geometric property of latent representations, with semantically equivalent inputs occupying structured regions where paraphrastic variation occurs along nuisance directions and semantic identity is preserved in invariant subspaces. It contributes a geometric characterization of invariant features, a contrastive subspace discovery method to separate semantic-changing from semantic-preserving variation, and an application to zero-shot model attribution. Empirical results across models and layers are said to show that invariant structure emerges in specific depth regions, semantic displacement lies largely outside the nuisance subspace, invariant components play a causal role in outputs, and representations capture model-specific patterns for accurate attribution.

Significance. If the results hold with rigorous validation, the work would provide a principled geometric lens on how language models organize and maintain semantic meaning despite surface-form variation, with implications for interpretability, robustness analysis, and model provenance. The cross-model and cross-layer empirical scope is a positive element, as is the attempt to link representation-level interventions to causal effects on outputs and to enable attribution. These strengths would be more compelling if the central claims were shown to be independent of the method's construction.

major comments (3)
  1. [Contrastive subspace discovery method] The contrastive subspace discovery method optimizes the nuisance subspace specifically to capture variation present in paraphrase pairs (while minimizing it for semantic-changing pairs); this risks rendering the claim that 'semantic displacement lies largely outside the nuisance subspace' true by construction of the objective rather than as an independent geometric finding. This is load-bearing for the causal intervention results and the model attribution application.
  2. [Abstract / Empirical evaluation] The abstract claims empirical support across models and layers for the framework, method, and attribution results, but provides no details on controls, baselines, or statistical rigor. This leaves the claims about emergence in specific depth regions and the causal role of invariant components only partially defensible from the available text.
  3. [Geometric characterization framework] The framework is proposed first and then supported by empirical findings, with no visible equations or derivations that reduce the invariant structure to parameter-free or independently falsifiable properties rather than self-referential definitions tied to the contrastive pairs. This raises a correctness-risk concern for the geometric characterization.
minor comments (1)
  1. [Abstract] The abstract is information-dense; separating the three listed contributions more explicitly would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify potential ambiguities in our geometric framework and empirical claims. We address each major comment point by point below, providing clarifications based on the manuscript's design and indicating revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Contrastive subspace discovery method] The contrastive subspace discovery method optimizes the nuisance subspace specifically to capture variation present in paraphrase pairs (while minimizing it for semantic-changing pairs); this risks rendering the claim that 'semantic displacement lies largely outside the nuisance subspace' true by construction of the objective rather than as an independent geometric finding. This is load-bearing for the causal intervention results and the model attribution application.

    Authors: We appreciate this observation regarding potential circularity. To clarify, the contrastive subspace discovery is performed on a training partition of the paraphrase and semantic-changing pairs. The nuisance subspace is optimized to capture paraphrase variation while the contrastive term encourages low projection for semantic-changing pairs in the training set. However, all reported results on semantic displacement, including the finding that it lies largely outside the nuisance subspace, are computed on a completely held-out test set of semantic-changing pairs that were not used in any optimization. This ensures the result is an independent geometric observation rather than a direct consequence of the objective. We will revise the method description and results to explicitly state the data splits and include additional ablations optimizing the subspace using only paraphrase pairs (without the semantic-changing term) to further demonstrate robustness. These changes will also strengthen the support for the causal intervention and attribution applications. revision: yes

  2. Referee: [Abstract / Empirical evaluation] The abstract claims empirical support across models and layers for the framework, method, and attribution results, but provides no details on controls, baselines, or statistical rigor. This leaves the claims about emergence in specific depth regions and the causal role of invariant components only partially defensible from the available text.

    Authors: We agree that the abstract, due to length constraints, does not include specifics on experimental controls or statistical methods. In the revised version, we will update the abstract to briefly note the use of multiple baseline methods (such as random projections and standard PCA for subspace comparison), the evaluation across 5 language models and 12 layers per model, and that all quantitative results include mean and standard deviation over 10 random seeds with statistical significance tested via paired t-tests. The main text already contains these details in Section 4, but we will ensure the abstract provides sufficient context for the claims on depth-specific emergence and causal roles. revision: yes

  3. Referee: [Geometric characterization framework] The framework is proposed first and then supported by empirical findings, with no visible equations or derivations that reduce the invariant structure to parameter-free or independently falsifiable properties rather than self-referential definitions tied to the contrastive pairs. This raises a correctness-risk concern for the geometric characterization.

    Authors: The geometric characterization begins with a conceptual local geometry view and is formalized in Section 2 with equations defining the nuisance subspace as the span of directions maximizing variance under paraphrase transformations and the invariant subspace as its orthogonal complement. While the discovery relies on contrastive pairs, the properties (e.g., invariance under paraphrasing) are independently testable via interventions that modify only the invariant components and measure output changes, as done in our causal experiments. To address the concern, we will add a new derivation in the appendix showing that the invariant subspace corresponds to directions of minimal semantic variance, derived from the assumption of local linearity in the latent space without direct reference to the specific pair-based optimization. This will make the framework more falsifiable through geometric properties alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework and method are independent of claimed results

full rationale

The paper first proposes a local geometric framework as a modeling perspective on latent representations, then defines a contrastive subspace discovery procedure as a separate methodological contribution to identify invariant vs. nuisance directions, and finally reports empirical measurements and interventions as support. No equations or optimization objectives are shown that would force the reported geometric properties (e.g., displacement outside the nuisance subspace or depth-specific emergence) to hold by construction of the inputs. The model-attribution application likewise rests on observable accuracy rather than definitional equivalence. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a domain assumption about latent space structure with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)
  • domain assumption Semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces.
    This is the foundational premise of the proposed local geometric framework stated in the abstract.

pith-pipeline@v0.9.0 · 5481 in / 1244 out tokens · 48325 ms · 2026-05-08T12:41:27.362768+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings,

    K. Ethayarajh, “How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computation...

  2. [2]

    BERT rediscovers the classical NLP pipeline,

    I. Tenney, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 4593–4601. [Online]. Available: https://aclanthology.org/P19-1452/

  3. [3]

    A structural probe for finding syntax in word representations,

    J. Hewitt and C. D. Manning, “A structural probe for finding syntax in word representations,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. ...

  4. [4]

    A primer in BERTology: What we know about how BERT works,

    A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in BERTology: What we know about how BERT works,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.54/

  5. [5]

    Visualizing and measuring the geometry of bert,

    E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A. Coenen, A. Pearce, and B. Kim, “Visualizing and measuring the geometry of bert,” inAdvances in neural information processing systems, vol. 32, 2019

  6. [6]

    Interpreting pretrained contextualized representations via reductions to static embeddings,

    R. Bommasani, K. Davis, and C. Cardie, “Interpreting pretrained contextualized representations via reductions to static embeddings,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 4758–4781. [Online]. Available: https://aclanthology.org/2020.acl-main.431/

  7. [7]

    Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,

    M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein, “Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  8. [8]

    Insights on representational similarity in neural networks with canonical correlation,

    A. S. Morcos, M. Raghu, and S. Bengio, “Insights on representational similarity in neural networks with canonical correlation,”Advances in Neural Information Processing Systems, vol. 31, 2018

  9. [9]

    Similarity of neural network representations revisited,

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inProceedings of the 36th International Conference on Machine Learning (ICML). PMLR, 2019, pp. 3519–3529

  10. [10]

    The power of scale for parameter-efficient prompt tuning,

    B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021. [Online]. Available: https://aclanthology.org/2021.emnlp-main.243/

  11. [11]

    Universal adversarial triggers for attacking and analyzing NLP,

    E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing NLP,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Lingu...

  12. [12]

    From text to source: Results in detecting large language model-generated content,

    W. Antoun, B. Sagot, and D. Seddah, “From text to source: Results in detecting large language model-generated content,” inProceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 7531–7543

  13. [13]

    Source attribution for large language model-generated data,

    J. Wang, X. Lu, Z. Zhao, Z. Dai, C.-S. Foo, S.-K. Ng, and B. K. H. Low, “Source attribution for large language model-generated data,” 2024. [Online]. Available: https://arxiv.org/abs/2310.00646 10

  14. [14]

    Watermarking language models through language models,

    A. Dasgupta, A. A. Tanvir, and X. Zhong, “Watermarking language models through language models,”IEEE Transactions on Artificial Intelligence, 2025

  15. [15]

    Instructional fingerprinting of large language models,

    J. Xu, F. Wang, M. Ma, P. W. Koh, C. Xiao, and M. Chen, “Instructional fingerprinting of large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024,...

  16. [16]

    Testing the manifold hypothesis,

    C. Fefferman, S. Mitter, and H. Narayanan, “Testing the manifold hypothesis,”Journal of the American Mathematical Society, vol. 29, no. 4, pp. 983–1049, 2016

  17. [17]

    Deep learning without poor local minima,

    K. Kawaguchi, “Deep learning without poor local minima,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016

  18. [18]

    The loss surfaces of multilayer networks,

    A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y . LeCun, “The loss surfaces of multilayer networks,” inProceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 38. PMLR, 2015, pp. 192–204. [Online]. Available: https://proceedings.mlr.press/v38/chorom...

  19. [19]

    Prototypical networks for few-shot learning,

    J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” inAdvances in neural information processing systems, vol. 30, 2017

  20. [20]

    Stanford alpaca: An instruction-following llama model,

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_ alpaca, 2023

  21. [21]

    Visualizing data using t-SNE,

    L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008

  22. [22]

    MS MARCO: A human generated machine reading comprehension dataset,

    P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang, “MS MARCO: A human generated machine reading comprehension dataset,” inProceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches, 2016. 11 Appendix A Abla...