Invariant Features in Language Models: Geometric Characterization and Model Attribution
Pith reviewed 2026-05-08 12:41 UTC · model grok-4.3
The pith
Language models encode semantic invariance as a local geometric property in their latent representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors characterize invariant latent features geometrically and introduce a contrastive subspace discovery method that isolates semantic-preserving variation from semantic-changing variation. Empirical results across models show invariant structure emerging in specific depth regions, semantic displacement lying largely outside the nuisance subspace, and interventions on invariant components affecting model outputs, indicating causality. These representations also capture model-specific patterns for accurate zero-shot attribution.
What carries the argument
Local geometric framework in which semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces.
If this is right
- Invariant structure emerges in specific depth regions of the models.
- Semantic displacement lies largely outside the nuisance subspace.
- Representation-level interventions indicate a causal role of invariant components in model outputs.
- Invariant representations enable accurate zero-shot model attribution via model-specific geometric patterns.
Where Pith is reading between the lines
- Intervening on invariant subspaces could enhance robustness to paraphrases in downstream tasks.
- The depth-specific pattern suggests semantic stability builds hierarchically through network layers.
- Model attribution via these features might extend to tracing training influences or detecting modified models.
Load-bearing premise
That semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces.
What would settle it
If targeted interventions on the identified invariant subspaces produce no greater change in model outputs than those on nuisance subspaces, or if the subspace discovery fails to separate semantic variations effectively.
Figures
read the original abstract
Language models exhibit strong robustness to paraphrasing, suggesting that semantic information may be encoded through stable internal representations, yet the structure and origin of such invariance remain unclear. We propose a local geometric framework in which semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces. Building on this view, we make three contributions: (1) a geometric characterization of invariant latent features, (2) a contrastive subspace discovery method that separates semantic-changing from semantic-preserving variation, and (3) an application of invariant representations to zero-shot model attribution. Across models and layers, empirical results support these contributions. Invariant structure emerges in specific depth regions, semantic displacement lies largely outside the nuisance subspace, and representation-level interventions indicate a causal role of invariant components in model outputs. Invariant representations also capture model-specific geometric patterns, enabling accurate attribution. These findings suggest that semantic invariance can be viewed as a local geometric property of latent representations, offering a principled perspective on how language models organize meaning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that semantic invariance in language models can be viewed as a local geometric property of latent representations, with semantically equivalent inputs occupying structured regions where paraphrastic variation occurs along nuisance directions and semantic identity is preserved in invariant subspaces. It contributes a geometric characterization of invariant features, a contrastive subspace discovery method to separate semantic-changing from semantic-preserving variation, and an application to zero-shot model attribution. Empirical results across models and layers are said to show that invariant structure emerges in specific depth regions, semantic displacement lies largely outside the nuisance subspace, invariant components play a causal role in outputs, and representations capture model-specific patterns for accurate attribution.
Significance. If the results hold with rigorous validation, the work would provide a principled geometric lens on how language models organize and maintain semantic meaning despite surface-form variation, with implications for interpretability, robustness analysis, and model provenance. The cross-model and cross-layer empirical scope is a positive element, as is the attempt to link representation-level interventions to causal effects on outputs and to enable attribution. These strengths would be more compelling if the central claims were shown to be independent of the method's construction.
major comments (3)
- [Contrastive subspace discovery method] The contrastive subspace discovery method optimizes the nuisance subspace specifically to capture variation present in paraphrase pairs (while minimizing it for semantic-changing pairs); this risks rendering the claim that 'semantic displacement lies largely outside the nuisance subspace' true by construction of the objective rather than as an independent geometric finding. This is load-bearing for the causal intervention results and the model attribution application.
- [Abstract / Empirical evaluation] The abstract claims empirical support across models and layers for the framework, method, and attribution results, but provides no details on controls, baselines, or statistical rigor. This leaves the claims about emergence in specific depth regions and the causal role of invariant components only partially defensible from the available text.
- [Geometric characterization framework] The framework is proposed first and then supported by empirical findings, with no visible equations or derivations that reduce the invariant structure to parameter-free or independently falsifiable properties rather than self-referential definitions tied to the contrastive pairs. This raises a correctness-risk concern for the geometric characterization.
minor comments (1)
- [Abstract] The abstract is information-dense; separating the three listed contributions more explicitly would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify potential ambiguities in our geometric framework and empirical claims. We address each major comment point by point below, providing clarifications based on the manuscript's design and indicating revisions to strengthen the presentation.
read point-by-point responses
-
Referee: [Contrastive subspace discovery method] The contrastive subspace discovery method optimizes the nuisance subspace specifically to capture variation present in paraphrase pairs (while minimizing it for semantic-changing pairs); this risks rendering the claim that 'semantic displacement lies largely outside the nuisance subspace' true by construction of the objective rather than as an independent geometric finding. This is load-bearing for the causal intervention results and the model attribution application.
Authors: We appreciate this observation regarding potential circularity. To clarify, the contrastive subspace discovery is performed on a training partition of the paraphrase and semantic-changing pairs. The nuisance subspace is optimized to capture paraphrase variation while the contrastive term encourages low projection for semantic-changing pairs in the training set. However, all reported results on semantic displacement, including the finding that it lies largely outside the nuisance subspace, are computed on a completely held-out test set of semantic-changing pairs that were not used in any optimization. This ensures the result is an independent geometric observation rather than a direct consequence of the objective. We will revise the method description and results to explicitly state the data splits and include additional ablations optimizing the subspace using only paraphrase pairs (without the semantic-changing term) to further demonstrate robustness. These changes will also strengthen the support for the causal intervention and attribution applications. revision: yes
-
Referee: [Abstract / Empirical evaluation] The abstract claims empirical support across models and layers for the framework, method, and attribution results, but provides no details on controls, baselines, or statistical rigor. This leaves the claims about emergence in specific depth regions and the causal role of invariant components only partially defensible from the available text.
Authors: We agree that the abstract, due to length constraints, does not include specifics on experimental controls or statistical methods. In the revised version, we will update the abstract to briefly note the use of multiple baseline methods (such as random projections and standard PCA for subspace comparison), the evaluation across 5 language models and 12 layers per model, and that all quantitative results include mean and standard deviation over 10 random seeds with statistical significance tested via paired t-tests. The main text already contains these details in Section 4, but we will ensure the abstract provides sufficient context for the claims on depth-specific emergence and causal roles. revision: yes
-
Referee: [Geometric characterization framework] The framework is proposed first and then supported by empirical findings, with no visible equations or derivations that reduce the invariant structure to parameter-free or independently falsifiable properties rather than self-referential definitions tied to the contrastive pairs. This raises a correctness-risk concern for the geometric characterization.
Authors: The geometric characterization begins with a conceptual local geometry view and is formalized in Section 2 with equations defining the nuisance subspace as the span of directions maximizing variance under paraphrase transformations and the invariant subspace as its orthogonal complement. While the discovery relies on contrastive pairs, the properties (e.g., invariance under paraphrasing) are independently testable via interventions that modify only the invariant components and measure output changes, as done in our causal experiments. To address the concern, we will add a new derivation in the appendix showing that the invariant subspace corresponds to directions of minimal semantic variance, derived from the assumption of local linearity in the latent space without direct reference to the specific pair-based optimization. This will make the framework more falsifiable through geometric properties alone. revision: partial
Circularity Check
No significant circularity; framework and method are independent of claimed results
full rationale
The paper first proposes a local geometric framework as a modeling perspective on latent representations, then defines a contrastive subspace discovery procedure as a separate methodological contribution to identify invariant vs. nuisance directions, and finally reports empirical measurements and interventions as support. No equations or optimization objectives are shown that would force the reported geometric properties (e.g., displacement outside the nuisance subspace or depth-specific emergence) to hold by construction of the inputs. The model-attribution application likewise rests on observable accuracy rather than definitional equivalence. The derivation chain therefore remains self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces.
Reference graph
Works this paper leans on
-
[1]
K. Ethayarajh, “How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computation...
work page 2019
-
[2]
BERT rediscovers the classical NLP pipeline,
I. Tenney, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 4593–4601. [Online]. Available: https://aclanthology.org/P19-1452/
work page 2019
-
[3]
A structural probe for finding syntax in word representations,
J. Hewitt and C. D. Manning, “A structural probe for finding syntax in word representations,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. ...
work page 2019
-
[4]
A primer in BERTology: What we know about how BERT works,
A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in BERTology: What we know about how BERT works,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.54/
work page 2020
-
[5]
Visualizing and measuring the geometry of bert,
E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A. Coenen, A. Pearce, and B. Kim, “Visualizing and measuring the geometry of bert,” inAdvances in neural information processing systems, vol. 32, 2019
work page 2019
-
[6]
Interpreting pretrained contextualized representations via reductions to static embeddings,
R. Bommasani, K. Davis, and C. Cardie, “Interpreting pretrained contextualized representations via reductions to static embeddings,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 4758–4781. [Online]. Available: https://aclanthology.org/2020.acl-main.431/
work page 2020
-
[7]
M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein, “Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” inAdvances in Neural Information Processing Systems, vol. 30, 2017
work page 2017
-
[8]
Insights on representational similarity in neural networks with canonical correlation,
A. S. Morcos, M. Raghu, and S. Bengio, “Insights on representational similarity in neural networks with canonical correlation,”Advances in Neural Information Processing Systems, vol. 31, 2018
work page 2018
-
[9]
Similarity of neural network representations revisited,
S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inProceedings of the 36th International Conference on Machine Learning (ICML). PMLR, 2019, pp. 3519–3529
work page 2019
-
[10]
The power of scale for parameter-efficient prompt tuning,
B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021. [Online]. Available: https://aclanthology.org/2021.emnlp-main.243/
work page 2021
-
[11]
Universal adversarial triggers for attacking and analyzing NLP,
E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing NLP,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Lingu...
work page 2019
-
[12]
From text to source: Results in detecting large language model-generated content,
W. Antoun, B. Sagot, and D. Seddah, “From text to source: Results in detecting large language model-generated content,” inProceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 7531–7543
work page 2024
-
[13]
Source attribution for large language model-generated data,
J. Wang, X. Lu, Z. Zhao, Z. Dai, C.-S. Foo, S.-K. Ng, and B. K. H. Low, “Source attribution for large language model-generated data,” 2024. [Online]. Available: https://arxiv.org/abs/2310.00646 10
-
[14]
Watermarking language models through language models,
A. Dasgupta, A. A. Tanvir, and X. Zhong, “Watermarking language models through language models,”IEEE Transactions on Artificial Intelligence, 2025
work page 2025
-
[15]
Instructional fingerprinting of large language models,
J. Xu, F. Wang, M. Ma, P. W. Koh, C. Xiao, and M. Chen, “Instructional fingerprinting of large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024,...
work page 2024
-
[16]
Testing the manifold hypothesis,
C. Fefferman, S. Mitter, and H. Narayanan, “Testing the manifold hypothesis,”Journal of the American Mathematical Society, vol. 29, no. 4, pp. 983–1049, 2016
work page 2016
-
[17]
Deep learning without poor local minima,
K. Kawaguchi, “Deep learning without poor local minima,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016
work page 2016
-
[18]
The loss surfaces of multilayer networks,
A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y . LeCun, “The loss surfaces of multilayer networks,” inProceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 38. PMLR, 2015, pp. 192–204. [Online]. Available: https://proceedings.mlr.press/v38/chorom...
work page 2015
-
[19]
Prototypical networks for few-shot learning,
J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” inAdvances in neural information processing systems, vol. 30, 2017
work page 2017
-
[20]
Stanford alpaca: An instruction-following llama model,
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_ alpaca, 2023
work page 2023
-
[21]
L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008
work page 2008
-
[22]
MS MARCO: A human generated machine reading comprehension dataset,
P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang, “MS MARCO: A human generated machine reading comprehension dataset,” inProceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches, 2016. 11 Appendix A Abla...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.