Relational Linear Properties in Language Models: An Empirical Investigation

Emanuele Marconato; Giovanni Valer; Luigi Gresele; Marco Bronzini

arxiv: 2605.22532 · v1 · pith:I6WVSZTWnew · submitted 2026-05-21 · 💻 cs.LG

Relational Linear Properties in Language Models: An Empirical Investigation

Giovanni Valer , Luigi Gresele , Marco Bronzini , Emanuele Marconato This is my paper

Pith reviewed 2026-05-22 07:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords relational linearitylanguage modelsprobing methodsKL divergenceembeddingslinear mapsparaphrased queries

0 comments

The pith

A linear map can predict an object's embedding from its subject's for any fixed relation in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the idea that language model representations encode relations in a linear fashion, so that for a given relation the unembedding of an object can be recovered from the embedding of its subject through a linear transformation. It introduces a probing technique based on Kullback-Leibler divergence to measure the strength of this property and applies it across multiple models and datasets. The results show that the degree of linearity differs between models, follows recognizable patterns from one layer to the next, and changes when the same relation is expressed in different words. A reader would care because this structure could explain how models store and retrieve factual associations internally.

Core claim

Relational linearity holds to varying degrees: for a fixed relation, the unembedding of an object can be predicted from the embedding of its subject by a linear map. The paper demonstrates this using a new probing method based on Kullback-Leibler divergence that compares predicted and observed output distributions. Experiments across four datasets reveal that the property varies across models, exhibits layer-wise patterns, and is affected differently by paraphrased versions of the relational query.

What carries the argument

The KL-divergence probing method, which finds the linear map that minimizes the divergence between the model's actual next-token distribution for the object and the distribution obtained by transforming the subject embedding.

If this is right

The strength of relational linearity differs from one language model to another.
The property shows systematic changes across successive layers that align with where linguistic information is processed.
Paraphrasing the way a relation is stated in the input query alters the measured degree of linearity.
The new probing approach evaluates the property more efficiently than methods that rely on Jacobian approximations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the linear maps prove stable, they could be used to edit specific relational facts inside a model by adjusting only the map rather than the full parameters.
The same measurement approach might be applied to test linearity for other kinds of structured knowledge beyond binary relations.
Layer-wise differences could guide the choice of which internal states to inspect or modify when studying model reasoning.

Load-bearing premise

The KL-divergence probing method accurately captures the degree of relational linearity without introducing its own biases or requiring Jacobian approximations.

What would settle it

A concrete falsifier would be to find that, for a large collection of relations and subject-object pairs, the lowest KL divergence achieved by any linear map is no smaller than the divergence achieved by a random map or by using the subject embedding unchanged.

Figures

Figures reproduced from arXiv: 2605.22532 by Emanuele Marconato, Giovanni Valer, Luigi Gresele, Marco Bronzini.

**Figure 1.** Figure 1: Relational linearity of final-layer representations for the TENSE dataset. For Llama-3.1 (two left columns) and Gemma-2 (two right columns), we compare the embeddings computed from contexts concatenated with the query q = “What is the tense of the previous sentence?” f(s ⌢ q), with embeddings obtained by KL-based Relational Probe (KL-RP) on the context-only embeddings f(s), for contexts s from the TENSE da… view at source ↗

**Figure 2.** Figure 2: Relational linear probing (KL-RP) with different paraphrases. The graphs show F1(LLM) for both models. improvements in middle layers (with major drops in the TENSE and TRUTH datasets). Last-layers often increase in dKL but not sensibly enough to downgrade F1 scores. Overall, middle layers for all datasets show high strong and weak relational linear probing for Llama-3.1. Instead, Gemma-2 representations i… view at source ↗

**Figure 3.** Figure 3: Relational linear probing across layers (normalized into the range 0–100), comparing Llama-3.1 and Gemma-2. progressively in Gemma-2, indicating that its middle layers may capture more abstract representations of this latent property, and (ii) Llama-3.1 appears to encode relational linearity earlier, as changes in query surface form have no effect across all layers. 5. Limitations Our work presents some l… view at source ↗

**Figure 4.** Figure 4: Sensibility of LRE when changing hyperparameters. F1(LLM) scores on TRUTH using different configurations of LRE hyper-parameters, (a) varying ℓ, β, and ρ, or (b) with a fixed layer ℓ. B.3. Randomly permuted embeddings for baseline The random baseline is defined by rearranging the embeddings with π, a random permutation of {1, . . . , N}, so that the randomized training data is: D rand q = f(sπ(i)), p(· | … view at source ↗

**Figure 5.** Figure 5: Histograms of probabilities for Llama-3.1 and Gemma-2. The x-axis reports max p(y | q˜ ⌢ s ⌢ ˜t) when prompted with the query. The y-axis is the frequency in log scale. We observe that Gemma-2 tends to be more overconfident than Llama-3.1, collapsing output probabilities to high confidence regions two times more in order of magnitude than low confidence intervals. C. Paraphrases [PITH_FULL_IMAGE:figures/f… view at source ↗

read the original abstract

Linear properties are ubiquitous in the representations of language models; however, testing them experimentally remains a challenging task. This work focuses on relational linearity: the hypothesis that, for a fixed relation (e.g., "plays"), the unembedding of an object (e.g., "trumpet") can be predicted from the embedding of its subject (e.g.,"Miles Davis") by a linear map. We present an experimental method to test the formulation of relational linearity by Marconato et al. (2025). Specifically, we introduce a probing method, based on Kullback-Leibler divergence, to evaluate this property and examine its variation across layers and paraphrased relational queries. It is also more efficient than previous work; for example, it avoids the crude Jacobian approximations used in Linear Relational Embeddings by Hernandez et al. (2024). Our findings across four datasets show that relational linearity varies across models, exhibits layer-wise patterns consistent with prior observations about linguistic information in model representations, and is differently affected by changes in how the relation is phrased.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a KL-divergence probe for relational linearity that skips Jacobian approximations and reports layer and paraphrase effects, but the probe's link to isolated linear predictability of embeddings needs tighter validation.

read the letter

The main point is that this work tests relational linearity with a KL-divergence probe and finds the property changes across models, layers, and paraphrased queries on four datasets. It builds directly on the Marconato et al. formulation and avoids the approximations in Hernandez et al. by minimizing divergence between the model's output distribution and a target tied to the object token. The layer-wise patterns line up with known results on where linguistic structure appears in transformers, and the paraphrase sensitivity shows the measured linearity is not fixed to one wording of the relation. That combination is the actual new piece here, and the efficiency claim holds if the probe runs as described. The empirical consistency across datasets gives a practical starting map for where these relations sit in different models. The soft spot is the one the stress test flags. The KL is taken over the full token distribution after unembedding and softmax, so it is possible for low divergence to arise from probability mass patterns that do not require the subject-to-object map to be strictly linear in embedding space. A side-by-side with plain least-squares regression on the same embeddings would have clarified how much the probe isolates the claimed property, but that check is missing. The citation pattern is straightforward and the datasets are standard, so nothing looks off there. This is the sort of paper interpretability researchers will want to read when they are building tools for knowledge editing or controlled generation. A reader who already works with linear probes or relational embeddings will get usable data points and a method worth trying on their own models. It is coherent enough on its own terms to go to a serious referee, even though the probe validation will need work in revision. I would send it out for review rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The paper empirically investigates relational linearity in language models: the hypothesis that, for a fixed relation, the unembedding of an object can be linearly predicted from the embedding of its subject. It introduces a KL-divergence-based probing method to test this property (building on Marconato et al. 2025), applies it across four datasets and multiple models, and reports that the property varies across models, exhibits layer-wise patterns, and is affected by paraphrased queries. The method is presented as more efficient than prior work by avoiding Jacobian approximations.

Significance. If the central empirical claims hold after validation of the probe, the work would contribute a scalable test for relational structure in LM representations and document its variation, which could inform mechanistic interpretability and the design of relation-aware interventions. The efficiency claim and layer-wise findings align with existing observations about where linguistic information is encoded.

major comments (1)

[Experimental method] The KL-divergence probing method (described in the experimental method section) minimizes KL between the model's output distribution (conditioned on subject + relation) and a target distribution derived from the object while varying the linear map. This approach remains coupled to the full unembedding matrix and softmax geometry; a non-linear perturbation that preserves marginal token probabilities could produce low KL without satisfying the claimed linear relation between subject embedding and object unembedding. No section reports a direct comparison (e.g., least-squares R² on the same embeddings) that would confirm the probe isolates the intended linear property rather than output-distribution match.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a brief explicit statement of the precise mathematical formulation of relational linearity being tested (e.g., the exact linear map and how the target distribution is constructed).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript on relational linear properties in language models. We address the major comment on the experimental method below and will incorporate additional validation to strengthen the claims.

read point-by-point responses

Referee: [Experimental method] The KL-divergence probing method (described in the experimental method section) minimizes KL between the model's output distribution (conditioned on subject + relation) and a target distribution derived from the object while varying the linear map. This approach remains coupled to the full unembedding matrix and softmax geometry; a non-linear perturbation that preserves marginal token probabilities could produce low KL without satisfying the claimed linear relation between subject embedding and object unembedding. No section reports a direct comparison (e.g., least-squares R² on the same embeddings) that would confirm the probe isolates the intended linear property rather than output-distribution match.

Authors: We appreciate the referee's point regarding potential confounding factors in our KL-divergence probing method. The approach optimizes a linear map to align the model's output distribution (conditioned on subject plus relation) with a target derived from the object, thereby testing whether a linear transformation suffices for the relational prediction under the model's unembedding and softmax. This is intended to directly evaluate the hypothesis from Marconato et al. (2025) in a manner that is more efficient than Jacobian-based alternatives. That said, we acknowledge that the probe is coupled to the output geometry and that non-linear perturbations preserving token probabilities could in principle yield low KL without a strict linear relation in embedding space. To provide independent confirmation, we will add a direct least-squares comparison (including R² metrics) on the same subject and object embeddings in the revised experimental method section. This will help isolate the linear property from output-distribution effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical test of external hypothesis

full rationale

The paper introduces a KL-divergence probing method as an experimental tool to evaluate the relational linearity hypothesis formulated in prior work (Marconato et al. 2025). No equations, fitted parameters, or results are shown to reduce the measured linearity to a quantity defined by the same experiment or to a self-citation chain that bears the central claim. The method is motivated independently as an efficiency improvement over Jacobian approximations, and findings are reported from direct experiments on four datasets. The derivation chain consists of hypothesis testing rather than self-referential construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the prior formulation of relational linearity from Marconato et al. (2025) and the validity of KL divergence as a linearity metric; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Language model hidden states contain extractable linear relational structure for fixed relations
This is the hypothesis being tested rather than proved; it is invoked when the probing method is applied to evaluate the linear map prediction.

pith-pipeline@v0.9.0 · 5717 in / 1350 out tokens · 32989 ms · 2026-05-22T07:41:20.245717+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a probing method, based on Kullback-Leibler divergence, to evaluate this property and examine its variation across layers and paraphrased relational queries.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL-RP builds on a linear model... minimizing the Kullback-Leibler (KL) divergence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 9 internal anchors

[1]

M., Chakraborty, T., Dietze, S., Hafid, S., Korre, K., Muti, A., Nakov, P., Ruggeri, F., Schellham- mer, S., Setty, V ., Sundriyal, M., Todorov, K., and V ., V

Alam, F., Struß, J. M., Chakraborty, T., Dietze, S., Hafid, S., Korre, K., Muti, A., Nakov, P., Ruggeri, F., Schellham- mer, S., Setty, V ., Sundriyal, M., Todorov, K., and V ., V . The clef-2025 checkthat! lab: Subjectivity, fact-checking, claim normalization, and retrieval. In Hauff, C., Macdon- ald, C., Jannach, D., Kazai, G., Nardini, F. M., Pinelli, ...

work page 2025
[2]

Identifying lin- ear relational concepts in large language models

Chanin, D., Hunter, A., and Camburu, O.-M. Identifying lin- ear relational concepts in large language models. InPro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1524–1535,

work page 2024
[3]

Toy Models of Superposition

https://transformer- circuits.pub/2021/framework/index.html. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Ferrando, J., Sarti, G., Bisazza, A., and Costa-Juss`a, M. R. A primer on the inner workings of transformer-based lan- guage models.arXiv preprint arXiv:2405.00208,

work page arXiv
[5]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri- ari, B., Ram ´e, A., et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Z., and Andreas, J

Hernandez, E., Li, B. Z., and Andreas, J. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740,

work page arXiv
[8]

Interpreting key mechanisms of factual recall in transformer-based language models

9 Relational Linear Properties in Language Models: An Empirical Investigation Lv, A., Chen, Y ., Zhang, K., Wang, Y ., Liu, L., Wen, J.- R., Xie, J., and Yan, R. Interpreting key mechanisms of factual recall in transformer-based language models. arXiv preprint arXiv:2403.19521,

work page arXiv
[9]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Marks, S. and Tegmark, M. The geometry of truth: Emer- gent linear structure in large language model represen- tations of true/false datasets.ArXiv, abs/2310.06824,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

org/CorpusID:263831277

URL https://api.semanticscholar. org/CorpusID:263831277. Merullo, J., Eickhoff, C., and Pavlick, E. Language mod- els implement simple word2vec-style vector arithmetic. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5030–5047,

work page 2024
[11]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Accessed: 2025-27-06. Park, K., Choe, Y . J., and Veitch, V . The linear represen- tation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Tracing Relational Knowledge Recall in Large Language Models

Popoviˇc, N. and F ¨arber, M. Tracing relational knowl- edge recall in large language models.arXiv preprint arXiv:2604.19934,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Linear Representations of Hierarchical Concepts in Language Models

Sakata, M., Heinzerling, B., Ito, T., Yokoi, S., and Inui, K. Linear representations of hierarchical concepts in lan- guage models.arXiv preprint arXiv:2604.07886,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Open Problems in Mechanistic Interpretability

Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Or- tega, A., Bloom, J., et al. Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2503.05613 , year=

Shu, D., Wu, X., Zhao, H., Rai, D., Yao, Z., Liu, N., and Du, M. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models.arXiv preprint arXiv:2503.05613,

work page arXiv
[16]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Locating and extracting relational concepts in large language models

Wang, Z., Whyte, B., and Xu, C. Locating and extracting relational concepts in large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 4818–4832,

work page 2024
[18]

10 Relational Linear Properties in Language Models: An Empirical Investigation A. Additional Related Work Linear Properties of Language Models.The latent representations of transformer-based models, also known as residual stream, is a high-dimensional vector space that aggregates the outputs of all hidden layers (Elhage et al., 2021; Roeder et al., 2021; ...

work page 2021
[19]

Hernandez et al

provides a first empirical evidence that certain relational mappings in language models can be approximated by linear transformations. Hernandez et al. (2024) frame LRE to inspect relational linear properties, testing it on curated datasets of annotated (subject, relation, object) triplets across 47 relations. Subsequent work by Chanin et al. (2024) inver...

work page 2024
[20]

This work empirically investigates the formulation of relational linearity proposed by Marconato et al

likewise rely on annotated subject–object pairs or relation-type labels to supervise or validate the linear structure. This work empirically investigates the formulation of relational linearity proposed by Marconato et al. (2025), which is grounded in the mechanisms by which language models compute next-token probability distributions. Our approach enable...

work page 2025
[21]

Preliminary experiments involving this method have been conducted using Llama-3.1

This rensembles the linear probing technique introduced by (Popoviˇc & F¨arber, 2026). Preliminary experiments involving this method have been conducted using Llama-3.1. Table 6 presents the results, showing how the two approaches reach close performance in terms of F1(LLM), but when considering dKL, KL-RP outperforms SVM, as it is specifically trained to...

work page 2026
[22]

All results refer to middle layer,i.e., ℓ= 16 for Llama-3.1 and ℓ= 13 for Gemma-2

Table 7.Result of different paraphrases. All results refer to middle layer,i.e., ℓ= 16 for Llama-3.1 and ℓ= 13 for Gemma-2. − marks invalid measurements. Llama-3.1 Gemma-2 LANGTRUTH LANGTRUTH F1(GT) F1(LLM)d KL F1(GT) F1(LLM)d KL F1(GT) F1(LLM)d KL F1(GT) F1(LLM)d KL i 0.98 0.98 0.06 0.86 0.94 0.04 0.51 0.62 2.19 0.72 0.90 0.42 ii –0.90 0.08–0.93 0.04 –0....

work page 2025

[1] [1]

M., Chakraborty, T., Dietze, S., Hafid, S., Korre, K., Muti, A., Nakov, P., Ruggeri, F., Schellham- mer, S., Setty, V ., Sundriyal, M., Todorov, K., and V ., V

Alam, F., Struß, J. M., Chakraborty, T., Dietze, S., Hafid, S., Korre, K., Muti, A., Nakov, P., Ruggeri, F., Schellham- mer, S., Setty, V ., Sundriyal, M., Todorov, K., and V ., V . The clef-2025 checkthat! lab: Subjectivity, fact-checking, claim normalization, and retrieval. In Hauff, C., Macdon- ald, C., Jannach, D., Kazai, G., Nardini, F. M., Pinelli, ...

work page 2025

[2] [2]

Identifying lin- ear relational concepts in large language models

Chanin, D., Hunter, A., and Camburu, O.-M. Identifying lin- ear relational concepts in large language models. InPro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1524–1535,

work page 2024

[3] [3]

Toy Models of Superposition

https://transformer- circuits.pub/2021/framework/index.html. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Ferrando, J., Sarti, G., Bisazza, A., and Costa-Juss`a, M. R. A primer on the inner workings of transformer-based lan- guage models.arXiv preprint arXiv:2405.00208,

work page arXiv

[5] [5]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri- ari, B., Ram ´e, A., et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Z., and Andreas, J

Hernandez, E., Li, B. Z., and Andreas, J. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740,

work page arXiv

[8] [8]

Interpreting key mechanisms of factual recall in transformer-based language models

9 Relational Linear Properties in Language Models: An Empirical Investigation Lv, A., Chen, Y ., Zhang, K., Wang, Y ., Liu, L., Wen, J.- R., Xie, J., and Yan, R. Interpreting key mechanisms of factual recall in transformer-based language models. arXiv preprint arXiv:2403.19521,

work page arXiv

[9] [9]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Marks, S. and Tegmark, M. The geometry of truth: Emer- gent linear structure in large language model represen- tations of true/false datasets.ArXiv, abs/2310.06824,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

org/CorpusID:263831277

URL https://api.semanticscholar. org/CorpusID:263831277. Merullo, J., Eickhoff, C., and Pavlick, E. Language mod- els implement simple word2vec-style vector arithmetic. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5030–5047,

work page 2024

[11] [11]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Accessed: 2025-27-06. Park, K., Choe, Y . J., and Veitch, V . The linear represen- tation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Tracing Relational Knowledge Recall in Large Language Models

Popoviˇc, N. and F ¨arber, M. Tracing relational knowl- edge recall in large language models.arXiv preprint arXiv:2604.19934,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Linear Representations of Hierarchical Concepts in Language Models

Sakata, M., Heinzerling, B., Ito, T., Yokoi, S., and Inui, K. Linear representations of hierarchical concepts in lan- guage models.arXiv preprint arXiv:2604.07886,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Open Problems in Mechanistic Interpretability

Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Or- tega, A., Bloom, J., et al. Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2503.05613 , year=

Shu, D., Wu, X., Zhao, H., Rai, D., Yao, Z., Liu, N., and Du, M. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models.arXiv preprint arXiv:2503.05613,

work page arXiv

[16] [16]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Locating and extracting relational concepts in large language models

Wang, Z., Whyte, B., and Xu, C. Locating and extracting relational concepts in large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 4818–4832,

work page 2024

[18] [18]

10 Relational Linear Properties in Language Models: An Empirical Investigation A. Additional Related Work Linear Properties of Language Models.The latent representations of transformer-based models, also known as residual stream, is a high-dimensional vector space that aggregates the outputs of all hidden layers (Elhage et al., 2021; Roeder et al., 2021; ...

work page 2021

[19] [19]

Hernandez et al

provides a first empirical evidence that certain relational mappings in language models can be approximated by linear transformations. Hernandez et al. (2024) frame LRE to inspect relational linear properties, testing it on curated datasets of annotated (subject, relation, object) triplets across 47 relations. Subsequent work by Chanin et al. (2024) inver...

work page 2024

[20] [20]

This work empirically investigates the formulation of relational linearity proposed by Marconato et al

likewise rely on annotated subject–object pairs or relation-type labels to supervise or validate the linear structure. This work empirically investigates the formulation of relational linearity proposed by Marconato et al. (2025), which is grounded in the mechanisms by which language models compute next-token probability distributions. Our approach enable...

work page 2025

[21] [21]

Preliminary experiments involving this method have been conducted using Llama-3.1

This rensembles the linear probing technique introduced by (Popoviˇc & F¨arber, 2026). Preliminary experiments involving this method have been conducted using Llama-3.1. Table 6 presents the results, showing how the two approaches reach close performance in terms of F1(LLM), but when considering dKL, KL-RP outperforms SVM, as it is specifically trained to...

work page 2026

[22] [22]

All results refer to middle layer,i.e., ℓ= 16 for Llama-3.1 and ℓ= 13 for Gemma-2

Table 7.Result of different paraphrases. All results refer to middle layer,i.e., ℓ= 16 for Llama-3.1 and ℓ= 13 for Gemma-2. − marks invalid measurements. Llama-3.1 Gemma-2 LANGTRUTH LANGTRUTH F1(GT) F1(LLM)d KL F1(GT) F1(LLM)d KL F1(GT) F1(LLM)d KL F1(GT) F1(LLM)d KL i 0.98 0.98 0.06 0.86 0.94 0.04 0.51 0.62 2.19 0.72 0.90 0.42 ii –0.90 0.08–0.93 0.04 –0....

work page 2025