SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

Aldeida Aleti; Chunyang Chen; Hongyu Zhang; Jian Gu

arxiv: 2606.32022 · v1 · pith:4B2Y7CHJnew · submitted 2026-06-30 · 💻 cs.LG · cs.CL

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

Jian Gu , Aldeida Aleti , Chunyang Chen , Hongyu Zhang This is my paper

Pith reviewed 2026-07-01 06:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords semantic reference frameresidual stream dynamicslanguage modelsVoronoi diagramminimum-action pathparameter efficiencybi-invertibilitydiscrete spline

0 comments

The pith

Semantic Reference Frames anchor residual streams to produce stable semantic coordinates and a minimum-action canonical trace linked to parameter efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Semantic Reference Frames (SemRF) to address inconsistency in semantic measurement across layers by fixing anchors and synchronizing via pseudo-inverse tying. Under restricted bi-invertibility, this yields stable basis coordinates with distortion bounds and near-identity changes. Residual computation is then viewed as a depthwise trajectory on a semantic Voronoi diagram, with a margin-relaxed tube whose canonical trace is the unique minimum-action path obeying a discrete spline equation away from constraints. This setup provides a conditional connection between lower trace complexity and fewer semantic degrees of freedom, potentially relating to parameter efficiency. The approach separates measurement from dynamics for clearer analysis of model computation.

Core claim

SemRF fixes anchors and measures states against them to separate semantic measurement from residual dynamics. Pseudo-inverse tying synchronizes embedding and unembedding. Under restricted bi-invertibility, it produces stable semantic-basis coordinates, distortion bounds, and near-identity changes. The anchors define a semantic Voronoi diagram and a margin-relaxed tube in which the canonical trace is the unique minimum-action path obeying a discrete spline equation away from active constraints. This gives a conditional link to parameter efficiency through lower semantic degrees of freedom.

What carries the argument

Semantic Reference Frame (SemRF) with anchor-based synchronization via pseudo-inverse tying, inducing a semantic Voronoi diagram and margin-relaxed tube whose canonical trace is the minimum-action path.

If this is right

Stable semantic-basis coordinates and distortion bounds across layers.
Near-identity changes in residual computation.
Canonical trace as unique minimum-action path inside the tube obeying discrete spline equation.
Lower trace complexity implies piecewise-linear compressibility and fewer semantic degrees of freedom.
Conditional link to parameter efficiency among admissible model settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

SemRF could be used to compare residual trajectories across different model architectures for shared semantic patterns.
Minimizing action in the tube might suggest new regularization techniques during training.
The discrete spline obedience might allow efficient computation of optimal traces without full simulation.

Load-bearing premise

The guarantees require controlled interface error and small projection residual under explicit tube constraints.

What would settle it

Finding a case where the minimum-action path inside a nonempty margin-relaxed tube with positive quadratic weight is not unique or fails to obey the discrete spline equation away from constraints would falsify the claim.

read the original abstract

Residual-stream analysis asks how language-model computation evolves across depth, but intermediate decoding requires comparable readout coordinates across layers. If embedding anchors and unembedding readout disagree on the chosen span, apparent motion may reflect measurement drift rather than computation. We introduce \emph{Semantic Reference Frames} (SemRF), an anchor-based formalism separating semantic measurement from residual dynamics. A SemRF fixes anchors and measures states against them. Pseudo-inverse tying gives exact synchronization; under restricted bi-invertibility, SemRF yields stable semantic-basis coordinates, distortion bounds, and near-identity changes. With the frame fixed, residual computation becomes a depthwise semantic trajectory. The anchors induce a semantic Voronoi diagram: distance, or evidence such as logits, assigns each layer to a coarse cell, while coordinates retain within-cell motion and margins. We define layerwise steps, contribution profiles, and imbalance diagnostics, then use the Voronoi trace to define a margin-relaxed tube. The canonical trace is the minimum-action path inside this tube; when nonempty with positive quadratic weight, it is unique and obeys a discrete spline equation away from active constraints. Excess action controls step, curvature, and profile mismatch. Low curvature implies piecewise-linear compressibility and local knowledge density: lower trace complexity means fewer semantic knots. Through the parameter-to-trajectory map, this gives a conditional link to parameter efficiency: among admissible settings fitting data, lower-action and lower-complexity traces use fewer semantic degrees of freedom. The guarantees require controlled interface error and small projection residual under explicit tube constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemRF gives a fresh coordinate system and trajectory formalism for residual streams, but its stability and efficiency claims rest on unverified assumptions about interface error and projection residuals.

read the letter

The paper's core move is to fix semantic anchors via pseudo-inverse tying so that readout coordinates stay comparable across layers, then treat the residual stream as a trajectory inside a margin-relaxed Voronoi tube whose canonical path is the minimum-action spline. That separation of measurement drift from actual computation is the genuinely new piece; the Voronoi trace and discrete spline equation away from constraints do not appear in the earlier residual-stream literature cited.

It does a clean job laying out layerwise diagnostics, contribution profiles, and the conditional link from low trace complexity to fewer semantic degrees of freedom. If the math goes through, the framework could give interpretability researchers a more principled way to compare trajectories and compressibility.

The soft spot is exactly where the stress-test flags it: every guarantee (stable coordinates, unique min-action path, efficiency implication) is conditioned on restricted bi-invertibility plus controlled interface error and small projection residual under the tube constraints. The abstract states these requirements but supplies neither a proof that typical embedding/unembedding pairs satisfy them nor any empirical check on real residual vectors. If the projection residual routinely exceeds the tube margin, the Voronoi cells and spline uniqueness become undefined for observed data. Without the full derivations or experiments, it is impossible to tell whether this is a minor technicality or a load-bearing gap.

This is for mechanistic interpretability groups already working on residual-stream geometry. A reader who wants new formal tools will get value from the definitions even if the empirical claims need work. It deserves a serious referee to check whether the assumptions can be validated or relaxed.

Referee Report

2 major / 0 minor

Summary. The paper introduces Semantic Reference Frames (SemRF) as an anchor-based formalism to separate semantic measurement from residual-stream dynamics in language models. It claims that pseudo-inverse tying and restricted bi-invertibility yield stable semantic-basis coordinates, distortion bounds, and near-identity changes across layers. With the frame fixed, it defines layerwise steps, contribution profiles, and a semantic Voronoi diagram; the canonical trace is the unique minimum-action path inside a margin-relaxed tube and obeys a discrete spline equation away from active constraints. Excess action controls mismatch diagnostics, and lower trace complexity is linked conditionally to parameter efficiency via fewer semantic degrees of freedom. All guarantees require controlled interface error and small projection residual under explicit tube constraints.

Significance. If the conditioning assumptions hold and the framework applies to trained models, SemRF could supply a geometric and variational lens on depthwise computation, with the minimum-action spline and Voronoi trace offering principled diagnostics for layerwise imbalance and compressibility. The conditional efficiency link, if made quantitative, would connect trajectory complexity directly to semantic degrees of freedom among data-fitting settings.

major comments (2)

[Abstract (final sentence)] Abstract (final sentence): The central claims—stable coordinates, unique canonical trace obeying the discrete spline equation, and the conditional parameter-efficiency link—are explicitly conditioned on 'controlled interface error and small projection residual under explicit tube constraints.' The manuscript supplies neither a proof that these hold for typical embedding/unembedding pairs (e.g., that ||(I - U^+ U) h_l|| remains below the tube margin for observed residual-stream vectors) nor empirical verification on trained models. If the residual routinely exceeds the margin, the Voronoi cells and spline equation become undefined for real trajectories, collapsing uniqueness and efficiency conclusions.
[Abstract] Abstract: The parameter-to-trajectory map is said to give a 'conditional link to parameter efficiency' via lower-action and lower-complexity traces using fewer semantic degrees of freedom. Without an explicit derivation showing how the quadratic action or knot count bounds the number of free parameters (or an empirical correlation on concrete models), the efficiency claim reduces to a restatement of the fitting assumption and is not yet falsifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the conditioning of our claims and the parameter-efficiency link. We respond point by point below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract (final sentence)] Abstract (final sentence): The central claims—stable coordinates, unique canonical trace obeying the discrete spline equation, and the conditional parameter-efficiency link—are explicitly conditioned on 'controlled interface error and small projection residual under explicit tube constraints.' The manuscript supplies neither a proof that these hold for typical embedding/unembedding pairs (e.g., that ||(I - U^+ U) h_l|| remains below the tube margin for observed residual-stream vectors) nor empirical verification on trained models. If the residual routinely exceeds the margin, the Voronoi cells and spline equation become undefined for real trajectories, collapsing uniqueness and efficiency conclusions.

Authors: The SemRF framework is developed under the stated conditioning assumptions precisely to guarantee stable coordinates, uniqueness of the canonical trace, and well-defined Voronoi cells and spline equation. The manuscript presents these results as holding inside the regime of controlled interface error and small projection residual; it does not claim the conditions are automatically satisfied by arbitrary embedding/unembedding pairs. We will revise the abstract to foreground the conditioning more explicitly and add a limitations subsection that (i) states the scope of the guarantees and (ii) sketches practical checks for the projection residual on concrete models, thereby clarifying that applicability to trained networks requires separate verification. revision: yes
Referee: [Abstract] Abstract: The parameter-to-trajectory map is said to give a 'conditional link to parameter efficiency' via lower-action and lower-complexity traces using fewer semantic degrees of freedom. Without an explicit derivation showing how the quadratic action or knot count bounds the number of free parameters (or an empirical correlation on concrete models), the efficiency claim reduces to a restatement of the fitting assumption and is not yet falsifiable.

Authors: The efficiency statement is framed as conditional on the parameter-to-trajectory map among admissible data-fitting settings, where lower trace complexity (fewer knots) corresponds to fewer semantic degrees of freedom. While the variational origin of this mapping is given, we acknowledge that an explicit quantitative relation between quadratic action or knot count and the number of free parameters is not derived in the current text. We will revise the relevant section to supply a more detailed derivation of this relationship and to indicate how empirical correlations could be tested, thereby making the claim more directly falsifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with explicit conditioning

full rationale

The provided abstract defines SemRF via new constructs (anchors, pseudo-inverse tying, Voronoi cells, margin-relaxed tube, canonical trace as min-action path) and states uniqueness and the spline equation as consequences under the stated conditions of restricted bi-invertibility, nonempty tube, and positive quadratic weight. The parameter-efficiency link is explicitly labeled conditional on data-fitting admissible settings and controlled interface error/small projection residual, without reducing any prediction to a fitted quantity by construction or invoking self-citations. No load-bearing step equates an output to its input definition; the claims remain independent of the target results once the assumptions are granted.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract invokes restricted bi-invertibility and controlled interface error as prerequisites for the guarantees; no free parameters or invented entities are explicitly named, but the entire construction rests on the existence of suitable anchors satisfying these conditions.

axioms (2)

domain assumption Restricted bi-invertibility of the embedding-unembedding interface
Invoked to obtain stable coordinates and near-identity changes (abstract).
domain assumption Controlled interface error and small projection residual under explicit tube constraints
Stated as necessary for all guarantees (final sentence of abstract).

pith-pipeline@v0.9.1-grok · 5812 in / 1458 out tokens · 22466 ms · 2026-07-01T06:03:01.903692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri`a Garriga- Alonso

https://transformer-circuits.pub/2023/monosemantic- features/index.html. Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri`a Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352,

2023
[3]

Toy Models of Superposition

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021a. Nelson Elhage, Christopher Olah, Neel Nanda, et al. A mathematical framework for transformer circuits. Transformer Cir...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Equivalence of context and parameter updates in modern transformer blocks.arXiv preprint arXiv:2511.17864,

Adrian Goldwaser, Michael Munn, Javier Gonzalvo, and Benoit Dherin. Equivalence of context and parameter updates in modern transformer blocks.arXiv preprint arXiv:2511.17864,

work page arXiv
[5]

V ocabulary-defined semantics: Latent space clustering for improving in-context learning.arXiv preprint arXiv:2401.16184,

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. V ocabulary-defined semantics: Latent space clustering for improving in-context learning.arXiv preprint arXiv:2401.16184,

work page arXiv
[6]

A semantic-aware layer-freezing approach to computation-efficient fine-tuning of language models

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. A semantic-aware layer-freezing approach to computation-efficient fine-tuning of language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8019–8033,

2025
[7]

Training Compute-Optimal Large Language Models

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. Rethinking weight tying: Pseudo- inverse tying for stable lm training and updates, 2026a. Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. Beyond neural incompatibility: Cross- scale knowledge transfer in large language models through latent semantic alignment. InFindings of the Association ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

The representational geometry of number.arXiv preprint arXiv:2602.06843,

Zhimin Hu, Lanhao Niu, and Sashank Varma. The representational geometry of number.arXiv preprint arXiv:2602.06843,

work page arXiv
[9]

Semantic tube prediction: Beating llm data efficiency with jepa.arXiv preprint arXiv:2602.22617,

Hai Huang, Yann LeCun, and Randall Balestriero. Semantic tube prediction: Beating llm data efficiency with jepa.arXiv preprint arXiv:2602.22617,

work page arXiv
[10]

Characterizing stable regions in the residual stream of llms.arXiv preprint arXiv:2409.17113,

Jett Janiak, Jacek Karwowski, Chatrik Singh Mangat, Giorgi Giglemiani, Nora Petrova, and Stefan Heimersheim. Characterizing stable regions in the residual stream of llms.arXiv preprint arXiv:2409.17113,

work page arXiv
[11]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185,

Tim Lawson, Lucy Farnik, Conor Houghton, and Laurence Aitchison. Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185,

work page arXiv
[13]

In-context Learning and Induction Heads

URL https://www.le sswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-l ens. Accessed 2026-02-22. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2463–2473,

2019
[16]

Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman

URL https://transformer-circuits.pub/2024/scaling-mon osemanticity/index.html. Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman. Interpretabil- ity at scale: Identifying causal mechanisms in alpaca.Advances in neural information processing systems, 36:78205–78226,

2024
[17]

Representation Engineering: A Top-Down Approach to AI Transparency

15 GUALETICHENZHANG Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri`a Garriga- Alonso

https://transformer-circuits.pub/2023/monosemantic- features/index.html. Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri`a Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352,

2023

[3] [3]

Toy Models of Superposition

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021a. Nelson Elhage, Christopher Olah, Neel Nanda, et al. A mathematical framework for transformer circuits. Transformer Cir...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Equivalence of context and parameter updates in modern transformer blocks.arXiv preprint arXiv:2511.17864,

Adrian Goldwaser, Michael Munn, Javier Gonzalvo, and Benoit Dherin. Equivalence of context and parameter updates in modern transformer blocks.arXiv preprint arXiv:2511.17864,

work page arXiv

[5] [5]

V ocabulary-defined semantics: Latent space clustering for improving in-context learning.arXiv preprint arXiv:2401.16184,

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. V ocabulary-defined semantics: Latent space clustering for improving in-context learning.arXiv preprint arXiv:2401.16184,

work page arXiv

[6] [6]

A semantic-aware layer-freezing approach to computation-efficient fine-tuning of language models

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. A semantic-aware layer-freezing approach to computation-efficient fine-tuning of language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8019–8033,

2025

[7] [7]

Training Compute-Optimal Large Language Models

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. Rethinking weight tying: Pseudo- inverse tying for stable lm training and updates, 2026a. Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. Beyond neural incompatibility: Cross- scale knowledge transfer in large language models through latent semantic alignment. InFindings of the Association ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

The representational geometry of number.arXiv preprint arXiv:2602.06843,

Zhimin Hu, Lanhao Niu, and Sashank Varma. The representational geometry of number.arXiv preprint arXiv:2602.06843,

work page arXiv

[9] [9]

Semantic tube prediction: Beating llm data efficiency with jepa.arXiv preprint arXiv:2602.22617,

Hai Huang, Yann LeCun, and Randall Balestriero. Semantic tube prediction: Beating llm data efficiency with jepa.arXiv preprint arXiv:2602.22617,

work page arXiv

[10] [10]

Characterizing stable regions in the residual stream of llms.arXiv preprint arXiv:2409.17113,

Jett Janiak, Jacek Karwowski, Chatrik Singh Mangat, Giorgi Giglemiani, Nora Petrova, and Stefan Heimersheim. Characterizing stable regions in the residual stream of llms.arXiv preprint arXiv:2409.17113,

work page arXiv

[11] [11]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[12] [12]

Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185,

Tim Lawson, Lucy Farnik, Conor Houghton, and Laurence Aitchison. Residual stream analysis with multi-layer saes.arXiv preprint arXiv:2409.04185,

work page arXiv

[13] [13]

In-context Learning and Induction Heads

URL https://www.le sswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-l ens. Accessed 2026-02-22. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2463–2473,

2019

[16] [16]

Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman

URL https://transformer-circuits.pub/2024/scaling-mon osemanticity/index.html. Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman. Interpretabil- ity at scale: Identifying causal mechanisms in alpaca.Advances in neural information processing systems, 36:78205–78226,

2024

[17] [17]

Representation Engineering: A Top-Down Approach to AI Transparency

15 GUALETICHENZHANG Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv