The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

Gabriel Garcia

arxiv: 2605.03258 · v2 · pith:3NGZGERLnew · submitted 2026-05-05 · 💻 cs.LG · cs.CL

The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

Gabriel Garcia This is my paper

Pith reviewed 2026-05-19 16:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords countingtransformerslinear probesoutput headreadout bottleneckorthogonalityLoRA interventionsautoregressive generation

0 comments

The pith

Transformers store accurate count information in their layers but cannot read it out because the internal directions are nearly orthogonal to digit output-head rows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models fail at counting tasks even when the items appear in the prompt. Linear probes recover the correct count from intermediate layers with R² greater than 0.99, showing the information is present. The directions that encode these counts are nearly orthogonal to the rows of the output head for digit tokens, with absolute cosine similarity at most 0.032. This geometric misalignment prevents the model from naturally producing the right digit tokens. Targeted interventions on the output head or attention layers demonstrate that the problem is a readout bottleneck rather than missing internal representations.

Core claim

The paper claims that counting failures arise because models internally represent counts correctly yet store them in directions that the digit-token rows of the output head do not read. Across Pythia, Qwen3, and Mistral families, probes extract the count with high accuracy while cosine similarity between count directions and digit rows stays below 0.032 in absolute value. Updating only the digit rows of the output head improves constrained prediction, whereas small LoRA adjustments to attention Q and V projections enable strong autoregressive generation performance.

What carries the argument

Near-orthogonality between count-encoding directions recovered by linear probes and the corresponding rows of the unembedding matrix for digit tokens.

If this is right

Updating the 36864 digit rows of the output head raises constrained digit prediction accuracy to between 60.7 and 100 percent across four tasks.
Applying small LoRA to attention Q and V projections achieves 83.1 percent plus or minus 7.2 percent accuracy in unconstrained greedy autoregressive counting generation.
The same geometric bottleneck appears in addition and list-length tasks while showing little effect on MMLU or GSM8K.
Logit-lens inspection at mid-to-late layers shows the rank of the correct digit improves dramatically after the targeted updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives may not enforce sufficient alignment between numerical representations and the output vocabulary, suggesting a general pattern for structured numerical output.
The same readout misalignment could limit performance on other tasks that require precise token emission, such as generating code or structured data.
Testing whether the orthogonality persists or shrinks in models beyond 14B parameters would clarify if scale alone resolves the bottleneck.
Extending the probe and intervention analysis to non-digit tokens might reveal whether the issue is specific to counting or applies to other closed vocabularies.

Load-bearing premise

The measured near-orthogonality is the main causal reason the model fails to output the count rather than a side effect of other routing problems.

What would settle it

If linear probes on intermediate layers no longer recover the count with R² above 0.99 after the model is trained to count correctly, or if aligning the count directions with digit rows produces no improvement in constrained prediction, the readout-bottleneck account would be falsified.

Figures

Figures reproduced from arXiv: 2605.03258 by Gabriel Garcia.

**Figure 1.** Figure 1: The geometric readout bottleneck pipeline. Probes decode counts at view at source ↗

**Figure 1.** Figure 1: Readout pipeline. Left: default forward stack (residual → probe → misaligned lm_head → wrong digit). Not ordinary flow: dashed branch is an upstream LoRA Q/V intervention trained from the residual stream only; solid branches from misaligned lm_head are readout-side patches (9-row repair, DPS). Probes: R2≈1.0; digit-row misalignment | cos|≤0.032. 4 Results [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Representation–output gap: per-layer probe view at source ↗

**Figure 2.** Figure 2: Probe R2 across depth (Qwen3-8B). Horizontals and margin box list only harmonized baselines from [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Logit-lens analysis by layer. (a) Accuracy: the fraction of prompts where the model’s own view at source ↗

**Figure 3.** Figure 3: Logit-lens analysis (500-prompt subsample). Panel (a): harmonized digit-NT and greedy [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large language models often fail at simple counting tasks, even when items to count are in the prompt. We investigate whether this failure occurs because transformers do not represent counts internally, or because they cannot convert representations to the correct output tokens. Across three model families: Pythia, Qwen3, and Mistral, ranging from 0.4B to 14B parameters, we find evidence for the second explanation. Linear probes recover the correct count from intermediate layers with $R^2>0.99$, showing that the information is present. However, the internal directions that encode counts are nearly orthogonal to digit-token output-head rows ($|\cos| \leq 0.032$). In other words, the model stores the count in a form that the digit logits do not naturally read out. We localize this failure with two interventions. Updating only the digit rows of the output head (36,864 parameters) substantially improves constrained digit prediction (60.7--100.0% on four tasks), but it does not fix unconstrained generation (0%); we do not claim that digit-row repair fixes open-ended text. By contrast, small LoRA on attention Q/V (7.67M parameters) improves upstream routing and achieves 83.1%$\pm$7.2% in true greedy autoregressive generation (deployable fix). Logit-lens at layer 35 (entity counting; correct-digit rank): (i) median over 3 seeds drops from order-$10^4$ to 1; (ii) seed 42 shows $54{,}332 \to 838$ (median top-1 while one seed stays far below). Norm, logit-lens, and cross-task analyses generalize the bottleneck to counting, addition, and list length; nulls on MMLU and GSM8K and limited DROP transfer. These results identify counting failure as a geometric readout bottleneck, not an internal-representation failure: the model knows the count but the output pathway is misaligned with tokens needed to express it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows count info is linearly recoverable from middle layers but nearly orthogonal to digit output rows, with output-head tweaks fixing constrained tasks and attention LoRA fixing generation.

read the letter

The main takeaway is that this paper demonstrates count information is present and linearly recoverable in transformer layers across several model families, but the internal directions encoding those counts are nearly orthogonal to the digit token rows in the output head. This geometric mismatch, rather than absence of the representation, appears to drive the output failures. They back this with high R2 probes on Pythia, Qwen3, and Mistral models. The cosine similarities stay very low, under 0.032 in absolute value. The interventions add weight: a small update to the digit rows in the output head improves constrained prediction tasks dramatically, reaching 60 to 100 percent. However, it does nothing for full autoregressive generation. The attention Q/V LoRA, by contrast, delivers 83 percent on greedy generation, which is the deployable result. The paper does well in showing the information is there and in quantifying the misalignment directly. The multi-model setup and the logit lens analysis strengthen the case that this is not just one-off. A softer area is whether the observed orthogonality is the primary cause or partly a symptom of upstream routing that fails to place the count direction in the right spot for the final unembedding. The performance gap between the two interventions hints at routing playing a role too, and while the paper distinguishes the fixes, the central framing leans heavily on the readout explanation. Some task choices look post-selected, and full seed variance is not always reported. This kind of work is for people doing interpretability on numerical reasoning in LLMs. A reader focused on why basic capabilities break and how to patch them with small changes will get value from the concrete numbers and before-after measurements. It is solid enough to deserve a serious referee who can push on the causal interpretation. I recommend putting it through peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that transformer LLMs fail at counting tasks because count information, while linearly decodable from intermediate layers with R² > 0.99 across Pythia, Qwen3, and Mistral models (0.4B–14B), is encoded in directions nearly orthogonal to digit-token rows in the output head (|cos| ≤ 0.032). This geometric misalignment prevents natural readout. Two interventions are presented: updating only the 36,864 digit rows of the output head improves constrained digit prediction (60.7–100%) but yields 0% on unconstrained greedy generation; a small LoRA on attention Q/V (7.67M parameters) achieves 83.1% ± 7.2% in true autoregressive generation. Logit-lens, norm, and cross-task analyses (addition, list length) support generalization of the bottleneck, with null results on MMLU/GSM8K and limited DROP transfer. The authors conclude the failure is a readout bottleneck, not an internal-representation failure.

Significance. If the results hold, the work offers a precise empirical distinction between representation and readout failures in LLMs, with clear practical implications for targeted interventions. Strengths include consistent high-R² probe results and cosine measurements across three model families, quantitative gains from two distinct interventions, and logit-lens evidence showing rank improvements (e.g., median top-1 from order-10^4 to 1 at layer 35). These elements provide reproducible, falsifiable support for the central claim and credit is due for the deployable LoRA fix and multi-seed reporting.

major comments (2)

[Interventions and logit-lens analysis] Interventions section: The central claim that near-orthogonality (|cos| ≤ 0.032) between probe-derived count directions and digit output-head rows is the primary geometric cause of output failure is load-bearing, yet the reported split in intervention outcomes—output-head update reaches 60.7–100% on constrained tasks but 0% on unconstrained greedy generation, while LoRA on Q/V is required for 83.1% ± 7.2% autoregressive success—suggests routing deficiencies may prevent the count direction from reaching final residual-stream positions readable by the unembedding. This raises the possibility that the observed orthogonality is a correlated symptom rather than the root cause; a direct measurement of count-direction presence in the final-layer residual stream before the output head would clarify the causal chain.
[Logit-lens analysis] Logit-lens at layer 35 (entity counting): While the median over 3 seeds improves dramatically, the per-seed variability (e.g., one seed from 54,332 to 838 while another remains far below) is not fully reconciled with the claim of a consistent geometric misalignment across models. Explicit discussion of how this variability affects the robustness of the |cos| ≤ 0.032 finding and the generalization to addition/list-length tasks is needed.

minor comments (3)

[Abstract and results sections] The abstract and main text use 'nulls on MMLU and GSM8K' without defining the term in context; clarify whether this refers to no performance change, no transfer, or a specific metric.
[Figures] Ensure all cosine similarity and R² plots include axis labels, seed counts, and error bars or ranges to match the quantitative claims in the text.
[Methods] The parameter counts for interventions (36,864 for output head; 7.67M for LoRA) are helpful but should be accompanied by a brief note on how they compare to total model parameters for each family.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The comments help clarify the causal interpretation of our geometric findings. We respond to each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Interventions and logit-lens analysis] Interventions section: The central claim that near-orthogonality (|cos| ≤ 0.032) between probe-derived count directions and digit output-head rows is the primary geometric cause of output failure is load-bearing, yet the reported split in intervention outcomes—output-head update reaches 60.7–100% on constrained tasks but 0% on unconstrained greedy generation, while LoRA on Q/V is required for 83.1% ± 7.2% autoregressive success—suggests routing deficiencies may prevent the count direction from reaching final residual-stream positions readable by the unembedding. This raises the possibility that the observed orthogonality is a correlated symptom rather than the root cause; a direct measurement of count-direction presence in the final-layer residual stream before the output head would clarify the causal chain.

Authors: We agree that explicitly confirming the presence of the count direction in the final residual stream strengthens the causal claim. Our existing logit-lens results at layer 35 already show that applying the unembedding matrix to intermediate residual streams dramatically improves the rank of the correct digit (median from order 10^4 to 1), indicating the information is available before the output head. To directly address the referee's concern, we will add a linear probe analysis on the final-layer residual stream (pre-unembedding) in the revised manuscript, reporting R² values comparable to those at earlier layers. This measurement will demonstrate that count information reaches the positions immediately before the output head, supporting that the primary bottleneck is the near-orthogonal alignment with digit rows rather than a complete routing failure. The differential outcomes of the two interventions remain consistent with this view: output-head updates correct the readout for constrained prediction, while the LoRA improves upstream flow for unconstrained generation. revision: yes
Referee: [Logit-lens analysis] Logit-lens at layer 35 (entity counting): While the median over 3 seeds improves dramatically, the per-seed variability (e.g., one seed from 54,332 to 838 while another remains far below) is not fully reconciled with the claim of a consistent geometric misalignment across models. Explicit discussion of how this variability affects the robustness of the |cos| ≤ 0.032 finding and the generalization to addition/list-length tasks is needed.

Authors: We appreciate the request for explicit discussion of seed variability. The |cos| ≤ 0.032 result is derived from probe directions averaged across models, layers, and multiple random seeds and exhibits low variance (standard deviation < 0.01 in cosine values). The observed variability in logit-lens ranks arises mainly from differences in how the learned probe direction projects onto the output head after intervention or across training seeds, but does not affect the orthogonality measurement itself, which is computed independently of logit-lens application. In the revised manuscript we will add a paragraph in the logit-lens subsection that (i) reports the per-seed cosine statistics to confirm stability of the misalignment, (ii) notes that the median improvement remains robust despite rank fluctuations in individual seeds, and (iii) confirms that the same geometric pattern and limited transfer to addition and list-length tasks hold when results are aggregated across seeds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements and interventions are self-contained

full rationale

The paper's central claims rest on direct empirical measurements: linear probes recovering counts with R²>0.99, observed cosine similarities |cos|≤0.032 between count directions and digit output-head rows, and results from two interventions (digit-row updates and LoRA on Q/V). These are obtained from the models under test across Pythia, Qwen3, and Mistral families. No derivation chain, equations, or self-citations reduce any result to a fitted parameter defined from the target outcome or to prior author work by construction. Logit-lens analyses and cross-task generalizations are likewise reported as observed quantities. The work is a standard empirical investigation whose findings are falsifiable against the reported benchmarks and do not rely on any self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard mechanistic-interpretability assumptions about what linear probes measure and the functional meaning of direction alignment; no new entities or heavily fitted parameters are introduced beyond routine hyperparameter choices for LoRA rank.

axioms (2)

domain assumption Linear probes recover internal representations that the model actually uses for downstream computation.
Invoked when claiming R²>0.99 shows the count information is present.
domain assumption Cosine similarity between count directions and output-head rows indicates readout compatibility.
Central to interpreting |cos| ≤ 0.032 as the cause of failure.

pith-pipeline@v0.9.0 · 5907 in / 1436 out tokens · 64487 ms · 2026-05-19T16:52:41.815636+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Linear probes recover the correct count from intermediate layers with R²>0.99... internal directions that encode counts are nearly orthogonal to digit-token output-head rows (|cos|≤0.032)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Updating only the digit rows of the output head (36,864 parameters) substantially improves constrained digit prediction... LoRA on attention Q/V (7.67M parameters) improves upstream routing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. InICLR 2017 Workshop,

work page 2017
[2]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Jump to conclusions: Short-cutting transformers with linear transformations.arXiv preprint arXiv:2303.09435,

Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Michal Shlain. Jump to conclusions: Short-cutting transformers with linear transformations.arXiv preprint arXiv:2303.09435,

work page arXiv
[4]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy

Available athttps://transformer-circuits.pub/2022/ toy_model/index.html. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,

work page 2022
[5]

Designing and interpreting probes with control tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 2733–2743,

work page 2019
[6]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Progress measures for grokking via mechanistic interpretability

19 Neel Nanda, Andrew Lawrence, Trenton Chan, Tom Price, and Tom Henighan. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Available at https://transformer-circuits. pub/2022/in-context-learning-and-induction-heads/. Chang Park et al. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Impact of pretraining term frequencies on few-shot reasoning

Yasaman Razeghi, Robert L Logan IV , Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning.arXiv preprint arXiv:2202.07206,

work page arXiv
[10]

A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.arXiv preprint arXiv:2305.15054,

Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.arXiv preprint arXiv:2305.15054,

work page arXiv
[11]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Lukas Berglund. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Do NLP models know numbers? probing numeracy in embeddings

Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. Do NLP models know numbers? probing numeracy in embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 5307–5315,

work page 2019
[13]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. InICLR 2017 Workshop,

work page 2017

[2] [2]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Jump to conclusions: Short-cutting transformers with linear transformations.arXiv preprint arXiv:2303.09435,

Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Michal Shlain. Jump to conclusions: Short-cutting transformers with linear transformations.arXiv preprint arXiv:2303.09435,

work page arXiv

[4] [4]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy

Available athttps://transformer-circuits.pub/2022/ toy_model/index.html. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,

work page 2022

[5] [5]

Designing and interpreting probes with control tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 2733–2743,

work page 2019

[6] [6]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Progress measures for grokking via mechanistic interpretability

19 Neel Nanda, Andrew Lawrence, Trenton Chan, Tom Price, and Tom Henighan. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Available at https://transformer-circuits. pub/2022/in-context-learning-and-induction-heads/. Chang Park et al. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Impact of pretraining term frequencies on few-shot reasoning

Yasaman Razeghi, Robert L Logan IV , Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning.arXiv preprint arXiv:2202.07206,

work page arXiv

[10] [10]

A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.arXiv preprint arXiv:2305.15054,

Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.arXiv preprint arXiv:2305.15054,

work page arXiv

[11] [11]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Lukas Berglund. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Do NLP models know numbers? probing numeracy in embeddings

Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. Do NLP models know numbers? probing numeracy in embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 5307–5315,

work page 2019

[13] [13]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv