MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

Xueping Gao

arxiv: 2606.07978 · v1 · pith:HBGOI66Rnew · submitted 2026-06-06 · 💻 cs.CL

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

Xueping Gao This is my paper

Pith reviewed 2026-06-27 20:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords late crystallizationfactual knowledgelanguage modelsknowledge interventionsresidual streamlayer analysishallucination mitigationmodel editing

0 comments

The pith

Factual knowledge in language models crystallizes abruptly in the final layers rather than emerging gradually.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that factual knowledge does not accumulate steadily through a model's layers but instead appears suddenly near the end. Across multiple model families and scales, a substantial share of correct answers stays out of the top predictions until late depths, and this pattern holds specifically for factual recall rather than other tasks like sentiment classification. The timing of crystallization directly influences which editing methods succeed on a given model. Layer normalization is shown to be central to the process, and a distinction emerges between knowledge that can be computed and knowledge that is memorized.

Core claim

Factual knowledge does not gradually emerge across layers but crystallizes abruptly at the final layers. Across five model families, 26.8%--93.4% of correct answers never enter top-10 predictions at any intermediate layer, with late emergence (>80% depth) consistent across architectures. Tuned lens rules out probe artifacts, and the pattern is far stronger for factual questions than for sentiment classification. This leads to a crystallization-guided intervention principle where method effectiveness varies by model, plus a computability-memorization spectrum and the finding that LayerNorm scaling improves accuracy at zero added cost.

What carries the argument

Late Crystallization, the abrupt surfacing of factual knowledge in the final layers of the residual stream, which determines intervention success and is measured by when correct answers first enter top-10 predictions.

If this is right

CAA outperforms DoLa on moderate-crystallization models such as Llama and Mistral.
On high-crystallization models such as Qwen the performance ordering reverses.
Scaling LayerNorm by a factor of 1.2 raises multiple-choice accuracy with no inference-time cost.
Computable knowledge crystallizes earlier (around layer 22 in a 28-layer model) than memorized facts (layer 28).
The late-crystallization pattern is specific to factual recall and does not appear in sentiment classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Knowledge editing techniques could gain efficiency by targeting only the layers at which crystallization occurs for each model and task.
Training objectives might be designed to shift more knowledge toward earlier crystallization, potentially improving reliability on factual queries.
The same measurement approach could be applied to non-text modalities to test whether crystallization timing is a general property of transformer residual streams.

Load-bearing premise

That the absence of a correct answer from top-10 predictions at intermediate layers means the knowledge is truly not yet present rather than stored in a form the metric does not capture.

What would settle it

Demonstrating a model family or scale in which the majority of factual answers from MMLU or similar benchmarks enter the top-10 predictions before 50% network depth would falsify the claim of consistent late crystallization.

Figures

Figures reproduced from arXiv: 2606.07978 by Xueping Gao.

**Figure 1.** Figure 1: Late Crystallization: Correct answers remain invisible in logit space throughout intermediate layers, then abruptly “crystallize” into top predictions at the final layer. components responsible for factual recall and intervene directly on their activations. However, the effectiveness of such interventions depends critically on whether factual knowledge is localized in specific model components or distrib… view at source ↗

**Figure 2.** Figure 2: Late Crystallization across architectures. Each row: a TruthfulQA sample; each column: a layer. Color: [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: FEP depth distribution across architectures (817 samples each). Qwen concentrates at the final layer [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Computability–Memorization Spectrum across architectures. Blue: computable knowledge (logical [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Understanding where LLMs store factual knowledge is critical for hallucination mitigation. We systematically quantify Late Crystallization: factual knowledge does not gradually emerge across layers but "crystallizes" abruptly at the final layers. Across five model families (Pythia, Gemma, Qwen2.5, Llama-3.1, Mistral; 0.5--14B), 26.8%--93.4% of correct answers never enter top-10 predictions at any intermediate layer, with late emergence (>80% depth) consistent across architectures. Cross-scale (Qwen2.5-14B) and cross-benchmark (MMLU: 98.2%) results confirm generality; tuned lens rules out probe artifacts. A sentiment-classification control (0.5% for Qwen vs. 85.9% factual; 2.0% for Mistral vs. 26.8%) confirms the phenomenon is specific to factual recall. Late Crystallization yields a crystallization-guided intervention principle: CAA outperforms DoLa on moderate-crystallization models (Llama, Mistral; p<0.001), with a directionally consistent reversal on high-crystallization Qwen (+25.4% vs. +15.5% MC1, p=0.069). LayerNorm ablation shows crystallization is intrinsic to the residual stream; LN scaling (x1.2) yields +11.8% MC1 with zero inference overhead. We further reveal a Computability-Memorization Spectrum: computable knowledge crystallizes earlier (layer 22.1/28) than memorized facts (28.0/28). We release MechLens supporting five model families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Late crystallization is a measurable pattern in the tuned-lens data across models, but the linear top-10 metric leaves open whether knowledge is truly absent earlier or just encoded differently.

read the letter

The one thing to know is that factual answers often stay out of tuned-lens top-10 until the final layers, with 26.8-93.4% never appearing earlier, and this split lines up with which intervention (CAA or DoLa) works better on different models. The computability-memorization difference and the simple LN scaling gain are the other concrete observations.

The paper does a solid job running the same check on five families and two scales, adding the sentiment control to show the effect is not generic, and releasing MechLens so others can replicate. The intervention results give a direct use case, and the LN adjustment is low-cost enough to try.

The soft spot is the measurement itself. The claim that knowledge has not crystallized rests on the tuned lens recovering the token if the information is present; if facts sit in non-linear combinations the affine map misses, the late signal is partly an artifact. The paper tests probe choice but not whether the decoder is information-theoretically complete. The 80% depth and top-10 cutoffs are free parameters whose effect on the percentages is not shown, and the abstract gives no error bars or dataset sizes.

This is for interpretability and editing researchers who want layer-wise patterns and a rule of thumb for method choice. It has enough new cross-model counts and a usable downstream claim to deserve referee time, even if the metric interpretation needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper claims that factual knowledge in LLMs does not emerge gradually across layers but crystallizes abruptly in the final layers. Across five model families (0.5-14B), 26.8%-93.4% of correct answers never appear in tuned-lens top-10 predictions at intermediate layers, with >80% depth emergence consistent; this is specific to factual recall (vs. sentiment control) and affects intervention choice (CAA outperforms DoLa on moderate-crystallization models). Additional claims include a computability-memorization spectrum and a LayerNorm ablation showing intrinsic residual-stream effects. MechLens is released.

Significance. If the measurement is valid, the late-crystallization finding supplies a mechanistic account of why certain interventions succeed and offers a practical editing principle (e.g., LN scaling). Cross-model and cross-benchmark consistency plus the public release of MechLens are concrete strengths that would aid reproducibility and follow-up work.

major comments (1)

[Abstract (quantification paragraph)] Abstract (quantification paragraph) and tuned-lens results: the central claim that non-appearance in top-10 at layers <80% depth means knowledge has not yet crystallized requires that the affine tuned lens recovers the correct token whenever the relevant information is present in the residual stream. The sentiment-classification control and cross-model results address probe choice but do not test whether non-linear feature interactions or directions misaligned with the lens objective could produce an artifactual late signal. A direct test (e.g., whether the lens achieves near-ceiling recovery on a set of facts known to be present early by other probes) is needed to support the interpretation.

minor comments (1)

[Abstract] Abstract omits dataset sizes, number of prompts, error bars, and statistical-test details for the reported percentages and p-values.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting a key assumption in our measurement approach. We address the concern regarding the tuned lens below and outline a planned revision to strengthen the supporting evidence.

read point-by-point responses

Referee: [Abstract (quantification paragraph)] Abstract (quantification paragraph) and tuned-lens results: the central claim that non-appearance in top-10 at layers <80% depth means knowledge has not yet crystallized requires that the affine tuned lens recovers the correct token whenever the relevant information is present in the residual stream. The sentiment-classification control and cross-model results address probe choice but do not test whether non-linear feature interactions or directions misaligned with the lens objective could produce an artifactual late signal. A direct test (e.g., whether the lens achieves near-ceiling recovery on a set of facts known to be present early by other probes) is needed to support the interpretation.

Authors: We agree that the interpretation of late crystallization rests on the tuned lens being able to recover the correct token when relevant information is present in the residual stream. The tuned lens is an affine map trained to reconstruct the final-layer vocabulary distribution from each intermediate residual stream, which is the established method for this type of analysis. The sentiment-classification control (0.5–2.0% late emergence vs. 26.8–85.9% for facts) and cross-model/cross-scale consistency reduce the chance of a lens-specific artifact, as any systematic misalignment would be expected to appear across tasks. Nevertheless, we acknowledge that this does not directly rule out non-linear interactions or misaligned directions. We will add a direct validation in the revision: on a subset of facts, we will train an alternative linear probe on early-layer activations using ground-truth labels and compare its recovery rate to the tuned lens, reporting whether the lens achieves near-ceiling performance on facts independently shown to be present early. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical counts of top-10 emergence

full rationale

The paper quantifies late crystallization via direct layer-wise counts of whether correct answers enter the top-10 under tuned-lens decoding, with no equations, fitted parameters, or self-citations that reduce the reported percentages or intervention comparisons to the inputs by construction. The tuned-lens usage is presented as an external control for probe choice rather than a self-referential definition, and the crystallization-guided intervention results are compared against independently measured crystallization depths rather than being forced by them. This is a standard empirical measurement pipeline with no load-bearing self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The top-10 and 80%-depth thresholds function as implicit measurement choices whose justification is not supplied.

free parameters (2)

top-10 prediction threshold
Used to decide whether an answer has 'entered' at a layer
>80% depth cutoff
Defines 'late emergence'

axioms (1)

domain assumption Tuned lens suffices to rule out probe artifacts
Invoked to validate that late emergence is not an artifact of the measurement method

pith-pipeline@v0.9.1-grok · 5839 in / 1260 out tokens · 25527 ms · 2026-06-27T20:06:17.315108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 5 canonical work pages · 5 internal anchors

[1]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in
[2]

International Conference on Learning Representations , year=

Mass-Editing Memory in a Transformer , author=. International Conference on Learning Representations , year=
[3]

Proceedings of EMNLP , year=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of EMNLP , year=
[4]

Proceedings of EMNLP , year=

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Proceedings of EMNLP , year=
[5]

Proceedings of EMNLP , year=

Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author=. Proceedings of EMNLP , year=
[6]

AI Alignment Forum , year=

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level , author=. AI Alignment Forum , year=
[7]

Transformer Circuits Thread , year=

Circuit Tracing: Revealing Computational Graphs in Language Models , author=. Transformer Circuits Thread , year=
[8]

2022 , howpublished=

TransformerLens , author=. 2022 , howpublished=

2022
[9]

LessWrong , year=

Interpreting GPT: the logit lens , author=. LessWrong , year=
[10]

Advances in Neural Information Processing Systems , volume=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. Advances in Neural Information Processing Systems , volume=
[11]

Wang, Zhenyu , journal=
[12]

Advances in Neural Information Processing Systems , volume=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems , volume=
[13]

Chuang, Yung-Sung and Xie, Yujia and Luo, Hongyin and Kim, Yoon and Glass, James and He, Pengcheng , booktitle=
[14]

Steering Language Models With Activation Engineering

Activation Addition: Steering Language Models Without Optimization , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation Engineering: A Top-Down Approach to
[16]

Zhang, Yifan and others , booktitle=
[17]

International Conference on Learning Representations , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. International Conference on Learning Representations , year=
[18]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , author=. arXiv preprint arXiv:2311.05232 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ACM Computing Surveys , volume=

Survey of Hallucination in Natural Language Generation , author=. ACM Computing Surveys , volume=
[20]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[21]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P and others , journal=. Judging
[22]

International Conference on Machine Learning , year=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. International Conference on Machine Learning , year=
[23]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open Models Based on Gemini Research and Technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de Las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal=. Mistral

[1] [1]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in

[2] [2]

International Conference on Learning Representations , year=

Mass-Editing Memory in a Transformer , author=. International Conference on Learning Representations , year=

[3] [3]

Proceedings of EMNLP , year=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of EMNLP , year=

[4] [4]

Proceedings of EMNLP , year=

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Proceedings of EMNLP , year=

[5] [5]

Proceedings of EMNLP , year=

Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author=. Proceedings of EMNLP , year=

[6] [6]

AI Alignment Forum , year=

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level , author=. AI Alignment Forum , year=

[7] [7]

Transformer Circuits Thread , year=

Circuit Tracing: Revealing Computational Graphs in Language Models , author=. Transformer Circuits Thread , year=

[8] [8]

2022 , howpublished=

TransformerLens , author=. 2022 , howpublished=

2022

[9] [9]

LessWrong , year=

Interpreting GPT: the logit lens , author=. LessWrong , year=

[10] [10]

Advances in Neural Information Processing Systems , volume=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. Advances in Neural Information Processing Systems , volume=

[11] [11]

Wang, Zhenyu , journal=

[12] [12]

Advances in Neural Information Processing Systems , volume=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems , volume=

[13] [13]

Chuang, Yung-Sung and Xie, Yujia and Luo, Hongyin and Kim, Yoon and Glass, James and He, Pengcheng , booktitle=

[14] [14]

Steering Language Models With Activation Engineering

Activation Addition: Steering Language Models Without Optimization , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation Engineering: A Top-Down Approach to

[16] [16]

Zhang, Yifan and others , booktitle=

[17] [17]

International Conference on Learning Representations , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. International Conference on Learning Representations , year=

[18] [18]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , author=. arXiv preprint arXiv:2311.05232 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

ACM Computing Surveys , volume=

Survey of Hallucination in Natural Language Generation , author=. ACM Computing Surveys , volume=

[20] [20]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[21] [21]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P and others , journal=. Judging

[22] [22]

International Conference on Machine Learning , year=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. International Conference on Machine Learning , year=

[23] [23]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open Models Based on Gemini Research and Technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de Las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal=. Mistral