The Generalization Ridge: Information Flow in Natural Language Generation

Chunyuan Deng; Hanjie Chen; Ruidi Chang

arxiv: 2507.05387 · v5 · submitted 2025-07-07 · 💻 cs.CL

The Generalization Ridge: Information Flow in Natural Language Generation

Ruidi Chang , Chunyuan Deng , Hanjie Chen This is my paper

Pith reviewed 2026-05-19 05:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords generalization ridgeinformation flowtransformer layersmutual informationnatural language generationmemorizationpredictive informationlayer-wise analysis

0 comments

The pith

Transformer language models exhibit a generalization ridge where predictive information peaks in intermediate layers before declining as they shift toward memorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InfoRidge to track how the mutual information between hidden layer representations and target outputs changes across depth in transformer models during training for natural language generation. It reports that this quantity rises to a peak in middle layers and then falls in later layers, forming what the authors call a generalization ridge. The pattern appears consistently across models and datasets and is interpreted as a transition point where layers move from capturing generalizable patterns to fitting training specifics more closely. Complementary checks using residual scaling, attention patterns, and multi-token decoding support the same trend. A sympathetic reader would care because this layer-wise view could clarify why intermediate representations often generalize better than final ones and point to practical levers for controlling generalization behavior.

Core claim

Predictive information, measured as mutual information between hidden representations and target outputs, follows a non-monotonic trajectory across transformer layers: it increases through early and middle layers to form a generalization ridge and then decreases in the final layers, marking a transition from generalization to memorization.

What carries the argument

InfoRidge, the information-theoretic framework that computes and tracks mutual information between each layer's hidden states and the target outputs across training steps.

If this is right

Intermediate layers carry the bulk of the task-relevant information that supports generalization.
Final layers increasingly encode training-set specifics at the expense of broader patterns.
The ridge pattern persists through multiple decoding steps in generation tasks.
Attention patterns and residual connections can be used to characterize the functional specialization of layers around the ridge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the ridge location can be predicted from model size or task type, it may become possible to intervene at specific depths to favor generalization over memorization.
The same non-monotonic information flow might appear in non-transformer architectures, offering a broader test of whether depth-induced specialization is architecture-independent.
Training procedures that explicitly preserve or amplify the ridge region could improve out-of-distribution performance without changing model size.

Load-bearing premise

The estimated mutual information between hidden representations and target outputs serves as a reliable proxy for generalization ability rather than being dominated by estimation bias or model-specific artifacts.

What would settle it

Re-running the mutual-information measurements with an alternative, less biased estimator or on a new dataset where generalization performance on held-out examples fails to correlate with the location of the observed peak would falsify the claim that the ridge reflects a genuine generalization-memorization transition.

Figures

Figures reproduced from arXiv: 2507.05387 by Chunyuan Deng, Hanjie Chen, Ruidi Chang.

**Figure 2.** Figure 2: Evolution of predictive information I(Zℓ; Y ), with lighter curves indicating later epochs. Each curve exhibits a three-phase trend: early layers rise, mid layers peak, and late layers decline [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Truncating GPT–2 to 8 layers removes the MI peak; 9–layer variants begin to exhibit a [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: I(Zℓ; Y ) rises in final layers during overfitting (LLaMA, ECQA). Discussion on Overfitting Scenario To probe overfitting dynamics, we intentionally finetuned the model beyond the optimal point. In [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Middle transformer blocks are key to encoding generalizable task-relevant information (GPT-2 Small, Synthetic). To understand how information accumulates across the network, we compute Incremental Information Gain (I(∆Zℓ; Y ))—the mutual information between each residual transition and the target label embedding [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Residual scaling coefficients βℓ for Qwen-2.5-0.5B and LLaMA-3.1-8B. A similar pattern consistently holds across architectures, with βfinal layer decreasing under OOD training. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Attention map across layers (GPT-2 Small, Synthetic). At the generalization ridge layers, [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Predictive information I(Z; Y ) across different models and datasets exhibits an information peak, indicating a generalization ridge. In cases where the task is too simple relative to model capacity—such as the synthetic arithmetic task with LLaMA—this trend reflects an overfitting regime. Lighter line colors represent later training epochs. Each curve shows the mean across five random seeds (0, 1, 2, 3, 4… view at source ↗

**Figure 9.** Figure 9: Incremental information gain I(∆Z; Y ) across different models and datasets with ~96% CI error bars. Across all models, we observe that the largest incremental information gain consistently occurs in intermediate layers—further supporting the emergence of a generalization ridge. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Residual scaling coefficients βℓ across all transformer layers. ID training emphasizes later layers, while OOD training shifts weight toward middle layers, aligning with the generalization ridge observed via InfoRidge. Each curve shows the mean across five random seeds (0, 1, 2, 3, 42), and the shaded region denotes 1-sigma error bar. F Incremental Information Gain Case Study To further illustrate how int… view at source ↗

read the original abstract

Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG), yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. We propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in intermediate layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we conduct a set of complementary analyses that leverage residual scaling and attention pattern to characterize layer-wise functional specialization. We further validate our findings with multiple-token generation experiments, verifying that the observed ridge phenomenon persists across decoding steps. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots a non-monotonic peak in estimated predictive mutual information in middle transformer layers but the MI numbers may be driven by layer-wise changes in representation statistics rather than real generalization.

read the letter

The main observation is a consistent ridge where mutual information between hidden states and targets rises to a peak in intermediate layers then falls off, which the authors tie to a generalization-to-memorization shift. They introduce the InfoRidge framing and show the pattern across several models and datasets, plus some extra checks using residual scaling, attention maps, and multi-token decoding steps. That consistency and the follow-up probes are the parts that actually add something concrete to the mechanistic interpretability literature on layer roles. The work is not just re-stating prior hints about middle layers; it tries to quantify the information flow during training in a systematic way. The soft spot is exactly the one the stress-test flags. Mutual information estimators are known to be sensitive to shifts in hidden-state norms, effective dimension, and entropy, all of which change with depth in transformers. Without seeing the specific estimator, any bias corrections, or ablations that hold those factors fixed, the ridge could be an artifact of the measurement rather than evidence about what the layers are actually doing. The abstract and the reported experiments do not appear to close that gap, so the interpretation stays provisional. This is the sort of paper that would interest people already working on layer-wise analysis or training dynamics. A reader who wants new quantities to plot or new hypotheses to test could pull useful ideas from it, even if the current evidence is not yet tight. It deserves peer review because the pattern is worth checking with stronger controls on the estimator; referees could push for the missing diagnostics and either confirm or correct the claim.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the InfoRidge framework to analyze predictive mutual information I(h_l; target) between hidden representations and outputs in transformer-based NLG models. Experiments across models and datasets report a consistent non-monotonic trend in which this information peaks at intermediate layers (the generalization ridge) before declining in final layers, interpreted as a generalization-to-memorization transition. Complementary analyses using residual scaling and attention patterns characterize layer-wise specialization, and the ridge is shown to persist in multi-token generation experiments.

Significance. If the reported ridge reflects genuine information content rather than estimator artifacts, the work would provide useful empirical insights into depth-dependent information flow in transformers and the special role of intermediate layers for generalization. The multi-model and multi-dataset consistency, together with the residual-scaling and attention analyses plus the multi-token validation, constitute a reasonably broad empirical base that could inform architecture and training choices focused on mid-depth representations.

major comments (2)

[Abstract and §3] Abstract and §3 (InfoRidge framework): the central claim that the observed non-monotonic curve reflects a genuine generalization-to-memorization transition rests on the assumption that the chosen mutual-information estimator faithfully tracks predictive information content. No description is given of the estimator (MINE, InfoNCE, or other), its hyperparameters, or any controls for layer-dependent changes in hidden-state norm, effective dimensionality, or entropy. Because these statistics are known to vary systematically with depth in transformers and to bias standard neural MI estimators, the ridge could be an artifact of the measurement procedure; this issue is load-bearing for the headline result.
[§5] §5 (complementary analyses): the residual-scaling and attention-pattern experiments are presented as supporting evidence for layer-wise functional specialization, yet the manuscript reports neither quantitative effect sizes nor statistical significance tests for the differences across layers. Without these, it is unclear how strongly the auxiliary analyses corroborate the ridge finding or rule out alternative explanations.

minor comments (2)

[Introduction] The notation I(h_l ; target) is introduced without an explicit definition or reference to the precise conditioning (e.g., whether targets are the next token or the full sequence).
[Figures] Figure captions should state the number of runs or seeds used to generate error bars or shaded regions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (InfoRidge framework): the central claim that the observed non-monotonic curve reflects a genuine generalization-to-memorization transition rests on the assumption that the chosen mutual-information estimator faithfully tracks predictive information content. No description is given of the estimator (MINE, InfoNCE, or other), its hyperparameters, or any controls for layer-dependent changes in hidden-state norm, effective dimensionality, or entropy. Because these statistics are known to vary systematically with depth in transformers and to bias standard neural MI estimators, the ridge could be an artifact of the measurement procedure; this issue is load-bearing for the headline result.

Authors: We agree that the current manuscript lacks a sufficient description of the mutual information estimator and does not explicitly address potential layer-dependent biases in hidden-state statistics. This is a valid concern given the load-bearing nature of the result. In the revised version we will expand §3 with a full specification of the estimator (including method, hyperparameters, and training procedure), normalization steps applied to hidden states, separate entropy estimates, and controls for effective dimensionality. We will also report results from at least one alternative estimator to assess robustness. These additions will allow readers to evaluate whether the ridge is likely an artifact. revision: yes
Referee: [§5] §5 (complementary analyses): the residual-scaling and attention-pattern experiments are presented as supporting evidence for layer-wise functional specialization, yet the manuscript reports neither quantitative effect sizes nor statistical significance tests for the differences across layers. Without these, it is unclear how strongly the auxiliary analyses corroborate the ridge finding or rule out alternative explanations.

Authors: We accept the referee's observation that the complementary analyses in §5 currently lack quantitative effect sizes and statistical significance tests. In the revision we will augment the section with effect-size measures (e.g., Cohen's d) and appropriate significance tests (paired t-tests or non-parametric equivalents) for the reported differences in residual scaling and attention patterns across layers. These statistics will be added both to the text and to the relevant figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the reported generalization ridge

full rationale

The paper defines InfoRidge as an information-theoretic framework for tracking mutual information I(h_l; target) across transformer layers and reports an empirical non-monotonic peak from experiments on multiple models and datasets. No equations, fitted parameters, or self-citations are shown that would make the ridge a direct algebraic consequence of the measurement procedure or prior author work. The central observation is presented as an experimental finding rather than a derivation that reduces to its own inputs by construction, rendering the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full experimental details and any parameter choices are unavailable.

axioms (1)

domain assumption Mutual information between hidden representations and target outputs can be meaningfully estimated from finite samples
Central to the proposed InfoRidge framework

invented entities (1)

InfoRidge framework no independent evidence
purpose: Characterize layer-wise predictive information during training
Newly proposed in the paper

pith-pipeline@v0.9.0 · 5712 in / 1158 out tokens · 38180 ms · 2026-05-19T05:26:06.645860+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

predictive information peaks in intermediate layers—forming a generalization ridge—before declining in final layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

[1]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[2]

A survey of natural language generation

Chenhe Dong, Yinghui Li, Haifan Gong, Miaoxin Chen, Junxin Li, Ying Shen, and Min Yang. A survey of natural language generation. ACM Computing Surveys, 55(8):1–38, 2022

work page 2022
[3]

Linguistic knowledge and transferability of contextual representations

Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew E Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. In Proceedings of NAACL-HLT, pages 1073–1094, 2019

work page 2019
[4]

The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

Elena V oita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380, 2019

work page arXiv 1909
[5]

Intrinsic dimension of data representations in deep neural networks

Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[6]

Read between the layers: Leveraging multi-layer representations for rehearsal-free continual learning with pre-trained models

Kyra Ahrens, Hans Hergen Lehmann, Jae Hee Lee, and Stefan Wermter. Read between the layers: Leveraging multi-layer representations for rehearsal-free continual learning with pre-trained models. arXiv preprint arXiv:2312.08888, 2023

work page arXiv 2023
[7]

Intermediate layer classifiers for ood generalization

Arnas Uselis and Seong Joon Oh. Intermediate layer classifiers for ood generalization. arXiv preprint arXiv:2504.05461, 2025

work page arXiv 2025
[8]

Not all layers of llms are necessary during inference

Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, and Zhongyuan Wang. Not all layers of llms are necessary during inference. arXiv preprint arXiv:2403.02181, 2024

work page arXiv 2024
[9]

Exploring concept depth: How large language models acquire knowledge and concept at different layers? arXiv preprint arXiv:2404.07066, 2024

Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, et al. Exploring concept depth: How large language models acquire knowledge and concept at different layers? arXiv preprint arXiv:2404.07066, 2024

work page arXiv 2024
[10]

Layer by Layer: Uncovering Hidden Representations in Language Models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. arXiv preprint arXiv:2502.02013, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Measures of entropy from data using infinitely divisible kernels

Luis Gonzalo Sanchez Giraldo, Murali Rao, and Jose C Principe. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory, 61(1):535–548, 2014

work page 2014
[12]

Self-adaptive scaling for learnable residual structure

Fenglin Liu, Meng Gao, Yuanxin Liu, and Kai Lei. Self-adaptive scaling for learnable residual structure. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 862–870, 2019

work page 2019
[13]

Laurel: Learned augmented residual layer

Gaurav Menghani, Ravi Kumar, and Sanjiv Kumar. Laurel: Learned augmented residual layer. arXiv preprint arXiv:2411.07501, 2024

work page arXiv 2024
[14]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Analyzing the Structure of Attention in a Transformer Language Model

Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. arXiv preprint arXiv:1906.04284, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[16]

Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Estimating information flow in deep neural networks

Ziv Goldfeld. Estimating information flow in deep neural networks. InInternational Conference on Machine Learning, 2019. 10

work page 2019
[18]

How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

work page 2014
[19]

On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis

Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, and Hiroshi Sato. On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 739–746. IEEE, 2023

work page 2022
[20]

Information-theoretic generalization bounds for deep neural networks

Haiyun He, Christina Lee Yu, and Ziv Goldfeld. Information-theoretic generalization bounds for deep neural networks. arXiv preprint arXiv:2404.03176, 2024

work page arXiv 2024
[21]

Optimal transport: old and new, volume 338

Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2008

work page 2008
[22]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[23]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Meta llama 3.1 8b

Meta AI. Meta llama 3.1 8b. https://huggingface.co/meta-llama/Llama-3.1-8B ,

work page
[25]

Accessed: 2025-05-12

work page 2025
[26]

Hamilton

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton. Clutrr: A diagnostic benchmark for inductive reasoning from text. Empirical Methods of Natural Language Processing (EMNLP), 2019

work page 2019
[27]

Explanations for commonsenseqa: New dataset and models

Shourya Aggarwal, Divyanshu Mandowara, Vishwajeet Agrawal, Dinesh Khandelwal, Parag Singla, and Dinesh Garg. Explanations for commonsenseqa: New dataset and models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), p...

work page 2021
[28]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019
[29]

Reft: Representation finetuning for language models.Advances in Neural Information Processing Systems, 37:63908–63962, 2024

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Reft: Representation finetuning for language models.Advances in Neural Information Processing Systems, 37:63908–63962, 2024

work page 2024
[30]

Ravel: Evaluat- ing interpretability methods on disentangling language model representations

Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. Ravel: Evaluat- ing interpretability methods on disentangling language model representations. arXiv preprint arXiv:2402.17700, 2024

work page arXiv 2024
[31]

Learning distribution-wise control in repre- sentation space for language models

Chunyuan Deng, Ruidi Chang, and Hanjie Chen. Learning distribution-wise control in repre- sentation space for language models. arXiv preprint arXiv:2506.06686, 2025

work page arXiv 2025
[32]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022
[33]

gen_train23_test2to10

Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman. Inter- pretability at scale: Identifying causal mechanisms in alpaca. Advances in neural information processing systems, 36:78205–78226, 2023. 11 A Mathematical Details for Matrix-Based Information Estimation and Theoretical Foundations We employ the matrix-based Rényi entrop...

work page 2023
[34]

All assets were used in compliance with their respective licenses, and no proprietary or restricted resources were employed in our experiments

The Synthetic Arithmetic dataset is custom-designed by the authors and does not rely on any external or licensed data sources. All assets were used in compliance with their respective licenses, and no proprietary or restricted resources were employed in our experiments. E Full Quantitative Results with Confidence Estimates To provide a comprehensive view ...

work page

[1] [1]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[2] [2]

A survey of natural language generation

Chenhe Dong, Yinghui Li, Haifan Gong, Miaoxin Chen, Junxin Li, Ying Shen, and Min Yang. A survey of natural language generation. ACM Computing Surveys, 55(8):1–38, 2022

work page 2022

[3] [3]

Linguistic knowledge and transferability of contextual representations

Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew E Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. In Proceedings of NAACL-HLT, pages 1073–1094, 2019

work page 2019

[4] [4]

The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

Elena V oita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380, 2019

work page arXiv 1909

[5] [5]

Intrinsic dimension of data representations in deep neural networks

Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[6] [6]

Read between the layers: Leveraging multi-layer representations for rehearsal-free continual learning with pre-trained models

Kyra Ahrens, Hans Hergen Lehmann, Jae Hee Lee, and Stefan Wermter. Read between the layers: Leveraging multi-layer representations for rehearsal-free continual learning with pre-trained models. arXiv preprint arXiv:2312.08888, 2023

work page arXiv 2023

[7] [7]

Intermediate layer classifiers for ood generalization

Arnas Uselis and Seong Joon Oh. Intermediate layer classifiers for ood generalization. arXiv preprint arXiv:2504.05461, 2025

work page arXiv 2025

[8] [8]

Not all layers of llms are necessary during inference

Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, and Zhongyuan Wang. Not all layers of llms are necessary during inference. arXiv preprint arXiv:2403.02181, 2024

work page arXiv 2024

[9] [9]

Exploring concept depth: How large language models acquire knowledge and concept at different layers? arXiv preprint arXiv:2404.07066, 2024

Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, et al. Exploring concept depth: How large language models acquire knowledge and concept at different layers? arXiv preprint arXiv:2404.07066, 2024

work page arXiv 2024

[10] [10]

Layer by Layer: Uncovering Hidden Representations in Language Models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. arXiv preprint arXiv:2502.02013, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Measures of entropy from data using infinitely divisible kernels

Luis Gonzalo Sanchez Giraldo, Murali Rao, and Jose C Principe. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory, 61(1):535–548, 2014

work page 2014

[12] [12]

Self-adaptive scaling for learnable residual structure

Fenglin Liu, Meng Gao, Yuanxin Liu, and Kai Lei. Self-adaptive scaling for learnable residual structure. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 862–870, 2019

work page 2019

[13] [13]

Laurel: Learned augmented residual layer

Gaurav Menghani, Ravi Kumar, and Sanjiv Kumar. Laurel: Learned augmented residual layer. arXiv preprint arXiv:2411.07501, 2024

work page arXiv 2024

[14] [14]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Analyzing the Structure of Attention in a Transformer Language Model

Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. arXiv preprint arXiv:1906.04284, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[16] [16]

Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Estimating information flow in deep neural networks

Ziv Goldfeld. Estimating information flow in deep neural networks. InInternational Conference on Machine Learning, 2019. 10

work page 2019

[18] [18]

How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014

work page 2014

[19] [19]

On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis

Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, and Hiroshi Sato. On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 739–746. IEEE, 2023

work page 2022

[20] [20]

Information-theoretic generalization bounds for deep neural networks

Haiyun He, Christina Lee Yu, and Ziv Goldfeld. Information-theoretic generalization bounds for deep neural networks. arXiv preprint arXiv:2404.03176, 2024

work page arXiv 2024

[21] [21]

Optimal transport: old and new, volume 338

Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2008

work page 2008

[22] [22]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019

[23] [23]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Meta llama 3.1 8b

Meta AI. Meta llama 3.1 8b. https://huggingface.co/meta-llama/Llama-3.1-8B ,

work page

[25] [25]

Accessed: 2025-05-12

work page 2025

[26] [26]

Hamilton

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton. Clutrr: A diagnostic benchmark for inductive reasoning from text. Empirical Methods of Natural Language Processing (EMNLP), 2019

work page 2019

[27] [27]

Explanations for commonsenseqa: New dataset and models

Shourya Aggarwal, Divyanshu Mandowara, Vishwajeet Agrawal, Dinesh Khandelwal, Parag Singla, and Dinesh Garg. Explanations for commonsenseqa: New dataset and models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), p...

work page 2021

[28] [28]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019

[29] [29]

Reft: Representation finetuning for language models.Advances in Neural Information Processing Systems, 37:63908–63962, 2024

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Reft: Representation finetuning for language models.Advances in Neural Information Processing Systems, 37:63908–63962, 2024

work page 2024

[30] [30]

Ravel: Evaluat- ing interpretability methods on disentangling language model representations

Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. Ravel: Evaluat- ing interpretability methods on disentangling language model representations. arXiv preprint arXiv:2402.17700, 2024

work page arXiv 2024

[31] [31]

Learning distribution-wise control in repre- sentation space for language models

Chunyuan Deng, Ruidi Chang, and Hanjie Chen. Learning distribution-wise control in repre- sentation space for language models. arXiv preprint arXiv:2506.06686, 2025

work page arXiv 2025

[32] [32]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022

[33] [33]

gen_train23_test2to10

Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman. Inter- pretability at scale: Identifying causal mechanisms in alpaca. Advances in neural information processing systems, 36:78205–78226, 2023. 11 A Mathematical Details for Matrix-Based Information Estimation and Theoretical Foundations We employ the matrix-based Rényi entrop...

work page 2023

[34] [34]

All assets were used in compliance with their respective licenses, and no proprietary or restricted resources were employed in our experiments

The Synthetic Arithmetic dataset is custom-designed by the authors and does not rely on any external or licensed data sources. All assets were used in compliance with their respective licenses, and no proprietary or restricted resources were employed in our experiments. E Full Quantitative Results with Confidence Estimates To provide a comprehensive view ...

work page