pith. machine review for the scientific record. sign in

arxiv: 2502.02013 · v2 · submitted 2025-02-04 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

Layer by Layer: Uncovering Hidden Representations in Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords language modelsintermediate layersrepresentation qualityinformation theorygeometric analysisperturbation invariancedownstream tasksembeddings
0
0 comments X

The pith

Intermediate layers in language models often encode richer representations than the final layer for downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the standard practice of using only final-layer outputs from LLMs by showing that intermediate layers frequently capture more useful features for a variety of tasks. It develops metrics grounded in information theory, geometry, and robustness to input changes to measure how each layer trades off compression against preservation of relevant signals. Experiments across 32 embedding tasks, covering transformers and state-space models in both language and vision, demonstrate consistent advantages for mid-depth layers. A reader would care because this finding questions default feature extraction methods and points to simple ways to improve performance from existing models without retraining.

Core claim

The authors establish that intermediate layers balance information compression and signal preservation more effectively than the final layer, leading to stronger representations that improve results on downstream tasks. Their unified metrics quantify these properties layer by layer and confirm the pattern holds across architectures and domains through extensive testing on 32 tasks.

What carries the argument

Unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations, which tracks the compression-preservation trade-off at each depth.

If this is right

  • Mid-layer embeddings improve accuracy on text and vision embedding tasks compared with final-layer use.
  • The same layer-wise pattern appears in both transformer and state-space model families.
  • Final-layer embeddings are not reliably optimal for feature extraction across tasks.
  • Selecting representations from intermediate depths becomes a viable direction for more robust embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Users could routinely extract and compare activations from several layers before choosing the best one for a given task.
  • The compression-preservation balance observed here may appear in other neural architectures beyond language models.
  • Task-specific layer selection might allow lighter inference by skipping deeper computations in some applications.

Load-bearing premise

The metrics from information theory, geometry, and perturbation invariance accurately reflect the qualities that determine usefulness for real downstream tasks.

What would settle it

An experiment on a new collection of tasks where final-layer embeddings match or exceed every intermediate layer on all metrics and actual task performance.

read the original abstract

From extracting features to generating text, the outputs of large language models (LLMs) typically rely on the final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer's performance. Through extensive experiments on 32 text-embedding tasks across various architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features, challenging the standard view on final-layer embeddings and opening new directions on using mid-layer representations for more robust and accurate representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that intermediate layers in LLMs and related architectures often encode richer representations than final layers for downstream tasks. It introduces a unified framework of representation-quality metrics grounded in information theory, geometry, and invariance to input perturbations, and supports the claim with experiments across 32 text-embedding tasks, multiple model families (transformers, state-space models), and domains (language and vision).

Significance. If the empirical findings and metric framework hold under scrutiny, the work would meaningfully challenge the default practice of relying on final-layer embeddings and could shift feature-extraction pipelines toward mid-layer representations, with potential gains in robustness and accuracy on embedding-based tasks.

minor comments (3)
  1. [§3] Abstract and §3: the claim that the metrics are 'independently motivated' and 'parameter-free' should be supported by explicit definitions; any implicit hyperparameters or normalization choices must be stated so readers can verify independence from downstream-task fitting.
  2. [§4] §4 and Table 2: the reported 'consistent' outperformance of intermediate layers requires per-task statistical significance tests (e.g., paired t-tests or Wilcoxon with correction) and effect-size reporting; aggregate win rates alone are insufficient to support the strong claim.
  3. [§5] §5: the cross-architecture and cross-domain generalization (transformers vs. state-space models, language vs. vision) is central; ensure that layer-indexing conventions and token-aggregation methods are identical across families so the comparison is apples-to-apples.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and significance assessment of the manuscript, as well as the recommendation for minor revision. No specific major comments were provided in the report, so we have no individual points requiring detailed rebuttal. We will incorporate any minor editorial adjustments in the revised version.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes a framework of representation quality metrics motivated independently by information theory, geometry, and invariance to perturbations. No equations, derivations, or fitted parameters are described that reduce predictions to inputs by construction. Experiments across 32 tasks on multiple architectures provide external validation of the claim that intermediate layers can outperform final layers. No self-citation chains, uniqueness theorems, or ansatz smuggling are referenced in the abstract or context. The central claim rests on empirical results rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5479 in / 1032 out tokens · 61913 ms · 2026-05-15T16:25:43.473004+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Inference-Time Machine Unlearning via Gated Activation Redirection

    cs.LG 2026-05 conditional novelty 8.0

    GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.

  2. UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.

  3. Intermediate Layers Encode Optimal Biological Representations in Single-Cell Foundation Models

    cs.AI 2026-04 unverdicted novelty 7.0

    Intermediate layers in single-cell foundation models encode optimal representations for biological tasks, outperforming final layers in a task- and context-dependent manner.

  4. Instruction Data Selection via Answer Divergence

    cs.CL 2026-04 unverdicted novelty 7.0

    ADG selects 10K instruction examples by scoring the geometric divergence of multiple high-temperature model outputs in embedding space, outperforming prior selectors on reasoning, knowledge, and coding benchmarks acro...

  5. Overcoming the Modality Gap in Context-Aided Forecasting

    cs.LG 2026-03 unverdicted novelty 7.0

    A semi-synthetic augmentation creates the CAF-7M dataset and demonstrates that improved context data enables multimodal models to outperform unimodal baselines in context-aided forecasting.

  6. A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

    cs.CL 2026-03 unverdicted novelty 7.0

    Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.

  7. Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

  8. Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

    cs.CV 2026-05 unverdicted novelty 6.0

    A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.

  9. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.

  10. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.

  11. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.

  12. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.

  13. Large Vision-Language Models Get Lost in Attention

    cs.AI 2026-05 unverdicted novelty 6.0

    In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

  14. Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.

  15. LLM Safety From Within: Detecting Harmful Content with Internal Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

  16. Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.

  17. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  18. From Words to Amino Acids: Does the Curse of Depth Persist?

    cs.LG 2026-02 unverdicted novelty 6.0

    Protein language models exhibit consistent depth inefficiency where most task-relevant computation occurs in a subset of layers, mirroring patterns in large language models.

  19. Semantic Structure of Feature Space in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.

  20. Do Vision Language Models Need to Process Image Tokens?

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.

  21. LTX-2: Efficient Joint Audio-Visual Foundation Model

    cs.CV 2026-01 conditional novelty 5.0

    LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

  22. Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

    cs.CV 2026-04 unverdicted novelty 4.0

    I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...

Reference graph

Works this paper leans on

173 extracted references · 173 canonical work pages · cited by 20 Pith papers

  1. [1]

    K., Mondal, A

    Agrawal, K. K., Mondal, A. K., Ghosh, A., and Richards, B. - ReQ : Assessing representation quality in self-supervised learning by measuring eigenspectrum decay. NeurIPs, 2022

  2. [2]

    and Bengio, Y

    Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes. ICLR, 2017

  3. [3]

    R., Subbaraj, G., Gontier, N., LeCun, Y., Rish, I., Shwartz-Ziv, R., and Pal, C

    Arefin, M. R., Subbaraj, G., Gontier, N., LeCun, Y., Rish, I., Shwartz-Ziv, R., and Pal, C. Seq-VCR : Preventing collapse in intermediate transformer representations for enhanced reasoning. ICLR, 2025

  4. [4]

    Information theory with kernel methods

    Bach, F. Information theory with kernel methods. IEEE Transactions on Information Theory, 2022

  5. [5]

    BeIT : Bert pre-training of image transformers

    Bao, H., Dong, L., Piao, S., and Wei, F. BeIT : Bert pre-training of image transformers. ICLR, 2022

  6. [6]

    Why do LLMs attend to the first token? arXiv, 2025

    Barbero, F., Arroyo, A., Gu, X., Perivolaropoulos, C., Bronstein, M., Veli c kovi \'c , P., and Pascanu, R. Why do LLMs attend to the first token? arXiv, 2025

  7. [7]

    LLM2Vec : Large language models are secretly powerful text encoders

    Behnam Ghader, P., Adlakha, V., Mosbach, M., Bahdanau, D., Chapados, N., and Reddy, S. LLM2Vec : Large language models are secretly powerful text encoders. COLM, 2024

  8. [8]

    G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M

    Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. ICML, 2023

  9. [9]

    P., and Wilming, H

    Boes, P., Eisert, J., Gallego, R., M \"u ller, M. P., and Wilming, H. Von neumann entropy from unitarity. Physical review letters, 2019

  10. [10]

    Guillotine regularization: Why removing layers is needed to improve generalization in self-supervised learning

    Bordes, F., Balestriero, R., Garrido, Q., Bardes, A., and Vincent, P. Guillotine regularization: Why removing layers is needed to improve generalization in self-supervised learning. TMLR, 2023

  11. [11]

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

  12. [12]

    On identifiability in transformers

    Brunner, G., Liu, Y., Pascual, D., Richter, O., Ciaramita, M., and Wattenhofer, R. On identifiability in transformers. ICLR, 2020

  13. [13]

    Discovering latent knowledge in language models without supervision

    Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. ICLR, 2023

  14. [14]

    Generative pretraining from pixels

    Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. Generative pretraining from pixels. ICML, 2020

  15. [15]

    Emergence of a high-dimensional abstraction phase in language transformers

    Cheng, E., Doimo, D., Kervadec, C., Macocco, I., Yu, J., Laio, A., and Baroni, M. Emergence of a high-dimensional abstraction phase in language transformers. ICLR, 2025

  16. [16]

    D., and Potts, C

    Csord \'a s, R., Manning, C. D., and Potts, C. Do language models use their depth efficiently? arXiv, 2025

  17. [17]

    Deepseek- R1 : Incentivizing reasoning capability in LLMs via reinforcement learning

    DeepSeek-AI. Deepseek- R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv, 2025

  18. [18]

    K., Aitchison, M., Orseau, L., et al

    Deletang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchison, M., Orseau, L., et al. Language modeling is compression. ICLR, 2024

  19. [19]

    BERT : Pre-training of deep bidirectional transformers for language understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019

  20. [20]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

  21. [21]

    The Llama 3 herd of models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 herd of models. arXiv, 2024

  22. [22]

    A., Toshev, A., Shankar, V., Susskind, J

    El-Nouby, A., Klein, M., Zhai, S., Bautista, M. A., Toshev, A., Shankar, V., Susskind, J. M., and Joulin, A. Scalable pre-training of large autoregressive image models. ICML, 2024

  23. [23]

    Not all layers of LLMs are necessary during inference

    Fan, S., Jiang, X., Li, X., Meng, X., Han, P., Shang, S., Sun, A., Wang, Y., and Wang, Z. Not all layers of LLMs are necessary during inference. arXiv, 2024

  24. [24]

    Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., da Costa, V. G. T., B \'e thune, L., Gan, Z., et al. Multimodal autoregressive pre-training of large vision encoders. CVPR, 2025

  25. [25]

    RankMe : Assessing the downstream performance of pretrained self-supervised representations by their rank

    Garrido, Q., Balestriero, R., Najman, L., and Lecun, Y. RankMe : Assessing the downstream performance of pretrained self-supervised representations by their rank. ICML, 2023

  26. [26]

    Giraldo, L. G. S., Rao, M., and Principe, J. C. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory, 2014

  27. [27]

    and Dao, T

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. COLM, 2024

  28. [28]

    When attention sink emerges in language models: An empirical view

    Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., and Lin, M. When attention sink emerges in language models: An empirical view. ICLR, 2025

  29. [29]

    and Tegmark, M

    Gurnee, W. and Tegmark, M. Language models represent space and time. arXiv, 2023

  30. [30]

    Training large language models to reason in a continuous latent space

    Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y. Training large language models to reason in a continuous latent space. arXiv, 2024

  31. [31]

    Masked autoencoders are scalable vision learners

    He, K., Chen, X., Xie, S., Li, Y., Doll \'a r, P., and Girshick, R. Masked autoencoders are scalable vision learners. CVPR, 2022

  32. [32]

    and Fedorenko, E

    Hosseini, E. and Fedorenko, E. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language. NeurIPs, 2023

  33. [33]

    Exploring concept depth: How large language models acquire knowledge at different layers? arXiv, 2024

    Jin, M., Yu, Q., Huang, J., Zeng, Q., Wang, Z., Hua, W., Zhao, H., Mei, K., Meng, Y., Ding, K., et al. Exploring concept depth: How large language models acquire knowledge at different layers? arXiv, 2024

  34. [34]

    The remarkable robustness of LLMs : Stages of inference? arXiv, 2024

    Lad, V., Gurnee, W., and Tegmark, M. The remarkable robustness of LLMs : Stages of inference? arXiv, 2024

  35. [35]

    Competition-level code generation with alphacode

    Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 2022

  36. [36]

    F., Gardner, M., Belinkov, Y., Peters, M

    Liu, N. F., Gardner, M., Belinkov, Y., Peters, M. E., and Smith, N. A. Linguistic knowledge and transferability of contextual representations. NAACL, 2019

  37. [37]

    NLP Augmentation , 2019

    Ma, E. NLP Augmentation , 2019. URL https://github.com/makcedward/nlpaug

  38. [38]

    Mallen, A. T. and Belrose, N. Eliciting latent knowledge from quirky language models. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024

  39. [39]

    A., Stephenson, C., Tang, H., Kim, Y., and Chung, S

    Mamou, J., Le, H., Del Rio, M. A., Stephenson, C., Tang, H., Kim, Y., and Chung, S. Emergence of separable manifolds in deep language representations. ICML, 2020

  40. [40]

    E., and Biau, G

    Marion, P., Wu, Y.-H., Sander, M. E., and Biau, G. Implicit regularization of deep residual networks towards neural odes. ICLR, 2024

  41. [41]

    Pointer sentinel mixture models

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. ICLR, 2017

  42. [42]

    MTEB : Massive text embedding benchmark

    Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. MTEB : Massive text embedding benchmark. EACL, 2022

  43. [43]

    Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ICLR, 2018

  44. [44]

    DINOv2 : Learning robust visual features without supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. DINOv2 : Learning robust visual features without supervision. TMLR, 2024

  45. [45]

    J., Jiang, Y., and Veitch, V

    Park, K., Choe, Y. J., Jiang, Y., and Veitch, V. The geometry of categorical and hierarchical concepts in large language models. ICML 2024 Workshop on Mechanistic Interpretability, 2024 a

  46. [46]

    J., and Veitch, V

    Park, K., Choe, Y. J., and Veitch, V. The linear representation hypothesis and the geometry of large language models. ICML, 2024 b

  47. [47]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. ICML, 2021

  48. [48]

    SVCCA : Singular vector canonical correlation analysis for deep learning dynamics and interpretability

    Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. SVCCA : Singular vector canonical correlation analysis for deep learning dynamics and interpretability. NeurIPs, 2017

  49. [49]

    The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models

    Razzhigaev, A., Mikhalchuk, M., Goncharova, E., Oseledets, I., Dimitrov, D., and Kuznetsov, A. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. EACL, 2024

  50. [50]

    On measures of entropy and information

    R \'e nyi, A. On measures of entropy and information. Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, 1961

  51. [51]

    and Vetterli, M

    Roy, O. and Vetterli, M. The effective rank: A measure of effective dimensionality. European signal processing conference, 2007

  52. [52]

    V., Stadelmann, T., and Grewe, B

    Saponati, M., Sager, P., Aceituno, P. V., Stadelmann, T., and Grewe, B. The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in transformer training. arXiv, 2025

  53. [53]

    and Smola, A

    Scholkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2018

  54. [54]

    Information flow in deep neural networks

    Shwartz-Ziv, R. Information flow in deep neural networks. PhD thesis, Hebrew University, 2022

  55. [55]

    and Tishby, N

    Shwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. Entropy, 2019

  56. [56]

    G., and LeCun, Y

    Shwartz-Ziv, R., Balestriero, R., Kawaguchi, K., Rudner, T. G., and LeCun, Y. An information theory perspective on variance-invariance-covariance regularization. NeurIPs, 2023

  57. [57]

    Skean, O., Osorio, J. K. H., Brockmeier, A. J., and Giraldo, L. G. S. DiME : Maximizing mutual information by a difference of matrix-based entropies. arXiv, 2023

  58. [58]

    Skean, O., Dhakal, A., Jacobs, N., and Giraldo, L. G. S. FroSSL : Frobenius norm minimization for self-supervised learning. ECCV, 2024

  59. [59]

    Neural representational geometry underlies few-shot concept learning

    Sorscher, B., Ganguli, S., and Sompolinsky, H. Neural representational geometry underlies few-shot concept learning. Proceedings of the National Academy of Sciences, 2022

  60. [60]

    BERT rediscovers the classical nlp pipeline

    Tenney, I., Das, D., and Pavlick, E. BERT rediscovers the classical nlp pipeline. NAACL, 2019

  61. [61]

    M., and Littwin, E

    Thilak, V., Huang, C., Saremi, O., Dinh, L., Goh, H., Nakkiran, P., Susskind, J. M., and Littwin, E. LiDAR : Sensing linear probing performance in joint embedding ssl architectures. ICLR, 2024

  62. [62]

    Contrastive multiview coding

    Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. ECCV, 2020

  63. [63]

    Llama 2: Open foundation and fine-tuned chat models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023

  64. [64]

    The geometry of hidden representations of large transformer models

    Valeriani, L., Doimo, D., Cuturello, F., Laio, A., Ansuini, A., and Cazzaniga, A. The geometry of hidden representations of large transformer models. NeurIPs, 2023

  65. [65]

    N., Kaiser, L

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. NeurIPs, 2017

  66. [66]

    The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

    Voita, E., Sennrich, R., and Titov, I. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. EMNLP-IJCNLP, 2019

  67. [67]

    Diff-eRank : A novel rank-based metric for evaluating large language models

    Wei, L., Tan, Z., Li, C., Wang, J., and Huang, W. Diff-eRank : A novel rank-based metric for evaluating large language models. NeurIPs, 2024

  68. [68]

    Efficient streaming language models with attention sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. ICLR, 2024

  69. [69]

    Qwen2.5 technical report

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui...

  70. [70]

    Zhao, Z., Ziser, Y., and Cohen, S. B. Layer by layer: Uncovering where multi-task learning happens in instruction-tuned large language models. EMNLP-IJCNLP, 2024

  71. [71]

    and Liu, D

    Zhouyin, Z. and Liu, D. Understanding neural networks with logarithm determinant entropy estimator. arXiv, 2021

  72. [72]

    Székely and Maria L

    Gábor J. Székely and Maria L. Rizzo and Nail K. Bakirov , Title =. 2008 , journal =

  73. [73]

    Training Large Language Models to Reason in a Continuous Latent Space , author=

  74. [74]

    Qwen2.5 Technical Report , author =

  75. [75]

    DeepSeek-

    DeepSeek-AI , year=. DeepSeek-

  76. [76]

    2019 , publisher=

    High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

  77. [77]

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , journal = neurips, title =

  78. [78]

    AI Medical Chatbot dataset , author=

  79. [79]

    Pythia: A suite for analyzing large language models across training and scaling , author=

  80. [80]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal=naacl, year=

Showing first 80 references.