Recognition: 2 theorem links
Layer by Layer: Uncovering Hidden Representations in Language Models
Pith reviewed 2026-05-15 16:25 UTC · model grok-4.3
The pith
Intermediate layers in language models often encode richer representations than the final layer for downstream tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that intermediate layers balance information compression and signal preservation more effectively than the final layer, leading to stronger representations that improve results on downstream tasks. Their unified metrics quantify these properties layer by layer and confirm the pattern holds across architectures and domains through extensive testing on 32 tasks.
What carries the argument
Unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations, which tracks the compression-preservation trade-off at each depth.
If this is right
- Mid-layer embeddings improve accuracy on text and vision embedding tasks compared with final-layer use.
- The same layer-wise pattern appears in both transformer and state-space model families.
- Final-layer embeddings are not reliably optimal for feature extraction across tasks.
- Selecting representations from intermediate depths becomes a viable direction for more robust embeddings.
Where Pith is reading between the lines
- Users could routinely extract and compare activations from several layers before choosing the best one for a given task.
- The compression-preservation balance observed here may appear in other neural architectures beyond language models.
- Task-specific layer selection might allow lighter inference by skipping deeper computations in some applications.
Load-bearing premise
The metrics from information theory, geometry, and perturbation invariance accurately reflect the qualities that determine usefulness for real downstream tasks.
What would settle it
An experiment on a new collection of tasks where final-layer embeddings match or exceed every intermediate layer on all metrics and actual task performance.
read the original abstract
From extracting features to generating text, the outputs of large language models (LLMs) typically rely on the final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer's performance. Through extensive experiments on 32 text-embedding tasks across various architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features, challenging the standard view on final-layer embeddings and opening new directions on using mid-layer representations for more robust and accurate representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that intermediate layers in LLMs and related architectures often encode richer representations than final layers for downstream tasks. It introduces a unified framework of representation-quality metrics grounded in information theory, geometry, and invariance to input perturbations, and supports the claim with experiments across 32 text-embedding tasks, multiple model families (transformers, state-space models), and domains (language and vision).
Significance. If the empirical findings and metric framework hold under scrutiny, the work would meaningfully challenge the default practice of relying on final-layer embeddings and could shift feature-extraction pipelines toward mid-layer representations, with potential gains in robustness and accuracy on embedding-based tasks.
minor comments (3)
- [§3] Abstract and §3: the claim that the metrics are 'independently motivated' and 'parameter-free' should be supported by explicit definitions; any implicit hyperparameters or normalization choices must be stated so readers can verify independence from downstream-task fitting.
- [§4] §4 and Table 2: the reported 'consistent' outperformance of intermediate layers requires per-task statistical significance tests (e.g., paired t-tests or Wilcoxon with correction) and effect-size reporting; aggregate win rates alone are insufficient to support the strong claim.
- [§5] §5: the cross-architecture and cross-domain generalization (transformers vs. state-space models, language vs. vision) is central; ensure that layer-indexing conventions and token-aggregation methods are identical across families so the comparison is apples-to-apples.
Simulated Author's Rebuttal
We thank the referee for their positive summary and significance assessment of the manuscript, as well as the recommendation for minor revision. No specific major comments were provided in the report, so we have no individual points requiring detailed rebuttal. We will incorporate any minor editorial adjustments in the revised version.
Circularity Check
No significant circularity identified
full rationale
The paper proposes a framework of representation quality metrics motivated independently by information theory, geometry, and invariance to perturbations. No equations, derivations, or fitted parameters are described that reduce predictions to inputs by construction. Experiments across 32 tasks on multiple architectures provide external validation of the claim that intermediate layers can outperform final layers. No self-citation chains, uniqueness theorems, or ansatz smuggling are referenced in the abstract or context. The central claim rests on empirical results rather than definitional equivalence.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
Inference-Time Machine Unlearning via Gated Activation Redirection
GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
-
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
-
Intermediate Layers Encode Optimal Biological Representations in Single-Cell Foundation Models
Intermediate layers in single-cell foundation models encode optimal representations for biological tasks, outperforming final layers in a task- and context-dependent manner.
-
Instruction Data Selection via Answer Divergence
ADG selects 10K instruction examples by scoring the geometric divergence of multiple high-temperature model outputs in embedding space, outperforming prior selectors on reasoning, knowledge, and coding benchmarks acro...
-
Overcoming the Modality Gap in Context-Aided Forecasting
A semi-synthetic augmentation creates the CAF-7M dataset and demonstrates that improved context data enables multimodal models to outperform unimodal baselines in context-aided forecasting.
-
A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs
Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
-
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
-
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
-
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.
-
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
-
Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
From Words to Amino Acids: Does the Curse of Depth Persist?
Protein language models exhibit consistent depth inefficiency where most task-relevant computation occurs in a subset of layers, mirroring patterns in large language models.
-
Semantic Structure of Feature Space in Large Language Models
LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
-
Do Vision Language Models Need to Process Image Tokens?
Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
-
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...
Reference graph
Works this paper leans on
-
[1]
Agrawal, K. K., Mondal, A. K., Ghosh, A., and Richards, B. - ReQ : Assessing representation quality in self-supervised learning by measuring eigenspectrum decay. NeurIPs, 2022
work page 2022
-
[2]
Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes. ICLR, 2017
work page 2017
-
[3]
R., Subbaraj, G., Gontier, N., LeCun, Y., Rish, I., Shwartz-Ziv, R., and Pal, C
Arefin, M. R., Subbaraj, G., Gontier, N., LeCun, Y., Rish, I., Shwartz-Ziv, R., and Pal, C. Seq-VCR : Preventing collapse in intermediate transformer representations for enhanced reasoning. ICLR, 2025
work page 2025
-
[4]
Information theory with kernel methods
Bach, F. Information theory with kernel methods. IEEE Transactions on Information Theory, 2022
work page 2022
-
[5]
BeIT : Bert pre-training of image transformers
Bao, H., Dong, L., Piao, S., and Wei, F. BeIT : Bert pre-training of image transformers. ICLR, 2022
work page 2022
-
[6]
Why do LLMs attend to the first token? arXiv, 2025
Barbero, F., Arroyo, A., Gu, X., Perivolaropoulos, C., Bronstein, M., Veli c kovi \'c , P., and Pascanu, R. Why do LLMs attend to the first token? arXiv, 2025
work page 2025
-
[7]
LLM2Vec : Large language models are secretly powerful text encoders
Behnam Ghader, P., Adlakha, V., Mosbach, M., Bahdanau, D., Chapados, N., and Reddy, S. LLM2Vec : Large language models are secretly powerful text encoders. COLM, 2024
work page 2024
-
[8]
G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M
Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. ICML, 2023
work page 2023
-
[9]
Boes, P., Eisert, J., Gallego, R., M \"u ller, M. P., and Wilming, H. Von neumann entropy from unitarity. Physical review letters, 2019
work page 2019
-
[10]
Bordes, F., Balestriero, R., Garrido, Q., Bardes, A., and Vincent, P. Guillotine regularization: Why removing layers is needed to improve generalization in self-supervised learning. TMLR, 2023
work page 2023
-
[11]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...
work page 2020
-
[12]
On identifiability in transformers
Brunner, G., Liu, Y., Pascual, D., Richter, O., Ciaramita, M., and Wattenhofer, R. On identifiability in transformers. ICLR, 2020
work page 2020
-
[13]
Discovering latent knowledge in language models without supervision
Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. ICLR, 2023
work page 2023
-
[14]
Generative pretraining from pixels
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. Generative pretraining from pixels. ICML, 2020
work page 2020
-
[15]
Emergence of a high-dimensional abstraction phase in language transformers
Cheng, E., Doimo, D., Kervadec, C., Macocco, I., Yu, J., Laio, A., and Baroni, M. Emergence of a high-dimensional abstraction phase in language transformers. ICLR, 2025
work page 2025
-
[16]
Csord \'a s, R., Manning, C. D., and Potts, C. Do language models use their depth efficiently? arXiv, 2025
work page 2025
-
[17]
Deepseek- R1 : Incentivizing reasoning capability in LLMs via reinforcement learning
DeepSeek-AI. Deepseek- R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv, 2025
work page 2025
-
[18]
K., Aitchison, M., Orseau, L., et al
Deletang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchison, M., Orseau, L., et al. Language modeling is compression. ICLR, 2024
work page 2024
-
[19]
BERT : Pre-training of deep bidirectional transformers for language understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019
work page 2019
-
[20]
An image is worth 16x16 words: Transformers for image recognition at scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021
work page 2021
-
[21]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 herd of models. arXiv, 2024
work page 2024
-
[22]
A., Toshev, A., Shankar, V., Susskind, J
El-Nouby, A., Klein, M., Zhai, S., Bautista, M. A., Toshev, A., Shankar, V., Susskind, J. M., and Joulin, A. Scalable pre-training of large autoregressive image models. ICML, 2024
work page 2024
-
[23]
Not all layers of LLMs are necessary during inference
Fan, S., Jiang, X., Li, X., Meng, X., Han, P., Shang, S., Sun, A., Wang, Y., and Wang, Z. Not all layers of LLMs are necessary during inference. arXiv, 2024
work page 2024
-
[24]
Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., da Costa, V. G. T., B \'e thune, L., Gan, Z., et al. Multimodal autoregressive pre-training of large vision encoders. CVPR, 2025
work page 2025
-
[25]
Garrido, Q., Balestriero, R., Najman, L., and Lecun, Y. RankMe : Assessing the downstream performance of pretrained self-supervised representations by their rank. ICML, 2023
work page 2023
-
[26]
Giraldo, L. G. S., Rao, M., and Principe, J. C. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory, 2014
work page 2014
-
[27]
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. COLM, 2024
work page 2024
-
[28]
When attention sink emerges in language models: An empirical view
Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., and Lin, M. When attention sink emerges in language models: An empirical view. ICLR, 2025
work page 2025
-
[29]
Gurnee, W. and Tegmark, M. Language models represent space and time. arXiv, 2023
work page 2023
-
[30]
Training large language models to reason in a continuous latent space
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y. Training large language models to reason in a continuous latent space. arXiv, 2024
work page 2024
-
[31]
Masked autoencoders are scalable vision learners
He, K., Chen, X., Xie, S., Li, Y., Doll \'a r, P., and Girshick, R. Masked autoencoders are scalable vision learners. CVPR, 2022
work page 2022
-
[32]
Hosseini, E. and Fedorenko, E. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language. NeurIPs, 2023
work page 2023
-
[33]
Jin, M., Yu, Q., Huang, J., Zeng, Q., Wang, Z., Hua, W., Zhao, H., Mei, K., Meng, Y., Ding, K., et al. Exploring concept depth: How large language models acquire knowledge at different layers? arXiv, 2024
work page 2024
-
[34]
The remarkable robustness of LLMs : Stages of inference? arXiv, 2024
Lad, V., Gurnee, W., and Tegmark, M. The remarkable robustness of LLMs : Stages of inference? arXiv, 2024
work page 2024
-
[35]
Competition-level code generation with alphacode
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 2022
work page 2022
-
[36]
F., Gardner, M., Belinkov, Y., Peters, M
Liu, N. F., Gardner, M., Belinkov, Y., Peters, M. E., and Smith, N. A. Linguistic knowledge and transferability of contextual representations. NAACL, 2019
work page 2019
-
[37]
Ma, E. NLP Augmentation , 2019. URL https://github.com/makcedward/nlpaug
work page 2019
-
[38]
Mallen, A. T. and Belrose, N. Eliciting latent knowledge from quirky language models. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024
work page 2024
-
[39]
A., Stephenson, C., Tang, H., Kim, Y., and Chung, S
Mamou, J., Le, H., Del Rio, M. A., Stephenson, C., Tang, H., Kim, Y., and Chung, S. Emergence of separable manifolds in deep language representations. ICML, 2020
work page 2020
-
[40]
Marion, P., Wu, Y.-H., Sander, M. E., and Biau, G. Implicit regularization of deep residual networks towards neural odes. ICLR, 2024
work page 2024
-
[41]
Pointer sentinel mixture models
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. ICLR, 2017
work page 2017
-
[42]
MTEB : Massive text embedding benchmark
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. MTEB : Massive text embedding benchmark. EACL, 2022
work page 2022
-
[43]
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ICLR, 2018
work page 2018
-
[44]
DINOv2 : Learning robust visual features without supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. DINOv2 : Learning robust visual features without supervision. TMLR, 2024
work page 2024
-
[45]
Park, K., Choe, Y. J., Jiang, Y., and Veitch, V. The geometry of categorical and hierarchical concepts in large language models. ICML 2024 Workshop on Mechanistic Interpretability, 2024 a
work page 2024
-
[46]
Park, K., Choe, Y. J., and Veitch, V. The linear representation hypothesis and the geometry of large language models. ICML, 2024 b
work page 2024
-
[47]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. ICML, 2021
work page 2021
-
[48]
Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. SVCCA : Singular vector canonical correlation analysis for deep learning dynamics and interpretability. NeurIPs, 2017
work page 2017
-
[49]
The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models
Razzhigaev, A., Mikhalchuk, M., Goncharova, E., Oseledets, I., Dimitrov, D., and Kuznetsov, A. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. EACL, 2024
work page 2024
-
[50]
On measures of entropy and information
R \'e nyi, A. On measures of entropy and information. Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, 1961
work page 1961
-
[51]
Roy, O. and Vetterli, M. The effective rank: A measure of effective dimensionality. European signal processing conference, 2007
work page 2007
-
[52]
V., Stadelmann, T., and Grewe, B
Saponati, M., Sager, P., Aceituno, P. V., Stadelmann, T., and Grewe, B. The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in transformer training. arXiv, 2025
work page 2025
-
[53]
Scholkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2018
work page 2018
-
[54]
Information flow in deep neural networks
Shwartz-Ziv, R. Information flow in deep neural networks. PhD thesis, Hebrew University, 2022
work page 2022
-
[55]
Shwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. Entropy, 2019
work page 2019
-
[56]
Shwartz-Ziv, R., Balestriero, R., Kawaguchi, K., Rudner, T. G., and LeCun, Y. An information theory perspective on variance-invariance-covariance regularization. NeurIPs, 2023
work page 2023
-
[57]
Skean, O., Osorio, J. K. H., Brockmeier, A. J., and Giraldo, L. G. S. DiME : Maximizing mutual information by a difference of matrix-based entropies. arXiv, 2023
work page 2023
-
[58]
Skean, O., Dhakal, A., Jacobs, N., and Giraldo, L. G. S. FroSSL : Frobenius norm minimization for self-supervised learning. ECCV, 2024
work page 2024
-
[59]
Neural representational geometry underlies few-shot concept learning
Sorscher, B., Ganguli, S., and Sompolinsky, H. Neural representational geometry underlies few-shot concept learning. Proceedings of the National Academy of Sciences, 2022
work page 2022
-
[60]
BERT rediscovers the classical nlp pipeline
Tenney, I., Das, D., and Pavlick, E. BERT rediscovers the classical nlp pipeline. NAACL, 2019
work page 2019
-
[61]
Thilak, V., Huang, C., Saremi, O., Dinh, L., Goh, H., Nakkiran, P., Susskind, J. M., and Littwin, E. LiDAR : Sensing linear probing performance in joint embedding ssl architectures. ICLR, 2024
work page 2024
-
[62]
Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. ECCV, 2020
work page 2020
-
[63]
Llama 2: Open foundation and fine-tuned chat models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023
work page 2023
-
[64]
The geometry of hidden representations of large transformer models
Valeriani, L., Doimo, D., Cuturello, F., Laio, A., Ansuini, A., and Cazzaniga, A. The geometry of hidden representations of large transformer models. NeurIPs, 2023
work page 2023
-
[65]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. NeurIPs, 2017
work page 2017
-
[66]
Voita, E., Sennrich, R., and Titov, I. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. EMNLP-IJCNLP, 2019
work page 2019
-
[67]
Diff-eRank : A novel rank-based metric for evaluating large language models
Wei, L., Tan, Z., Li, C., Wang, J., and Huang, W. Diff-eRank : A novel rank-based metric for evaluating large language models. NeurIPs, 2024
work page 2024
-
[68]
Efficient streaming language models with attention sinks
Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. ICLR, 2024
work page 2024
-
[69]
Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui...
work page 2024
-
[70]
Zhao, Z., Ziser, Y., and Cohen, S. B. Layer by layer: Uncovering where multi-task learning happens in instruction-tuned large language models. EMNLP-IJCNLP, 2024
work page 2024
-
[71]
Zhouyin, Z. and Liu, D. Understanding neural networks with logarithm determinant entropy estimator. arXiv, 2021
work page 2021
-
[72]
Gábor J. Székely and Maria L. Rizzo and Nail K. Bakirov , Title =. 2008 , journal =
work page 2008
-
[73]
Training Large Language Models to Reason in a Continuous Latent Space , author=
-
[74]
Qwen2.5 Technical Report , author =
- [75]
-
[76]
High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=
work page 2019
-
[77]
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , journal = neurips, title =
-
[78]
AI Medical Chatbot dataset , author=
-
[79]
Pythia: A suite for analyzing large language models across training and scaling , author=
-
[80]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal=naacl, year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.