arxiv: 2605.14075 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

Cristian Hinostroza , Rodrigo Toro Icarte , Christ Devia , Andres Carvallo De Ferari , Eugenio Herrera-Berg , Denis Parra , Jorge F Silva

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:01 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords cosine similaritylayer relevancelarge language modelsperformance degradationlayer pruningtransformer interpretabilitymodel ablationlayer importance

0 comments

The pith

Cosine similarity can be arbitrarily low for a layer that is still essential to an LLM's performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cosine similarity between a layer's output and the model's final predictions is a poor indicator of that layer's true importance. A theoretical argument demonstrates that layers can score very low on similarity yet cause large accuracy losses when removed, while experiments across multiple models show only weak or moderate correlation between the two quantities. The authors therefore recommend directly measuring the accuracy drop from ablating a layer as a more reliable, though costlier, way to rank layer relevance for pruning and interpretability work.

Core claim

Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model's performance. Empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer's internal mechanisms. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer.

What carries the argument

The accuracy drop metric obtained by removing one layer from the intact model and measuring the resulting change in task performance.

If this is right

Pruning decisions based on cosine similarity alone can retain unimportant layers or discard essential ones.
Mechanistic interpretability studies that rely on similarity measures may miss layers that drive actual capability.
Lightweight model construction improves when layers are ranked by measured performance impact rather than similarity.
Intervention methods such as targeted removal become preferred over passive similarity checks for assessing component importance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The accuracy-drop test could be applied to smaller units such as individual attention heads or MLP blocks.
For models too large for repeated full evaluations, cheap surrogate tests that approximate the drop would be valuable next steps.
Existing pruning literature that used cosine similarity may need re-examination with the performance-based ranking.

Load-bearing premise

That removing a single layer and measuring accuracy drop on held-out tasks gives a faithful picture of that layer's contribution inside the intact model without major compensatory effects from remaining layers.

What would settle it

Finding a consistent strong correlation (Pearson's r above 0.7) between cosine similarity scores and accuracy drops across diverse LLMs and tasks would challenge the claim that cosine similarity is a poor proxy.

Figures

Figures reproduced from arXiv: 2605.14075 by Andres Carvallo De Ferari, Christ Devia, Cristian Hinostroza, Denis Parra, Eugenio Herrera-Berg, Jorge F Silva, Rodrigo Toro Icarte.

**Figure 2.** Figure 2: We evaluate LLaMA-3-8B using the cosine-similarity pruning strategy proposed by Gro [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A. Relationship between cosine similarity scores and performance variation after removing a layer. Each point represents a specific layer–task pair from one of the 28 middle layers in Pythia, Mistral, or OLMo, evaluated across ten tasks (same set as in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Relevance of Mistral’s Transformer blocks across datasets. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Relevances Across Datasets To ensure these differences are not artifacts of visualization, we conducted a statistical comparison between the two metrics. Using z-score normalization, we computed the average variance of each of OLMo’s 32 blocks across ten datasets. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of healing (dark colors) after pruning (light colors) across varying pruning ratios. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Block relevance during training in OLMo (left) and Pythia (right) on the MathInstruct [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Block relevance during training in OLMo on the CodeAlpaca dataset. Each row corre [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Block relevance during training in OLMo on the C4 dataset. Each row corresponds to [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Block relevance during training in OLMo on the LIMA dataset. Each row corresponds [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Block relevance during training in Pythia on the CodeAlpaca dataset. Each row cor [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Block relevance during training in Pythia on the C4 dataset. Each row corresponds to a [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Block relevance during training in Pythia on the LIMA dataset. Each row corresponds [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Block relevance on Mistral in MathInstruct (left) and CodeAlpaca (right) as blocks are [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Block relevance in Mistral on the C4 dataset as layers are incrementally pruned. In each [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Block relevance in Mistral on the LIMA dataset as layers are incrementally pruned. In [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Block relevance in LLaMA-3-8B across 4 datastes. In each row, we used a different size [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Evaluation of LLaMA-3-8B under the cosine-similarity pruning strategy of Gromov et al. [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

**Figure 19.** Figure 19: Relation between tasks. Computed from data in Table 7 [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Time per sample versus the number of calibration samples for our method. Results are [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Impact of healing after pruning across varying pruning ratios and train epochs. [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗

read the original abstract

Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the development of various methods for assessing layer relevance, with cosine similarity being a widely used tool in the field. On this work, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal. Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model's performance. On the other hand, empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer's internal mechanisms. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer. Even though it is a computationally costly metric, this approach offers a more accurate picture of layer importance, allowing for more informed pruning strategies and lightweight models. Our findings have significant implications for the development of interpretable LLMs and highlight the need to move beyond cosine similarity in assessing layer relevance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cosine similarity is a weak proxy for layer importance in LLMs, with a solid theoretical counterexample but a debatable ablation metric.

read the letter

Hey, the punchline is that cosine similarity is not a good stand-in for measuring how much a layer contributes to an LLM's performance. The authors build a theoretical case where a layer can be vital yet have arbitrarily low cosine similarity, and their experiments on real models show only weak or moderate correlation between similarity scores and the accuracy loss from removing the layer. What the paper does well is lay out that counter-example clearly and run checks across multiple LLMs to show the issue isn't isolated. It also correctly notes that for pruning to lightweight models, you'd want a metric tied directly to performance impact rather than a proxy. The main soft spot is the reliance on layer removal as the ground truth. In transformers, layers aren't independent; the rest of the network can adjust or the drop might come from disrupting the overall computation rather than revealing the layer's unique role. The paper doesn't seem to test for compensation effects or compare to other removal strategies like zeroing out instead of full removal. The empirical details in the abstract are light on specifics like exact models and tasks, though the full text presumably fills that in. This work is aimed at researchers in mechanistic interpretability and model compression. Anyone using cosine similarity for layer ranking should read it to see the potential pitfalls. I'd recommend sending it for peer review. The observation is solid enough to merit discussion, and referees can help tighten the validation of the proposed metric.

Referee Report

2 major / 2 minor

Summary. The paper claims that cosine similarity is a poor proxy for layer relevance in LLMs. It presents a theoretical construction showing that a layer can have arbitrarily low cosine similarity yet remain crucial to performance, and reports empirical evidence of only weak or moderate correlation between cosine similarity and the accuracy drop caused by removing that layer. The authors propose the accuracy drop upon single-layer removal as a more faithful (though expensive) metric for assessing importance and guiding pruning.

Significance. If the central claim holds, the work would usefully caution against over-reliance on cosine similarity in mechanistic interpretability and motivate more direct ablation-based diagnostics for layer importance. The theoretical counter-example is a clear strength, and the suggestion of a performance-based metric has practical value for model compression. However, the significance is limited by the unexamined assumption that single-layer ablation faithfully isolates a layer’s contribution without compensatory effects from the remaining network.

major comments (2)

[Empirical evidence] The central empirical claim—that cosine similarity correlates only weakly with actual layer importance—rests on treating post-removal accuracy drop as the unbiased ground-truth metric. The manuscript does not discuss or control for the possibility that remaining layers can compensate for the removed layer, which is a known concern in overparameterized transformers; this assumption is load-bearing for the reported weak correlation.
[Theoretical analysis] The theoretical analysis constructs a case of low cosine similarity yet high importance, but the argument only demonstrates that cosine similarity can be misleading if the ablation metric is accepted as faithful. No quantitative bound or example is given showing how large the discrepancy can be under realistic transformer dynamics.

minor comments (2)

[Abstract] The abstract and introduction should explicitly state the models, datasets, and number of layers tested in the empirical section to allow readers to assess the scope of the weak-correlation claim.
[Methods] Notation for the proposed accuracy-drop metric should be introduced formally (e.g., as ΔAcc_l) and distinguished from cosine similarity in all equations and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key assumptions underlying our claims. We address each major point below and outline targeted revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Empirical evidence] The central empirical claim—that cosine similarity correlates only weakly with actual layer importance—rests on treating post-removal accuracy drop as the unbiased ground-truth metric. The manuscript does not discuss or control for the possibility that remaining layers can compensate for the removed layer, which is a known concern in overparameterized transformers; this assumption is load-bearing for the reported weak correlation.

Authors: We agree that compensatory effects among remaining layers constitute a genuine limitation when interpreting accuracy drop as an isolated measure of layer importance. The original manuscript did not explicitly address this issue. In the revision we will add a new paragraph in the Discussion section that acknowledges this possibility, references prior work on ablation compensation in overparameterized networks, and clarifies that the accuracy-drop metric is advanced as a more direct performance-based alternative to cosine similarity rather than an absolute ground truth. We will also note that future multi-layer ablation studies could further isolate contributions. revision: partial
Referee: [Theoretical analysis] The theoretical analysis constructs a case of low cosine similarity yet high importance, but the argument only demonstrates that cosine similarity can be misleading if the ablation metric is accepted as faithful. No quantitative bound or example is given showing how large the discrepancy can be under realistic transformer dynamics.

Authors: The theoretical construction is deliberately general, showing that cosine similarity can be made arbitrarily low while a layer remains essential to the output, without depending on specific transformer dynamics. We accept that this does not supply quantitative bounds or realistic-dynamics examples. The revised manuscript will include a short numerical illustration using synthetic linear layers to demonstrate the scale of possible discrepancy, and will explicitly state that the argument establishes the possibility of failure rather than providing bounds for all practical cases. revision: partial

Circularity Check

0 steps flagged

No significant circularity; ablation metric is independently defined

full rationale

The paper defines its proposed layer-relevance metric directly as the observed accuracy drop after single-layer removal, an empirical quantity measured on held-out tasks and not constructed from cosine similarity, fitted parameters, or prior self-citations. The theoretical claim constructs an explicit counter-example showing arbitrarily low cosine similarity can coexist with high importance, without any definitional loop. Empirical correlations are then computed between cosine similarity and this external ablation drop; no step renames a known result, imports a uniqueness theorem from the authors' prior work, or smuggles an ansatz via citation. The derivation chain is therefore self-contained against the chosen ground-truth metric.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard transformer assumptions and the validity of single-layer ablation as a relevance probe; no new entities or fitted parameters are introduced.

axioms (1)

domain assumption Transformer layers can be removed individually without retraining while still allowing meaningful accuracy measurement.
Invoked when proposing ablation as the ground-truth relevance metric.

pith-pipeline@v0.9.0 · 5522 in / 1046 out tokens · 38289 ms · 2026-05-15T05:01:38.973870+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 6 internal anchors

[1]

Advances in neural information processing systems , booktitle=NIPS17, year=

Attention is all you need , author=. Advances in neural information processing systems , booktitle=NIPS17, year=

work page
[2]

All bark and no bite: Rogue dimensions in transformer language models obscure representational quality , author=

work page
[3]

2023 , journal=

A Survey on Transformers in Reinforcement Learning , author=. 2023 , journal=

work page 2023
[4]

Breakthroughs in statistics: Methodology and distribution , pages=

Individual comparisons by ranking methods , author=. Breakthroughs in statistics: Methodology and distribution , pages=. 1992 , publisher=

work page 1992
[5]

Journal of the american statistical association , volume=

The use of ranks to avoid the assumption of normality implicit in the analysis of variance , author=. Journal of the american statistical association , volume=. 1937 , publisher=

work page 1937
[6]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

On transforming reinforcement learning with transformers: The development trajectory , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[7]

2021 , address =

Chen, Lili and Lu, Kevin and Rajeswaran, Aravind and Lee, Kimin and Grover, Aditya and Laskin, Michael and Abbeel, Pieter and Srinivas, Aravind and Mordatch, Igor , title =. 2021 , address =

work page 2021
[8]

ACM computing surveys (CSUR) , volume=

Transformers in vision: A survey , author=. ACM computing surveys (CSUR) , volume=. 2022 , publisher=

work page 2022
[9]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[10]

Emerging Properties in Self-Supervised Vision Transformers , author=

work page
[11]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Multimodal learning with transformers: A survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2023 , publisher=

work page 2023
[12]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[13]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[14]

Proceedings of the 28th ACM international conference on information and knowledge management , pages=

BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer , author=. Proceedings of the 28th ACM international conference on information and knowledge management , pages=

work page
[15]

2018 IEEE international conference on data mining (ICDM) , pages=

Self-attentive sequential recommendation , author=. 2018 IEEE international conference on data mining (ICDM) , pages=. 2018 , organization=

work page 2018
[16]

Proceedings of the 14th ACM Conference on Recommender Systems , pages=

Interpretable contextual team-aware item recommendation: application in multiplayer online battle arena games , author=. Proceedings of the 14th ACM Conference on Recommender Systems , pages=

work page
[17]

Transformer layers as painters , author=

work page
[18]

2021 , organization=

Transformer feed-forward layers are key-value memories , author=. 2021 , organization=

work page 2021
[19]

ACM Journal on Emerging Technologies in Computing Systems (JETC) , volume=

Structured pruning of deep convolutional neural networks , author=. ACM Journal on Emerging Technologies in Computing Systems (JETC) , volume=. 2017 , publisher=

work page 2017
[20]

2022 , organization=

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space , author=. 2022 , organization=

work page 2022
[21]

What does BERT learn about the structure of language? , author=

work page
[22]

2019 , organization=

What does BERT look at? an analysis of BERT’s attention , author=. 2019 , organization=

work page 2019
[23]

Proceedings of the 2019 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages=

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages=

work page 2019
[24]

Locating and Editing Factual Associations in GPT , author=

work page
[25]

2023 , organization=

Dissecting recall of factual associations in auto-regressive language models , author=. 2023 , organization=

work page 2023
[26]

Language models represent space and time , author=

work page
[27]

arXiv preprint arXiv:2401.12181 , year=

Universal neurons in gpt2 language models , author=. arXiv preprint arXiv:2401.12181 , year=

work page arXiv
[28]

arXiv preprint arXiv:2405.14860 , year=

Not all language model features are linear , author=. arXiv preprint arXiv:2405.14860 , year=

work page arXiv
[29]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

A mechanistic analysis of a transformer trained on a symbolic multi-step reasoning task , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[30]

arXiv preprint arXiv:2310.15154 , year=

Linear representations of sentiment in large language models , author=. arXiv preprint arXiv:2310.15154 , year=

work page arXiv
[31]

2025 , url=

Looking Beyond the Top-1: Transformers Determine Top Tokens in Order , author=. 2025 , url=

work page 2025
[32]

Pruning Filters for Efficient ConvNets

Pruning filters for efficient convnets , author=. arXiv preprint arXiv:1608.08710 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Shortgpt: Layers in large language models are more redundant than you expect , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[34]

The Unreasonable Ineffectiveness of the Deeper Layers , author=

work page
[35]

arXiv preprint arXiv:2406.15786 , year=

What matters in transformers? not all attention is needed , author=. arXiv preprint arXiv:2406.15786 , year=

work page arXiv
[36]

Shortened

Bo-Kyeong Kim and Geonmin Kim and Tae-Ho Kim and Thibault Castells and Shinkook Choi and Junho Shin and Hyoung-Kyu Song , booktitle=. Shortened. 2024 , url=

work page 2024
[37]

LLM-pruner: on the structural pruning of large language models , author=

work page
[38]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Blockpruner: Fine-grained pruning for large language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[39]

Proceedings of the NeurIPS 2024 Workshop on Machine Learning and Compression , year=

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models , author=. Proceedings of the NeurIPS 2024 Workshop on Machine Learning and Compression , year=

work page 2024
[40]

Investigating layer importance in large language models , author=

work page
[41]

Proceedings of the IEEE 21st International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET) , pages=

Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends , author=. Proceedings of the IEEE 21st International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET) , pages=. 2024 , organization=

work page 2024
[42]

Profesional de la Informaci

AI application in journalism: ChatGPT and the uses and risks of an emergent technology , author=. Profesional de la Informaci

work page
[43]

2025 , url =

Kunal Handa and Drew Bent and Alex Tamkin and Miles McCain and Esin Durmus and Michael Stern and Mike Schiraldi and Saffron Huang and Stuart Ritchie and Steven Syverud and Kamya Jagadish and Margaret Vo and Matt Bell and Deep Ganguli , title =. 2025 , url =

work page 2025
[44]

OLMo: Accelerating the science of language models , author=

work page
[45]

arXiv preprint arXiv:2411.15558 , year=

Reassessing Layer Pruning in LLMs: New Insights and Methods , author=. arXiv preprint arXiv:2411.15558 , year=

work page arXiv
[46]

Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks , author=

work page
[47]

Information Processing & Management , volume=

Less is more: Towards green code large language models via unified structural pruning , author=. Information Processing & Management , volume=. 2026 , publisher=

work page 2026
[48]

arXiv preprint arXiv:2505.18235 , year=

The Origins of Representation Manifolds in Large Language Models , author=. arXiv preprint arXiv:2505.18235 , year=

work page arXiv
[49]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Laco: Large language model pruning via layer collapse , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024
[50]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910
[51]

Computer Speech & Language , volume=

On the effect of dropping layers of pre-trained transformer models , author=. Computer Speech & Language , volume=. 2023 , publisher=

work page 2023
[52]

SliceGPT: Compress Large Language Models by Deleting Rows and Columns , author=

work page
[53]

2023 , organization=

How do transformers learn topic structure: Towards a mechanistic understanding , author=. 2023 , organization=

work page 2023
[54]

Javier Ferrando and Gabriele Sarti and Arianna Bisazza and Marta R. Costa. A Primer on the Inner Workings of Transformer-based Language Models , journal =. 2024 , url =. doi:10.48550/ARXIV.2405.00208 , eprinttype =. 2405.00208 , timestamp =

work page doi:10.48550/arxiv.2405.00208 2024
[55]

arXiv preprint arXiv:2410.14649 , year=

Evopress: Towards optimal dynamic model compression via evolutionary search , author=. arXiv preprint arXiv:2410.14649 , year=

work page arXiv
[56]

Proceedings of the ICML 2024 Workshop on Theoretical Foundations of Foundation Models , year=

A deeper look at depth pruning of LLMs , author=. Proceedings of the ICML 2024 Workshop on Theoretical Foundations of Foundation Models , year=

work page 2024
[57]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page
[58]

arXiv preprint arXiv:2309.05653 , year=

Mammoth: Building math generalist models through hybrid instruction tuning , author=. arXiv preprint arXiv:2309.05653 , year=

work page arXiv
[59]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

work page
[60]

A Simple and Effective Pruning Approach for Large Language Models , author=

work page
[61]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[62]

Mistral 7B

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

International Conference on Machine Learning , pages=

Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[66]

Proceedings of the 2019 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages=

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. Proceedings of the 2019 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages=

work page 2019
[67]

Piqa: Reasoning about physical commonsense in natural language , author=

work page
[68]

Hellaswag: Can a machine really finish your sentence? , author=

work page
[69]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021
[70]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Measuring Massive Multitask Language Understanding , author=

work page
[72]

Can a suit of armor conduct electricity? a new dataset for open book question answering , author=

work page
[73]

2023 , url =

CodeAlpaca: Instruction tuning for code generation , author =. 2023 , url =

work page 2023
[74]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[75]

QLoRA: Efficient Finetuning of Quantized LLMs , author=

work page
[76]

LoRA: Low-Rank Adaptation of Large Language Models , author=

work page
[77]

Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan and Marian Tietz , howpublished =

work page