A general tensor-structured compression scheme for efficient large language models

Gang Su; Pan Zhang; Peng-Fei Zhou; Qi-Xuan Fang; Shi-Ju Ran; Ying Lu

arxiv: 2605.25344 · v1 · pith:NB5VLRMZnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI· cs.LG· quant-ph

A general tensor-structured compression scheme for efficient large language models

Ying Lu , Peng-Fei Zhou , Qi-Xuan Fang , Pan Zhang , Shi-Ju Ran , Gang Su This is my paper

Pith reviewed 2026-06-29 22:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGquant-ph

keywords tensor compressionlarge language modelsmodel compressiontensor operatorsLLM efficiencylinear layer replacementparameter reductionentropy geometry transition

0 comments

The pith

MixT replaces dense linear layers in LLMs with tensor operator mixtures, preserving accuracy up to model-specific boundaries where parameters drop 47.5%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MixT as a general compression method that substitutes targeted dense linear projections in Transformer-based models with mixtures of tensor operators that run natively. Evaluation on Qwen3-8B and LLaMA2-7B reveals a wide regime of compression where MMLU accuracy stays largely intact, followed by an abrupt performance transition. This transition aligns with simultaneous changes in output entropy, prediction entropy, and inter-layer geometry. At the LLaMA2-7B boundary the approach delivers 47.5% fewer parameters, 37.1% lower inference FLOPs, 52.1% lower training FLOPs, and 60.4% less peak inference memory. The scheme operates on generic linear layers rather than architecture-specific modules, opening the possibility of broader use.

Core claim

MixT replaces targeted dense linear layers with natively executable mixtures of tensor operators. On Qwen3-8B and LLaMA2-7B a broad compressible regime exists in which MMLU accuracy is largely preserved before an abrupt transition at model-specific boundaries; the transition coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry. At the LLaMA2-7B boundary this yields 47.5% parameter reduction, 37.1% inference FLOP reduction, 52.1% training FLOP reduction and 60.4% lower peak inference memory.

What carries the argument

MixT, the tensor mixture that replaces dense linear projections with combinations of tensor operators.

If this is right

Accuracy on MMLU remains largely preserved throughout the compressible regime before the transition boundary.
Full-model parameter counts, inference FLOPs, training FLOPs and peak inference memory all decrease substantially once the identified boundary is reached.
Because the method targets generic linear projections it extends in principle to other Transformer-based LLMs without model-specific redesign.
The coordinated entropy and geometry shifts supply a detectable marker for locating the transition boundary on new models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the entropy-geometry marker generalizes, it could serve as an early diagnostic for compressibility during pre-training rather than after full evaluation.
The same replacement of linear layers might be applied selectively inside individual blocks to create hybrid dense-compressed models that trade accuracy for speed in a graded way.
Because the operators are natively executable, the compression could reduce the cost of both training and inference on the same hardware without requiring custom kernels.

Load-bearing premise

The abrupt performance transition observed on the two evaluated models coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry, and this boundary can be reliably identified in advance for practical use.

What would settle it

Applying the same recovery protocol and boundary detection to a third LLM such as a different-scale model and finding that accuracy drops immediately rather than remaining stable up to the predicted entropy-geometry transition point would falsify the claimed compressible regime.

read the original abstract

Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder efficient adaptation and deployment while masking the functional impacts of structural simplification. Here we present Tensor Mixture (MixT), a general tensor-structured compression scheme that replaces targeted dense linear layers with natively executable mixtures of tensor operators. Operating directly on generic linear projections instead of model-specific components, MixT is potentially applicable across Transformer-based LLMs and other dense neural mappings. We evaluate MixT on Qwen3-8B and LLaMA2-7B under a unified recovery protocol, identifying a broad compressible regime in which MMLU accuracy is largely preserved before an abrupt transition at model-specific boundaries. This transition coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry. At the LLaMA2-7B transition boundary, MixT reduces full-model parameters by 47.5\%, inference FLOPs by 37.1\%, training FLOPs by 52.1\% and peak inference memory by 60.4\%, demonstrating its practical potential for lower-cost LLM compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MixT reports solid compression numbers on LLaMA2-7B right where MMLU holds up, but locating that operating point appears to require the accuracy measurements the method is meant to avoid.

read the letter

The main thing to know is that this paper puts numbers on a tensor-mixture replacement for linear layers and shows 47.5% parameter cut, 37% inference FLOP drop, and 60% memory drop on LLaMA2-7B at the point where MMLU accuracy is still largely intact. They also run the same protocol on Qwen3-8B and note an abrupt drop after a compressible regime.

MixT works by swapping selected dense projections for mixtures of tensor operators that stay executable. The claim is that the approach is general rather than tied to attention or feed-forward specifics, which is the part that could travel to other models. The empirical section maps the compression ratio against accuracy and ties the transition to shifts in output entropy, prediction entropy, and layer geometry. Those observations are at least recorded on real models instead of synthetic cases.

The soft spot is the boundary itself. The abstract and stress-test note both indicate the transition is model-specific and identified where performance changes. If finding the right ratio still needs sweeps or downstream checks on the target model, then the practical cost saving shrinks because you pay the evaluation price upfront. Nothing in the description turns the entropy or geometry signals into a parameter-free predictor that would let you pick the ratio without running the model first.

The citation pattern looks standard for the area and the numbers are given with concrete percentages rather than vague claims. The work is honest about the limits it hits.

This is for groups already running tensor decompositions or LLM serving experiments. It has enough concrete results on standard models to justify sending it to referees, though the boundary selection procedure needs clearer description before the method can be used without extra measurement cost.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Tensor Mixture (MixT), a general compression scheme that replaces selected dense linear projections in Transformer-based LLMs with natively executable mixtures of tensor operators. Evaluated under a unified recovery protocol on Qwen3-8B and LLaMA2-7B, the work identifies a broad compressible regime in which MMLU accuracy remains largely preserved, followed by an abrupt performance transition at model-specific boundaries; these boundaries are stated to coincide with coordinated shifts in output entropy, prediction entropy, and inter-layer geometry. At the LLaMA2-7B boundary the method reports 47.5% parameter reduction, 37.1% inference-FLOP reduction, 52.1% training-FLOP reduction, and 60.4% peak-inference-memory reduction.

Significance. If an a-priori, parameter-free predictor of the transition boundary exists and the entropy/geometry coincidence can be turned into a practical diagnostic, MixT would constitute a genuinely model-agnostic compression technique with substantial efficiency gains across LLMs. The reported reductions at the boundary are large enough to be practically relevant provided the identification cost does not offset them.

major comments (2)

[Abstract] Abstract: the transition boundary is described as coinciding with entropy and geometry shifts and as 'model-specific,' yet no independent, pre-compression criterion or derivation is supplied for locating it; if identification requires sweeping compression ratios and measuring downstream accuracy on the target model, the scheme cannot be deployed on a new LLM without incurring the evaluation cost the compression is intended to avoid.
[Abstract] Abstract: the claims of 'largely preserved' MMLU accuracy in the compressible regime and of an 'abrupt transition' are presented without quantitative scores, error bars, number of runs, or protocol details, so the width of the regime and the sharpness of the boundary cannot be assessed from the reported evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: the transition boundary is described as coinciding with entropy and geometry shifts and as 'model-specific,' yet no independent, pre-compression criterion or derivation is supplied for locating it; if identification requires sweeping compression ratios and measuring downstream accuracy on the target model, the scheme cannot be deployed on a new LLM without incurring the evaluation cost the compression is intended to avoid.

Authors: We acknowledge that the manuscript does not supply an independent, pre-compression criterion or derivation for locating the transition boundary. The boundaries are identified empirically via sweeping of compression ratios and downstream MMLU evaluation under the unified protocol. The noted coincidence with entropy and geometry shifts is observational rather than predictive. As such, applying the method to a new LLM incurs an identification cost through evaluation. This is a genuine limitation for fully model-agnostic, zero-cost deployment. The compression scheme itself is general, but the boundary location remains model-specific and empirically determined. revision: no
Referee: [Abstract] Abstract: the claims of 'largely preserved' MMLU accuracy in the compressible regime and of an 'abrupt transition' are presented without quantitative scores, error bars, number of runs, or protocol details, so the width of the regime and the sharpness of the boundary cannot be assessed from the reported evidence.

Authors: The abstract is a concise summary and does not include the quantitative details. However, the full manuscript provides MMLU accuracy scores for the compressible regime, the location and sharpness of the transition, error bars from repeated runs, and complete protocol specifications in the experiments section. To address this concern directly in the abstract, we will add specific quantitative values and protocol references in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained

full rationale

The abstract and description present MixT as an empirical compression method with performance measured at an observed transition boundary on two models, validated against external MMLU accuracy. No equations, fitted parameters called predictions, self-citations, or ansatzes are quoted that would reduce any claimed result to its own inputs by construction. The boundary is described via observed coincidences with entropy and geometry shifts rather than derived from self-defined quantities, and reductions are reported as direct measurements, leaving the claims independent of circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms or invented entities are stated. The method implicitly assumes that tensor-operator mixtures can be chosen to approximate the original linear maps without additional learned parameters beyond the mixture weights.

pith-pipeline@v0.9.1-grok · 5741 in / 1079 out tokens · 27405 ms · 2026-06-29T22:57:45.384646+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 19 canonical work pages · 4 internal anchors

[1]

& Zettlemoyer, L

Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In Koyejo, S.et al.(eds.)Advances in Neu- ral Information Processing Systems, vol. 35, 30318–30332 (Curran Associates, Inc., 2022). https://proceedings.neurips.cc/paper_files /paper/2022/file/c3ba4962c05c49636d4c62 06a97e9c8a-Paper-Co...

2022
[2]

In Koyejo, S.et al.(eds.)Advances in Neural Information Processing Systems, vol

Hoffmann, J.et al.An empirical analysis of compute-optimal large language model train- ing. In Koyejo, S.et al.(eds.)Advances in Neural Information Processing Systems, vol. 35, 30016–30030 (Curran Associates, Inc., 2022). https://proceedings.neurips.cc/p aper_files/paper/2022/file/c1e2faff6f58887 0935f114ebe04a3e5-Paper-Conference.pdf

2022
[3]

ht tp://dx.doi.org/10.1038/s42256-025-01137 -0

Xiao, C.et al.Densing law of llms.Nature Machine Intelligence7, 1823–1833 (2025). ht tp://dx.doi.org/10.1038/s42256-025-01137 -0

work page doi:10.1038/s42256-025-01137 2025
[4]

http://dx.doi.org/10.1038/s4225 6-023-00626-4

Ding, N.et al.Parameter-efficient fine- tuning of large-scale pre-trained language models.Nature Machine Intelligence5, 220– 235 (2023). http://dx.doi.org/10.1038/s4225 6-023-00626-4

work page doi:10.1038/s4225 2023
[5]

In Gibbons, P., Pekhi- menko, G

Lin, J.et al.Awq: Activation-aware weight quantization for on-device llm compression and acceleration. In Gibbons, P., Pekhi- menko, G. & Sa, C. D. (eds.)Proceedings of Machine Learning and Systems, vol. 6, 87– 100 (2024). https://proceedings.mlsys.org/ paper_files/paper/2024/file/42a452cbafa9 dd64e9ba4aa95cc1ef21-Paper-Conference.pd f

2024
[6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T. & Alis- tarh,D. Gptq:Accuratepost-trainingquanti- zationforgenerativepre-trainedtransformers (2023). https://arxiv.org/abs/2210.17323. 2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

In Krause, A.et al.(eds.) Proceedings of the 40th International Con- ference on Machine Learning, vol

Xiao, G.et al.SmoothQuant: Accurate and efficient post-training quantization for large language models. In Krause, A.et al.(eds.) Proceedings of the 40th International Con- ference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research, 38087–38099 (PMLR, 2023). https://procee dings.mlr.press/v202/xiao23c.html

2023
[8]

& Zettlemoyer, L

Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. In Oh, A.et al.(eds.) Advances in Neural Information Processing Systems, vol. 36, 10088–10115 (Curran Asso- ciates, Inc., 2023). https://proceedings.neur ips.cc/paper_files/paper/2023/file/1feb8787 1436031bdc0f2beaa62a049b-Paper-Confere nce.pdf

2023
[9]

& De Sa, C

Chee, J., Cai, Y., Kuleshov, V. & De Sa, C. M. Quip: 2-bit quantization of large lan- guage models with guarantees. In Oh, A. et al.(eds.)Advances in Neural Information Processing Systems, vol. 36, 4396–4429 (Cur- ran Associates, Inc., 2023). https://proceedi ngs.neurips.cc/paper_files/paper/2023/file/ 0df38cd13520747e1e64e5b123a78ef8-Paper-C onference.pdf

2023
[10]

& Alistarh, D

Frantar, E. & Alistarh, D. SparseGPT: Massive language models can be accurately pruned in one-shot. In Krause, A.et al. (eds.)Proceedings of the 40th International Conference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research, 10323–10337 (PMLR, 2023). https://procee dings.mlr.press/v202/frantar23a.html

2023
[11]

A Simple and Effective Pruning Approach for Large Language Models

Sun, M., Liu, Z., Bair, A. & Kolter, J. Z. A simple and effective pruning approach for large language models (2024). https://arxiv. org/abs/2306.11695. 2306.11695

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Ma,X., Fang,G. &Wang,X. Llm-pruner:On the structural pruning of large language mod- els. In Oh, A.et al.(eds.)Advances in Neu- ral Information Processing Systems, vol. 36, 21702–21720 (Curran Associates, Inc., 2023). https://proceedings.neurips.cc/paper_files 10 /paper/2023/file/44956951349095f74492a547 1128a7e0-Paper-Conference.pdf

work page arXiv 2023
[13]

L., do Nascimento, M

Ashkboos, S., Croci, M. L., do Nascimento, M. G., Hoefler, T. & Hensman, J. Slicegpt: Compress large language models by deleting rows and columns (2024). https://arxiv.org/ abs/2401.15024. 2401.15024

work page arXiv 2024
[14]

Sheared llama: Accelerating language model pre-training via structured pruning

Xia, M., Gao, T., Zeng, Z. & Chen, D. Sheared llama: Accelerating language model pre-trainingviastructuredpruning(2024). ht tps://arxiv.org/abs/2310.06694. 2310.06694

work page arXiv 2024
[15]

ht tps://arxiv.org/abs/2403.19135

Chen, X.et al.Streamlining redundant layers to compress large language models (2025). ht tps://arxiv.org/abs/2403.19135. 2403.19135

work page arXiv 2025
[16]

& Roberts, D

Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P. & Roberts, D. A. The unrea- sonable ineffectiveness of the deeper layers (2025). https://arxiv.org/abs/2403.17887. 2403.17887

work page arXiv 2025
[17]

& Kawaguchi, K

Zhang, Y., Dong, Y. & Kawaguchi, K. Inves- tigating layer importance in large language models. In Belinkov, Y.et al.(eds.)Pro- ceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 469–479 (Association for Computa- tional Linguistics, Miami, Florida, US, 2024). https://aclanthology.org/2024.blackboxnl p-1.29/

2024
[18]

In Che, W., Nabende, J., Shutova, E

Men, X.et al.ShortGPT: Layers in large lan- guage models are more redundant than you expect. In Che, W., Nabende, J., Shutova, E. &Pilehvar,M.T.(eds.)Findings of the Asso- ciation for Computational Linguistics: ACL 2025, 20192–20204 (Association for Compu- tational Linguistics, Vienna, Austria, 2025). https://aclanthology.org/2025.findings-acl.1 035/

2025
[19]

http://dx.doi.org/10.1038/s 41467-025-65518-0

Goldstein, A.et al.Temporal structure of natural language processing in the human brain corresponds to layered hierarchy of large language models.Nature Communica- tions16(2025). http://dx.doi.org/10.1038/s 41467-025-65518-0

work page doi:10.1038/s 2025
[20]

& Wang, W

Zhu, X., Li, J., Liu, Y., Ma, C. & Wang, W. A survey on model compression for large language models.Transactions of the Asso- ciation for Computational Linguistics12, 1556–1577 (2024). https://aclanthology.org /2024.tacl-1.85/

2024
[21]

In Guyon, I.et al.(eds.)Advances in Neu- ral Information Processing Systems, vol

Vaswani, A.et al.Attention is all you need. In Guyon, I.et al.(eds.)Advances in Neu- ral Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017). https://proc eedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Pap er.pdf

2017
[22]

A practical introduction to ten- sor networks: Matrix product states and projected entangled pair states.Annals of Physics349, 117–158 (2014)

Orús, R. A practical introduction to ten- sor networks: Matrix product states and projected entangled pair states.Annals of Physics349, 117–158 (2014). https://www. sciencedirect.com/science/article/pii/S00034 91614001596

2014
[23]

Ran, S.-J. & Su, G. Tensor networks for interpretable and efficient quantum-inspired machine learning.Intelligent Computing2, 0061 (2023). https://spj.science.org/doi/ab s/10.34133/icomputing.0061. https://spj.scie nce.org/doi/pdf/10.34133/icomputing.0061

work page doi:10.34133/icomputing.0061 2023
[24]

Kolda, T. G. & Bader, B. W. Tensor decom- positions and applications.SIAM Review51, 455–500 (2009). https://doi.org/10.1137/07 070111X. https://doi.org/10.1137/07070111 X

work page doi:10.1137/07 2009
[25]

& Vetrov, D

Novikov, A., Podoprikhin, D., Osokin, A. & Vetrov, D. P. Tensorizing neural networks. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M. & Garnett, R. (eds.)Advances in Neu- ral Information Processing Systems, vol. 28 (Curran Associates, Inc., 2015). https://proc eedings.neurips.cc/paper_files/paper/2015/ file/6855456e2fe46a9d49d3d3af4f57443d-P aper.pdf

work page arXiv 2015
[26]

& Schwab, D

Stoudenmire, E. & Schwab, D. J. Super- vised learning with tensor networks. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I. & Garnett, R. (eds.)Advances in Neural Infor- mation Processing Systems, vol. 29 (Curran Associates, Inc., 2016). https://proceedings. neurips.cc/paper_files/paper/2016/file/5314 11 b9674c86e3f9d1ba25ef9bb32895-Paper.pdf

2016
[27]

In Wallach, H.et al

Ma, X.et al.A tensorized transformer for language modeling. In Wallach, H.et al. (eds.)Advances in Neural Information Pro- cessing Systems, vol. 32 (Curran Associates, Inc., 2019). https://proceedings.neurips.cc/p aper_files/paper/2019/file/dc960c46c38bd1 6e953d97cdeefdbc68-Paper.pdf

2019
[28]

& Oseledets, I

Hrinchuk, O., Khrulkov, V., Mirvakhabova, L., Orlova, E. & Oseledets, I. Tensorized embedding layers. In Cohn, T., He, Y. & Liu, Y. (eds.)Findings of the Associa- tion for Computational Linguistics: EMNLP 2020, 4847–4860 (Association for Computa- tional Linguistics, Online, 2020). https://ac lanthology.org/2020.findings-emnlp.436/

2020
[29]

& Ran, S.- J

Qing, Y., Li, K., Zhou, P.-F. & Ran, S.- J. Compressing neural networks using tensor networks with exponentially fewer variational parameters.Intelligent Computing4, 0123 (2025). https://spj.science.org/doi/abs/10.3 4133/icomputing.0123. https://spj.science.or g/doi/pdf/10.34133/icomputing.0123

work page doi:10.34133/icomputing.0123 2025
[30]

& Zhang, Z

Yang, Y., Zhou, J., Wong, N. & Zhang, Z. LoRETTA: Low-rank economic tensor- trainadaptationforultra-low-parameterfine- tuning of large language models. In Duh, K., Gomez, H. & Bethard, S. (eds.)Proceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Compu- tational Linguistics: Human Language Tech- nologies (Volume 1: Lon...

2024
[31]

& Mardani, M

Javanmard, Y., Pandit, T. & Mardani, M. Compressing transformer language models via matrix product operator decomposition: A case study on picogpt (2026). https://arxi v.org/abs/2603.28534. 2603.28534

work page arXiv 2026
[32]

Qwen3 Technical Report

Yang, A.et al.Qwen3 technical report (2025). https://arxiv.org/abs/2505.09388. 2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H.et al.Llama 2: Open foundation and fine-tuned chat models (2023). https: //arxiv.org/abs/2307.09288. 2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

InInter- national Conference on Learning Representa- tions(2021)

Hendrycks, D.et al.Measuring massive multitask language understanding. InInter- national Conference on Learning Representa- tions(2021). https://openreview.net/forum ?id=d7KBjmI3GmQ

2021
[35]

http://dx .doi.org/10.1038/s42256-024-00902-x

Raghavan, G.et al.Engineering flexible machine learning systems by traversing func- tionally invariant paths.Nature Machine Intelligence6, 1179–1196 (2024). http://dx .doi.org/10.1038/s42256-024-00902-x

work page doi:10.1038/s42256-024-00902-x 2024
[36]

https://arxiv.org/abs/2405.147

Biderman, S.et al.Lessons from the trenches on reproducible evaluation of language mod- els (2024). https://arxiv.org/abs/2405.147

2024
[37]

& Caldarelli, G

Pesce, D., He, Y.-H. & Caldarelli, G. Phase transitions in neural networks pruning (2026). https://arxiv.org/abs/2602.15224. 2602.15224

work page arXiv 2026
[38]

& Hinton, G

Kornblith, S., Norouzi, M., Lee, H. & Hinton, G. Similarity of neural network representa- tions revisited. In Chaudhuri, K. & Salakhut- dinov, R. (eds.)Proceedings of the 36th Inter- national Conference on Machine Learning, vol. 97 ofProceedings of Machine Learning Research, 3519–3529 (PMLR, 2019). https: //proceedings.mlr.press/v97/kornblith19a.h tml

2019
[39]

& Lecun, Y

Garrido, Q., Balestriero, R., Najman, L. & Lecun, Y. RankMe: Assessing the downstream performance of pretrained self- supervised representations by their rank. In Krause,A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learn- ing,vol. 202ofProceedings of Machine Learn- ing Research, 10929–10974 (PMLR, 2023). https://proceedings.ml...

2023

[1] [1]

& Zettlemoyer, L

Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In Koyejo, S.et al.(eds.)Advances in Neu- ral Information Processing Systems, vol. 35, 30318–30332 (Curran Associates, Inc., 2022). https://proceedings.neurips.cc/paper_files /paper/2022/file/c3ba4962c05c49636d4c62 06a97e9c8a-Paper-Co...

2022

[2] [2]

In Koyejo, S.et al.(eds.)Advances in Neural Information Processing Systems, vol

Hoffmann, J.et al.An empirical analysis of compute-optimal large language model train- ing. In Koyejo, S.et al.(eds.)Advances in Neural Information Processing Systems, vol. 35, 30016–30030 (Curran Associates, Inc., 2022). https://proceedings.neurips.cc/p aper_files/paper/2022/file/c1e2faff6f58887 0935f114ebe04a3e5-Paper-Conference.pdf

2022

[3] [3]

ht tp://dx.doi.org/10.1038/s42256-025-01137 -0

Xiao, C.et al.Densing law of llms.Nature Machine Intelligence7, 1823–1833 (2025). ht tp://dx.doi.org/10.1038/s42256-025-01137 -0

work page doi:10.1038/s42256-025-01137 2025

[4] [4]

http://dx.doi.org/10.1038/s4225 6-023-00626-4

Ding, N.et al.Parameter-efficient fine- tuning of large-scale pre-trained language models.Nature Machine Intelligence5, 220– 235 (2023). http://dx.doi.org/10.1038/s4225 6-023-00626-4

work page doi:10.1038/s4225 2023

[5] [5]

In Gibbons, P., Pekhi- menko, G

Lin, J.et al.Awq: Activation-aware weight quantization for on-device llm compression and acceleration. In Gibbons, P., Pekhi- menko, G. & Sa, C. D. (eds.)Proceedings of Machine Learning and Systems, vol. 6, 87– 100 (2024). https://proceedings.mlsys.org/ paper_files/paper/2024/file/42a452cbafa9 dd64e9ba4aa95cc1ef21-Paper-Conference.pd f

2024

[6] [6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T. & Alis- tarh,D. Gptq:Accuratepost-trainingquanti- zationforgenerativepre-trainedtransformers (2023). https://arxiv.org/abs/2210.17323. 2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

In Krause, A.et al.(eds.) Proceedings of the 40th International Con- ference on Machine Learning, vol

Xiao, G.et al.SmoothQuant: Accurate and efficient post-training quantization for large language models. In Krause, A.et al.(eds.) Proceedings of the 40th International Con- ference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research, 38087–38099 (PMLR, 2023). https://procee dings.mlr.press/v202/xiao23c.html

2023

[8] [8]

& Zettlemoyer, L

Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. In Oh, A.et al.(eds.) Advances in Neural Information Processing Systems, vol. 36, 10088–10115 (Curran Asso- ciates, Inc., 2023). https://proceedings.neur ips.cc/paper_files/paper/2023/file/1feb8787 1436031bdc0f2beaa62a049b-Paper-Confere nce.pdf

2023

[9] [9]

& De Sa, C

Chee, J., Cai, Y., Kuleshov, V. & De Sa, C. M. Quip: 2-bit quantization of large lan- guage models with guarantees. In Oh, A. et al.(eds.)Advances in Neural Information Processing Systems, vol. 36, 4396–4429 (Cur- ran Associates, Inc., 2023). https://proceedi ngs.neurips.cc/paper_files/paper/2023/file/ 0df38cd13520747e1e64e5b123a78ef8-Paper-C onference.pdf

2023

[10] [10]

& Alistarh, D

Frantar, E. & Alistarh, D. SparseGPT: Massive language models can be accurately pruned in one-shot. In Krause, A.et al. (eds.)Proceedings of the 40th International Conference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research, 10323–10337 (PMLR, 2023). https://procee dings.mlr.press/v202/frantar23a.html

2023

[11] [11]

A Simple and Effective Pruning Approach for Large Language Models

Sun, M., Liu, Z., Bair, A. & Kolter, J. Z. A simple and effective pruning approach for large language models (2024). https://arxiv. org/abs/2306.11695. 2306.11695

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Ma,X., Fang,G. &Wang,X. Llm-pruner:On the structural pruning of large language mod- els. In Oh, A.et al.(eds.)Advances in Neu- ral Information Processing Systems, vol. 36, 21702–21720 (Curran Associates, Inc., 2023). https://proceedings.neurips.cc/paper_files 10 /paper/2023/file/44956951349095f74492a547 1128a7e0-Paper-Conference.pdf

work page arXiv 2023

[13] [13]

L., do Nascimento, M

Ashkboos, S., Croci, M. L., do Nascimento, M. G., Hoefler, T. & Hensman, J. Slicegpt: Compress large language models by deleting rows and columns (2024). https://arxiv.org/ abs/2401.15024. 2401.15024

work page arXiv 2024

[14] [14]

Sheared llama: Accelerating language model pre-training via structured pruning

Xia, M., Gao, T., Zeng, Z. & Chen, D. Sheared llama: Accelerating language model pre-trainingviastructuredpruning(2024). ht tps://arxiv.org/abs/2310.06694. 2310.06694

work page arXiv 2024

[15] [15]

ht tps://arxiv.org/abs/2403.19135

Chen, X.et al.Streamlining redundant layers to compress large language models (2025). ht tps://arxiv.org/abs/2403.19135. 2403.19135

work page arXiv 2025

[16] [16]

& Roberts, D

Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P. & Roberts, D. A. The unrea- sonable ineffectiveness of the deeper layers (2025). https://arxiv.org/abs/2403.17887. 2403.17887

work page arXiv 2025

[17] [17]

& Kawaguchi, K

Zhang, Y., Dong, Y. & Kawaguchi, K. Inves- tigating layer importance in large language models. In Belinkov, Y.et al.(eds.)Pro- ceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 469–479 (Association for Computa- tional Linguistics, Miami, Florida, US, 2024). https://aclanthology.org/2024.blackboxnl p-1.29/

2024

[18] [18]

In Che, W., Nabende, J., Shutova, E

Men, X.et al.ShortGPT: Layers in large lan- guage models are more redundant than you expect. In Che, W., Nabende, J., Shutova, E. &Pilehvar,M.T.(eds.)Findings of the Asso- ciation for Computational Linguistics: ACL 2025, 20192–20204 (Association for Compu- tational Linguistics, Vienna, Austria, 2025). https://aclanthology.org/2025.findings-acl.1 035/

2025

[19] [19]

http://dx.doi.org/10.1038/s 41467-025-65518-0

Goldstein, A.et al.Temporal structure of natural language processing in the human brain corresponds to layered hierarchy of large language models.Nature Communica- tions16(2025). http://dx.doi.org/10.1038/s 41467-025-65518-0

work page doi:10.1038/s 2025

[20] [20]

& Wang, W

Zhu, X., Li, J., Liu, Y., Ma, C. & Wang, W. A survey on model compression for large language models.Transactions of the Asso- ciation for Computational Linguistics12, 1556–1577 (2024). https://aclanthology.org /2024.tacl-1.85/

2024

[21] [21]

In Guyon, I.et al.(eds.)Advances in Neu- ral Information Processing Systems, vol

Vaswani, A.et al.Attention is all you need. In Guyon, I.et al.(eds.)Advances in Neu- ral Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017). https://proc eedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Pap er.pdf

2017

[22] [22]

A practical introduction to ten- sor networks: Matrix product states and projected entangled pair states.Annals of Physics349, 117–158 (2014)

Orús, R. A practical introduction to ten- sor networks: Matrix product states and projected entangled pair states.Annals of Physics349, 117–158 (2014). https://www. sciencedirect.com/science/article/pii/S00034 91614001596

2014

[23] [23]

Ran, S.-J. & Su, G. Tensor networks for interpretable and efficient quantum-inspired machine learning.Intelligent Computing2, 0061 (2023). https://spj.science.org/doi/ab s/10.34133/icomputing.0061. https://spj.scie nce.org/doi/pdf/10.34133/icomputing.0061

work page doi:10.34133/icomputing.0061 2023

[24] [24]

Kolda, T. G. & Bader, B. W. Tensor decom- positions and applications.SIAM Review51, 455–500 (2009). https://doi.org/10.1137/07 070111X. https://doi.org/10.1137/07070111 X

work page doi:10.1137/07 2009

[25] [25]

& Vetrov, D

Novikov, A., Podoprikhin, D., Osokin, A. & Vetrov, D. P. Tensorizing neural networks. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M. & Garnett, R. (eds.)Advances in Neu- ral Information Processing Systems, vol. 28 (Curran Associates, Inc., 2015). https://proc eedings.neurips.cc/paper_files/paper/2015/ file/6855456e2fe46a9d49d3d3af4f57443d-P aper.pdf

work page arXiv 2015

[26] [26]

& Schwab, D

Stoudenmire, E. & Schwab, D. J. Super- vised learning with tensor networks. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I. & Garnett, R. (eds.)Advances in Neural Infor- mation Processing Systems, vol. 29 (Curran Associates, Inc., 2016). https://proceedings. neurips.cc/paper_files/paper/2016/file/5314 11 b9674c86e3f9d1ba25ef9bb32895-Paper.pdf

2016

[27] [27]

In Wallach, H.et al

Ma, X.et al.A tensorized transformer for language modeling. In Wallach, H.et al. (eds.)Advances in Neural Information Pro- cessing Systems, vol. 32 (Curran Associates, Inc., 2019). https://proceedings.neurips.cc/p aper_files/paper/2019/file/dc960c46c38bd1 6e953d97cdeefdbc68-Paper.pdf

2019

[28] [28]

& Oseledets, I

Hrinchuk, O., Khrulkov, V., Mirvakhabova, L., Orlova, E. & Oseledets, I. Tensorized embedding layers. In Cohn, T., He, Y. & Liu, Y. (eds.)Findings of the Associa- tion for Computational Linguistics: EMNLP 2020, 4847–4860 (Association for Computa- tional Linguistics, Online, 2020). https://ac lanthology.org/2020.findings-emnlp.436/

2020

[29] [29]

& Ran, S.- J

Qing, Y., Li, K., Zhou, P.-F. & Ran, S.- J. Compressing neural networks using tensor networks with exponentially fewer variational parameters.Intelligent Computing4, 0123 (2025). https://spj.science.org/doi/abs/10.3 4133/icomputing.0123. https://spj.science.or g/doi/pdf/10.34133/icomputing.0123

work page doi:10.34133/icomputing.0123 2025

[30] [30]

& Zhang, Z

Yang, Y., Zhou, J., Wong, N. & Zhang, Z. LoRETTA: Low-rank economic tensor- trainadaptationforultra-low-parameterfine- tuning of large language models. In Duh, K., Gomez, H. & Bethard, S. (eds.)Proceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Compu- tational Linguistics: Human Language Tech- nologies (Volume 1: Lon...

2024

[31] [31]

& Mardani, M

Javanmard, Y., Pandit, T. & Mardani, M. Compressing transformer language models via matrix product operator decomposition: A case study on picogpt (2026). https://arxi v.org/abs/2603.28534. 2603.28534

work page arXiv 2026

[32] [32]

Qwen3 Technical Report

Yang, A.et al.Qwen3 technical report (2025). https://arxiv.org/abs/2505.09388. 2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H.et al.Llama 2: Open foundation and fine-tuned chat models (2023). https: //arxiv.org/abs/2307.09288. 2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

InInter- national Conference on Learning Representa- tions(2021)

Hendrycks, D.et al.Measuring massive multitask language understanding. InInter- national Conference on Learning Representa- tions(2021). https://openreview.net/forum ?id=d7KBjmI3GmQ

2021

[35] [35]

http://dx .doi.org/10.1038/s42256-024-00902-x

Raghavan, G.et al.Engineering flexible machine learning systems by traversing func- tionally invariant paths.Nature Machine Intelligence6, 1179–1196 (2024). http://dx .doi.org/10.1038/s42256-024-00902-x

work page doi:10.1038/s42256-024-00902-x 2024

[36] [36]

https://arxiv.org/abs/2405.147

Biderman, S.et al.Lessons from the trenches on reproducible evaluation of language mod- els (2024). https://arxiv.org/abs/2405.147

2024

[37] [37]

& Caldarelli, G

Pesce, D., He, Y.-H. & Caldarelli, G. Phase transitions in neural networks pruning (2026). https://arxiv.org/abs/2602.15224. 2602.15224

work page arXiv 2026

[38] [38]

& Hinton, G

Kornblith, S., Norouzi, M., Lee, H. & Hinton, G. Similarity of neural network representa- tions revisited. In Chaudhuri, K. & Salakhut- dinov, R. (eds.)Proceedings of the 36th Inter- national Conference on Machine Learning, vol. 97 ofProceedings of Machine Learning Research, 3519–3529 (PMLR, 2019). https: //proceedings.mlr.press/v97/kornblith19a.h tml

2019

[39] [39]

& Lecun, Y

Garrido, Q., Balestriero, R., Najman, L. & Lecun, Y. RankMe: Assessing the downstream performance of pretrained self- supervised representations by their rank. In Krause,A.et al.(eds.)Proceedings of the 40th International Conference on Machine Learn- ing,vol. 202ofProceedings of Machine Learn- ing Research, 10929–10974 (PMLR, 2023). https://proceedings.ml...

2023