NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium
Pith reviewed 2026-05-18 02:47 UTC · model grok-4.3
The pith
SVD compression of MLP layers with custom tiling delivers 1.35x kernel speedup and 1.21x end-to-end LLM inference speedup on Trainium at 0.05 compression ratio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NeuronMLP applies singular value decomposition compression to MLP layers at a 0.05 ratio and pairs it with tiling, kernel fusion, and caching strategies tailored to Trainium's architecture. These changes reduce data movement across the memory hierarchy, maximize SRAM bandwidth, and avoid matrix transpose operations. On this basis the method records an average 1.35x speedup over the existing NKI-based matrix multiplication kernel at the kernel level, which produces an average 1.21x end-to-end inference speedup across the evaluated models and datasets.
What carries the argument
SVD compression of MLP weight matrices combined with tiling and kernel fusion that respects Trainium's systolic array layout and software-managed memory hierarchy to cut data movement.
If this is right
- MLP layers can be replaced by their compressed versions on Trainium while still producing usable inference results across multiple recent LLMs.
- Hardware-specific tiling and fusion reduce the cost of data movement enough to yield both kernel and end-to-end gains.
- Avoiding explicit matrix transpose through layout choices improves throughput on systolic-array accelerators.
- The same compression-plus-tiling pattern can be applied to other matrix-heavy kernels inside LLM inference pipelines on Trainium.
Where Pith is reading between the lines
- Similar SVD-plus-tiling recipes could be tested on other systolic or dataflow accelerators that expose comparable memory hierarchies.
- The accuracy impact might be larger on tasks or domains outside the nine evaluation datasets, suggesting a need for targeted recovery techniques.
- The reported speedups assume the compression ratio stays fixed; varying the ratio per layer or model size could trade accuracy for further gains.
- If the caching strategy generalizes, it might reduce memory traffic in other multi-layer neural network workloads on Trainium.
Load-bearing premise
The SVD-compressed MLP layers keep acceptable accuracy on the nine test datasets without any fine-tuning or accuracy-recovery steps.
What would settle it
Measure the accuracy of the SVD-compressed models against the uncompressed baselines on the same nine datasets and six LLMs to check for unacceptable drops at the 0.05 compression ratio.
Figures
read the original abstract
Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we propose NeuronMLP, an efficient LLM inference method based on Singular Value Decomposition (SVD) compression and tiling on AWS Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. The proposed method is specifically optimized for multi-layer perceptron (MLP) layers in LLMs, which serve as a critical computational kernel for inference on Trainium. Evaluating on nine datasets and six recent LLMs, we show that NeuronMLP significantly outperforms the state-of-the-art Neuron Kernel Interface (NKI)-based matrix multiplication (matmul) kernel implemented by AWS on Trainium: at the kernel level, it achieves an average 1.35x speedup, which translates to an average 1.21x speedup for end-to-end LLM inference, under a compression ratio of 0.05.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes NeuronMLP, an SVD-based compression and tiling approach for accelerating MLP layers in LLMs on AWS Trainium. It introduces kernel fusion, caching, and data-layout optimizations tailored to Trainium's systolic arrays and software-managed memory. The central empirical claim is that at a fixed compression ratio of 0.05, NeuronMLP delivers an average 1.35× kernel-level speedup over the AWS NKI matmul baseline, which translates to a 1.21× end-to-end inference speedup across six recent LLMs and nine datasets.
Significance. If the accuracy of the SVD-compressed models is shown to remain comparable to the uncompressed baselines, the work would offer a concrete, hardware-specific recipe for reducing inference latency on Trainium without requiring model fine-tuning. The direct timing measurements against an external production kernel and the multi-model, multi-dataset evaluation are positive attributes that would make the result useful to practitioners targeting this accelerator.
major comments (2)
- [Evaluation] Evaluation section: the headline speedups (1.35× kernel, 1.21× end-to-end) at compression ratio 0.05 are reported without any perplexity, accuracy, or zero-shot scores for the compressed models versus the original six LLMs. At this aggressive rank reduction, MLP layers are known to be accuracy-sensitive; the absence of these metrics leaves open the possibility that the observed speedups apply only to a lower-quality model, undermining the claim of practical efficient inference.
- [Abstract and §3] Abstract and §3: the compression ratio is fixed at 0.05 with no description of how it was selected, no sensitivity analysis across ratios, and no statement of whether accuracy-recovery steps (fine-tuning or calibration) were applied. This choice is load-bearing for the reported performance numbers.
minor comments (2)
- [Methods] Notation for the SVD rank and the resulting compression ratio should be defined explicitly in the methods section rather than only in the abstract.
- [Figures] Figure captions for the kernel-level and end-to-end timing plots should state the exact models, datasets, and batch sizes used in each bar.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and describe the corresponding revisions.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the headline speedups (1.35× kernel, 1.21× end-to-end) at compression ratio 0.05 are reported without any perplexity, accuracy, or zero-shot scores for the compressed models versus the original six LLMs. At this aggressive rank reduction, MLP layers are known to be accuracy-sensitive; the absence of these metrics leaves open the possibility that the observed speedups apply only to a lower-quality model, undermining the claim of practical efficient inference.
Authors: We agree that the absence of accuracy and perplexity metrics is a significant omission. The manuscript as submitted emphasizes the kernel-level and end-to-end latency improvements but does not report model quality. In the revised version we will add a new table in the Evaluation section that reports perplexity on the nine datasets and zero-shot accuracy on standard benchmarks for both the original and SVD-compressed models at the 0.05 ratio. These measurements will be included so that readers can directly evaluate the quality-speedup trade-off. revision: yes
-
Referee: [Abstract and §3] Abstract and §3: the compression ratio is fixed at 0.05 with no description of how it was selected, no sensitivity analysis across ratios, and no statement of whether accuracy-recovery steps (fine-tuning or calibration) were applied. This choice is load-bearing for the reported performance numbers.
Authors: The ratio 0.05 was selected after preliminary profiling experiments that identified it as the point at which Trainium-specific tiling and caching deliver substantial kernel speedups while the resulting model remains usable for inference. No fine-tuning or calibration was performed after the SVD decomposition. We acknowledge that the manuscript provides insufficient justification. In the revision we will expand §3 with (i) an explicit statement that no post-SVD recovery steps were used, (ii) a description of the profiling process that led to 0.05, and (iii) a sensitivity plot showing kernel speedup versus compression ratio over the range 0.01–0.20. This will make the parameter choice transparent. revision: yes
Circularity Check
No circularity: empirical speedups rest on direct hardware measurements against external baseline
full rationale
The paper presents an engineering implementation of SVD-based compression plus custom tiling, kernel fusion, and caching for MLP layers on Trainium. Its load-bearing claims are measured kernel-level (1.35x) and end-to-end (1.21x) speedups at a fixed 0.05 compression ratio, obtained by timing runs against the AWS-provided NKI matmul kernel. No equations, first-principles derivations, or fitted parameters are shown that reduce these timing results to the inputs by construction; the results are external-benchmark comparisons rather than self-referential predictions. Self-citations, if present, are not load-bearing for the performance numbers.
Axiom & Free-Parameter Ledger
free parameters (1)
- compression_ratio
axioms (1)
- domain assumption SVD provides a sufficiently accurate low-rank approximation for MLP weight matrices without retraining
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We apply SVD to the weight matrices in LLM... W ≈ U V ... transforms the original matmul (X W) into X U V ... TrainiumFusion introduces an SRAM-capacity-aware caching strategy...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Evaluating on nine datasets and six recent LLMs... at a compression ratio of 0.05... 1.35× kernel speedup... 1.21× end-to-end
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
- [2]
-
[3]
Amazon Web Services. Aws neuron introduces neuron kernel interface (nki), nxd training, and jax support for training, 2024. Accessed: 2025- 09-23
work page 2024
-
[4]
Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large lan- guage models by deleting rows and columns. InThe Twelfth Interna- tional Conference on Learning Representations, 2024
work page 2024
-
[5]
AWS Neuron SDK Documentation.NKI Matrix multiplication, 2025. Accessed: 2025-09-13
work page 2025
-
[6]
AWS Neuron SDK Documentation.Trainium and Inferentia2 Architec- ture, 2025. Accessed: 2025-07-28
work page 2025
-
[7]
Transformer-opu: An fpga-based overlay processor for trans- former networks
Yueyin Bai, Hao Zhou, Keqing Zhao, Jianli Chen, Jun Yu, and Kun Wang. Transformer-opu: An fpga-based overlay processor for trans- former networks. In2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 221–221. IEEE, 2023
work page 2023
-
[8]
Yahav Biran and Imry Kissos. Adaptive orchestration for large-scale inference on heterogeneous accelerator systems balancing cost, per- formance, and resilience, 2025
work page 2025
-
[9]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InPro- ceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
work page 2020
-
[10]
Disco: Distilling counterfactuals with large language models.arXiv preprint arXiv:2212.10534, 2022
Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle Richardson. Disco: Distilling counterfactuals with large language models.arXiv preprint arXiv:2212.10534, 2022
-
[11]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sab- harwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Meta’s second generation ai chip: Model-chip co-design and productionization experiences
Joel Coburn, Chunqiang Tang, Sameer Abu Asal, Neeraj Agrawal, Raviteja Chinta, Harish Dixit, Brian Dodds, Saritha Dwarakapuram, Amin Firoozshahian, Cao Gao, et al. Meta’s second generation ai chip: Model-chip co-design and productionization experiences. In Proceedings of the 52nd Annual International Symposium on Computer Architecture, pages 1689–1702, 2025
work page 2025
- [13]
-
[14]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 35:30318–30332, 2022
work page 2022
-
[15]
8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861,
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861, 2021
-
[16]
Qlora: Efficient finetuning of quantized llms, 2023
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023
work page 2023
-
[17]
Dipsvd: Dual-importance protected svd for efficient llm compression
Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Chuanlong Xie, and Yao Zhu. Dipsvd: Dual- importance protected svd for efficient llm compression.arXiv preprint arXiv:2506.20353, 2025
-
[18]
Hlat: High-quality large language model pre-trained on aws trainium
Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, and Jun Huan. Hlat: High-quality large language model pre-trained on aws trainium. In2024 IEEE International Conference on Big Data (BigData), pages 2100–2109. IEEE, 2024
work page 2024
-
[19]
Mtia: First generation silicon target- ing meta’s recommendation systems
Amin Firoozshahian, Joel Coburn, Roman Levenstein, Rakesh Nat- toji, Ashwin Kamath, Olivia Wu, Gurdeepak Grewal, Harish Aepala, Bhasker Jakka, Bob Dreyer, et al. Mtia: First generation silicon target- ing meta’s recommendation systems. InProceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–13, 2023
work page 2023
-
[20]
Sparsegpt: Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023
work page 2023
-
[21]
Optq: Accurate quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq: Accurate quantization for generative pre-trained transformers. In International Conference on Learning Representations, 2023
work page 2023
-
[22]
Distributed training of large language models on aws trainium
Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Moham- mad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, and Yida Wang. Distributed training of large language models on aws trainium. InProceedings of the 2024 ACM Symposium on Cloud Computing, pages 961–976, 2024
work page 2024
-
[23]
Song Guo, Jiahang Xu, Li Lyna Zhang, and Mao Yang. Compresso: Structured pruning with collaborative prompting learns compact large language models, 2023
work page 2023
-
[24]
What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024
Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024
-
[25]
Dynabert: dynamic bert with adaptive width and depth
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: dynamic bert with adaptive width and depth. NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc
work page 2020
-
[26]
Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022
-
[27]
Lora: Low-rank adap- tation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adap- tation of large language models.ICLR, 1(2):3, 2022
work page 2022
-
[28]
Beta: Binarized energy- efficient transformer accelerator at the edge
Yuhao Ji, Chao Fang, and Zhongfeng Wang. Beta: Binarized energy- efficient transformer accelerator at the edge. In2024 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE, 2024
work page 2024
-
[29]
Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture, pages 1–14, 2023
work page 2023
-
[30]
Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A domain- specific supercomputer for training deep neural networks.Communi- cations of the ACM, 63(7):67–78, 2020
work page 2020
-
[31]
Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step.arXiv preprint arXiv:2306.14050, 2023
-
[32]
Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Adasvd: Adaptive singular value de- composition for large language models.arXiv preprint arXiv:2502.01403, 2025
-
[33]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Shiwei Liu, Chen Mu, Hao Jiang, Yunzhengmao Wang, Jinshan Zhang, Feng Lin, Keji Zhou, Qi Liu, and Chixiao Chen. Hardsea: Hybrid analog- reram clustering and digital-sram in-memory computing accelerator for dynamic sparse self-attention in transformer.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 32(2):269–282, 2023
work page 2023
-
[35]
Llm-pruner: On the structural pruning of large language models, 2023
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models, 2023
work page 2023
-
[36]
Building a large annotated corpus of english: The penn treebank.Using Large Corpora, 273:31, 1994
Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Using Large Corpora, 273:31, 1994. 14
work page 1994
-
[37]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
Infor- mation theoretic representation distillation, 2022
Roy Miles, Adrian Lopez Rodriguez, and Krystian Mikolajczyk. Infor- mation theoretic representation distillation, 2022
work page 2022
-
[40]
neuronx-distributed-inference, 2025
AWS Neuron. neuronx-distributed-inference, 2025. Accessed: 2025- 09-24
work page 2025
-
[41]
Neuron Kernel Interface. Neuron kernel interface.https://awsdocs- neuron.readthedocs-hosted.com/en/latest/general/nki/index.html,
-
[42]
Accessed: August 1, 2025
work page 2025
-
[43]
Neuron Kernel Interface. Neuron kernel interface mm. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/ nki/tutorials/matrix_multiplication.html, 2025. Accessed: August 1, 2025
work page 2025
-
[44]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[45]
Winogrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[46]
Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020
work page 2020
-
[47]
Shikhar Tuli and Niraj K Jha. Acceltran: A sparsity-aware accelera- tor for dynamic inference with transformers.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(11):4038– 4051, 2023
work page 2023
-
[48]
Hard- ware acceleration of transformer networks using fpgas
Georgios Tzanos, Christoforos Kachris, and Dimitrios Soudris. Hard- ware acceleration of transformer networks using fpgas. In2022 Panhel- lenic Conference on Electronics & Telecommunications (PACET), pages 1–5. IEEE, 2022
work page 2022
-
[49]
Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023
Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023
-
[50]
Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm com- pression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025
-
[51]
Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340, 2025
-
[52]
SVD-LLM: Truncation-aware singular value decomposition for large language model compression
Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[53]
Roofline: an insightful visual performance model for multicore architectures
Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009
work page 2009
-
[54]
Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, and Yuxiong He. Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases, 2023
work page 2023
-
[55]
Tengfei Xue, Xuefeng Li, Roman Smirnov, Tahir Azim, Arash Sadrieh, and Babak Pahlavan. Ninjallm: Fast, scalable and cost-effective rag using amazon sagemaker and aws trainium and inferentia2, 2024
work page 2024
-
[56]
Alpaca cleaned dataset.https://huggingface.co/datasets/ yahma/alpaca-cleaned, 2023
Yahma. Alpaca cleaned dataset.https://huggingface.co/datasets/ yahma/alpaca-cleaned, 2023. Accessed: 2025-07-28
work page 2023
-
[57]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Shuo Yang, Sujay Sanghavi, Holakou Rahmanian, Jan Bakus, and S. V. N. Vishwanathan. Toward understanding privileged features distillation in learning-to-rank, 2022
work page 2022
-
[59]
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decom- position for compressing large language models.arXiv preprint arXiv:2312.05821, 2023
work page internal anchor Pith review arXiv 2023
-
[60]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[61]
Minjia Zhang and Yuxiong He. Accelerating training of transformer- based language models with progressive layer dropping.Advances in neural information processing systems, 33:14011–14023, 2020
work page 2020
-
[62]
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 15
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.