NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium

Dinghong Song; Dong Li; Jierui Xu; Pengfei Su; Weichu Yang

arxiv: 2510.25977 · v4 · submitted 2025-10-29 · 💻 cs.CL

NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium

Dinghong Song , Jierui Xu , Weichu Yang , Pengfei Su , Dong Li This is my paper

Pith reviewed 2026-05-18 02:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM inferenceSVD compressionAWS TrainiumMLP layerskernel optimizationtilingmatrix multiplicationAI accelerators

0 comments

The pith

SVD compression of MLP layers with custom tiling delivers 1.35x kernel speedup and 1.21x end-to-end LLM inference speedup on Trainium at 0.05 compression ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NeuronMLP as a method to speed up large language model inference on AWS Trainium by applying singular value decomposition to compress the weight matrices in multi-layer perceptron layers. It adds tiling, kernel fusion, and specialized caching to fit Trainium's systolic array hardware and reduce data movement through its software-managed memory. A sympathetic reader would care because MLP layers form a large part of inference compute, and the approach shows measurable gains from hardware-specific optimizations without extra training steps. The reported results span nine datasets and six recent LLMs, focusing on practical end-to-end improvements at the stated compression level.

Core claim

NeuronMLP applies singular value decomposition compression to MLP layers at a 0.05 ratio and pairs it with tiling, kernel fusion, and caching strategies tailored to Trainium's architecture. These changes reduce data movement across the memory hierarchy, maximize SRAM bandwidth, and avoid matrix transpose operations. On this basis the method records an average 1.35x speedup over the existing NKI-based matrix multiplication kernel at the kernel level, which produces an average 1.21x end-to-end inference speedup across the evaluated models and datasets.

What carries the argument

SVD compression of MLP weight matrices combined with tiling and kernel fusion that respects Trainium's systolic array layout and software-managed memory hierarchy to cut data movement.

If this is right

MLP layers can be replaced by their compressed versions on Trainium while still producing usable inference results across multiple recent LLMs.
Hardware-specific tiling and fusion reduce the cost of data movement enough to yield both kernel and end-to-end gains.
Avoiding explicit matrix transpose through layout choices improves throughput on systolic-array accelerators.
The same compression-plus-tiling pattern can be applied to other matrix-heavy kernels inside LLM inference pipelines on Trainium.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar SVD-plus-tiling recipes could be tested on other systolic or dataflow accelerators that expose comparable memory hierarchies.
The accuracy impact might be larger on tasks or domains outside the nine evaluation datasets, suggesting a need for targeted recovery techniques.
The reported speedups assume the compression ratio stays fixed; varying the ratio per layer or model size could trade accuracy for further gains.
If the caching strategy generalizes, it might reduce memory traffic in other multi-layer neural network workloads on Trainium.

Load-bearing premise

The SVD-compressed MLP layers keep acceptable accuracy on the nine test datasets without any fine-tuning or accuracy-recovery steps.

What would settle it

Measure the accuracy of the SVD-compressed models against the uncompressed baselines on the same nine datasets and six LLMs to check for unacceptable drops at the 0.05 compression ratio.

Figures

Figures reproduced from arXiv: 2510.25977 by Dinghong Song, Dong Li, Jierui Xu, Pengfei Su, Weichu Yang.

**Figure 1.** Figure 1: NeuronCore memory hierarchy on Trainium with bandwidth and memory size. The tensor engine is organized as a 128 × 128 systolic array of processing elements, defining a partition dimension (𝑃 = 128), where each partition maps to a memory partition in SBUF or PSUM. To fully exploit parallelism across the 128 processing units, the contraction dimension of a matmul must align with the partition dimension, al… view at source ↗

**Figure 2.** Figure 2: Matmul tiling on Trainium (mathematical view). PSUM 128 (P) 512 (F) x1T y1 output0 + output1 128 (P) 128 (F) stationary.T moving Accumulate 512 (F) Tensor Engine SBUF 128 (P) PSUM 128 (P) 512 (F) x2T y2 output0 + output1 + output2 128 (P) 128 (F) stationary.T moving Accumulate 512 (F) Tensor Engine SBUF 128 (P) PSUM output0 + output1 + output2 + output3 128 (P) 512 (F) x3T y3 128 (P) 128 (F) stationary.T m… view at source ↗

**Figure 3.** Figure 3: Matmul tiling on Trainium (hardware view) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The Neuron Profiler view of up_projection matmul in Deepseek-V3 with SVD-compression. Directly computing on the SVD-compressed weight matrices sequentially leads to low SBUF and PSUM utilization and reduced Model Float Utilization (MFU). Frequent idle periods in the MFU indicate that the tensor engine is underutilized while waiting for data transfers and data preparation to complete. 18432]) is factorize… view at source ↗

**Figure 5.** Figure 5: The overview of NeuronMM. (a) Block-aligned SVD. The weight parameters of the attention layers remain unchanged, while only the large matrices𝑊 in the MLP layers are compressed using SVD. (b) TrainiumFusion. The weight𝑊 is decomposed into𝑈 and𝑉 , and the original matmul 𝑋𝑊 turns into 𝑋𝑈𝑉 . The kernel leverages caching, implicit transposition, and blocking to enable efficient matmul, thereby reducing data m… view at source ↗

**Figure 6.** Figure 6: Execution time and HBM-SBUF memory traffic of different matmul implementations across input sequence lengths [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Model degradation with increasing compression ratios. 28% 72% 40% 61% 46% 38% 25% 63% 33% 60% 41% 30% 27% 72% 38% 63% 43% 34% 0% 20% 40% 60% 80% Openb. Arc_e Arc_c WinoG. HellaS. MathQA Accuracy Compression Ratio 0.10 28% 72% 40% 61% 46% 38% 22% 59% 29% 58% 39% 25% 26% 70% 37% 59% 42% 31% 0% 20% 40% 60% 80% Openb. Arc_e Arc_c WinoG. HellaS. MathQA Accuracy Compression Ratio 0.20 Full Model NeuronMM w/o Lor… view at source ↗

**Figure 8.** Figure 8: Accuracy degradation and recovery of Qwen-3- 1.7B under different compression ratios on six common-sense reasoning datasets. the original, with mAcc drop ≤ 0.10 in every case, a level of loss generally considered acceptable [32, 49–51, 58]. Meanwhile, NeuronMM achieves significant end-to-end inference speedup (1.21×–2.49×), while 𝛾 remains low — ranging from 3.24% to 25.27% (shown in [PITH_FULL_IMAGE:figu… view at source ↗

read the original abstract

Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we propose NeuronMLP, an efficient LLM inference method based on Singular Value Decomposition (SVD) compression and tiling on AWS Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. The proposed method is specifically optimized for multi-layer perceptron (MLP) layers in LLMs, which serve as a critical computational kernel for inference on Trainium. Evaluating on nine datasets and six recent LLMs, we show that NeuronMLP significantly outperforms the state-of-the-art Neuron Kernel Interface (NKI)-based matrix multiplication (matmul) kernel implemented by AWS on Trainium: at the kernel level, it achieves an average 1.35x speedup, which translates to an average 1.21x speedup for end-to-end LLM inference, under a compression ratio of 0.05.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuronMLP reports 1.35x kernel and 1.21x end-to-end speedups on Trainium via SVD compression at 0.05 ratio plus custom tiling and caching, but skips accuracy numbers for the compressed models.

read the letter

The main thing to know is that this paper claims solid speedups for LLM inference on AWS Trainium by compressing the MLP layers with SVD down to a 0.05 ratio and then applying custom tiling, kernel fusion, and caching to fit the hardware's systolic array and memory setup. The numbers are 1.35 times faster at the kernel level and 1.21 times end-to-end across several models and datasets. What is actually new here is the specific combination of these techniques tuned to Trainium's architecture. SVD compression itself is not novel, nor is tiling for systolic arrays, but the way they avoid expensive transposes, maximize SRAM bandwidth with caching, and fuse operations for the compressed MLPs shows careful attention to the data layout and software-managed memory on this accelerator. The paper does well in providing direct comparisons to the state-of-the-art NKI-based matmul kernel from AWS. The evaluation covers six recent LLMs and nine datasets, which gives a reasonable sense of the runtime benefits. The soft spots are around the accuracy side. At such a low compression ratio, the MLP layers are heavily approximated, yet the results section appears to report only timing data without any perplexity, accuracy, or zero-shot performance numbers for the compressed versions versus the originals. This leaves open whether the speedups apply to models that still perform well on the tasks. The choice of compression ratio also lacks details on how it was selected or if any recovery steps were used. Overall, this paper is for hardware engineers and practitioners working on efficient inference specifically on Trainium or similar custom accelerators. A reader looking for practical kernel optimizations and implementation tricks for this platform could find value in the details. It deserves a serious referee because the core engineering work is grounded in measurements against a real baseline, even if the accuracy gap needs to be addressed in revisions.

Referee Report

2 major / 2 minor

Summary. The paper proposes NeuronMLP, an SVD-based compression and tiling approach for accelerating MLP layers in LLMs on AWS Trainium. It introduces kernel fusion, caching, and data-layout optimizations tailored to Trainium's systolic arrays and software-managed memory. The central empirical claim is that at a fixed compression ratio of 0.05, NeuronMLP delivers an average 1.35× kernel-level speedup over the AWS NKI matmul baseline, which translates to a 1.21× end-to-end inference speedup across six recent LLMs and nine datasets.

Significance. If the accuracy of the SVD-compressed models is shown to remain comparable to the uncompressed baselines, the work would offer a concrete, hardware-specific recipe for reducing inference latency on Trainium without requiring model fine-tuning. The direct timing measurements against an external production kernel and the multi-model, multi-dataset evaluation are positive attributes that would make the result useful to practitioners targeting this accelerator.

major comments (2)

[Evaluation] Evaluation section: the headline speedups (1.35× kernel, 1.21× end-to-end) at compression ratio 0.05 are reported without any perplexity, accuracy, or zero-shot scores for the compressed models versus the original six LLMs. At this aggressive rank reduction, MLP layers are known to be accuracy-sensitive; the absence of these metrics leaves open the possibility that the observed speedups apply only to a lower-quality model, undermining the claim of practical efficient inference.
[Abstract and §3] Abstract and §3: the compression ratio is fixed at 0.05 with no description of how it was selected, no sensitivity analysis across ratios, and no statement of whether accuracy-recovery steps (fine-tuning or calibration) were applied. This choice is load-bearing for the reported performance numbers.

minor comments (2)

[Methods] Notation for the SVD rank and the resulting compression ratio should be defined explicitly in the methods section rather than only in the abstract.
[Figures] Figure captions for the kernel-level and end-to-end timing plots should state the exact models, datasets, and batch sizes used in each bar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and describe the corresponding revisions.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the headline speedups (1.35× kernel, 1.21× end-to-end) at compression ratio 0.05 are reported without any perplexity, accuracy, or zero-shot scores for the compressed models versus the original six LLMs. At this aggressive rank reduction, MLP layers are known to be accuracy-sensitive; the absence of these metrics leaves open the possibility that the observed speedups apply only to a lower-quality model, undermining the claim of practical efficient inference.

Authors: We agree that the absence of accuracy and perplexity metrics is a significant omission. The manuscript as submitted emphasizes the kernel-level and end-to-end latency improvements but does not report model quality. In the revised version we will add a new table in the Evaluation section that reports perplexity on the nine datasets and zero-shot accuracy on standard benchmarks for both the original and SVD-compressed models at the 0.05 ratio. These measurements will be included so that readers can directly evaluate the quality-speedup trade-off. revision: yes
Referee: [Abstract and §3] Abstract and §3: the compression ratio is fixed at 0.05 with no description of how it was selected, no sensitivity analysis across ratios, and no statement of whether accuracy-recovery steps (fine-tuning or calibration) were applied. This choice is load-bearing for the reported performance numbers.

Authors: The ratio 0.05 was selected after preliminary profiling experiments that identified it as the point at which Trainium-specific tiling and caching deliver substantial kernel speedups while the resulting model remains usable for inference. No fine-tuning or calibration was performed after the SVD decomposition. We acknowledge that the manuscript provides insufficient justification. In the revision we will expand §3 with (i) an explicit statement that no post-SVD recovery steps were used, (ii) a description of the profiling process that led to 0.05, and (iii) a sensitivity plot showing kernel speedup versus compression ratio over the range 0.01–0.20. This will make the parameter choice transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical speedups rest on direct hardware measurements against external baseline

full rationale

The paper presents an engineering implementation of SVD-based compression plus custom tiling, kernel fusion, and caching for MLP layers on Trainium. Its load-bearing claims are measured kernel-level (1.35x) and end-to-end (1.21x) speedups at a fixed 0.05 compression ratio, obtained by timing runs against the AWS-provided NKI matmul kernel. No equations, first-principles derivations, or fitted parameters are shown that reduce these timing results to the inputs by construction; the results are external-benchmark comparisons rather than self-referential predictions. Self-citations, if present, are not load-bearing for the performance numbers.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions about SVD being a valid low-rank approximation for weight matrices and on the existence of a performant NKI baseline. No new physical constants or invented entities are introduced.

free parameters (1)

compression_ratio
Fixed at 0.05 in the reported experiments; the value is chosen to balance speed and (unstated) accuracy.

axioms (1)

domain assumption SVD provides a sufficiently accurate low-rank approximation for MLP weight matrices without retraining
Invoked implicitly when claiming end-to-end speedups; accuracy impact is not quantified in the abstract.

pith-pipeline@v0.9.0 · 5783 in / 1359 out tokens · 18193 ms · 2026-05-18T02:47:21.030690+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply SVD to the weight matrices in LLM... W ≈ U V ... transforms the original matmul (X W) into X U V ... TrainiumFusion introduces an SRAM-capacity-aware caching strategy...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evaluating on nine datasets and six recent LLMs... at a compression ratio of 0.05... 1.35× kernel speedup... 1.21× end-to-end

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 7 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[2]

Aws trainium, 2023

Amazon Web Services. Aws trainium, 2023. Accessed: 2025-09-23

work page 2023
[3]

Aws neuron introduces neuron kernel interface (nki), nxd training, and jax support for training, 2024

Amazon Web Services. Aws neuron introduces neuron kernel interface (nki), nxd training, and jax support for training, 2024. Accessed: 2025- 09-23

work page 2024
[4]

Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large lan- guage models by deleting rows and columns. InThe Twelfth Interna- tional Conference on Learning Representations, 2024

work page 2024
[5]

Accessed: 2025-09-13

AWS Neuron SDK Documentation.NKI Matrix multiplication, 2025. Accessed: 2025-09-13

work page 2025
[6]

Accessed: 2025-07-28

AWS Neuron SDK Documentation.Trainium and Inferentia2 Architec- ture, 2025. Accessed: 2025-07-28

work page 2025
[7]

Transformer-opu: An fpga-based overlay processor for trans- former networks

Yueyin Bai, Hao Zhou, Keqing Zhao, Jianli Chen, Jun Yu, and Kun Wang. Transformer-opu: An fpga-based overlay processor for trans- former networks. In2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 221–221. IEEE, 2023

work page 2023
[8]

Adaptive orchestration for large-scale inference on heterogeneous accelerator systems balancing cost, per- formance, and resilience, 2025

Yahav Biran and Imry Kissos. Adaptive orchestration for large-scale inference on heterogeneous accelerator systems balancing cost, per- formance, and resilience, 2025

work page 2025
[9]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InPro- ceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020
[10]

Disco: Distilling counterfactuals with large language models.arXiv preprint arXiv:2212.10534, 2022

Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle Richardson. Disco: Distilling counterfactuals with large language models.arXiv preprint arXiv:2212.10534, 2022

work page arXiv 2022
[11]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sab- harwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Meta’s second generation ai chip: Model-chip co-design and productionization experiences

Joel Coburn, Chunqiang Tang, Sameer Abu Asal, Neeraj Agrawal, Raviteja Chinta, Harish Dixit, Brian Dodds, Saritha Dwarakapuram, Amin Firoozshahian, Cao Gao, et al. Meta’s second generation ai chip: Model-chip co-design and productionization experiences. In Proceedings of the 52nd Annual International Symposium on Computer Architecture, pages 1689–1702, 2025

work page 2025
[13]

SIAM, 1997

James W Demmel.Applied numerical linear algebra. SIAM, 1997

work page 1997
[14]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 35:30318–30332, 2022

work page 2022
[15]

8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861,

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861, 2021

work page arXiv 2021
[16]

Qlora: Efficient finetuning of quantized llms, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023

work page 2023
[17]

Dipsvd: Dual-importance protected svd for efficient llm compression

Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Chuanlong Xie, and Yao Zhu. Dipsvd: Dual- importance protected svd for efficient llm compression.arXiv preprint arXiv:2506.20353, 2025

work page arXiv 2025
[18]

Hlat: High-quality large language model pre-trained on aws trainium

Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, and Jun Huan. Hlat: High-quality large language model pre-trained on aws trainium. In2024 IEEE International Conference on Big Data (BigData), pages 2100–2109. IEEE, 2024

work page 2024
[19]

Mtia: First generation silicon target- ing meta’s recommendation systems

Amin Firoozshahian, Joel Coburn, Roman Levenstein, Rakesh Nat- toji, Ashwin Kamath, Olivia Wu, Gurdeepak Grewal, Harish Aepala, Bhasker Jakka, Bob Dreyer, et al. Mtia: First generation silicon target- ing meta’s recommendation systems. InProceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–13, 2023

work page 2023
[20]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023

work page 2023
[21]

Optq: Accurate quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq: Accurate quantization for generative pre-trained transformers. In International Conference on Learning Representations, 2023

work page 2023
[22]

Distributed training of large language models on aws trainium

Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Moham- mad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, and Yida Wang. Distributed training of large language models on aws trainium. InProceedings of the 2024 ACM Symposium on Cloud Computing, pages 961–976, 2024

work page 2024
[23]

Compresso: Structured pruning with collaborative prompting learns compact large language models, 2023

Song Guo, Jiahang Xu, Li Lyna Zhang, and Mao Yang. Compresso: Structured pruning with collaborative prompting learns compact large language models, 2023

work page 2023
[24]

What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024

Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024

work page arXiv 2024
[25]

Dynabert: dynamic bert with adaptive width and depth

Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: dynamic bert with adaptive width and depth. NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc

work page 2020
[26]

Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022

work page arXiv 2022
[27]

Lora: Low-rank adap- tation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adap- tation of large language models.ICLR, 1(2):3, 2022

work page 2022
[28]

Beta: Binarized energy- efficient transformer accelerator at the edge

Yuhao Ji, Chao Fang, and Zhongfeng Wang. Beta: Binarized energy- efficient transformer accelerator at the edge. In2024 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE, 2024

work page 2024
[29]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture, pages 1–14, 2023

work page 2023
[30]

A domain- specific supercomputer for training deep neural networks.Communi- cations of the ACM, 63(7):67–78, 2020

Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A domain- specific supercomputer for training deep neural networks.Communi- cations of the ACM, 63(7):67–78, 2020

work page 2020
[31]

Symbolic chain-of-thought distillation: Small models can also" think" step-by-step.arXiv preprint arXiv:2306.14050, 2023

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step.arXiv preprint arXiv:2306.14050, 2023

work page arXiv 2023
[32]

Adasvd: Adaptive singular value de- composition for large language models.arXiv preprint arXiv:2502.01403, 2025

Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Adasvd: Adaptive singular value de- composition for large language models.arXiv preprint arXiv:2502.01403, 2025

work page arXiv 2025
[33]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Shiwei Liu, Chen Mu, Hao Jiang, Yunzhengmao Wang, Jinshan Zhang, Feng Lin, Keji Zhou, Qi Liu, and Chixiao Chen. Hardsea: Hybrid analog- reram clustering and digital-sram in-memory computing accelerator for dynamic sparse self-attention in transformer.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 32(2):269–282, 2023

work page 2023
[35]

Llm-pruner: On the structural pruning of large language models, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models, 2023

work page 2023
[36]

Building a large annotated corpus of english: The penn treebank.Using Large Corpora, 273:31, 1994

Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Using Large Corpora, 273:31, 1994. 14

work page 1994
[37]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Infor- mation theoretic representation distillation, 2022

Roy Miles, Adrian Lopez Rodriguez, and Krystian Mikolajczyk. Infor- mation theoretic representation distillation, 2022

work page 2022
[40]

neuronx-distributed-inference, 2025

AWS Neuron. neuronx-distributed-inference, 2025. Accessed: 2025- 09-24

work page 2025
[41]

Neuron kernel interface.https://awsdocs- neuron.readthedocs-hosted.com/en/latest/general/nki/index.html,

Neuron Kernel Interface. Neuron kernel interface.https://awsdocs- neuron.readthedocs-hosted.com/en/latest/general/nki/index.html,

work page
[42]

Accessed: August 1, 2025

work page 2025
[43]

Neuron kernel interface mm

Neuron Kernel Interface. Neuron kernel interface mm. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/ nki/tutorials/matrix_multiplication.html, 2025. Accessed: August 1, 2025

work page 2025
[44]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[45]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

work page 2021
[46]

Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

work page 2020
[47]

Shikhar Tuli and Niraj K Jha. Acceltran: A sparsity-aware accelera- tor for dynamic inference with transformers.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(11):4038– 4051, 2023

work page 2023
[48]

Hard- ware acceleration of transformer networks using fpgas

Georgios Tzanos, Christoforos Kachris, and Dimitrios Soudris. Hard- ware acceleration of transformer networks using fpgas. In2022 Panhel- lenic Conference on Electronics & Telecommunications (PACET), pages 1–5. IEEE, 2022

work page 2022
[49]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

work page arXiv 2023
[50]

Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm com- pression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

work page arXiv 2025
[51]

Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340, 2025

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340, 2025

work page arXiv 2025
[52]

SVD-LLM: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[53]

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009

work page 2009
[54]

Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases, 2023

Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, and Yuxiong He. Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases, 2023

work page 2023
[55]

Ninjallm: Fast, scalable and cost-effective rag using amazon sagemaker and aws trainium and inferentia2, 2024

Tengfei Xue, Xuefeng Li, Roman Smirnov, Tahir Azim, Arash Sadrieh, and Babak Pahlavan. Ninjallm: Fast, scalable and cost-effective rag using amazon sagemaker and aws trainium and inferentia2, 2024

work page 2024
[56]

Alpaca cleaned dataset.https://huggingface.co/datasets/ yahma/alpaca-cleaned, 2023

Yahma. Alpaca cleaned dataset.https://huggingface.co/datasets/ yahma/alpaca-cleaned, 2023. Accessed: 2025-07-28

work page 2023
[57]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Shuo Yang, Sujay Sanghavi, Holakou Rahmanian, Jan Bakus, and S. V. N. Vishwanathan. Toward understanding privileged features distillation in learning-to-rank, 2022

work page 2022
[59]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decom- position for compressing large language models.arXiv preprint arXiv:2312.05821, 2023

work page internal anchor Pith review arXiv 2023
[60]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[61]

Accelerating training of transformer- based language models with progressive layer dropping.Advances in neural information processing systems, 33:14011–14023, 2020

Minjia Zhang and Yuxiong He. Accelerating training of transformer- based language models with progressive layer dropping.Advances in neural information processing systems, 33:14011–14023, 2020

work page 2020
[62]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 15

work page 2024

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[2] [2]

Aws trainium, 2023

Amazon Web Services. Aws trainium, 2023. Accessed: 2025-09-23

work page 2023

[3] [3]

Aws neuron introduces neuron kernel interface (nki), nxd training, and jax support for training, 2024

Amazon Web Services. Aws neuron introduces neuron kernel interface (nki), nxd training, and jax support for training, 2024. Accessed: 2025- 09-23

work page 2024

[4] [4]

Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large lan- guage models by deleting rows and columns. InThe Twelfth Interna- tional Conference on Learning Representations, 2024

work page 2024

[5] [5]

Accessed: 2025-09-13

AWS Neuron SDK Documentation.NKI Matrix multiplication, 2025. Accessed: 2025-09-13

work page 2025

[6] [6]

Accessed: 2025-07-28

AWS Neuron SDK Documentation.Trainium and Inferentia2 Architec- ture, 2025. Accessed: 2025-07-28

work page 2025

[7] [7]

Transformer-opu: An fpga-based overlay processor for trans- former networks

Yueyin Bai, Hao Zhou, Keqing Zhao, Jianli Chen, Jun Yu, and Kun Wang. Transformer-opu: An fpga-based overlay processor for trans- former networks. In2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 221–221. IEEE, 2023

work page 2023

[8] [8]

Adaptive orchestration for large-scale inference on heterogeneous accelerator systems balancing cost, per- formance, and resilience, 2025

Yahav Biran and Imry Kissos. Adaptive orchestration for large-scale inference on heterogeneous accelerator systems balancing cost, per- formance, and resilience, 2025

work page 2025

[9] [9]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InPro- ceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020

[10] [10]

Disco: Distilling counterfactuals with large language models.arXiv preprint arXiv:2212.10534, 2022

Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle Richardson. Disco: Distilling counterfactuals with large language models.arXiv preprint arXiv:2212.10534, 2022

work page arXiv 2022

[11] [11]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sab- harwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Meta’s second generation ai chip: Model-chip co-design and productionization experiences

Joel Coburn, Chunqiang Tang, Sameer Abu Asal, Neeraj Agrawal, Raviteja Chinta, Harish Dixit, Brian Dodds, Saritha Dwarakapuram, Amin Firoozshahian, Cao Gao, et al. Meta’s second generation ai chip: Model-chip co-design and productionization experiences. In Proceedings of the 52nd Annual International Symposium on Computer Architecture, pages 1689–1702, 2025

work page 2025

[13] [13]

SIAM, 1997

James W Demmel.Applied numerical linear algebra. SIAM, 1997

work page 1997

[14] [14]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 35:30318–30332, 2022

work page 2022

[15] [15]

8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861,

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861, 2021

work page arXiv 2021

[16] [16]

Qlora: Efficient finetuning of quantized llms, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023

work page 2023

[17] [17]

Dipsvd: Dual-importance protected svd for efficient llm compression

Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Chuanlong Xie, and Yao Zhu. Dipsvd: Dual- importance protected svd for efficient llm compression.arXiv preprint arXiv:2506.20353, 2025

work page arXiv 2025

[18] [18]

Hlat: High-quality large language model pre-trained on aws trainium

Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, and Jun Huan. Hlat: High-quality large language model pre-trained on aws trainium. In2024 IEEE International Conference on Big Data (BigData), pages 2100–2109. IEEE, 2024

work page 2024

[19] [19]

Mtia: First generation silicon target- ing meta’s recommendation systems

Amin Firoozshahian, Joel Coburn, Roman Levenstein, Rakesh Nat- toji, Ashwin Kamath, Olivia Wu, Gurdeepak Grewal, Harish Aepala, Bhasker Jakka, Bob Dreyer, et al. Mtia: First generation silicon target- ing meta’s recommendation systems. InProceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–13, 2023

work page 2023

[20] [20]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023

work page 2023

[21] [21]

Optq: Accurate quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq: Accurate quantization for generative pre-trained transformers. In International Conference on Learning Representations, 2023

work page 2023

[22] [22]

Distributed training of large language models on aws trainium

Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Moham- mad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, and Yida Wang. Distributed training of large language models on aws trainium. InProceedings of the 2024 ACM Symposium on Cloud Computing, pages 961–976, 2024

work page 2024

[23] [23]

Compresso: Structured pruning with collaborative prompting learns compact large language models, 2023

Song Guo, Jiahang Xu, Li Lyna Zhang, and Mao Yang. Compresso: Structured pruning with collaborative prompting learns compact large language models, 2023

work page 2023

[24] [24]

What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024

Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024

work page arXiv 2024

[25] [25]

Dynabert: dynamic bert with adaptive width and depth

Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: dynamic bert with adaptive width and depth. NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc

work page 2020

[26] [26]

Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022

work page arXiv 2022

[27] [27]

Lora: Low-rank adap- tation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adap- tation of large language models.ICLR, 1(2):3, 2022

work page 2022

[28] [28]

Beta: Binarized energy- efficient transformer accelerator at the edge

Yuhao Ji, Chao Fang, and Zhongfeng Wang. Beta: Binarized energy- efficient transformer accelerator at the edge. In2024 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE, 2024

work page 2024

[29] [29]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture, pages 1–14, 2023

work page 2023

[30] [30]

A domain- specific supercomputer for training deep neural networks.Communi- cations of the ACM, 63(7):67–78, 2020

Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A domain- specific supercomputer for training deep neural networks.Communi- cations of the ACM, 63(7):67–78, 2020

work page 2020

[31] [31]

Symbolic chain-of-thought distillation: Small models can also" think" step-by-step.arXiv preprint arXiv:2306.14050, 2023

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step.arXiv preprint arXiv:2306.14050, 2023

work page arXiv 2023

[32] [32]

Adasvd: Adaptive singular value de- composition for large language models.arXiv preprint arXiv:2502.01403, 2025

Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Adasvd: Adaptive singular value de- composition for large language models.arXiv preprint arXiv:2502.01403, 2025

work page arXiv 2025

[33] [33]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Shiwei Liu, Chen Mu, Hao Jiang, Yunzhengmao Wang, Jinshan Zhang, Feng Lin, Keji Zhou, Qi Liu, and Chixiao Chen. Hardsea: Hybrid analog- reram clustering and digital-sram in-memory computing accelerator for dynamic sparse self-attention in transformer.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 32(2):269–282, 2023

work page 2023

[35] [35]

Llm-pruner: On the structural pruning of large language models, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models, 2023

work page 2023

[36] [36]

Building a large annotated corpus of english: The penn treebank.Using Large Corpora, 273:31, 1994

Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Using Large Corpora, 273:31, 1994. 14

work page 1994

[37] [37]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

Infor- mation theoretic representation distillation, 2022

Roy Miles, Adrian Lopez Rodriguez, and Krystian Mikolajczyk. Infor- mation theoretic representation distillation, 2022

work page 2022

[40] [40]

neuronx-distributed-inference, 2025

AWS Neuron. neuronx-distributed-inference, 2025. Accessed: 2025- 09-24

work page 2025

[41] [41]

Neuron kernel interface.https://awsdocs- neuron.readthedocs-hosted.com/en/latest/general/nki/index.html,

Neuron Kernel Interface. Neuron kernel interface.https://awsdocs- neuron.readthedocs-hosted.com/en/latest/general/nki/index.html,

work page

[42] [42]

Accessed: August 1, 2025

work page 2025

[43] [43]

Neuron kernel interface mm

Neuron Kernel Interface. Neuron kernel interface mm. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/ nki/tutorials/matrix_multiplication.html, 2025. Accessed: August 1, 2025

work page 2025

[44] [44]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[45] [45]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

work page 2021

[46] [46]

Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

work page 2020

[47] [47]

Shikhar Tuli and Niraj K Jha. Acceltran: A sparsity-aware accelera- tor for dynamic inference with transformers.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(11):4038– 4051, 2023

work page 2023

[48] [48]

Hard- ware acceleration of transformer networks using fpgas

Georgios Tzanos, Christoforos Kachris, and Dimitrios Soudris. Hard- ware acceleration of transformer networks using fpgas. In2022 Panhel- lenic Conference on Electronics & Telecommunications (PACET), pages 1–5. IEEE, 2022

work page 2022

[49] [49]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

work page arXiv 2023

[50] [50]

Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm com- pression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

work page arXiv 2025

[51] [51]

Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340, 2025

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340, 2025

work page arXiv 2025

[52] [52]

SVD-LLM: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[53] [53]

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009

work page 2009

[54] [54]

Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases, 2023

Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, and Yuxiong He. Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases, 2023

work page 2023

[55] [55]

Ninjallm: Fast, scalable and cost-effective rag using amazon sagemaker and aws trainium and inferentia2, 2024

Tengfei Xue, Xuefeng Li, Roman Smirnov, Tahir Azim, Arash Sadrieh, and Babak Pahlavan. Ninjallm: Fast, scalable and cost-effective rag using amazon sagemaker and aws trainium and inferentia2, 2024

work page 2024

[56] [56]

Alpaca cleaned dataset.https://huggingface.co/datasets/ yahma/alpaca-cleaned, 2023

Yahma. Alpaca cleaned dataset.https://huggingface.co/datasets/ yahma/alpaca-cleaned, 2023. Accessed: 2025-07-28

work page 2023

[57] [57]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Shuo Yang, Sujay Sanghavi, Holakou Rahmanian, Jan Bakus, and S. V. N. Vishwanathan. Toward understanding privileged features distillation in learning-to-rank, 2022

work page 2022

[59] [59]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decom- position for compressing large language models.arXiv preprint arXiv:2312.05821, 2023

work page internal anchor Pith review arXiv 2023

[60] [60]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[61] [61]

Accelerating training of transformer- based language models with progressive layer dropping.Advances in neural information processing systems, 33:14011–14023, 2020

Minjia Zhang and Yuxiong He. Accelerating training of transformer- based language models with progressive layer dropping.Advances in neural information processing systems, 33:14011–14023, 2020

work page 2020

[62] [62]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 15

work page 2024