ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
hub
Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.
DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.
SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
Bayesian fine-tuning of large models can be done efficiently by projecting uncertainties into low-dimensional subspaces, yielding improved calibration and generalization while keeping computational costs low.
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
RUQuant uses block-wise composite orthogonal matrices from Householder reflections and Givens rotations plus a fine-tuned global reflection to achieve 99.8% full-precision accuracy at W6A6 and 97% at W4A4 for 13B LLMs in about one minute.
SCT pre-trains LLMs by keeping weights as compact SVD factors with Stiefel QR retraction, delivering up to 199x memory reduction per layer and allowing 70B-parameter training on a Steam Deck.
A3 splits Transformer layers into QK, OV, and MLP components and derives analytical low-rank approximations that reduce hidden dimensions while minimizing each component's functional loss, yielding better perplexity than prior low-rank methods on LLaMA models.
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
Unifying cross-layer SVD compression for LLMs improves weight reconstruction error by up to 46% on Pythia models but causes severe degradation in perplexity and accuracy due to residual stream decoupling.
A slice-wise feature distillation framework for independent tensorization of neural network slices to achieve scalable compression with reduced fine-tuning costs.
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
citing papers explorer
-
Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction
SCT pre-trains LLMs by keeping weights as compact SVD factors with Stiefel QR retraction, delivering up to 199x memory reduction per layer and allowing 70B-parameter training on a Steam Deck.
-
A3 : an Analytical Low-Rank Approximation Framework for Attention
A3 splits Transformer layers into QK, OV, and MLP components and derives analytical low-rank approximations that reduce hidden dimensions while minimizing each component's functional loss, yielding better perplexity than prior low-rank methods on LLaMA models.