DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
method 1polarities
background 1representative citing papers
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
NeuronMLP applies SVD-based compression and Trainium-specific tiling and caching to MLP layers, delivering 1.35x kernel speedup and 1.21x end-to-end inference speedup at 0.05 compression ratio versus AWS NKI baseline.
citing papers explorer
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
-
NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium
NeuronMLP applies SVD-based compression and Trainium-specific tiling and caching to MLP layers, delivering 1.35x kernel speedup and 1.21x end-to-end inference speedup at 0.05 compression ratio versus AWS NKI baseline.