pith. sign in

arxiv: 2306.00978 · v6 · submitted 2023-06-01 · 💻 cs.CL

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Pith reviewed 2026-05-24 08:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords activation-aware quantizationLLM compressionweight-only quantizationmodel acceleration4-bit quantizationon-device inferencesalient weights
0
0 comments X

The pith

Protecting 1% of salient weights via activation scaling sharply reduces LLM quantization error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that weight-only quantization of large language models to low bits can preserve accuracy by protecting a small subset of important weights rather than treating all weights equally. It claims that the right weights to protect are those with high activation impact, identified offline, and that an equivalent scaling transformation can shield them during quantization without any retraining or reconstruction. If this holds, models can be compressed to 4 bits and run much faster on edge devices while maintaining performance across language tasks, coding, math, instruction tuning, and even multi-modal settings.

Core claim

AWQ shows that referring to activation distributions, not weight magnitudes, identifies the 1% of salient channels whose protection cuts quantization error dramatically. An equivalent transformation scales these channels to reduce error while keeping the computation unchanged, with the scale factor derived from offline activation statistics. The method requires no backpropagation and avoids overfitting the calibration set, allowing direct application to instruction-tuned and multi-modal models.

What carries the argument

An equivalent transformation that scales salient weight channels according to activation statistics to protect them during quantization.

If this is right

  • 4-bit quantized LLMs match or exceed prior methods on language modeling, coding, and math benchmarks.
  • The same weights-only approach works without modification on instruction-tuned and multi-modal models.
  • Kernel-fused inference yields more than 3x speedup over FP16 on both desktop and mobile GPUs.
  • 70B-scale models become deployable on mobile GPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Activation magnitude may serve as a general proxy for parameter importance in other compression methods such as pruning.
  • Offline calibration could simplify deployment pipelines by removing the need for per-domain retraining of quantized models.
  • The scaling idea might transfer to reducing quantization error in non-transformer architectures.

Load-bearing premise

Activation statistics collected from a calibration set remain representative when the quantized model encounters new domains or fine-tuned versions.

What would settle it

Apply AWQ to an instruction-tuned model whose fine-tuning data lies far outside the calibration distribution and check whether perplexity or task accuracy drops more than with prior quantization methods.

Figures

Figures reproduced from arXiv: 2306.00978 by Chuang Gan, Guangxuan Xiao, Haotian Tang, Jiaming Tang, Ji Lin, Shang Yang, Song Han, Wei-Chen Wang, Wei-Ming Chen, Xingyu Dang.

Figure 1
Figure 1. Figure 1: We introduce AWQ, a versatile weight quantization method for LLM. To implement AWQ, we developed TinyChat to deploy 4-bit quantized LLMs into various edge platforms, achieving a 3-4× performance boost compared to FP16. No￾tably, we’ve also manufactured a TinyChat computer, powered by TinyChat, which contains an NVIDIA Jetson Orin Nano with only 8GB of memory and 15W power consumption. Demo: https://youtu.b… view at source ↗
Figure 2
Figure 2. Figure 2: We observe that we can find 1% of the salient weights in LLMs based on the activation distribution (middle). Keeping the salient weights in FP16 can significantly improve the quantized performance (PPL from 43.2 (left) to 13.0 (middle)), but the mixed-precision format is not hardware-efficient. We follow the activation-awareness principle and propose AWQ (right). AWQ performs per-channel scaling to protect… view at source ↗
Figure 3
Figure 3. Figure 3: Bottleneck analysis for Llama-2-7B on NVIDIA RTX 4090. Left: In on-device LLM applications, generation stage is much slower than the context stage. Middle: The generation stage is memory bound and has low arithmetic intensity. W4A16 quantization can effectively improve the arithmetic intensity by 4×. Right: The amount of weight access is orders of magnitude larger than the amount of activation access. Thus… view at source ↗
Figure 4
Figure 4. Figure 4: SIMD-aware weight packing for ARM NEON with 128-bit SIMD units. Original weights are reordered and packed to align with the bit width so that the weights can be unpacked into bytes at runtime using AND and shift bitwise operations with a 128-bit mask. with 200 tokens only takes 10 ms. Consequently, the gen￾eration phase is substantially slower than the context stage, particularly for on-device interactive … view at source ↗
Figure 5
Figure 5. Figure 5: Comparing INT3-g128 quantized Vicuna models with FP16 counterparts under GPT-4 evaluation protocol (Chiang et al., 2023). More winning cases (in blue) indicate better performance. AWQ consistently improves the quantized performance compared to RTN and GPTQ (Frantar et al., 2022), showing generalization to instruction-tuned models. Evaluations. Following previous literature (Dettmers et al., 2022; Xiao et a… view at source ↗
Figure 6
Figure 6. Figure 6: Visual reasoning examples from LLaVA-13B model (Liu et al., 2023a). AWQ improves over the round-to-nearest (RTN) baseline, providing more reasonable answers. We color the text to show the correct or wrong responses. W4-RTN: A man and a dog walking past some bushes. W4-AWQ: Two dogs are walking on the street. W4-RTN: A man is holding a baby elephant in his arms. W4-AWQ: A man and his daughter pose with an e… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of quantized OpenFlamingo-9B (Awadalla et al., 2023) on COCO captioning dataset (4-shot, INT4-g128 quantization). Our method significantly improves the captioning quality compared to the round-to-nearest (RTN) baseline. We color the text to show the correct or wrong captions. to provide accurate and efficient quantization. We perform experiments with the OpenFlamingo-9B model (Awadalla … view at source ↗
Figure 8
Figure 8. Figure 8: Left: AWQ needs a much smaller calibration set to reach a good quantized performance. It can achieve better perplexity using 10× smaller calibration set compared to GPTQ. Right: Our method is more robust to the calibration set distribution. Overall, using the same calibration and evaluation distribution works the best (PubMed-PubMed, Enron-Enron). But when using a different calibration distribution (PubMed… view at source ↗
Figure 9
Figure 9. Figure 9: TinyChat provides a turn-key solution to transform the theoretical memory footprint reduction into a quantifiable speedup. As a result, TinyChat is up to 3.9× and 3.5× faster than the FP16 implementation from Huggingface on 4090 (desktop GPU) and Orin (mobile GPU), respectively. AWQ also democratizes Llama-2-13B deployment on laptop GPUs (4070) with merely 8GB memory. RTN completely fails, and AWQ brings s… view at source ↗
Figure 10
Figure 10. Figure 10: TinyChat offers 1.2-3.0× speedup over existing systems when running 4-bit quantized Llama models on NVIDIA Jetson Orin. It also supports a diverse range of general-purpose and coding-specific LLMs with at least 2.6× speedup over AutoGPTQ, which also supports all these workloads. Moreover, TinyChat seamlessly operates on Raspberry Pi and enables the deployment of LLMs with up to 7 billion parameters on ext… view at source ↗
read the original abstract

Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Activation-aware Weight Quantization (AWQ) for low-bit weight-only quantization of LLMs. It claims that protecting only 1% of salient weight channels—identified from activation magnitudes rather than weight magnitudes—greatly reduces quantization error. An equivalent transformation scales these channels, with per-channel scales derived from offline activation statistics on a calibration set. The method requires no backpropagation or reconstruction, asserts generalization across domains and modalities without overfitting, and reports superior results on language modeling, coding, math, instruction-tuned, and multi-modal benchmarks. It also introduces the TinyChat inference engine for >3x speedup on 4-bit models.

Significance. If the central claims hold, AWQ offers a practical, hardware-friendly quantization technique that avoids reconstruction and mixed precision while leveraging activation statistics for salience. The reconstruction-free design and reported generalization to instruction-tuned and multi-modal models are strengths that could facilitate on-device LLM deployment. The accompanying TinyChat framework adds engineering value for efficient inference.

major comments (2)
  1. [Abstract and Section 3] Abstract and Section 3 (salient channel identification): the claim that protecting only 1% salient weights suffices is load-bearing, yet the fraction is a free parameter with no reported ablation on its sensitivity or justification for the specific 1% value across model scales.
  2. [Section 3.2] Section 3.2 (activation statistics and scaling derivation): the offline calibration-set procedure for selecting the 1% channels and computing scales is load-bearing for the generalization claim. The paper should demonstrate stability of the selected channels under distribution shift (e.g., via cross-domain or cross-calibration-set experiments), as mismatch would render the fixed scaling suboptimal even though the mathematical equivalence holds for the chosen scales.
minor comments (2)
  1. [Abstract] The abstract states that a mathematical derivation exists for the scaling transformation but does not present it; the main text should explicitly reference the relevant equation(s) so readers can verify the equivalence without mixed-precision hardware.
  2. [Experiments] Benchmark tables would benefit from error bars or multiple random seeds to allow assessment of whether reported gains are statistically reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback and the recommendation for minor revision. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and Section 3] Abstract and Section 3 (salient channel identification): the claim that protecting only 1% salient weights suffices is load-bearing, yet the fraction is a free parameter with no reported ablation on its sensitivity or justification for the specific 1% value across model scales.

    Authors: We selected the 1% fraction based on the observation that a small percentage of channels have significantly larger activation magnitudes, as shown in our analysis. This value provides an effective trade-off and has been validated across various model sizes in our experiments. We agree that an ablation study on the sensitivity to this hyperparameter would be beneficial and will include it in the revised version, along with results for different fractions. revision: yes

  2. Referee: [Section 3.2] Section 3.2 (activation statistics and scaling derivation): the offline calibration-set procedure for selecting the 1% channels and computing scales is load-bearing for the generalization claim. The paper should demonstrate stability of the selected channels under distribution shift (e.g., via cross-domain or cross-calibration-set experiments), as mismatch would render the fixed scaling suboptimal even though the mathematical equivalence holds for the chosen scales.

    Authors: While the paper shows strong generalization to instruction-tuned and multi-modal models using a fixed calibration set, we acknowledge the value of explicit experiments on channel stability. We will add results demonstrating the overlap of selected salient channels across different calibration sets and domains in the revision to further support the robustness of our approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper computes per-channel scaling factors directly from offline activation statistics on a calibration set and applies an equivalent transformation whose error-reduction property is derived mathematically. No step uses the final quantization error, downstream performance metric, or reconstruction loss to set the scaling; the 1% salient-channel selection is likewise a direct magnitude computation on the collected activations. The text explicitly states the method avoids backpropagation or reconstruction. No self-citations, self-definitional loops, or fitted-input-called-prediction patterns appear in the provided derivation chain. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that activation statistics provide a stable signal for importance and that the scaling transformation is mathematically equivalent without side effects on model output.

free parameters (1)
  • salient weight fraction
    The 1% figure used to select channels for protection is stated without derivation from first principles.
axioms (1)
  • domain assumption Scaling salient weight channels via equivalent transformation reduces quantization error without altering the model's computation graph.
    Invoked to justify avoiding mixed-precision hardware costs.

pith-pipeline@v0.9.0 · 5860 in / 1145 out tokens · 19107 ms · 2026-05-24T08:25:26.774486+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 53 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

    cs.DC 2026-05 conditional novelty 7.0

    LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browse...

  2. When Bits Break Recourse: Counterfactual-Faithful Quantization

    cs.LG 2026-05 unverdicted novelty 7.0

    CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.

  3. Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

    stat.ML 2026-05 unverdicted novelty 7.0

    MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...

  4. When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

    cs.PF 2026-05 unverdicted novelty 7.0

    A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

  5. Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

    cs.LG 2026-04 unverdicted novelty 7.0

    High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...

  6. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  7. CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers

    cs.LG 2026-02 unverdicted novelty 7.0

    CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into wei...

  8. Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

    cs.CL 2025-12 conditional novelty 7.0

    Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

  9. Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs

    cs.LG 2025-11 unverdicted novelty 7.0

    Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.

  10. SpinQuant: LLM quantization with learned rotations

    cs.LG 2024-05 conditional novelty 7.0

    SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.

  11. RouterBench: A Benchmark for Multi-LLM Routing System

    cs.LG 2024-03 unverdicted novelty 7.0

    RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.

  12. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    cs.CL 2024-02 unverdicted novelty 7.0

    BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.

  13. Massive Activations in Large Language Models

    cs.CL 2024-02 unverdicted novelty 7.0

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  14. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  15. LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

    cs.LG 2026-05 unverdicted novelty 6.0

    The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythi...

  16. A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's tran...

  17. Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

    cs.CV 2026-05 conditional novelty 6.0

    VLA-AD distills 7B VLA teachers into 158M students using offline VLM semantic guidance on task phases and directions, matching teacher performance on LIBERO with 44x size reduction and 3.28x speedup.

  18. Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

    cs.SE 2026-05 accept novelty 6.0

    Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.

  19. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.

  20. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.

  21. Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

    cs.LG 2026-04 unverdicted novelty 6.0

    ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.

  22. BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

    cs.LG 2026-04 unverdicted novelty 6.0

    BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.

  23. MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...

  24. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  25. Quantization Dominates Rank Reduction for KV-Cache Compression

    cs.LG 2026-04 conditional novelty 6.0

    Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...

  26. Rethinking Residual Errors in Compensation-based LLM Quantization

    cs.LG 2026-04 conditional novelty 6.0

    Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

  27. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  28. RUQuant: Towards Refining Uniform Quantization for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    RUQuant uses block-wise composite orthogonal matrices from Householder reflections and Givens rotations plus a fine-tuned global reflection to achieve 99.8% full-precision accuracy at W6A6 and 97% at W4A4 for 13B LLMs...

  29. Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

    cs.LG 2025-12 unverdicted novelty 6.0

    A post-training 1-bit quantization method for LLMs that fixes error accumulation and anisotropic representation distortion to outperform prior weight-driven and naive output-driven baselines.

  30. You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

    cs.CL 2025-11 conditional novelty 6.0

    TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accur...

  31. LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

    cs.LG 2025-03 unverdicted novelty 6.0

    LogQuant applies log-based filtering for 2-bit KV cache quantization in LLMs, claiming 25% higher throughput, 60% larger batches, and 40-200% accuracy gains on math/code tasks versus existing compression approaches.

  32. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    eess.AS 2024-06 unverdicted novelty 6.0

    Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.

  33. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    cs.CL 2024-02 conditional novelty 6.0

    KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.

  34. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  35. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

    cs.CL 2023-12 unverdicted novelty 6.0

    ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.

  36. H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

    cs.LG 2023-06 unverdicted novelty 6.0

    H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.

  37. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  38. RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

    cs.CL 2026-05 unverdicted novelty 5.0

    LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.

  39. Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.

  40. Fast NF4 Dequantization Kernels for Large Language Model Inference

    cs.LG 2026-04 unverdicted novelty 5.0

    A lightweight shared-memory technique for NF4 dequantization kernels yields 2.0-2.2x kernel speedup and 1.54x end-to-end gains on models up to 70B parameters while using only 64 bytes of shared memory per block.

  41. Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

    cs.SE 2026-03 unverdicted novelty 5.0

    Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.

  42. Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

    cs.LG 2025-12 unverdicted novelty 5.0

    A post-training quantization technique for 1-bit LLMs that corrects layer-wise error accumulation and anisotropic representation distortion to preserve output behavior more effectively than existing methods.

  43. ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

    cs.PF 2025-08 unverdicted novelty 5.0

    ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference wh...

  44. AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

    cs.CL 2024-10 unverdicted novelty 5.0

    AdaSwitch improves small local LLM performance on reasoning tasks by adaptively switching to a large cloud LLM upon detected errors, sometimes matching cloud results with far less overhead.

  45. StatQAT: Statistical Quantizer Optimization for Deep Networks

    stat.ML 2026-05 unverdicted novelty 4.0

    A statistical error analysis framework yields iterative and analytic quantizers that improve accuracy and stability when incorporated into quantization-aware training for integer and floating-point formats.

  46. DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

    cs.CV 2026-04 unverdicted novelty 4.0

    DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...

  47. Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

    cs.AI 2026-04 unverdicted novelty 4.0

    A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.

  48. Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation

    cs.CY 2026-03 conditional novelty 4.0

    An isolation-first on-premise architecture for open-weights LLMs in radiology achieved regulatory approval for processing PHI and showed good utility for text-anchored tasks in a one-week pilot with 22 users.

  49. Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding

    cs.CL 2026-05 unverdicted novelty 3.0

    A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.

  50. LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

    cs.LG 2026-01 unverdicted novelty 3.0

    A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

  51. Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models

    cs.SE 2024-11 unverdicted novelty 3.0

    Smaller LLMs produce functional but limited Python code with variable quantization effects and quality/maintainability concerns that require validation before use.

  52. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

  53. Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

    cs.HC 2024-01 unverdicted novelty 3.0

    This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 51 Pith papers · 22 internal anchors

  1. [1]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    URL https: //doi.org/10.5281/zenodo.7733589. A WQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- ditional computation. arXiv preprint arXiv:1308.3432,

  2. [2]

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745,

  3. [3]

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....

  4. [4]

    neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper

    URL https://proceedings. neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf. Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y ., Ceze, L., et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI),

  5. [5]

    Chen, X., Fang, H., Lin, T.-Y ., Vedantam, R., Gupta, S., Doll´ar, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325,

  6. [6]

    PACT: Parameterized Clipping Activation for Quantized Neural Networks

    URL https://lmsys.org/blog/ 2023-03-30-vicuna/ . Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srini- vasan, V ., and Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085,

  7. [7]

    Scaling Instruction-Finetuned Language Models

    Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416,

  8. [8]

    and Zettlemoyer, L

    Dettmers, T. and Zettlemoyer, L. The case for 4-bit pre- cision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720,

  9. [9]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Dettmers, T., Lewis, M., Belkada, Y ., and Zettlemoyer, L. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,

  10. [10]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,

  11. [11]

    K., McKinstry, J

    Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S. Learned step size quantization. arXiv preprint arXiv:1902.08153,

  12. [12]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635,

  13. [13]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers. arXiv preprint arXiv:2210.17323,

  14. [14]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394,

  15. [15]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  16. [16]

    W., and Keutzer, K

    Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630,

  17. [17]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  18. [18]

    J., Henry, R., Fahim, R., and Awadalla, H

    Kim, Y . J., Henry, R., Fahim, R., and Awadalla, H. H. Who says elephants can’t run: Bringing large scale moe models into cloud scale production. arXiv preprint arXiv:2211.10017,

  19. [19]

    and Yang, Y

    Klimt, B. and Yang, Y . The enron corpus: A new dataset for email classification research. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24,

  20. [20]

    Y ., Salakhutdinov, R., and Fried, D

    Koh, J. Y ., Salakhutdinov, R., and Fried, D. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823,

  21. [21]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., and Shan, Y . Seed-bench: Benchmarking multimodal llms with gener- ative comprehension. arXiv preprint arXiv:2307.16125, 2023a. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, ...

  22. [22]

    Evaluating Object Hallucination in Large Vision-Language Models

    Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision- language models. arXiv preprint arXiv:2305.10355 , 2023d. Lin, J., Chen, W.-M., Lin, Y ., Gan, C., Han, S., et al. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems, 33:11711–11722,

  23. [23]

    Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tuning. 2023a. Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b. Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Cl...

  24. [24]

    A White Paper on Neural Network Quantization

    Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y ., Van Baalen, M., and Blankevoort, T. A white pa- per on neural network quantization. arXiv preprint arXiv:2106.08295,

  25. [25]

    J., Kim, B., Lee, Y ., and Lee, D

    Park, G., Park, B., Kwon, S. J., Kim, B., Lee, Y ., and Lee, D. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557,

  26. [26]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap- pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: out- performing curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,

  27. [27]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Sanh, V ., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 ,

  28. [28]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili´c, S., Hesslow, D., Castagn´e, R., Luccioni, A. S., Yvon, F., Gall ´e, M., et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,

  29. [29]

    Y ., Xie, Z., Chen, B., Barrett, C., Gonzalez, J

    Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y ., Xie, Z., Chen, B., Barrett, C., Gonzalez, J. E., et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865,

  30. [30]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, ...

  31. [32]

    Wang, K., Liu, Z., Lin, Y ., Lin, J., and Han, S

    URL https://arxiv.org/abs/2012.09852. Wang, K., Liu, Z., Lin, Y ., Lin, J., and Han, S. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In CVPR,

  32. [33]

    Finetuned Language Models Are Zero-Shot Learners

    Wei, J., Bosma, M., Zhao, V . Y ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V . Finetuned lan- guage models are zero-shot learners. arXiv preprint arXiv:2109.01652,

  33. [34]

    Outlier suppression: Pushing the limit of low-bit transformer language models, 2022a

    Wei, X., Zhang, Y ., Zhang, X., Gong, R., Zhang, S., Zhang, Q., Yu, F., and Liu, X. Outlier suppression: Pushing the limit of low-bit transformer language models, 2022a. URL https://arxiv.org/abs/2209.13325. Wei, X., Zhang, Y ., Zhang, X., Gong, R., Zhang, S., Zhang, Q., Yu, F., and Liu, X. Outlier suppression: Pushing the limit of low-bit transformer lan...

  34. [35]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Xiao, G., Lin, J., Seznec, M., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438,

  35. [36]

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L

    URL https://arxiv.org/abs/2206.01861. Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

  36. [37]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., and Qiao, Y . Llama-adapter: Efficient fine- tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199,