AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Pith reviewed 2026-05-24 08:25 UTC · model grok-4.3
The pith
Protecting 1% of salient weights via activation scaling sharply reduces LLM quantization error.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AWQ shows that referring to activation distributions, not weight magnitudes, identifies the 1% of salient channels whose protection cuts quantization error dramatically. An equivalent transformation scales these channels to reduce error while keeping the computation unchanged, with the scale factor derived from offline activation statistics. The method requires no backpropagation and avoids overfitting the calibration set, allowing direct application to instruction-tuned and multi-modal models.
What carries the argument
An equivalent transformation that scales salient weight channels according to activation statistics to protect them during quantization.
If this is right
- 4-bit quantized LLMs match or exceed prior methods on language modeling, coding, and math benchmarks.
- The same weights-only approach works without modification on instruction-tuned and multi-modal models.
- Kernel-fused inference yields more than 3x speedup over FP16 on both desktop and mobile GPUs.
- 70B-scale models become deployable on mobile GPUs.
Where Pith is reading between the lines
- Activation magnitude may serve as a general proxy for parameter importance in other compression methods such as pruning.
- Offline calibration could simplify deployment pipelines by removing the need for per-domain retraining of quantized models.
- The scaling idea might transfer to reducing quantization error in non-transformer architectures.
Load-bearing premise
Activation statistics collected from a calibration set remain representative when the quantized model encounters new domains or fine-tuned versions.
What would settle it
Apply AWQ to an instruction-tuned model whose fine-tuning data lies far outside the calibration distribution and check whether perplexity or task accuracy drops more than with prior quantization methods.
Figures
read the original abstract
Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Activation-aware Weight Quantization (AWQ) for low-bit weight-only quantization of LLMs. It claims that protecting only 1% of salient weight channels—identified from activation magnitudes rather than weight magnitudes—greatly reduces quantization error. An equivalent transformation scales these channels, with per-channel scales derived from offline activation statistics on a calibration set. The method requires no backpropagation or reconstruction, asserts generalization across domains and modalities without overfitting, and reports superior results on language modeling, coding, math, instruction-tuned, and multi-modal benchmarks. It also introduces the TinyChat inference engine for >3x speedup on 4-bit models.
Significance. If the central claims hold, AWQ offers a practical, hardware-friendly quantization technique that avoids reconstruction and mixed precision while leveraging activation statistics for salience. The reconstruction-free design and reported generalization to instruction-tuned and multi-modal models are strengths that could facilitate on-device LLM deployment. The accompanying TinyChat framework adds engineering value for efficient inference.
major comments (2)
- [Abstract and Section 3] Abstract and Section 3 (salient channel identification): the claim that protecting only 1% salient weights suffices is load-bearing, yet the fraction is a free parameter with no reported ablation on its sensitivity or justification for the specific 1% value across model scales.
- [Section 3.2] Section 3.2 (activation statistics and scaling derivation): the offline calibration-set procedure for selecting the 1% channels and computing scales is load-bearing for the generalization claim. The paper should demonstrate stability of the selected channels under distribution shift (e.g., via cross-domain or cross-calibration-set experiments), as mismatch would render the fixed scaling suboptimal even though the mathematical equivalence holds for the chosen scales.
minor comments (2)
- [Abstract] The abstract states that a mathematical derivation exists for the scaling transformation but does not present it; the main text should explicitly reference the relevant equation(s) so readers can verify the equivalence without mixed-precision hardware.
- [Experiments] Benchmark tables would benefit from error bars or multiple random seeds to allow assessment of whether reported gains are statistically reliable.
Simulated Author's Rebuttal
Thank you for the constructive feedback and the recommendation for minor revision. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and Section 3] Abstract and Section 3 (salient channel identification): the claim that protecting only 1% salient weights suffices is load-bearing, yet the fraction is a free parameter with no reported ablation on its sensitivity or justification for the specific 1% value across model scales.
Authors: We selected the 1% fraction based on the observation that a small percentage of channels have significantly larger activation magnitudes, as shown in our analysis. This value provides an effective trade-off and has been validated across various model sizes in our experiments. We agree that an ablation study on the sensitivity to this hyperparameter would be beneficial and will include it in the revised version, along with results for different fractions. revision: yes
-
Referee: [Section 3.2] Section 3.2 (activation statistics and scaling derivation): the offline calibration-set procedure for selecting the 1% channels and computing scales is load-bearing for the generalization claim. The paper should demonstrate stability of the selected channels under distribution shift (e.g., via cross-domain or cross-calibration-set experiments), as mismatch would render the fixed scaling suboptimal even though the mathematical equivalence holds for the chosen scales.
Authors: While the paper shows strong generalization to instruction-tuned and multi-modal models using a fixed calibration set, we acknowledge the value of explicit experiments on channel stability. We will add results demonstrating the overlap of selected salient channels across different calibration sets and domains in the revision to further support the robustness of our approach. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper computes per-channel scaling factors directly from offline activation statistics on a calibration set and applies an equivalent transformation whose error-reduction property is derived mathematically. No step uses the final quantization error, downstream performance metric, or reconstruction loss to set the scaling; the 1% salient-channel selection is likewise a direct magnitude computation on the collected activations. The text explicitly states the method avoids backpropagation or reconstruction. No self-citations, self-definitional loops, or fitted-input-called-prediction patterns appear in the provided derivation chain. The central claim therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- salient weight fraction
axioms (1)
- domain assumption Scaling salient weight channels via equivalent transformation reduces quantization error without altering the model's computation graph.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
s∗ = arg min_s L(s) … s = s_X^α, α∗ = arg min_α L(s_X^α) … scale is determined by collecting the activation statistics offline
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AWQ does not rely on any backpropagation or reconstruction, so it generalizes … without overfitting the calibration set
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 53 Pith papers
-
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browse...
-
When Bits Break Recourse: Counterfactual-Faithful Quantization
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
-
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...
-
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
-
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers
CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into wei...
-
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
-
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.
-
SpinQuant: LLM quantization with learned rotations
SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
-
RouterBench: A Benchmark for Multi-LLM Routing System
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythi...
-
A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization
Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's tran...
-
Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation
VLA-AD distills 7B VLA teachers into 158M students using offline VLM semantic guidance on task phases and directions, matching teacher performance on LIBERO with 44x size reduction and 3.28x speedup.
-
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
-
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
-
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.
-
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
Quantization Dominates Rank Reduction for KV-Cache Compression
Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...
-
Rethinking Residual Errors in Compensation-based LLM Quantization
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
RUQuant: Towards Refining Uniform Quantization for Large Language Models
RUQuant uses block-wise composite orthogonal matrices from Householder reflections and Givens rotations plus a fine-tuned global reflection to achieve 99.8% full-precision accuracy at W6A6 and 97% at W4A4 for 13B LLMs...
-
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models
A post-training 1-bit quantization method for LLMs that fixes error accumulation and anisotropic representation distortion to outperform prior weight-driven and naive output-driven baselines.
-
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations
TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accur...
-
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
LogQuant applies log-based filtering for 2-bit KV cache quantization in LLMs, claiming 25% higher throughput, 60% larger batches, and 40-200% accuracy gains on math/code tasks versus existing compression approaches.
-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
-
SGLang: Efficient Execution of Structured Language Model Programs
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
-
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
-
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI
LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.
-
Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.
-
Fast NF4 Dequantization Kernels for Large Language Model Inference
A lightweight shared-memory technique for NF4 dequantization kernels yields 2.0-2.2x kernel speedup and 1.54x end-to-end gains on models up to 70B parameters while using only 64 bytes of shared memory per block.
-
Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence
Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.
-
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models
A post-training quantization technique for 1-bit LLMs that corrects layer-wise error accumulation and anisotropic representation distortion to preserve output behavior more effectively than existing methods.
-
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference wh...
-
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning
AdaSwitch improves small local LLM performance on reasoning tasks by adaptively switching to a large cloud LLM upon detected errors, sometimes matching cloud results with far less overhead.
-
StatQAT: Statistical Quantizer Optimization for Deep Networks
A statistical error analysis framework yields iterative and analytic quantizers that improve accuracy and stability when incorporated into quantization-aware training for integer and floating-point formats.
-
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...
-
Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference
A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.
-
Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation
An isolation-first on-premise architecture for open-weights LLMs in radiology achieved regulatory approval for processing PHI and showed good utility for text-anchored tasks in a one-week pilot with 22 users.
-
Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.
-
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
-
Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models
Smaller LLMs produce functional but limited Python code with variable quantization effects and quality/maintainability concerns that require validation before use.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
Reference graph
Works this paper leans on
-
[1]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
URL https: //doi.org/10.5281/zenodo.7733589. A WQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- ditional computation. arXiv preprint arXiv:1308.3432,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.7733589
-
[2]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....
work page 1901
-
[4]
neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper
URL https://proceedings. neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf. Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y ., Ceze, L., et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI),
work page 2020
-
[5]
Chen, X., Fang, H., Lin, T.-Y ., Vedantam, R., Gupta, S., Doll´ar, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
PACT: Parameterized Clipping Activation for Quantized Neural Networks
URL https://lmsys.org/blog/ 2023-03-30-vicuna/ . Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srini- vasan, V ., and Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Scaling Instruction-Finetuned Language Models
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Dettmers, T. and Zettlemoyer, L. The case for 4-bit pre- cision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720,
-
[9]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Dettmers, T., Lewis, M., Belkada, Y ., and Zettlemoyer, L. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
PaLM-E: An Embodied Multimodal Language Model
Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S. Learned step size quantization. arXiv preprint arXiv:1902.08153,
-
[12]
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers. arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630,
-
[17]
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
J., Henry, R., Fahim, R., and Awadalla, H
Kim, Y . J., Henry, R., Fahim, R., and Awadalla, H. H. Who says elephants can’t run: Bringing large scale moe models into cloud scale production. arXiv preprint arXiv:2211.10017,
-
[19]
Klimt, B. and Yang, Y . The enron corpus: A new dataset for email classification research. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24,
work page 2004
-
[20]
Y ., Salakhutdinov, R., and Fried, D
Koh, J. Y ., Salakhutdinov, R., and Fried, D. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823,
-
[21]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., and Shan, Y . Seed-bench: Benchmarking multimodal llms with gener- ative comprehension. arXiv preprint arXiv:2307.16125, 2023a. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Evaluating Object Hallucination in Large Vision-Language Models
Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision- language models. arXiv preprint arXiv:2305.10355 , 2023d. Lin, J., Chen, W.-M., Lin, Y ., Gan, C., Han, S., et al. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems, 33:11711–11722,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tuning. 2023a. Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b. Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Cl...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
A White Paper on Neural Network Quantization
Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y ., Van Baalen, M., and Blankevoort, T. A white pa- per on neural network quantization. arXiv preprint arXiv:2106.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
J., Kim, B., Lee, Y ., and Lee, D
Park, G., Park, B., Kwon, S. J., Kim, B., Lee, Y ., and Lee, D. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557,
-
[26]
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap- pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: out- performing curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Sanh, V ., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili´c, S., Hesslow, D., Castagn´e, R., Luccioni, A. S., Yvon, F., Gall ´e, M., et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Y ., Xie, Z., Chen, B., Barrett, C., Gonzalez, J
Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y ., Xie, Z., Chen, B., Barrett, C., Gonzalez, J. E., et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865,
-
[30]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Wang, K., Liu, Z., Lin, Y ., Lin, J., and Han, S
URL https://arxiv.org/abs/2012.09852. Wang, K., Liu, Z., Lin, Y ., Lin, J., and Han, S. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In CVPR,
-
[33]
Finetuned Language Models Are Zero-Shot Learners
Wei, J., Bosma, M., Zhao, V . Y ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V . Finetuned lan- guage models are zero-shot learners. arXiv preprint arXiv:2109.01652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Outlier suppression: Pushing the limit of low-bit transformer language models, 2022a
Wei, X., Zhang, Y ., Zhang, X., Gong, R., Zhang, S., Zhang, Q., Yu, F., and Liu, X. Outlier suppression: Pushing the limit of low-bit transformer language models, 2022a. URL https://arxiv.org/abs/2209.13325. Wei, X., Zhang, Y ., Zhang, X., Gong, R., Zhang, S., Zhang, Q., Yu, F., and Liu, X. Outlier suppression: Pushing the limit of low-bit transformer lan...
-
[35]
Smoothquant: Accurate and efficient post-training quantization for large language models
Xiao, G., Lin, J., Seznec, M., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438,
-
[36]
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L
URL https://arxiv.org/abs/2206.01861. Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,
-
[37]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., and Qiao, Y . Llama-adapter: Efficient fine- tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.